Currently, the implementations of fromList for Data.Set and Data.Map
try to take advantage of inputs with some prefix in strictly increasing order. It uses the fast (O (n)) fromDistinctAscList algorithm for the strictly ascending prefix, and then falls back on the slower (O (n log n)) naive algorithm for the rest. For Data.Set, this works roughly like the implementation sketched at the bottom of this message. This whole thing strikes me as a bit odd: changing just the first element of the input list can have a substantial impact on the overall performance. In my opinion, there are two reasonable ways to implement this function: 1. Don't bother to try to take advantage of preexisting ordering. This is the simplest option, and will probably be fastest for lists with little preexisting order. 2. Try to take advantage of sorted ascending and descending "runs". A simple version would be to break up the input list into ascending and descending segments (merging duplicates on the fly), use fromDistinctAscList or fromDistinctDescList to convert each segment, and combine the segments using set unions. This option will be slower for totally unsorted lists, but should degrade much more gradually than the current algorithm as disorder is added. Wren suggests that there may be other variations on the same theme that we could consider. Should we A. Switch to 1 (and offer 2 by another name) B. Switch to 2 (and offer 1 by another name?) C. Leave fromList alone and offer 2 by another name Note that option 1 can be implemented quite easily outside the library using insert and foldl'; option 2 is not something users should be expected to write themselves. If we add names, what should those be? Thanks, David Feuer Current implementation sketch: fromList :: Ord a => [a] -> Set a fromList xs = foldl' (flip insert) sortedSet unsortedList where sortedSet = fromDistinctAscList sortedList (sortedList, unsortedList) = spanStrictlyAscending xs spanStrictlyAscending [] = ([], []) spanStrictlyAscending (x : xs) = x : go x xs where go prev (n : ns) | prev < n = first (n :) $ go n ns go _ ns = ([], ns) _______________________________________________ Libraries mailing list [hidden email] http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries |
The current implementation was really just an attempt to improve over the old implementation which did the naive construction. It did improve things for the sorted list case to the point where there is very little marginal benefit to using fromDistinctAscList any more, but it very much isn't the final word on the topic. =) Finding something that handles the middle cases where you have "mostly sorted" data better would be a good idea, but definitely needs to be benchmarked to death, and it would help to know if the merges of runs affect the overall asymptotic performance. Timsort is both fast and popular and includes a similar pass (it also happens to include handling strictly descending runs as mentioned), so it seems that there is room to do something similar here. -Edward On Thu, Feb 9, 2017 at 3:20 PM, David Feuer <[hidden email]> wrote: Currently, the implementations of fromList for Data.Set and Data.Map _______________________________________________ Libraries mailing list [hidden email] http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries |
Merging the sorted runs will not hurt big-O complexity. When I replaced the union algorithms, Ross Paterson noted that the time bounds on the new ones were good enough that starting from singletons we can take the unions in any order and build the set/map in O(n log n) time. My main concern is about constant factors. Like you said, I'll have to benchmark a lot. On Feb 9, 2017 8:16 PM, "Edward Kmett" <[hidden email]> wrote:
_______________________________________________ Libraries mailing list [hidden email] http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries |
Dear all,
yes, as Edward says, the current implementation just alleviated the need to use fromDistinctAscList -- it was quite a common case of using fromList (because many functions produce sorted lists). Some adaptive algorithm depending on number of inversions (by inversion I mean $(x, y)$ such that $x$ preceeds $y$ and $x > y$), ideally with complexity $O(n \log (2+number_of_inversions/n))$, would definitely be desirable. I tried to experiment with it when I implemented this "use linear-time algorithm if sorted", but as noted in the e4e9a97 commit, I have no idea how to keep the constant factors small. Cheers, Milan Straka > -----Original message----- > From: David Feuer <[hidden email]> > Sent: 9 Feb 2017, 20:26 > > Merging the sorted runs will not hurt big-O complexity. When I replaced the > union algorithms, Ross Paterson noted that the time bounds on the new ones > were good enough that starting from singletons we can take the unions in > any order and build the set/map in O(n log n) time. My main concern is > about constant factors. Like you said, I'll have to benchmark a lot. > > On Feb 9, 2017 8:16 PM, "Edward Kmett" <[hidden email]> wrote: > > > The current implementation was really just an attempt to improve over the > > old implementation which did the naive construction. It did improve things > > for the sorted list case to the point where there is very little marginal > > benefit to using fromDistinctAscList any more, but it very much isn't the > > final word on the topic. =) > > > > Finding something that handles the middle cases where you have "mostly > > sorted" data better would be a good idea, but definitely needs to be > > benchmarked to death, and it would help to know if the merges of runs > > affect the overall asymptotic performance. > > > > Timsort is both fast and popular and includes a similar pass (it also > > happens to include handling strictly descending runs as mentioned), so it > > seems that there is room to do something similar here. > > > > -Edward > > > > On Thu, Feb 9, 2017 at 3:20 PM, David Feuer <[hidden email]> wrote: > > > >> Currently, the implementations of fromList for Data.Set and Data.Map > >> try to take advantage of inputs with some prefix in strictly > >> increasing order. It uses the fast (O (n)) fromDistinctAscList > >> algorithm for the strictly ascending prefix, and then falls back on > >> the slower (O (n log n)) naive algorithm for the rest. For Data.Set, > >> this works roughly like the implementation sketched at the bottom of > >> this message. > >> > >> This whole thing strikes me as a bit odd: changing just the first > >> element of the input list can have a substantial impact on the overall > >> performance. > >> > >> In my opinion, there are two reasonable ways to implement this function: > >> > >> 1. Don't bother to try to take advantage of preexisting ordering. This > >> is the simplest option, and will probably be fastest for lists with > >> little preexisting order. > >> > >> 2. Try to take advantage of sorted ascending and descending "runs". A > >> simple version would be to break up the input list into ascending and > >> descending segments (merging duplicates on the fly), use > >> fromDistinctAscList or fromDistinctDescList to convert each segment, > >> and combine the segments using set unions. This option will be slower > >> for totally unsorted lists, but should degrade much more gradually > >> than the current algorithm as disorder is added. Wren suggests that > >> there may be other variations on the same theme that we could > >> consider. > >> > >> Should we > >> > >> A. Switch to 1 (and offer 2 by another name) > >> B. Switch to 2 (and offer 1 by another name?) > >> C. Leave fromList alone and offer 2 by another name > >> > >> Note that option 1 can be implemented quite easily outside the library > >> using insert and foldl'; option 2 is not something users should be > >> expected to write themselves. > >> > >> If we add names, what should those be? > >> > >> Thanks, > >> David Feuer > >> > >> Current implementation sketch: > >> > >> fromList :: Ord a => [a] -> Set a > >> fromList xs = foldl' (flip insert) sortedSet unsortedList > >> where > >> sortedSet = fromDistinctAscList sortedList > >> (sortedList, unsortedList) = spanStrictlyAscending xs > >> > >> spanStrictlyAscending [] = ([], []) > >> spanStrictlyAscending (x : xs) = x : go x xs > >> where > >> go prev (n : ns) > >> | prev < n = first (n :) $ go n ns > >> go _ ns = ([], ns) > >> _______________________________________________ > >> Libraries mailing list > >> [hidden email] > >> http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries > >> > > > > Libraries mailing list [hidden email] http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries |
I tried about the simplest thing that could possibly work, and it
doesn't seem terrible. As expected, it's pretty close to the current implementation for strictly increasing lists, and a good deal better for mostly-ordered lists (in either direction). Also as expected, it's worse for random inputs, but not as much worse as I feared: it takes about 26% longer. I suspect it's probably possible to do better. I tried to eliminate the intermediate lists, but unfortunately that seems to make it really messy to deal with merging duplicates. It may be worth doing anyway, or there may be some other optimization I haven't thought of. spanIncreasing :: Ord a => [a] -> ([a], [a]) spanIncreasing [] = ([], []) spanIncreasing (a : as) = first (a:) (go a as) where go _prev [] = ([], []) go prev q@(a : as) = case compare prev a of LT -> first (a :) $ go a as EQ -> go a as GT -> ([], q) spanDecreasing :: Ord a => [a] -> ([a], [a]) spanDecreasing [] = ([], []) spanDecreasing (a : as) = first (a:) (go a as) where go _prev [] = ([], []) go prev q@(a : as) = case compare prev a of GT -> first (a :) (go a as) EQ -> go a as LT -> ([], q) fromList :: Ord a => [a] -> Set a fromList = up empty where up !acc [] = acc up acc xs = case spanIncreasing xs of ([x], rest) -> down (insert x acc) rest (ups, rest) -> down (union (fromDistinctAscList ups) acc) rest down !acc [] = acc down acc xs = case spanDecreasing xs of ([x], rest) -> up (insert x acc) rest (downs, rest) -> up (union (fromDistinctDescList downs) acc) rest On Fri, Feb 10, 2017 at 3:59 AM, Milan Straka <[hidden email]> wrote: > Dear all, > > yes, as Edward says, the current implementation just alleviated the need > to use fromDistinctAscList -- it was quite a common case of using > fromList (because many functions produce sorted lists). > > Some adaptive algorithm depending on number of inversions (by inversion > I mean $(x, y)$ such that $x$ preceeds $y$ and $x > y$), ideally > with complexity $O(n \log (2+number_of_inversions/n))$, would definitely > be desirable. I tried to experiment with it when I implemented this > "use linear-time algorithm if sorted", but as noted in the e4e9a97 > commit, I have no idea how to keep the constant factors small. > > Cheers, > Milan Straka > >> -----Original message----- >> From: David Feuer <[hidden email]> >> Sent: 9 Feb 2017, 20:26 >> >> Merging the sorted runs will not hurt big-O complexity. When I replaced the >> union algorithms, Ross Paterson noted that the time bounds on the new ones >> were good enough that starting from singletons we can take the unions in >> any order and build the set/map in O(n log n) time. My main concern is >> about constant factors. Like you said, I'll have to benchmark a lot. >> >> On Feb 9, 2017 8:16 PM, "Edward Kmett" <[hidden email]> wrote: >> >> > The current implementation was really just an attempt to improve over the >> > old implementation which did the naive construction. It did improve things >> > for the sorted list case to the point where there is very little marginal >> > benefit to using fromDistinctAscList any more, but it very much isn't the >> > final word on the topic. =) >> > >> > Finding something that handles the middle cases where you have "mostly >> > sorted" data better would be a good idea, but definitely needs to be >> > benchmarked to death, and it would help to know if the merges of runs >> > affect the overall asymptotic performance. >> > >> > Timsort is both fast and popular and includes a similar pass (it also >> > happens to include handling strictly descending runs as mentioned), so it >> > seems that there is room to do something similar here. >> > >> > -Edward >> > >> > On Thu, Feb 9, 2017 at 3:20 PM, David Feuer <[hidden email]> wrote: >> > >> >> Currently, the implementations of fromList for Data.Set and Data.Map >> >> try to take advantage of inputs with some prefix in strictly >> >> increasing order. It uses the fast (O (n)) fromDistinctAscList >> >> algorithm for the strictly ascending prefix, and then falls back on >> >> the slower (O (n log n)) naive algorithm for the rest. For Data.Set, >> >> this works roughly like the implementation sketched at the bottom of >> >> this message. >> >> >> >> This whole thing strikes me as a bit odd: changing just the first >> >> element of the input list can have a substantial impact on the overall >> >> performance. >> >> >> >> In my opinion, there are two reasonable ways to implement this function: >> >> >> >> 1. Don't bother to try to take advantage of preexisting ordering. This >> >> is the simplest option, and will probably be fastest for lists with >> >> little preexisting order. >> >> >> >> 2. Try to take advantage of sorted ascending and descending "runs". A >> >> simple version would be to break up the input list into ascending and >> >> descending segments (merging duplicates on the fly), use >> >> fromDistinctAscList or fromDistinctDescList to convert each segment, >> >> and combine the segments using set unions. This option will be slower >> >> for totally unsorted lists, but should degrade much more gradually >> >> than the current algorithm as disorder is added. Wren suggests that >> >> there may be other variations on the same theme that we could >> >> consider. >> >> >> >> Should we >> >> >> >> A. Switch to 1 (and offer 2 by another name) >> >> B. Switch to 2 (and offer 1 by another name?) >> >> C. Leave fromList alone and offer 2 by another name >> >> >> >> Note that option 1 can be implemented quite easily outside the library >> >> using insert and foldl'; option 2 is not something users should be >> >> expected to write themselves. >> >> >> >> If we add names, what should those be? >> >> >> >> Thanks, >> >> David Feuer >> >> >> >> Current implementation sketch: >> >> >> >> fromList :: Ord a => [a] -> Set a >> >> fromList xs = foldl' (flip insert) sortedSet unsortedList >> >> where >> >> sortedSet = fromDistinctAscList sortedList >> >> (sortedList, unsortedList) = spanStrictlyAscending xs >> >> >> >> spanStrictlyAscending [] = ([], []) >> >> spanStrictlyAscending (x : xs) = x : go x xs >> >> where >> >> go prev (n : ns) >> >> | prev < n = first (n :) $ go n ns >> >> go _ ns = ([], ns) >> >> _______________________________________________ >> >> Libraries mailing list >> >> [hidden email] >> >> http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries >> >> >> > >> > Libraries mailing list [hidden email] http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries |
On Sat, Feb 11, 2017 at 8:56 PM, David Feuer <[hidden email]> wrote:
> spanIncreasing :: Ord a => [a] -> ([a], [a]) > spanIncreasing [] = ([], []) > spanIncreasing (a : as) = first (a:) (go a as) > where > go _prev [] = ([], []) > go prev q@(a : as) = case compare prev a of > LT -> first (a :) $ go a as > EQ -> go a as > GT -> ([], q) Suggest CPSing away the intermediate list: spanIncreasing :: Ord a => r -> (r -> a -> r) -> [a] -> (r, [a]) spanIncreasing z k [] = (z, []) spanIncreasing z k (x:xs) = go z k x xs case go !z k !x [] = (z, []) go z k x yys@(y:ys) | x < y = go (k z x) y ys | x == y = go z k x ys | otherwise = (k z x, yys) > spanDecreasing :: Ord a => [a] -> ([a], [a]) > spanDecreasing [] = ([], []) > spanDecreasing (a : as) = first (a:) (go a as) > where > go _prev [] = ([], []) > go prev q@(a : as) = case compare prev a of > GT -> first (a :) (go a as) > EQ -> go a as > LT -> ([], q) Ditto > fromList :: Ord a => [a] -> Set a > fromList = up empty > where > up !acc [] = acc > up acc xs = case spanIncreasing xs of > ([x], rest) -> down (insert x acc) rest > (ups, rest) -> down (union (fromDistinctAscList ups) acc) rest > > down !acc [] = acc > down acc xs = case spanDecreasing xs of > ([x], rest) -> up (insert x acc) rest > (downs, rest) -> up (union (fromDistinctDescList downs) acc) rest Here, I suggest inlining the above to dynamically choose whether to call `up` or `down`. E.g., something like (untested): fromList :: Ord a => [a] -> Set a fromList [] = empty fromList (x0:xs0) = start empty x0 xs0 where start !acc !x [] = insert x acc start acc x (y:ys) = case compare x y of LT -> up acc (singletonUp x) y ys EQ -> start acc x ys GT -> down acc (singletonDown x) y ys up !acc1 !acc2 !x [] = union acc1 (upSet (insertUp x acc2)) up acc1 acc2 x (y:ys) = case compare x y of LT -> up acc1 (insertUp x acc2) y ys EQ -> up acc1 acc2 x ys GT -> start (union acc1 (upSet (insertUp x acc2))) y ys down !acc1 !acc2 !x [] = union acc1 (downSet (insertDown x acc2)) down acc1 acc2 x (y:ys) = case compare x y of GT -> down acc1 (insertDown x acc2) y ys EQ -> down acc1 acc2 x ys LT -> start (union acc1 (downSet (insertDown x acc2))) y ys where `insertUp` and `insertDown` are the incremental steps of fromAscList/fromDescList, and `acc2` has whatever appropriate intermediate type it needs for that to work, and upSet/downSet does the conversion from that intermediate type into a standard Set. A naive but workable intermediate type gives us: singletonUp = (:[]) singletonDown = (:[]) insertUp = (:) insertDown = (:) upSet = fromAscList downSet = fromDescList Though we should be able to do better than that. -- Live well, ~wren _______________________________________________ Libraries mailing list [hidden email] http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries |
Free forum by Nabble | Edit this page |