# Proposal: a new implementation for Data.List.sort and Data.List.sortBy, which has better performance characteristics and is more laziness-friendly.

25 messages
12
Open this post in threaded view
|
Report Content as Inappropriate

## Proposal: a new implementation for Data.List.sort and Data.List.sortBy, which has better performance characteristics and is more laziness-friendly.

 Motivation: ---------- Data.List.sort is a very important functionality in Haskell. I believe that the proposed implementation is: - significantly faster than the current implementation on unsorted lists, typically 14% to 27% faster - more laziness-friendly, i.e.:     take 3 \$ sort l   will require significantly less comparisons than the current implementation Proposed Implementation ----------------------- sort :: (Ord a) => [a] -> [a] sort =  sortBy compare sortBy cmp [] = [] sortBy cmp xs = head \$ until (null.tail) reduce (pair xs)   where     pair (x:y:t) | x `cmp` y == GT  = [y, x] : pair t                  | otherwise        = [x, y] : pair t     pair [x] = [[x]]     pair []  = []     reduce (v:w:x:y:t) = merge v' x' : reduce t                          where v' = merge v w                                x' = merge x y     reduce (x:y:t) = merge x y : reduce t     reduce xs      = xs     merge xs []           = xs     merge []  ys          = ys     merge xs@(x:xs') ys@(y:ys')          | x `cmp` y == GT  = y : merge xs  ys'          | otherwise        = x : merge xs' ys Effect and Interactions ----------------------- I have a stack project with a criterion test for this new implementation, available at https://github.com/greg7mdp/ghc-sort. I ran the tests on an Ubuntu 14.0.2 VM and GHC 8.0.2, and had the following results: - sorting of random lists of integers is 27% faster - sorting of random lists of strings is 14% faster - sorting of already sorted lists is significantly slower, but still much faster than sorting random lists - proposed version is more laziness friendly. For example this version of sortBy requires 11 comparisons to find   the smallest element of a 15 element list, while the default Data.List.sortBy requires 15 comparisons.   (see https://github.com/greg7mdp/ghc-sort/blob/master/src/sort_with_trace.hs) Test results ------------ Criterion output (descending/ascending results are for already sorted lists). I barely understand what Criterion does, and I am puzzled with the various "T" output - maybe there is a bug in my bench code: vagrant@vagrant-ubuntu-trusty-64:/vagrant\$ stack exec ghc-sort benchmarking ascending ints/ghc TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTtime                 160.6 ms   (153.4 ms .. 167.8 ms)                      0.997 R²   (0.986 R² .. 1.000 R²) mean                 161.7 ms   (158.3 ms .. 165.9 ms) std dev              5.210 ms   (3.193 ms .. 7.006 ms) variance introduced by outliers: 12% (moderately inflated) benchmarking ascending ints/greg TTTTTTTTTTTTTTTTtime                 473.8 ms   (398.6 ms .. 554.9 ms)                      0.996 R²   (0.987 R² .. 1.000 R²) mean                 466.2 ms   (449.0 ms .. 475.0 ms) std dev              14.94 ms   (0.0 s .. 15.29 ms) variance introduced by outliers: 19% (moderately inflated) benchmarking descending ints/ghc TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTtime                 165.1 ms   (148.2 ms .. 178.2 ms)                      0.991 R²   (0.957 R² .. 1.000 R²) mean                 158.7 ms   (154.0 ms .. 164.3 ms) std dev              7.075 ms   (4.152 ms .. 9.903 ms) variance introduced by outliers: 12% (moderately inflated) benchmarking descending ints/greg TTTTTTTTTTTTTTTTtime                 471.7 ms   (419.8 ms .. 508.3 ms)                      0.999 R²   (0.995 R² .. 1.000 R²) mean                 476.0 ms   (467.5 ms .. 480.0 ms) std dev              7.447 ms   (67.99 as .. 7.865 ms) variance introduced by outliers: 19% (moderately inflated) benchmarking random ints/ghc TTTTTTTTTTTTTTTTtime                 2.852 s    (2.564 s .. 3.019 s)                      0.999 R²   (0.997 R² .. 1.000 R²) mean                 2.812 s    (2.785 s .. 2.838 s) std dev              44.06 ms   (543.9 as .. 44.97 ms) variance introduced by outliers: 19% (moderately inflated) benchmarking random ints/greg TTTTTTTTTTTTTTTTtime                 2.032 s    (1.993 s .. 2.076 s)                      1.000 R²   (1.000 R² .. 1.000 R²) mean                 2.028 s    (2.019 s .. 2.033 s) std dev              7.832 ms   (0.0 s .. 8.178 ms) variance introduced by outliers: 19% (moderately inflated) benchmarking shakespeare/ghc TTTTTTTTTTTTTTTTtime                 6.504 s    (6.391 s .. 6.694 s)                      1.000 R²   (1.000 R² .. 1.000 R²) mean                 6.499 s    (6.468 s .. 6.518 s) std dev              28.85 ms   (0.0 s .. 32.62 ms) variance introduced by outliers: 19% (moderately inflated) benchmarking shakespeare/greg TTTTTTTTTTTTTTTTtime                 5.560 s    (5.307 s .. 5.763 s)                      1.000 R²   (0.999 R² .. 1.000 R²) mean                 5.582 s    (5.537 s .. 5.607 s) std dev              39.30 ms   (0.0 s .. 43.49 ms) variance introduced by outliers: 19% (moderately inflated) Costs and Drawbacks ------------------- The only cost I see is the reduced performance when sorting already sorted lists. However, since this remains quite efficient, indeed over 4 times faster than sorting unsorted lists, I think it is an acceptable tradeoff. Final note ---------- My Haskell is very rusty. I worked on this a couple years ago when I was learning Haskell, and meant to propose it to the Haskell community, but never got to it at the time. _______________________________________________ Libraries mailing list [hidden email] http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Proposal: a new implementation for Data.List.sort and Data.List.sortBy, which has better performance characteristics and is more laziness-friendly.

 Thank you! This identifies a space leak in base which went unnoticed for 7 years.Your implementation can be improved further. Instead of splitting into pairs, you could instead split into lists of sorted sublists by replacing the pairs function with the following    pair = foldr f []      where        f x [] = [[x]]        f x (y:ys)          | x `cmp` head y == LT = (x:y):ys          | otherwise            = [x]:y:ysThis should give you the same performance improvements for sorting random lists, but better performance while sorting ascending lists.The version in base takes it one step further by using a DList to handle the descending case efficiently as well, except there's a space leak right now because of which it is slower.On Sun, Mar 26, 2017 at 7:21 AM, Gregory Popovitch wrote: Motivation: ---------- Data.List.sort is a very important functionality in Haskell. I believe that the proposed implementation is: - significantly faster than the current implementation on unsorted lists, typically 14% to 27% faster - more laziness-friendly, i.e.:     take 3 \$ sort l   will require significantly less comparisons than the current implementation Proposed Implementation ----------------------- sort :: (Ord a) => [a] -> [a] sort =  sortBy compare sortBy cmp [] = [] sortBy cmp xs = head \$ until (null.tail) reduce (pair xs)   where     pair (x:y:t) | x `cmp` y == GT  = [y, x] : pair t                  | otherwise        = [x, y] : pair t     pair [x] = [[x]]     pair []  = []     reduce (v:w:x:y:t) = merge v' x' : reduce t                          where v' = merge v w                                x' = merge x y     reduce (x:y:t) = merge x y : reduce t     reduce xs      = xs     merge xs []           = xs     merge []  ys          = ys     merge xs@(x:xs') ys@(y:ys')          | x `cmp` y == GT  = y : merge xs  ys'          | otherwise        = x : merge xs' ys Effect and Interactions ----------------------- I have a stack project with a criterion test for this new implementation, available at https://github.com/greg7mdp/ghc-sort. I ran the tests on an Ubuntu 14.0.2 VM and GHC 8.0.2, and had the following results: - sorting of random lists of integers is 27% faster - sorting of random lists of strings is 14% faster - sorting of already sorted lists is significantly slower, but still much faster than sorting random lists - proposed version is more laziness friendly. For example this version of sortBy requires 11 comparisons to find   the smallest element of a 15 element list, while the default Data.List.sortBy requires 15 comparisons.   (see https://github.com/greg7mdp/ghc-sort/blob/master/src/sort_with_trace.hs) Test results ------------ Criterion output (descending/ascending results are for already sorted lists). I barely understand what Criterion does, and I am puzzled with the various "T" output - maybe there is a bug in my bench code: vagrant@vagrant-ubuntu-trusty-64:/vagrant\$ stack exec ghc-sort benchmarking ascending ints/ghc TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTtime                 160.6 ms   (153.4 ms .. 167.8 ms)                      0.997 R²   (0.986 R² .. 1.000 R²) mean                 161.7 ms   (158.3 ms .. 165.9 ms) std dev              5.210 ms   (3.193 ms .. 7.006 ms) variance introduced by outliers: 12% (moderately inflated) benchmarking ascending ints/greg TTTTTTTTTTTTTTTTtime                 473.8 ms   (398.6 ms .. 554.9 ms)                      0.996 R²   (0.987 R² .. 1.000 R²) mean                 466.2 ms   (449.0 ms .. 475.0 ms) std dev              14.94 ms   (0.0 s .. 15.29 ms) variance introduced by outliers: 19% (moderately inflated) benchmarking descending ints/ghc TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTtime                 165.1 ms   (148.2 ms .. 178.2 ms)                      0.991 R²   (0.957 R² .. 1.000 R²) mean                 158.7 ms   (154.0 ms .. 164.3 ms) std dev              7.075 ms   (4.152 ms .. 9.903 ms) variance introduced by outliers: 12% (moderately inflated) benchmarking descending ints/greg TTTTTTTTTTTTTTTTtime                 471.7 ms   (419.8 ms .. 508.3 ms)                      0.999 R²   (0.995 R² .. 1.000 R²) mean                 476.0 ms   (467.5 ms .. 480.0 ms) std dev              7.447 ms   (67.99 as .. 7.865 ms) variance introduced by outliers: 19% (moderately inflated) benchmarking random ints/ghc TTTTTTTTTTTTTTTTtime                 2.852 s    (2.564 s .. 3.019 s)                      0.999 R²   (0.997 R² .. 1.000 R²) mean                 2.812 s    (2.785 s .. 2.838 s) std dev              44.06 ms   (543.9 as .. 44.97 ms) variance introduced by outliers: 19% (moderately inflated) benchmarking random ints/greg TTTTTTTTTTTTTTTTtime                 2.032 s    (1.993 s .. 2.076 s)                      1.000 R²   (1.000 R² .. 1.000 R²) mean                 2.028 s    (2.019 s .. 2.033 s) std dev              7.832 ms   (0.0 s .. 8.178 ms) variance introduced by outliers: 19% (moderately inflated) benchmarking shakespeare/ghc TTTTTTTTTTTTTTTTtime                 6.504 s    (6.391 s .. 6.694 s)                      1.000 R²   (1.000 R² .. 1.000 R²) mean                 6.499 s    (6.468 s .. 6.518 s) std dev              28.85 ms   (0.0 s .. 32.62 ms) variance introduced by outliers: 19% (moderately inflated) benchmarking shakespeare/greg TTTTTTTTTTTTTTTTtime                 5.560 s    (5.307 s .. 5.763 s)                      1.000 R²   (0.999 R² .. 1.000 R²) mean                 5.582 s    (5.537 s .. 5.607 s) std dev              39.30 ms   (0.0 s .. 43.49 ms) variance introduced by outliers: 19% (moderately inflated) Costs and Drawbacks ------------------- The only cost I see is the reduced performance when sorting already sorted lists. However, since this remains quite efficient, indeed over 4 times faster than sorting unsorted lists, I think it is an acceptable tradeoff. Final note ---------- My Haskell is very rusty. I worked on this a couple years ago when I was learning Haskell, and meant to propose it to the Haskell community, but never got to it at the time. _______________________________________________ Libraries mailing list [hidden email] http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries _______________________________________________ Libraries mailing list [hidden email] http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Open this post in threaded view
|
Report Content as Inappropriate

## RE: Proposal: a new implementation for Data.List.sort and Data.List.sortBy, which has better performance characteristics and is more laziness-friendly.

Open this post in threaded view
|
Report Content as Inappropriate

## Re: Proposal: a new implementation for Data.List.sort and Data.List.sortBy, which has better performance characteristics and is more laziness-friendly.

Open this post in threaded view
|
Report Content as Inappropriate

## Re: Proposal: a new implementation for Data.List.sort and Data.List.sortBy, which has better performance characteristics and is more laziness-friendly.

 In reply to this post by Gregory Popovitch On Sun, Mar 26, 2017 at 9:19 AM, Gregory Popovitch <[hidden email]> wrote: > Is there a real need to optimize the sort for already sorted list? It's useful for data constructors with the precondition that the input is sorted.  I generally keep things in sorted order, but since sortedness is not tracked in the type and I'm not willing to face corruption if a transformation slightly de-sorts the list I just unconditionally sort the input.  It's nice to have that be relatively cheap for the common case. I'm just giving a case where it's useful, not advocating for the status quo.  I don't know how to weight the trade-off for a general purpose sort.  I'd personally be willing to switch to a specialized sort that works well on mostly sorted input if it made a performance difference. _______________________________________________ Libraries mailing list [hidden email] http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Open this post in threaded view
|
Report Content as Inappropriate

## RE: Proposal: a new implementation for Data.List.sort and Data.List.sortBy, which has better performance characteristics and is more laziness-friendly.

Open this post in threaded view
|
Report Content as Inappropriate

## Re: Proposal: a new implementation for Data.List.sort and Data.List.sortBy, which has better performance characteristics and is more laziness-friendly.

Open this post in threaded view
|
Report Content as Inappropriate

## Re: Proposal: a new implementation for Data.List.sort and Data.List.sortBy, which has better performance characteristics and is more laziness-friendly.

 In reply to this post by Gregory Popovitch Hi! > Test results See e.g.  https://github.com/rust-lang/rust/pull/38192for a good suite of test vectors for sort functions. I still remember first benchmarking a sort I wrote in university which performed well, but broke down (relatively) with test cases that are not random, sorted, rev sorted. Great work you two btw! Cheers,  Tob(ias Florek) _______________________________________________ Libraries mailing list [hidden email] http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries signature.asc (856 bytes) Download Attachment
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Proposal: a new implementation for Data.List.sort and Data.List.sortBy, which has better performance characteristics and is more laziness-friendly.

 On Sun, Mar 26, 2017 at 2:05 PM, Tobias Florek wrote:Hi! > Test results See e.g.  https://github.com/rust-lang/rust/pull/38192That's a good idea actually. Maybe it would be a welcomed addition to the https://github.com/haskell-perf/sequences benchmarks.  for a good suite of test vectors for sort functions. I still remember first benchmarking a sort I wrote in university which performed well, but broke down (relatively) with test cases that are not random, sorted, rev sorted. Great work you two btw! Cheers,  Tob(ias Florek) _______________________________________________ Libraries mailing list [hidden email] http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries _______________________________________________ Libraries mailing list [hidden email] http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Proposal: a new implementation for Data.List.sort and Data.List.sortBy, which has better performance characteristics and is more laziness-friendly.

Open this post in threaded view
|
Report Content as Inappropriate

## RE: Proposal: a new implementation for Data.List.sort and Data.List.sortBy, which has better performance characteristics and is more laziness-friendly.

Open this post in threaded view
|
Report Content as Inappropriate

## RE: fix for Data.List.sortBy

Open this post in threaded view
|
Report Content as Inappropriate

## RE: fix for Data.List.sortBy

Open this post in threaded view
|
Report Content as Inappropriate

## Re: fix for Data.List.sortBy

Open this post in threaded view
|
Report Content as Inappropriate

## Re: fix for Data.List.sortBy

Open this post in threaded view
|
Report Content as Inappropriate

## RE: fix for Data.List.sortBy

Open this post in threaded view
|
Report Content as Inappropriate

## Re: fix for Data.List.sortBy

Open this post in threaded view
|
Report Content as Inappropriate

## RE: fix for Data.List.sortBy

 OK, so current proposed changes are below:   test results look good.   Thanks,   greg _______________________________________________ Libraries mailing list [hidden email] http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Open this post in threaded view
|
Report Content as Inappropriate

## Re: fix for Data.List.sortBy

 mergePairs (a:b:xs) = merge a b `seq` merge a b : mergePairs xsas well.On Tue, Mar 28, 2017 at 11:43 AM, Gregory Popovitch wrote: OK, so current proposed changes are below:   test results look good.   Thanks,   greg _______________________________________________ Libraries mailing list [hidden email] http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries