tldr: nub is abnormally slow, we shouldn't use it, but we do.
As you might know, Data.List.nub is O(n²). (*) As you might not know, almost *all* practical Haskell projects use it, and that in places where an Ord instance is given, e.g. happy, Xmonad, ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600 more (see https://github.com/nh2/haskell-ordnub). I've taken the Ord-based O(n * log n) implementation from yi using a Set: ordNub :: (Ord a) => [a] -> [a] ordNub l = go empty l where go _ [] = [] go s (x:xs) = if x `member` s then go s xs else x : go (insert x s) xs and put benchmarks on http://htmlpreview.github.io/?https://github.com/nh2/haskell-ordnub/blob/1f0a2c94a/report.html (compare `nub` vs `ordNub`). `ordNub` is not only in a different complexity class, but even seems to perform better than nub for very small numbers of actually different list elements (that's the numbers before the benchmark names). (The benchmark also shows some other potential problem: Using a state monad to keep the set instead of a function argument can be up to 20 times slower. Should that happen?) What do you think about ordNub? I've seen a proposal from 5 years ago about adding a *sort*Nub function started by Neil, but it just died. (*) The mentioned complexity is for the (very common) worst case, in which the number of different elements in the list grows with the list (alias you don't have an N element list with always only 5 different things inside). _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
Similarly, I've always used: import qualified Data.HashSet as S nub :: Hashable a => [a] -> [a] And i can't think of any type which i can't write a Hashable instance, so this is extremely practical. On Jul 14, 2013 7:24 AM, "Niklas Hambüchen" <[hidden email]> wrote:
tldr: nub is abnormally slow, we shouldn't use it, but we do. _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
One of my main points is:
Should we not add such a function (ord-based, same output as nub, stable, no sorting) to base? As the package counting shows, if we don't offer an alternative, people obviously use it, and not to our benefit. (Not to say it this way: We could make the Haskell world fast with smarter fusion, strictness analysis and LLVM backends. Or we could stop using quadratic algorithms.) _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Clark Gaebel-2
At Sun, 14 Jul 2013 07:31:05 -0400,
Clark Gaebel wrote: > Similarly, I've always used: > > import qualified Data.HashSet as S > > nub :: Hashable a => [a] -> [a] > nub = S.toList . S.fromList > > And i can't think of any type which i can't write a Hashable instance, so > this is extremely practical. Well, the above is not stable while Niklas’ is. But I guess that’s not the point of your message :). I’ve always avoided “nub” too, and FWIW I’d like a constrained version too—maybe avoiding Data.Set so that it could live in Data.List. I think Ord would be much better than Hashable, since it is 1. in “base” 2. much more established and understood. Although if you find yourself using “nub” too much you’re probably doing something wrong... Francesco _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Niklas Hambüchen
Oops sorry I guess my point wasn't clear. Why ord based when hashable is faster? Then there's no reason this has to be in base, it can just be a free function in Data.HashSet. If stability is a concern then there's a way to easily account for that using HashMap. - Clark On Jul 14, 2013 7:48 AM, "Niklas Hambüchen" <[hidden email]> wrote:
One of my main points is: _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Niklas Hambüchen
Something like that should definitely be included in Data.List.
Thanks for working on it. Roman * Niklas Hambüchen <[hidden email]> [2013-07-14 19:20:52+0800] > tldr: nub is abnormally slow, we shouldn't use it, but we do. > > > As you might know, Data.List.nub is O(n²). (*) > > As you might not know, almost *all* practical Haskell projects use it, > and that in places where an Ord instance is given, e.g. happy, Xmonad, > ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600 > more (see https://github.com/nh2/haskell-ordnub). > > I've taken the Ord-based O(n * log n) implementation from yi using a Set: > > ordNub :: (Ord a) => [a] -> [a] > ordNub l = go empty l > where > go _ [] = [] > go s (x:xs) = if x `member` s then go s xs > else x : go (insert x s) xs > > > and put benchmarks on > http://htmlpreview.github.io/?https://github.com/nh2/haskell-ordnub/blob/1f0a2c94a/report.html > (compare `nub` vs `ordNub`). > > `ordNub` is not only in a different complexity class, but even seems to > perform better than nub for very small numbers of actually different > list elements (that's the numbers before the benchmark names). > > (The benchmark also shows some other potential problem: Using a state > monad to keep the set instead of a function argument can be up to 20 > times slower. Should that happen?) > > What do you think about ordNub? > > I've seen a proposal from 5 years ago about adding a *sort*Nub function > started by Neil, but it just died. > > > (*) The mentioned complexity is for the (very common) worst case, in > which the number of different elements in the list grows with the list > (alias you don't have an N element list with always only 5 different > things inside). > > _______________________________________________ > Haskell-Cafe mailing list > [hidden email] > http://www.haskell.org/mailman/listinfo/haskell-cafe _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Clark Gaebel-2
On Sun, Jul 14, 2013 at 7:31 AM, Clark Gaebel <[hidden email]> wrote:
This won't yield results lazily (e.g. nub (repeat 'x') = _|_ instead of 'x' : _|_), but Niklas' ordNub will. His ordNub can be translated directly to HashSet and still have the stability and laziness properties.
A difficulty with putting ordNub in Data.List is that it depends on containers, which is outside of the base package. Some options: * Move the implementation of Set to base.
* Implement a lean version of Set in base that only provides 'insert' and 'member'. * Define ordNub in Data.Set instead. Adding a Hashable-based nub to base would be even more problematic, since you'd need Hashable in base.
_______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
On 15 July 2013 09:54, Joey Adams <[hidden email]> wrote:
> On Sun, Jul 14, 2013 at 7:31 AM, Clark Gaebel <[hidden email]> wrote: >> >> Similarly, I've always used: >> >> import qualified Data.HashSet as S >> >> nub :: Hashable a => [a] -> [a] >> nub = S.toList . S.fromList >> >> And i can't think of any type which i can't write a Hashable instance, so >> this is extremely practical. > > This won't yield results lazily (e.g. nub (repeat 'x') = _|_ instead of 'x' > : _|_), but Niklas' ordNub will. His ordNub can be translated directly to > HashSet and still have the stability and laziness properties. > > A difficulty with putting ordNub in Data.List is that it depends on > containers, which is outside of the base package. Some options: > > * Move the implementation of Set to base. > > * Implement a lean version of Set in base that only provides 'insert' and > 'member'. > > * Define ordNub in Data.Set instead. > > Adding a Hashable-based nub to base would be even more problematic, since > you'd need Hashable in base. Right, I suggest the following community course of action: 1a) add ordNub to Data.Set 1b) add ordNub to Data.Hashable (1 day) 2) make a libraries@ proposal to include a stripped-down Data.Set-like balanced binary tree implementation to base. (2 weeks) 3) bikeshed about the name, eg.: * is "nub" really intuitive? how about "uniq", like in perl/ruby/underscore.js? * but "uniq" in unix only removes _adjacent_ duplicates, confusing! * how about "distinct"? "sole"? "unique"? "azygous"? (7 weeks) 4) Failing consensus on technical grounds (that the stripped-down Data.Set implementation is overkill for one library function), agree that anyone who really cares should just use the version from containers or hashable. Only newbs and textbook authors actually use base anyway, and it's impossible to change the language definition. Prelude will continue to fulfil its role of avoiding success at all costs, quadratic or otherwise. (Please, let's have both 1a and 1b :) Conrad. _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Niklas Hambüchen
Just so people are aware - five years ago the notion of nubOrd and
nubWith was discussed and a consensus reached on including nubOrd. I think Bart got too busy, didn't submit a final patch, and no one with commit access actually commited any code. http://haskell.1045720.n5.nabble.com/GHC-2717-Add-nubWith-nubOrd-td3159919.html I fully support an efficient nub implementation making its way into base - it's far past time. Using Set seems sensible. Cheers, Thomas On Sun, Jul 14, 2013 at 4:20 AM, Niklas Hambüchen <[hidden email]> wrote: > tldr: nub is abnormally slow, we shouldn't use it, but we do. > > > As you might know, Data.List.nub is O(n²). (*) > > As you might not know, almost *all* practical Haskell projects use it, > and that in places where an Ord instance is given, e.g. happy, Xmonad, > ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600 > more (see https://github.com/nh2/haskell-ordnub). > > I've taken the Ord-based O(n * log n) implementation from yi using a Set: > > ordNub :: (Ord a) => [a] -> [a] > ordNub l = go empty l > where > go _ [] = [] > go s (x:xs) = if x `member` s then go s xs > else x : go (insert x s) xs > > > and put benchmarks on > http://htmlpreview.github.io/?https://github.com/nh2/haskell-ordnub/blob/1f0a2c94a/report.html > (compare `nub` vs `ordNub`). > > `ordNub` is not only in a different complexity class, but even seems to > perform better than nub for very small numbers of actually different > list elements (that's the numbers before the benchmark names). > > (The benchmark also shows some other potential problem: Using a state > monad to keep the set instead of a function argument can be up to 20 > times slower. Should that happen?) > > What do you think about ordNub? > > I've seen a proposal from 5 years ago about adding a *sort*Nub function > started by Neil, but it just died. > > > (*) The mentioned complexity is for the (very common) worst case, in > which the number of different elements in the list grows with the list > (alias you don't have an N element list with always only 5 different > things inside). > > _______________________________________________ > Haskell-Cafe mailing list > [hidden email] > http://www.haskell.org/mailman/listinfo/haskell-cafe _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Clark Gaebel-2
On Sun, Jul 14, 2013 at 7:54 AM, Clark Gaebel <[hidden email]> wrote:
Did the point about "stable" fly overhead? brandon s allbery kf8nh sine nomine associates
unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Niklas Hambüchen
On Sun, Jul 14, 2013 at 4:20 AM, Niklas Hambüchen <[hidden email]> wrote:
> tldr: nub is abnormally slow, we shouldn't use it, but we do. > > > As you might know, Data.List.nub is O(n²). (*) > > As you might not know, almost *all* practical Haskell projects use it, > and that in places where an Ord instance is given, e.g. happy, Xmonad, > ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600 > more (see https://github.com/nh2/haskell-ordnub). > > I've taken the Ord-based O(n * log n) implementation from yi using a Set: > > ordNub :: (Ord a) => [a] -> [a] > ordNub l = go empty l > where > go _ [] = [] > go s (x:xs) = if x `member` s then go s xs > else x : go (insert x s) xs > > > and put benchmarks on > http://htmlpreview.github.io/?https://github.com/nh2/haskell-ordnub/blob/1f0a2c94a/report.html > (compare `nub` vs `ordNub`). Richard Bird has a book, "Pearls of Functional Algorithm Design" that is meant to teach a form of deriving algorithms from the properties we ask of them. In this book, he gives a possible derivation of ordNub, simply called nub in the book, following the methodology he is teaching. He notes in the text that this derivation feels more complicated than it ought. Here is his version: http://lpaste.net/87625 I just sent you a pull request to add that one and S.toList . S.fromList that was suggested in this thread. I don't think those two implementations are faster than the others but it's nice to have them for completeness. Jason _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
Hey Jason,
would you mind giving a short idea of what the point of Bird's implementation is / from what properties it is derived? Also, running the QuickCheck tests you added, it doesn't give the same output (order) as nub. On 15/07/13 13:26, Jason Dagit wrote: > Richard Bird has a book, "Pearls of Functional Algorithm Design" that > is meant to teach a form of deriving algorithms from the properties we > ask of them. In this book, he gives a possible derivation of ordNub, > simply called nub in the book, following the methodology he is > teaching. He notes in the text that this derivation feels more > complicated than it ought. > > Here is his version: http://lpaste.net/87625 > > I just sent you a pull request to add that one and S.toList . > S.fromList that was suggested in this thread. I don't think those two > implementations are faster than the others but it's nice to have them > for completeness. > > Jason > _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Brandon Allbery
Apologies. I was being lazy. Here's a stable version:
import qualified Data.HashSet as S hashNub :: (Ord a) => [a] -> [a] hashNub l = go S.empty l where go _ [] = [] go s (x:xs) = if x `S.member` s then go s xs else x : go (S.insert x s) xs Which, again, will probably be faster than the one using Ord, and I can't think of any cases where I'd want the one using Ord instead. I may just not be creative enough, though. - Clark On Mon, Jul 15, 2013 at 12:46 AM, Brandon Allbery <[hidden email]> wrote: > On Sun, Jul 14, 2013 at 7:54 AM, Clark Gaebel <[hidden email]> wrote: >> >> Oops sorry I guess my point wasn't clear. >> >> Why ord based when hashable is faster? Then there's no reason this has to >> be in base, it can just be a > > Did the point about "stable" fly overhead? > > -- > brandon s allbery kf8nh sine nomine associates > [hidden email] [hidden email] > unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In my tests, using unordered-containers was slightly slower than using Ord, although as the number of repeated elements grows unordered-containers appears to have an advantage. I'm sure the relative costs of comparison vs hashing would affect this also. But both are dramatically better than the current nub.
Has anyone looked at Bart's patches to see how difficult it would be to apply them (or re-write them)? On Mon, Jul 15, 2013 at 8:43 PM, Clark Gaebel <[hidden email]> wrote: Apologies. I was being lazy. Here's a stable version: _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
On 16 July 2013 11:46, John Lato <[hidden email]> wrote:
> In my tests, using unordered-containers was slightly slower than using Ord, > although as the number of repeated elements grows unordered-containers > appears to have an advantage. I'm sure the relative costs of comparison vs > hashing would affect this also. But both are dramatically better than the > current nub. > > Has anyone looked at Bart's patches to see how difficult it would be to > apply them (or re-write them)? If I understand correctly, this function is proposed to be added to Data.List which lives in base... but the proposals here are about using either Sets from containers or HashSet from unordered-containers; I thought base wasn't supposed to depend on any other package :/ > > > > On Mon, Jul 15, 2013 at 8:43 PM, Clark Gaebel <[hidden email]> wrote: >> >> Apologies. I was being lazy. Here's a stable version: >> >> import qualified Data.HashSet as S >> >> hashNub :: (Ord a) => [a] -> [a] >> hashNub l = go S.empty l >> where >> go _ [] = [] >> go s (x:xs) = if x `S.member` s then go s xs >> else x : go (S.insert x s) xs >> >> Which, again, will probably be faster than the one using Ord, and I >> can't think of any cases where I'd want the one using Ord instead. I >> may just not be creative enough, though. >> >> >> - Clark >> >> On Mon, Jul 15, 2013 at 12:46 AM, Brandon Allbery <[hidden email]> >> wrote: >> > On Sun, Jul 14, 2013 at 7:54 AM, Clark Gaebel <[hidden email]> >> > wrote: >> >> >> >> Oops sorry I guess my point wasn't clear. >> >> >> >> Why ord based when hashable is faster? Then there's no reason this has >> >> to >> >> be in base, it can just be a >> > >> > Did the point about "stable" fly overhead? >> > >> > -- >> > brandon s allbery kf8nh sine nomine >> > associates >> > [hidden email] >> > [hidden email] >> > unix, openafs, kerberos, infrastructure, xmonad >> > http://sinenomine.net >> >> _______________________________________________ >> Haskell-Cafe mailing list >> [hidden email] >> http://www.haskell.org/mailman/listinfo/haskell-cafe > > > > _______________________________________________ > Haskell-Cafe mailing list > [hidden email] > http://www.haskell.org/mailman/listinfo/haskell-cafe > -- Ivan Lazar Miljenovic [hidden email] http://IvanMiljenovic.wordpress.com _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
On Tue, Jul 16, 2013 at 10:31 AM, Ivan Lazar Miljenovic <[hidden email]> wrote:
That was one of the points up for discussion: is it worth including a subset of Set functionality to enable a much better nub in base? Is it even worth having Data.List.nub if it has quadratic complexity?
As an alternative, Bart's proposal was for both including ordNub in containers and an improved nub (with no dependencies outside base) in Data.List. Unfortunately the patches are quite old (darcs format), and I don't know how they'd apply to the current situation.
_______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
I'm procrastinating something else, so I wrote the patch to
unordered-containers. Feel free to comment on the github link: https://github.com/tibbe/unordered-containers/pull/67 I'm still against having an Ord version, since my intuition tells me that hash-based data structures are faster than ordered ones. Someone else can write the patch, though! As a tangent, can anyone think of a data structure for which you can write an Ord instance but Hashable/Eq is impossible (or prove otherwise)? How about the converse? Regards, - Clark On Mon, Jul 15, 2013 at 10:40 PM, John Lato <[hidden email]> wrote: > On Tue, Jul 16, 2013 at 10:31 AM, Ivan Lazar Miljenovic > <[hidden email]> wrote: >> >> On 16 July 2013 11:46, John Lato <[hidden email]> wrote: >> > In my tests, using unordered-containers was slightly slower than using >> > Ord, >> > although as the number of repeated elements grows unordered-containers >> > appears to have an advantage. I'm sure the relative costs of comparison >> > vs >> > hashing would affect this also. But both are dramatically better than >> > the >> > current nub. >> > >> > Has anyone looked at Bart's patches to see how difficult it would be to >> > apply them (or re-write them)? >> >> If I understand correctly, this function is proposed to be added to >> Data.List which lives in base... but the proposals here are about >> using either Sets from containers or HashSet from >> unordered-containers; I thought base wasn't supposed to depend on any >> other package :/ > > > That was one of the points up for discussion: is it worth including a subset > of Set functionality to enable a much better nub in base? Is it even worth > having Data.List.nub if it has quadratic complexity? > > As an alternative, Bart's proposal was for both including ordNub in > containers and an improved nub (with no dependencies outside base) in > Data.List. Unfortunately the patches are quite old (darcs format), and I > don't know how they'd apply to the current situation. _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Ivan Lazar Miljenovic
On Mon, Jul 15, 2013 at 10:31 PM, Ivan Lazar Miljenovic <[hidden email]> wrote:
As I understand it, currently we have a double proposal: simple ordNub in Data.List without external dependencies, and the other one in containers and/or unordered-containers as appropriate.
brandon s allbery kf8nh sine nomine associates
unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Clark Gaebel-2
On 16/07/2013, at 3:21 PM, Clark Gaebel wrote: > > I'm still against having an Ord version, since my intuition tells me > that hash-based data structures are faster than ordered ones. There are at least four different things that "an Ord version" might mean: - first sort a list, then eliminate duplicates - sort a list eliminating duplicates stably as you go (think 'merge sort', using 'union' instead of 'merge') - build a balanced tree set as you go - having a list that is already sorted, use that to eliminated duplicates cheaply. These things have different costs. For example, if there are N elements of which U are unique, the first as O(N.log N) cost, the third has O(N.log U) cost, and the fourth has O(N) cost. What I want is more often ordNubBy than ordNub, though. > Someone > else can write the patch, though! > > As a tangent, can anyone think of a data structure for which you can > write an Ord instance but Hashable/Eq is impossible (or prove > otherwise)? How about the converse? Since Ord has Eq as a superclass, and since 0 is a functionally correct hash value for anything, if you can implement Ord you can obviously implement Hashable/Eq. Whether it is *useful* to do so is another question. It turns out that it _is_ possible to define good quality hash functions on sets, but most code in the field to do so is pretty bad. (Just a modular sum or exclusive or.) _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
nubBy is a very good suggestion. Added!
Regarding good hash functions: if your data structure is algebraic, you can derive generic and Hashable will give you a pretty good hash function: > data ADT a = C0 Int String | C1 [a] > deriving Generic > > instance Hashable a => Hashable (ADT a) It's magic! - Clark On Mon, Jul 15, 2013 at 11:35 PM, Richard A. O'Keefe <[hidden email]> wrote: > > On 16/07/2013, at 3:21 PM, Clark Gaebel wrote: >> >> I'm still against having an Ord version, since my intuition tells me >> that hash-based data structures are faster than ordered ones. > > There are at least four different things that "an Ord version" might > mean: > > - first sort a list, then eliminate duplicates > - sort a list eliminating duplicates stably as you go > (think 'merge sort', using 'union' instead of 'merge') > - build a balanced tree set as you go > - having a list that is already sorted, use that to > eliminated duplicates cheaply. > > These things have different costs. For example, if there are N > elements of which U are unique, the first as O(N.log N) cost, > the third has O(N.log U) cost, and the fourth has O(N) cost. > > What I want is more often ordNubBy than ordNub, though. > >> Someone >> else can write the patch, though! >> >> As a tangent, can anyone think of a data structure for which you can >> write an Ord instance but Hashable/Eq is impossible (or prove >> otherwise)? How about the converse? > > Since Ord has Eq as a superclass, and since 0 is a functionally > correct hash value for anything, if you can implement Ord you > can obviously implement Hashable/Eq. Whether it is *useful* to > do so is another question. > > It turns out that it _is_ possible to define good quality hash > functions on sets, but most code in the field to do so is pretty bad. > (Just a modular sum or exclusive or.) > > _______________________________________________ > Haskell-Cafe mailing list > [hidden email] > http://www.haskell.org/mailman/listinfo/haskell-cafe _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
Free forum by Nabble | Edit this page |