On 16 July 2013 10:31, Ivan Lazar Miljenovic <[hidden email]> wrote:
> On 16 July 2013 11:46, John Lato <[hidden email]> wrote: >> In my tests, using unorderedcontainers was slightly slower than using Ord, >> although as the number of repeated elements grows unorderedcontainers >> appears to have an advantage. I'm sure the relative costs of comparison vs >> hashing would affect this also. But both are dramatically better than the >> current nub. >> >> Has anyone looked at Bart's patches to see how difficult it would be to >> apply them (or rewrite them)? > > If I understand correctly, this function is proposed to be added to > Data.List which lives in base... but the proposals here are about > using either Sets from containers or HashSet from > unorderedcontainers; I thought base wasn't supposed to depend on any > other package :/ This discussion (on cafe@) is just about what course of action to take; adding such functions to containers or unorderedcontainers would not require a libraries@ proposal. Conrad. > >> >> >> >> On Mon, Jul 15, 2013 at 8:43 PM, Clark Gaebel <[hidden email]> wrote: >>> >>> Apologies. I was being lazy. Here's a stable version: >>> >>> import qualified Data.HashSet as S >>> >>> hashNub :: (Ord a) => [a] > [a] >>> hashNub l = go S.empty l >>> where >>> go _ [] = [] >>> go s (x:xs) = if x `S.member` s then go s xs >>> else x : go (S.insert x s) xs >>> >>> Which, again, will probably be faster than the one using Ord, and I >>> can't think of any cases where I'd want the one using Ord instead. I >>> may just not be creative enough, though. >>> >>> >>>  Clark >>> >>> On Mon, Jul 15, 2013 at 12:46 AM, Brandon Allbery <[hidden email]> >>> wrote: >>> > On Sun, Jul 14, 2013 at 7:54 AM, Clark Gaebel <[hidden email]> >>> > wrote: >>> >> >>> >> Oops sorry I guess my point wasn't clear. >>> >> >>> >> Why ord based when hashable is faster? Then there's no reason this has >>> >> to >>> >> be in base, it can just be a >>> > >>> > Did the point about "stable" fly overhead? >>> > >>> >  >>> > brandon s allbery kf8nh sine nomine >>> > associates >>> > [hidden email] >>> > [hidden email] >>> > unix, openafs, kerberos, infrastructure, xmonad >>> > http://sinenomine.net >>> >>> _______________________________________________ >>> HaskellCafe mailing list >>> [hidden email] >>> http://www.haskell.org/mailman/listinfo/haskellcafe >> >> >> >> _______________________________________________ >> HaskellCafe mailing list >> [hidden email] >> http://www.haskell.org/mailman/listinfo/haskellcafe >> > > > >  > Ivan Lazar Miljenovic > [hidden email] > http://IvanMiljenovic.wordpress.com > > _______________________________________________ > HaskellCafe mailing list > [hidden email] > http://www.haskell.org/mailman/listinfo/haskellcafe > _______________________________________________ HaskellCafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskellcafe 
In reply to this post by Francesco Mazzoli
Francesco Mazzoli <[hidden email]> writes: >> import qualified Data.HashSet as S >> >> nub :: Hashable a => [a] > [a] >> nub = S.toList . S.fromList > Well, the above is not stable while Niklas’ is. But I guess that’s not > the point of your message :). We could also implement Data.BloomFilter.nub, which removes equal elements probabilistically (with a small but nonzero chance of removing some unique elements) :) k  If I haven't seen further, it is by standing in the footprints of giants _______________________________________________ HaskellCafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskellcafe 
In reply to this post by Niklas Hambüchen
On 14.07.2013 13:20, Niklas Hambüchen wrote:
> I've taken the Ordbased O(n * log n) implementation from yi using a Set: > > ordNub :: (Ord a) => [a] > [a] > ordNub l = go empty l > where > go _ [] = [] > go s (x:xs) = if x `member` s then go s xs > else x : go (insert x s) xs > > (The benchmark also shows some other potential problem: Using a state > monad to keep the set instead of a function argument can be up to 20 > times slower. Should that happen?) I cannot say whether this should happen, but your code about can be straightforwardly refactored using a *Reader* monad. import Control.Monad.Reader import Data.Functor ((<$>)) import qualified Data.Set as Set  ifM still not in Control.Monad ifM mc md me = do { c < mc; if c then md else me } ordNub :: (Ord a) => [a] > [a] ordNub l = runReader (go l) Set.empty where go [] = return [] go (x:xs) = ifM ((x `Set.member`) <$> ask) (go xs) ((x :) <$> local (Set.insert x) (go xs)) test = ordNub [1,2,4,1,3,5,2,3,4,5,6,1] Of course, this does not lend itself to an application of filterM. In fact, your implementation is already in the (Set a >) reader monad, in normalized form. It looks already optimal to me. Cheers, Andreas  Andreas Abel <>< Du bist der geliebte Mensch. Theoretical Computer Science, University of Munich Oettingenstr. 67, D80538 Munich, GERMANY [hidden email] http://www2.tcs.ifi.lmu.de/~abel/ _______________________________________________ HaskellCafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskellcafe 
In reply to this post by Richard A. O'Keefe
> Richard A. O'Keefe <ok <at> cs.otago.ac.nz> writes:
> > There are at least four different things that "an Ord version" might > mean: > >  first sort a list, then eliminate duplicates >  sort a list eliminating duplicates stably as you go > (think 'merge sort', using 'union' instead of 'merge') >  build a balanced tree set as you go >  having a list that is already sorted, use that to > eliminated duplicates cheaply. > > These things have different costs. For example, ... > > What I want is more often ordNubBy than ordNub, though. > (ordNubBy you can get via a suitable Ord instance for the element type?) The bigger problem is that you might not have a suitable Ord instance. After all, sets are defined by equality/equivalence relation, not necessarily by Ord. There are many other things you might want to do with a set/collection than just remove duplicates. I notice that Data.Set.map = fromList . (map stuff) . toList That is, build two lists (to be GC'd), as well as the set result. So does the performane cost of from/to List outrun the benefit of Data.Set.union? Depends how much you're mapping vs inserting and checking membership. _______________________________________________ HaskellCafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskellcafe 
In reply to this post by Niklas Hambüchen
On 14/07/13 20:20, Niklas Hambüchen wrote:
> As you might not know, almost *all* practical Haskell projects use it, > and that in places where an Ord instance is given, e.g. happy, Xmonad, > ghcmod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600 > more (see https://github.com/nh2/haskellordnub). GHC uses nub. Also let me stress again that the n² case happens even if there are no duplicates. _______________________________________________ HaskellCafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskellcafe 
In reply to this post by Roman Cheplyaka2
I would like to come back to the original question:
How can ordNub be added to base? I guess we agree that Data.List is the right module for a function of type Ord a => [a] > [a], but this introduces * a cyclic dependency between Data.List and Data.Set * a base dependency on containers. What is the right way to go with that? Should ordNub be introduced as part of Data.Set, as Conrad suggested? It does not really have anything to do with Set, apart from being implemented with it. On 14/07/13 14:12, Roman Cheplyaka wrote: > Something like that should definitely be included in Data.List. > Thanks for working on it. > > Roman _______________________________________________ HaskellCafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskellcafe 
On Oct 12, 2013, at 2:47 PM, Niklas Hambüchen <[hidden email]> wrote: > > I would like to come back to the original question: > > How can ordNub be added to base? > > I guess we agree that Data.List is the right module for a function of > type Ord a => [a] > [a], but this introduces > > * a cyclic dependency between Data.List and Data.Set > * a base dependency on containers. > > What is the right way to go with that? > > Should ordNub be introduced as part of Data.Set, as Conrad suggested? > > It does not really have anything to do with Set, apart from being > implemented with it. I think nub's behavior is rather setrelated, so I don't really understand the objection to putting it in Data.Set. Anthony > >> On 14/07/13 14:12, Roman Cheplyaka wrote: >> Something like that should definitely be included in Data.List. >> Thanks for working on it. >> >> Roman > _______________________________________________ > HaskellCafe mailing list > [hidden email] > http://www.haskell.org/mailman/listinfo/haskellcafe HaskellCafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskellcafe 
On 12/10/13 20:43, Anthony Cowley wrote:
> I think nub's behavior is rather setrelated, so I don't really understand the objection to putting it in Data.Set. In sets, the order does not matter, while for nub it does. nub :: Eq a => [a] > [a] ordNub :: Ord a => [a] > [a] both do not mention Set, and the fact that Set is used inside my proposed ordNub implementation is a detail not visible to the caller. That's why it looks like a Data.List function to me. _______________________________________________ HaskellCafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskellcafe 
In reply to this post by Anthony Cowley
* Anthony Cowley <[hidden email]> [20131012 15:43:570400]
> > On Oct 12, 2013, at 2:47 PM, Niklas Hambüchen <[hidden email]> wrote: > > > > I would like to come back to the original question: > > > > How can ordNub be added to base? > > > > I guess we agree that Data.List is the right module for a function of > > type Ord a => [a] > [a], but this introduces > > > > * a cyclic dependency between Data.List and Data.Set > > * a base dependency on containers. > > > > What is the right way to go with that? > > > > Should ordNub be introduced as part of Data.Set, as Conrad suggested? > > > > It does not really have anything to do with Set, apart from being > > implemented with it. > > I think nub's behavior is rather setrelated, so I don't really > understand the objection to putting it in Data.Set. because it clearly acts on lists. Therefore, it belongs to Data.List. Besides, we already have the precedent of the slow nub being in Data.List. Roman _______________________________________________ HaskellCafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskellcafe signature.asc (853 bytes) Download Attachment 
In reply to this post by Niklas Hambüchen
> Niklas Hambüchen <mail <at> nh2.me> writes:
> > In sets, the order does not matter, while for nub it does. > Let's be careful here!. Niklas, when you say "order", do you mean: * the _ordering_ from the Ord instance? Or * the relative sequence of elements in the list? > ... the fact that Set is used inside my proposed > ordNub implementation is a detail not visible to the caller. If you use the Set library, that fact may be very visible! Because Set resequences the whole list, as per its Ord instance. But List.nub preserves the list sequence (except for omitting duplicates). Furthermore, the Ord instance might compare two elements as EQ, even though their Eq instance says they're not equal. So a Setbased ordNub could end up returning: * not the same elements as List.nub * and/or not in the same list sequence I'd call that very much *visible* to the caller. > > That's why it looks like a Data.List function to me. > [BTW I am still less than convinced that overall a Setbased ordNub is significantly more efficient. I suspect it depends on how big is your list.] _______________________________________________ HaskellCafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskellcafe 
On 13/10/13 21:42, AntC wrote:
>> Niklas Hambüchen <mail <at> nh2.me> writes: >> >> In sets, the order does not matter, while for nub it does. >> > > Let's be careful here!. Niklas, when you say "order", do you mean: > * the _ordering_ from the Ord instance? Or > * the relative sequence of elements in the list? > >> ... the fact that Set is used inside my proposed >> ordNub implementation is a detail not visible to the caller. > > If you use the Set library, that fact may be very visible! > Because Set resequences the whole list, as per its Ord instance. > > But List.nub preserves the list sequence (except for omitting duplicates). I mean *exactly* what you say here. ordNub behaves has the same behaviour as nub, while (Set.toList . Set.fromList) doesn't. > [BTW I am still less than convinced that overall a Setbased ordNub is > significantly more efficient. I suspect it depends on how big is your > list.] What do you mean? ordNub is clearly in a different complexity class, and the benchmarks that I provided show not only this, but also that ordNub is *always* faster than nub, even for singleton lists. _______________________________________________ HaskellCafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskellcafe 
> Niklas Hambüchen <mail <at> nh2.me> writes:
> > > On 13/10/13 21:42, AntC wrote: > > ... > > If you use the Set library, that fact may be very visible! > > Because Set resequences the whole list, as per its Ord instance. > > > > But List.nub preserves the list sequence > > (except for omitting duplicates). > > I mean *exactly* what you say here. > > ordNub behaves has the same behaviour as nub, while (Set.toList . > Set.fromList) doesn't. > That's great, thank you. > > [BTW I am still less than convinced that overall a Setbased ordNub is > > significantly more efficient. I suspect it depends on how big is your > > list.] > > What do you mean? > > ordNub is clearly in a different complexity class, ... Yes, I'm not disputing that. > ... and the benchmarks that I provided show not only this, > but also that ordNub is *always* faster than nub, > even for singleton lists. Thanks Niklas, I hadn't spotted those benchmarks back in July. I'm surprised at that result for singletons (and for very small numbers of elements which are in fact each different). Especially because List's `nub` uses `filter` ==> fold, which should be tailrecursive. It seems to me that for small numbers, the Setbased approach still requires comparing each element to each other. Plus there's the overhead for building the Set and inserting each element into it  where `insert` again walks the Set to find the insertion point. Then here's a further possible optimisation, instead of making separate calls to `member` and `insert`: * Make a single call to insert' :: (Ord a) => a > Set a > (Bool, Set a) * The Bool returns True if already a member. * Else returns an updated Set in the snd, with the element inserted. _______________________________________________ HaskellCafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskellcafe 
On 14/10/13 03:20, AntC wrote:
> Thanks Niklas, I hadn't spotted those benchmarks back in July. No worries :) > I'm surprised at that result for singletons > (and for very small numbers of elements which are in fact each different). I think one of the main reasons for the performance difference is that a list node and a Set binary tree node have pretty much the same performance, with the difference that in http://hackage.haskell.org/package/containers0.5.2.1/docs/src/DataSetBase.html data Set a = Bin {# UNPACK #} !Size !a !(Set a) !(Set a)  Tip there are strictness and unpack annotations, while for data [a] = []  a : [a]  pseudo syntax there are not. Good for us in this case, I guess. > It seems to me that for small numbers, the Setbased approach still > requires comparing each element to each other. This I don't understand. > Then here's a further possible optimisation, instead of making separate > calls to `member` and `insert`: This I understand again. Where do you get insert' from? containers doesn't seem to have it. Do you suggest adding it? Niklas _______________________________________________ HaskellCafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskellcafe 
> Niklas Hambüchen <mail <at> nh2.me> writes:
> > > On 14/10/13 03:20, AntC wrote: > > ... > > Then here's a further possible optimisation, instead of making > > separate calls to `member` and `insert`: > > This I understand again. Where do you get insert' from? containers > doesn't seem to have it. Do you suggest adding it? > err, well I didn't have any specific library in mind. More there's a kind of 'folk idiom' for managing data structures, (this applies more for imperative code/updateinsitu than functional) that if you know the next thing you're going to do after failing to find an element is insert it, you might as well get on with the insert there and then. (It's a higherlevel analogue of a machine instruction decrementand branchifzero.) I'm looking at all the remarks about managing libraries and dependencies. Would it make sense to build a standalone version of Set purely to support ordNub? Then it needs only methods `empty` and `insertIfAbsent`. _______________________________________________ HaskellCafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskellcafe 
In reply to this post by Niklas Hambüchen
On Sun, Oct 13, 2013 at 6:25 PM, Niklas Hambüchen <[hidden email]> wrote:
> ordNub behaves has the same behaviour as nub, while (Set.toList . > Set.fromList) doesn't. Perhaps you mean that they produce exactly the same output when fully evaluated. But I doubt that their behavior is *exactly* the same in every aspect. Given the differences in strictness between sets and lists, can you prove that nub and nubOrd have *exactly* the same strictness properties in every possible case? > ...ordNub is *always* faster than nub, even for singleton lists. This is also a claim that I doubt. You say that you tested the case of singletons. How about this: 2 : replicate 100000 1 ++ [3] Well, you could optimize for that by having nubOrd squeeze runs of of equal elements in the input. But then how about something like this: [3,2,1] ++ take 100000 (cycle [3,2,1]) ++ [4] Or variations on that, cycling some other fairly small number of elements. Set containers must do extra comparisons, whereas the of cost running down the first few hundred elements of a linked list (as long as the compiler fuses the iteration down to a tight loop in the output code and you remain within the cache) is near zero. How could nubOrd be as fast as ord in these kinds of cases? I'm not saying that it wouldn't be worthwhile to add a standard welloptimized implementation of nubOrd somewhere. But nub is a very useful function. The only advantage of nubOrd is for very large lists where the complexity advantage overtakes the excellent constants of nub to provide better speed. In practice, the cases where that's worthwhile are unusual. I rarely bother with nubOrd over nub. Yitz _______________________________________________ HaskellCafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskellcafe 
Administrator

On Wed, Oct 16, 2013 at 11:46 PM, Yitzchak Gale <[hidden email]> wrote:
+1  KimEe _______________________________________________ HaskellCafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskellcafe 
In reply to this post by Yitzchak Gale
On 16/10/13 17:46, Yitzchak Gale wrote:
> On Sun, Oct 13, 2013 at 6:25 PM, Niklas Hambüchen <[hidden email]> wrote: >> ordNub behaves has the same behaviour as nub, while (Set.toList . >> Set.fromList) doesn't. > > Perhaps you mean that they produce exactly the same > output when fully evaluated. But I doubt that their behavior > is *exactly* the same in every aspect. Given the differences > in strictness between sets and lists, can you prove that > nub and nubOrd have *exactly* the same strictness properties > in every possible case? Yes. The elements are are evaluated in the same order as in `nub`. We can see that by comparing the implementations. If you have objects that need not be fully evaluated to use (==) on them, and need not fully evaluated *in a different way* when using `compare` on them, then of course the strictness properties are different. That is why one function has "Eq a =>" and the other one "Ord a =>" at the front. >> ...ordNub is *always* faster than nub, even for singleton lists. > > This is also a claim that I doubt. You say that you tested the > case of singletons. How about this: > > 2 : replicate 100000 1 ++ [3] > > Well, you could optimize for that by having nubOrd squeeze > runs of of equal elements in the input. But then how about > something like this: > > [3,2,1] ++ take 100000 (cycle [3,2,1]) ++ [4] > > Or variations on that, cycling some other fairly small number of > elements. Set containers must do extra comparisons, whereas > the of cost running down the first few hundred elements of a linked > list (as long as the compiler fuses the iteration down to a tight > loop in the output code and you remain within the cache) is near zero. > How could nubOrd be as fast as ord in these kinds of cases? I make the following, really cool proposal: * You add your test cases that you just mentioned to the benchmark list in "ordnub.hs" * run `cabal build` * run `dist/build/ordnub/ordnub o report.html`. Then you will see that ordNub is, indeed, faster. > (as long as the compiler fuses the iteration down to a tight > loop in the output code and you remain within the cache) Even though a list is conceptually simpler than a set, a list node and a set node are not that different (as I explained in the last email). Both have a content and a pointer to the next node. The only difference is that the Set node is optimised with strictness and unboxing, and the list one is not. > The only advantage of nubOrd > is for very large lists where the complexity advantage overtakes > the excellent constants of nub to provide better speed. Sorry, I think this is absolutely invented. What "excellent constants" do you mean? How large are "very large lists"? I find lists of length 1000 a very common case. An `n log2 n` algorithm is already 100 times faster on those than an n² one. I do not dispute that there exist n² algorithms that are faster than others with better complexity for small inputs. nub is not one of them. _______________________________________________ HaskellCafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskellcafe 
Niklas Hambüchen wrote:
> I make the following, really cool proposal:> > * You add your test cases that you just mentioned to the benchmark list > in "ordnub.hs"... No thanks. Those were not meant to be test cases to add to a benchmark suite  they were meant to get you thinking a little more deeply about execution models of Haskell in general, and the GHC execution model in particular, given your claim that nubOrd is better than ord is *every* case, in *every* sense of "better". But that is a side issue. Everyone agrees with you that it is useful to add a good implementation of nubOrd to the standard libraries in some convenient and logical place. Let's get back to focusing on that. Thanks, Yitz _______________________________________________ HaskellCafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskellcafe 
On 17/10/13 17:56, Yitzchak Gale wrote:
> Let's get back to focusing on that. Agreed. My main question is still: Can the Data.List <> Data.Set issue be resolved? Is it OK to introduce a cyclic dependency between the two, making use of a .boot.hs file? _______________________________________________ HaskellCafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskellcafe 
On 17.10.2013 20:52, Niklas Hambüchen wrote:
> Can the Data.List <> Data.Set issue be resolved? > > Is it OK to introduce a cyclic dependency between the two, making use of > a .boot.hs file? As long as the users of Data.List and Data.Set do not have to fiddle with {# SOURCE #} imports, I do not see an issue, so please go ahead. Cheers, Andreas P.S.: The Agda source has tons of .hsboot files, I hate them, but what can you do?  Andreas Abel <>< Du bist der geliebte Mensch. Theoretical Computer Science, University of Munich Oettingenstr. 67, D80538 Munich, GERMANY [hidden email] http://www2.tcs.ifi.lmu.de/~abel/ _______________________________________________ HaskellCafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskellcafe 
Free forum by Nabble  Edit this page 