Performance horrors

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
33 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Performance horrors

Adrian Hey
Hello Folks,

I was looking at the definitions of nub (and hence nubBy) recently
in connection with a trac ticket and realised that this is O(n^2)
complexity! Ouch!

I was going to say I sure hope nobody uses this in real programs,
but I thought I'd better check first and I see that darcs seems to
use it in a few places. Hmm..

How did we ever get stuff like this in the standard libs? I can only
imagine this is relic from the days when Haskell was only used for
research or pedagogical purposes only (not for real programs).

Seeing as practically all Eq instances are also Ord instances, at
the very least we could have O(n*(log n)) definitions for ..

nub :: Ord a => [a] -> [a]
nub = nubBy compare

nubBy :: (a -> a -> Ordering) -> [a] -> [a]
nubBy cmp xs ys = -- something using an AVL tree perhaps.

..or even better using the generalised trie stuff Jamie Brandon
has been working on.

Of course I'm not actually advocating this as it's probably bad policy
to have a simple standard lib dependent on any complex non-standard lib.
But if it just isn't possible to implement some functions with
reasonable efficiency using simple definitions only then I think they
really shouldn't be there at all. Instead we should make users "roll
their own" using whatever complex non-standard lib they want.

So could nub and nubBy be deprecated and an appropriate health warning
added to the Haddock?

In the mean time I think I'll put appropriate alternative definitions
in the next AVL release and ask Jamie Brandon to consider doing the same
in his generalised tries lib (GMap).

Also, looking at a few other functions there like union(By) and
intersection(By) make me quite suspicious. Maybe we need a thorough
performance audit to get rid of implementations that are so insanely
inefficient they should *never* be used.

Regards
--
Adrian Hey




_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Performance horrors

Neil Mitchell
Hi

> I was looking at the definitions of nub (and hence nubBy) recently
> in connection with a trac ticket and realised that this is O(n^2)
> complexity! Ouch!

Complexity theory plus the Eq context makes that inevitable. You can't
have nub over Eq in anything less than O(n^2)

> I was going to say I sure hope nobody uses this in real programs,
> but I thought I'd better check first and I see that darcs seems to
> use it in a few places. Hmm..

I use it all the time, its dead handy. There are 12 instances in
Hoogle, for example. If profiling later shows it to be a problem, I'd
fix it - but I can't ever actually remember that being the case.
Premature optimisation is the root of all evil.

> Seeing as practically all Eq instances are also Ord instances, at
> the very least we could have O(n*(log n)) definitions for ..
>
> nub :: Ord a => [a] -> [a]
> nub = nubBy compare
>
> nubBy :: (a -> a -> Ordering) -> [a] -> [a]
> nubBy cmp xs ys = -- something using an AVL tree perhaps.
>
> ..or even better using the generalised trie stuff Jamie Brandon
> has been working on.

Having nubOrd is a great idea, I even proposed it previously, but
people disagreed with me. Propose it and add it by all means.

> So could nub and nubBy be deprecated and an appropriate health warning
> added to the Haddock?

No. They should say O(n^2) in the haddock documentation, but O(n^2) /= useless.

Thanks

Neil
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

RE: Performance horrors

Bayley, Alistair-3
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of Neil Mitchell
>
> Complexity theory plus the Eq context makes that inevitable. You can't
> have nub over Eq in anything less than O(n^2)
>
> > I was going to say I sure hope nobody uses this in real programs,
> > but I thought I'd better check first and I see that darcs seems to
> > use it in a few places. Hmm..
>
> > nub :: Ord a => [a] -> [a]
> > nub = nubBy compare
> >
> Having nubOrd is a great idea, I even proposed it previously, but
> people disagreed with me. Propose it and add it by all means.


The name is... well, pessimal might be a bit strong, but few programmers
would think to look for something called "nub". Personally, when I first
looked for it I expected uniq or unique (because that's what the unix
utility that does the same thing is called). Distinct (from SQL) is
another name that occurred to me, but never nub... it's not even a
synonym for unique: http://thesaurus.reference.com/browse/unique

See the redefinition of nub as uniq here (which I assume is because John
didn't know about nub):
http://hackage.haskell.org/packages/archive/MissingH/1.0.0/doc/html/Data
-List-Utils.html


The folklore (such as it is) for uniq is that it is trivially defined
like so (for lists):

> uniq = map head . group . sort

and so perhaps is not worthy of library inclusion? BTW, is this a
suitably performant definition, or would we benefit from a lower-level
implementation?


Alistair
*****************************************************************
Confidentiality Note: The information contained in this message,
and any attachments, may contain confidential and/or privileged
material. It is intended solely for the person(s) or entity to
which it is addressed. Any review, retransmission, dissemination,
or taking of any action in reliance upon this information by
persons or entities other than the intended recipient(s) is
prohibited. If you received this in error, please contact the
sender and delete the material from any computer.
*****************************************************************

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Performance horrors

Neil Mitchell
Hi

> The folklore (such as it is) for uniq is that it is trivially defined
> like so (for lists):
>
>> uniq = map head . group . sort
>
> and so perhaps is not worthy of library inclusion? BTW, is this a
> suitably performant definition, or would we benefit from a lower-level
> implementation?

A much better definition would use a Data.Set, then you get laziness
and the order of elements is not permuted. Having a sortNub as well is
a good idea though.

Thanks

Neil
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Performance horrors

Adrian Hey
In reply to this post by Neil Mitchell
Neil Mitchell wrote:
> There are 12 instances in
> Hoogle, for example. If profiling later shows it to be a problem, I'd
> fix it - but I can't ever actually remember that being the case.
> Premature optimisation is the root of all evil.

I'm sure most of the community would agree with you, but I have to
say that if the consequence of this philosophy is horrors like this
in the standard libs, it should come as no surprise that Haskell has
a reputation for being "slow". What else is lurking there I wonder?
What's really bad is that the terrible performance isn't even
documented. It may be obvious, but it should still be documented.

Has anybody even the remotest clue why darcs is (apparently) so slow?
Maybe the community itself should share some of the blame for this. Like
it wasn't obvious to me that the uses of nub I saw in darcs could rely
on very short lists (<=1 element :-)

> Having nubOrd is a great idea, I even proposed it previously, but
> people disagreed with me. Propose it and add it by all means.

Like I said, I'm not proposing it, as it doesn't seem to possible to
implement it efficiently using anything else in the standard libs.
You could do nubOrd (but not nubOrdBy) using Data.Set.

But there are 2 problems..
1- This not only introduces a cyclic dependency between modules, but
also packages. I'm not sure how well various compilers and cabal would
deal with this between them, but I'm not optimistic.

2- Data.Set is not obviously the best underlying implementation (in
fact it is obviously not the best underlying implementation, this and
Data.Map etc really should be pensioned off to hackage along with the
rest of the badly documented, unreliable, inefficient and unstable
junk :-)

So I still think they should be deprecated. It seems like the least bad
option if we can agree that their use should be strongly discouraged.

>> So could nub and nubBy be deprecated and an appropriate health warning
>> added to the Haddock?
>
> No. They should say O(n^2) in the haddock documentation, but O(n^2) /= useless.

I would say it is if there are many obvious O(n.(log n)) or better
implementations that can be used in in 99%+ of cases. I mean so
obvious that users can quite easily roll their own in 3 lines
of code or less.

Regards
--
Adrian Hey

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Performance horrors

Neil Mitchell
Hi

> Has anybody even the remotest clue why darcs is (apparently) so slow?
> Maybe the community itself should share some of the blame for this. Like
> it wasn't obvious to me that the uses of nub I saw in darcs could rely
> on very short lists (<=1 element :-)

I'd be very surprised if it has anything to do with nub!

> You could do nubOrd (but not nubOrdBy) using Data.Set.

You can do nubOrdBy, just use the data type:

data Elem a = Elem (a -> a -> Ordering) a

> 1- This not only introduces a cyclic dependency between modules, but
> also packages. I'm not sure how well various compilers and cabal would
> deal with this between them, but I'm not optimistic.

Yep, that would be a pain.

> 2- Data.Set is not obviously the best underlying implementation (in
> fact it is obviously not the best underlying implementation, this and
> Data.Map etc really should be pensioned off to hackage along with the
> rest of the badly documented, unreliable, inefficient and unstable
> junk :-)

Data.Set is an interface, which you seem to think is not the fastest
implementation of that interface. If you can expose the same
interface, but improve the performance, and demonstrate that the
performance is better, then go for it! I'd support such a library
change proposal.

> So I still think they should be deprecated. It seems like the least bad
> option if we can agree that their use should be strongly discouraged.

Nope. I love nub. It's a beautiful function, which does exactly what
it says on the tin (albeit the tin is labelled in Latin).

> I would say it is if there are many obvious O(n.(log n)) or better
> implementations that can be used in in 99%+ of cases. I mean so
> obvious that users can quite easily roll their own in 3 lines
> of code or less.

None of them are Eq a => [a] -> [a].

If designing things from scratch I'd say its fine to have a debate
about whether you make nub require Eq or Ord, and then provide the
other as an option. But that time has passed.

Thanks

Neil
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Performance horrors

Jeremy Shaw-3
In reply to this post by Adrian Hey
At Tue, 26 Aug 2008 08:38:45 +0100,
Adrian Hey wrote:

> I was looking at the definitions of nub (and hence nubBy) recently
> in connection with a trac ticket and realised that this is O(n^2)
> complexity! Ouch!

Can we modify Data.List to include the big-O notation for all the
functions similar to Data.Set, Data.Map, and bytestring?

j.
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Performance horrors

Gwern Branwen
In reply to this post by Bayley, Alistair-3
On 2008.08.26 12:09:05 +0100, "Bayley, Alistair" <[hidden email]> scribbled 1.5K characters:

> The name is... well, pessimal might be a bit strong, but few programmers
> would think to look for something called "nub". Personally, when I first
> looked for it I expected uniq or unique (because that's what the unix
> utility that does the same thing is called). Distinct (from SQL) is
> another name that occurred to me, but never nub... it's not even a
> synonym for unique: http://thesaurus.reference.com/browse/unique
>
> See the redefinition of nub as uniq here (which I assume is because John
> didn't know about nub):
> http://hackage.haskell.org/packages/archive/MissingH/1.0.0/doc/html/Data
> -List-Utils.html
>
> The folklore (such as it is) for uniq is that it is trivially defined
> like so (for lists):
>
> > uniq = map head . group . sort
>
> and so perhaps is not worthy of library inclusion? BTW, is this a
> suitably performant definition, or would we benefit from a lower-level
> implementation?
>
> Alistair
FWIW, when I tested out a number of nub/uniq definitions, the map head version performed the worst, much worse than the O(n^2) one everyone is complaining about. It's a neat definition, but not performant.

--
gwern
Iraq Hercules Bosnia Summercon Compsec 20 Albright EuroFed RDI encryption

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Performance horrors

Adrian Hey
Gwern Branwen wrote:
> FWIW, when I tested out a number of nub/uniq definitions, the map head version performed the worst, much worse than the O(n^2) one everyone is complaining about. It's a neat definition, but not performant.

When did you try this? IIRC correctly even the old sort was O(n^2), but
someone had the sense to replace it a while ago.

With ghci now on my machine..
  length $ map head $ group $ sort [1..100000]
finishes "instantly", but..
  length $ nub [1..100000]
takes about 90 seconds.

Regards
--
Adrian Hey

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Performance horrors

John Meacham
In reply to this post by Adrian Hey
nub has a couple advantages over the n log n version, the main one being
that it only requires an 'Eq' constraint, not an 'Ord' one. another
being that it is fully lazy, it produces results even for an infinite
list. a third is that the results come out in the order they went in.

That said, I have a 'snub' (sorted nub) routine I use pretty commonly as
well defined in the standard way. If you have something like

setnub xs = f Set.empty xs where
        f _ [] = []
        f s (x:xs) = if x `Set.member` s then f s xs else f (Set.insert
x xs) (x:xs)

then you can use 'RULES' pragmas to replace nub with setnub when it is
allowed.

Though, I have reservations about using RULES to change the order of
complexity.

        John


--
John Meacham - ⑆repetae.net⑆john⑈
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Performance horrors

Duncan Coutts
In reply to this post by Adrian Hey
On Tue, 2008-08-26 at 22:52 +0100, Adrian Hey wrote:

> Gwern Branwen wrote:
> > FWIW, when I tested out a number of nub/uniq definitions, the map head version performed the worst, much worse than the O(n^2) one everyone is complaining about. It's a neat definition, but not performant.
>
> When did you try this? IIRC correctly even the old sort was O(n^2), but
> someone had the sense to replace it a while ago.
>
> With ghci now on my machine..
>   length $ map head $ group $ sort [1..100000]
> finishes "instantly", but..
>   length $ nub [1..100000]
> takes about 90 seconds.

Also, sorting followed by grouping is unnecessary extra work. Sorts that
discard duplicates are usually simple modifications of the sort
algorithm.

Though as people have pointed out, nub is nice because it is lazy, so
sorting is out. An ord based nub should accumulate previously seen
values so that it can operate lazily too.

Duncan

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Performance horrors

Adrian Hey
In reply to this post by John Meacham
John Meacham wrote:
> nub has a couple advantages over the n log n version, the main one being
> that it only requires an 'Eq' constraint, not an 'Ord' one.

This is only an advantage for a tiny minority of types that have Eq
but no Ord instances, such as TypRep. But there's no good reason
why even this cannot have an Ord instance AFAICS (though not a derived
one obviously).

> another
> being that it is fully lazy, it produces results even for an infinite
> list.

As does the O(n*(log n)) AVL based nub I just wrote.

> a third is that the results come out in the order they went in.

As does the O(n*(log n)) AVL based nub I just wrote :-)

Regards
--
Adrian Hey

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Performance horrors

Adrian Hey
In reply to this post by Jeremy Shaw-3
Jeremy Shaw wrote:
> At Tue, 26 Aug 2008 08:38:45 +0100,
> Adrian Hey wrote:
>
>> I was looking at the definitions of nub (and hence nubBy) recently
>> in connection with a trac ticket and realised that this is O(n^2)
>> complexity! Ouch!
>
> Can we modify Data.List to include the big-O notation for all the
> functions similar to Data.Set, Data.Map, and bytestring?

I'm sure this is possible. I have no idea how to do this though.

Perhaps somone else can explain. But patches submitted by ordinary
users (even bug fixes) tend to languish in obscurity for some reason.
I don't even know where the source code is or if we're using darcs or
git or what these days (but I dare say I could find out if I was
sufficiently motivated :-).

Regards
--
Adrian Hey

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Performance horrors

Johan Tibell-2
In reply to this post by Neil Mitchell
On Tue, Aug 26, 2008 at 11:11 AM, Neil Mitchell <[hidden email]> wrote:

>> 2- Data.Set is not obviously the best underlying implementation (in
>> fact it is obviously not the best underlying implementation, this and
>> Data.Map etc really should be pensioned off to hackage along with the
>> rest of the badly documented, unreliable, inefficient and unstable
>> junk :-)
>
> Data.Set is an interface, which you seem to think is not the fastest
> implementation of that interface. If you can expose the same
> interface, but improve the performance, and demonstrate that the
> performance is better, then go for it! I'd support such a library
> change proposal.

The problem with Data.Set and other collection types is that they're
*not* defined as interfaces. If Data.Set was defined using a type
class and maybe some associated data types it would be easier to
provide an alternative implementation.

Cheers,

Johan
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Performance horrors

David House
In reply to this post by Adrian Hey
2008/8/27 Adrian Hey <[hidden email]>:
> As does the O(n*(log n)) AVL based nub I just wrote.

Care to share?

WDYT about using RULES to rewrite to nubOrd if an Ord context is
available, as John Meacham mentioned?

John: you said you were uneasy about changing the complexity of an
algorithm using RULES, but this is exactly what list fusion does
(albeit for space, not time).

--
-David
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Performance horrors

John Meacham
On Wed, Aug 27, 2008 at 04:48:41PM +0100, David House wrote:

> 2008/8/27 Adrian Hey <[hidden email]>:
> > As does the O(n*(log n)) AVL based nub I just wrote.
>
> Care to share?
>
> WDYT about using RULES to rewrite to nubOrd if an Ord context is
> available, as John Meacham mentioned?
>
> John: you said you were uneasy about changing the complexity of an
> algorithm using RULES, but this is exactly what list fusion does
> (albeit for space, not time).

Indeed. and I am a little uneasy about that too :) not that I don't
think it is an excellent idea and reap the benefits of fast list
operations.. O(n^2) to O(n log n) just feels like a bigger jump to me
for some reason. I think in the end, the RULES are a good idea, but I
personally try to make which one I am using explicit when possible.
Hence making my ideas as to the time/space requirements for an operation
to both the compiler and a future reader of the code.

        John


--
John Meacham - ⑆repetae.net⑆john⑈
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Performance horrors

Christopher Lane Hinson

>> WDYT about using RULES to rewrite to nubOrd if an Ord context is
>> available, as John Meacham mentioned?
>>
>> John: you said you were uneasy about changing the complexity of an
>> algorithm using RULES, but this is exactly what list fusion does
>> (albeit for space, not time).
>
> Indeed. and I am a little uneasy about that too :) not that I don't
> think it is an excellent idea and reap the benefits of fast list
> operations.. O(n^2) to O(n log n) just feels like a bigger jump to me
> for some reason. I think in the end, the RULES are a good idea, but I
> personally try to make which one I am using explicit when possible.
> Hence making my ideas as to the time/space requirements for an operation
> to both the compiler and a future reader of the code

Using RULES in this way could be a pessimization.  I can't imagine a
nubOrd that would ever be as performant as nub on very small data
sets (two or three elements).

--Lane
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Performance horrors

Adrian Hey
Christopher Lane Hinson wrote:

>
>>> WDYT about using RULES to rewrite to nubOrd if an Ord context is
>>> available, as John Meacham mentioned?
>>>
>>> John: you said you were uneasy about changing the complexity of an
>>> algorithm using RULES, but this is exactly what list fusion does
>>> (albeit for space, not time).
>>
>> Indeed. and I am a little uneasy about that too :) not that I don't
>> think it is an excellent idea and reap the benefits of fast list
>> operations.. O(n^2) to O(n log n) just feels like a bigger jump to me
>> for some reason. I think in the end, the RULES are a good idea, but I
>> personally try to make which one I am using explicit when possible.
>> Hence making my ideas as to the time/space requirements for an operation
>> to both the compiler and a future reader of the code
>
> Using RULES in this way could be a pessimization.  I can't imagine a
> nubOrd that would ever be as performant as nub on very small data sets
> (two or three elements).

Why not? comparison doesn't cost any more than equality testing and I
think we can safely say that nubOrd will never require *more*
comparisons than equality tests needed by nub

Regards
--
Adrian Hey

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Performance horrors

Brandon S Allbery KF8NH
In reply to this post by Christopher Lane Hinson
On 2008 Aug 27, at 17:09, Christopher Lane Hinson wrote:

>>> WDYT about using RULES to rewrite to nubOrd if an Ord context is
>>> available, as John Meacham mentioned?
>>>
>>> John: you said you were uneasy about changing the complexity of an
>>> algorithm using RULES, but this is exactly what list fusion does
>>> (albeit for space, not time).
>>
>> Indeed. and I am a little uneasy about that too :) not that I don't
>> think it is an excellent idea and reap the benefits of fast list
>> operations.. O(n^2) to O(n log n) just feels like a bigger jump to me
>>
> Using RULES in this way could be a pessimization.  I can't imagine a  
> nubOrd that would ever be as performant as nub on very small data  
> sets (two or three elements).


I think if you're in a situation where the performance of nub on such  
a small dataset matters, you're already well into the realm of  
controlling everything manually to get performance.

--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] [hidden email]
system administrator [openafs,heimdal,too many hats] [hidden email]
electrical and computer engineering, carnegie mellon university    KF8NH


_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Performance horrors

Ross Paterson
In reply to this post by Adrian Hey
On Tue, Aug 26, 2008 at 06:32:40PM +0100, Adrian Hey wrote:
> 2- Data.Set is not obviously the best underlying implementation (in
> fact it is obviously not the best underlying implementation, this and
> Data.Map etc really should be pensioned off to hackage along with the
> rest of the badly documented, unreliable, inefficient and unstable
> junk :-)

Do you have benchmarks to quantify how bad Data.Set and Data.Map are?
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
12