Function to detect duplicates

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Function to detect duplicates

Rafael Gustavo da Cunha Pereira Pinto-2

Hi folks,

While solving a puzzle, I was posed the problem of finding if there was no duplicates on a list.

First I used:

noneRepeated=null.(filter (>1)).(map length).group.sort


But this seemed very unneficient, so I thought that I could detect the duplicates while sorting, and devised this:

import Control.Monad
import Data.Maybe

noneRepeated=isNothing . (foldl merge (Just [])) . (map sort) . pairs

pairs []=[]
pairs [x]=[[x]]
pairs (x:y:xs)=[x,y]:pairs xs

sort []=Just []
sort [x]=Just [x]
sort [x,y] | x>y=Just [y,x]
           | y>x=Just [x,y]
           | x==y=Nothing

merge::(Eq a, Ord a) => Maybe [a]->Maybe [a]->Maybe[a]
merge _ Nothing = Nothing
merge Nothing _ = Nothing
merge (Just []) (Just xs)=Just xs
merge (Just xs) (Just [])=Just xs
merge (Just (x:xs)) (Just (y:ys)) | x==y = Nothing
                                  | x>y  = (Just y) +? (merge (Just (x:xs)) (Just ys))
                                  | x<y  = (Just x) +? (merge (Just xs) (Just (y:ys)))

(+?) = liftM2 (:)



My version of the merge sort returns Nothing whenever it finds two equal entries, aborting all subsequent comparisons.

I have a few questions for the friendly people at this cafe:

1) Is there any improvement I can make?
2) Can it be parallelized (par, pseq)?


Best regards,

Rafael

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Function to detect duplicates

Jonas Almström Duregård
Hi Rafael,

I assume you will perform this operation on some very large lists, or
performance would not be an issue. Have you tested if your optimized
version is better than your initial one?

You should compare your implementation against something like this:

import qualified Data.Set as Set
noneRepeated :: (Ord a) => [a] -> Bool
noneRepeated = accum Set.empty where
  accum _ [] = True
  accum s (x:xs)
    | Set.member x s = False
    | otherwise      = accum (Set.insert x s) xs

Also there is some discussion about the nub function that relates to
this topic, e.g. http://buffered.io/2008/07/28/a-better-nub/.

/Jonas

On 23 February 2010 12:30, Rafael Gustavo da Cunha Pereira Pinto
<[hidden email]> wrote:

>
> Hi folks,
>
> While solving a puzzle, I was posed the problem of finding if there was no
> duplicates on a list.
>
> First I used:
>
> noneRepeated=null.(filter (>1)).(map length).group.sort
>
>
> But this seemed very unneficient, so I thought that I could detect the
> duplicates while sorting, and devised this:
>
> import Control.Monad
> import Data.Maybe
>
> noneRepeated=isNothing . (foldl merge (Just [])) . (map sort) . pairs
>
> pairs []=[]
> pairs [x]=[[x]]
> pairs (x:y:xs)=[x,y]:pairs xs
>
> sort []=Just []
> sort [x]=Just [x]
> sort [x,y] | x>y=Just [y,x]
>            | y>x=Just [x,y]
>            | x==y=Nothing
>
> merge::(Eq a, Ord a) => Maybe [a]->Maybe [a]->Maybe[a]
> merge _ Nothing = Nothing
> merge Nothing _ = Nothing
> merge (Just []) (Just xs)=Just xs
> merge (Just xs) (Just [])=Just xs
> merge (Just (x:xs)) (Just (y:ys)) | x==y = Nothing
>                                   | x>y  = (Just y) +? (merge (Just (x:xs))
> (Just ys))
>                                   | x<y  = (Just x) +? (merge (Just xs)
> (Just (y:ys)))
>
> (+?) = liftM2 (:)
>
>
>
> My version of the merge sort returns Nothing whenever it finds two equal
> entries, aborting all subsequent comparisons.
>
> I have a few questions for the friendly people at this cafe:
>
> 1) Is there any improvement I can make?
> 2) Can it be parallelized (par, pseq)?
>
>
> Best regards,
>
> Rafael
>
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Function to detect duplicates

Ertugrul Söylemez
In reply to this post by Rafael Gustavo da Cunha Pereira Pinto-2
Rafael Gustavo da Cunha Pereira Pinto <[hidden email]> wrote:

> While solving a puzzle, I was posed the problem of finding if there
> was no duplicates on a list.
>
> First I used:
>
> noneRepeated=null.(filter (>1)).(map length).group.sort
>
> But this seemed very unneficient, so I thought that I could detect the
> duplicates while sorting, and devised this:
>
> import Control.Monad
> import Data.Maybe
>
> noneRepeated=isNothing . (foldl merge (Just [])) . (map sort) . pairs

import Data.List

noneRepeated xs = xs == nub xs


Greets
Ertugrul


--
nightmare = unsafePerformIO (getWrongWife >>= sex)
http://blog.ertes.de/


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: Function to detect duplicates

Daniel Fischer-4
Am Dienstag 23 Februar 2010 13:03:45 schrieb Ertugrul Soeylemez:

> Rafael Gustavo da Cunha Pereira Pinto <[hidden email]> wrote:
> > While solving a puzzle, I was posed the problem of finding if there
> > was no duplicates on a list.
> >
> > First I used:
> >
> > noneRepeated=null.(filter (>1)).(map length).group.sort
> >
> > But this seemed very unneficient, so I thought that I could detect the
> > duplicates while sorting, and devised this:
> >
> > import Control.Monad
> > import Data.Maybe
> >
> > noneRepeated=isNothing . (foldl merge (Just [])) . (map sort) . pairs
>
> import Data.List
>
> noneRepeated xs = xs == nub xs

Talk about inefficiency :)

import Data.Set (Set)
import qualified Data.Set as Set

noneRepeated = go 0 Set.empty
      where
        go ct st (x:xs)
            | Set.size st < ct = False
            | otherwise = go (ct+1) (Set.insert x st) xs
        go ct st [] = ct == Set.size st

>
>
> Greets
> Ertugrul

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: Function to detect duplicates

Jonas Almström Duregård
In reply to this post by Ertugrul Söylemez
Ertugrul: while your solution is minimalistic, Rafael deemed his
~n*log n implementation too inefficient. Thus your ~n^3 implementation
is hardly an improvement...

/Jonas

On 23 February 2010 13:03, Ertugrul Soeylemez <[hidden email]> wrote:

> Rafael Gustavo da Cunha Pereira Pinto <[hidden email]> wrote:
>
>> While solving a puzzle, I was posed the problem of finding if there
>> was no duplicates on a list.
>>
>> First I used:
>>
>> noneRepeated=null.(filter (>1)).(map length).group.sort
>>
>> But this seemed very unneficient, so I thought that I could detect the
>> duplicates while sorting, and devised this:
>>
>> import Control.Monad
>> import Data.Maybe
>>
>> noneRepeated=isNothing . (foldl merge (Just [])) . (map sort) . pairs
>
> import Data.List
>
> noneRepeated xs = xs == nub xs
>
>
> Greets
> Ertugrul
>
>
> --
> nightmare = unsafePerformIO (getWrongWife >>= sex)
> http://blog.ertes.de/
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Function to detect duplicates

Ketil Malde-5
In reply to this post by Rafael Gustavo da Cunha Pereira Pinto-2
Rafael Gustavo da Cunha Pereira Pinto <[hidden email]>
writes:

> First I used:
>
> noneRepeated=null.(filter (>1)).(map length).group.sort

> But this seemed very unneficient, so I thought that I could detect the
> duplicates while sorting, and devised this:
   [...]
> 1) Is there any improvement I can make?

Well - it's a bit long, don't you think?  Long enough that from a
cursory glance, I'd say it's in the "no obvious errors" category.  How
about (inspired by quicksort, as you no doubt can tell):

   noneRepeated [] = True
   noneRepeated (x:xs) = noneRepeated lt && singleton eq && noneRepeated gt
        where lt = filter (<x) xs
              gt = filter (>x) xs
              eq = x:filter (==x) xs
              singleton [_] = True
              singleton _   = False

> 2) Can it be parallelized (par, pseq)?

You could force each of the sublists in parallel, but you might lose
some laziness properties, so I'd carefully benchmark it.

-k
--
If I haven't seen further, it is by standing in the footprints of giants
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Function to detect duplicates

Ertugrul Söylemez
In reply to this post by Jonas Almström Duregård
Jonas Almström Duregård <[hidden email]> wrote:

> Ertugrul: while your solution is minimalistic, Rafael deemed his
> ~n*log n implementation too inefficient. Thus your ~n^3 implementation
> is hardly an improvement...

My variant has an advantage, though.  It is completely lazy, so it will
take a shortcut, as soon as a duplicate is found.  Depending on his
application, this may be useful or not.

I think the nub-based solution is the best one in general, but it's the
base library implementation of nub, which is unfortunate.  In fact, with
a better nub implementation, this becomes an O(n * log n) time
algorithm, too, but with the additional laziness advantage.  The article
you linked to contains such an implementation, I think.


Greets
Ertugrul


--
nightmare = unsafePerformIO (getWrongWife >>= sex)
http://blog.ertes.de/


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: Function to detect duplicates

Daniel Fischer-4
Am Dienstag 23 Februar 2010 13:59:49 schrieb Ertugrul Soeylemez:
> Jonas Almström Duregård <[hidden email]> wrote:
> > Ertugrul: while your solution is minimalistic, Rafael deemed his
> > ~n*log n implementation too inefficient. Thus your ~n^3 implementation
> > is hardly an improvement...

Not quite as bad, nub is O(n^2).

>
> My variant has an advantage, though.  It is completely lazy, so it will
> take a shortcut, as soon as a duplicate is found.  Depending on his
> application, this may be useful or not.
>
> I think the nub-based solution is the best one in general, but it's the
> base library implementation of nub, which is unfortunate.  In fact, with
> a better nub implementation, this becomes an O(n * log n) time

How can you nub in O(n*log n)? Remember, you only have Eq for nub.

> algorithm, too, but with the additional laziness advantage.  The article
> you linked to contains such an implementation, I think.
>
>
> Greets
> Ertugrul

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: Function to detect duplicates

Jonas Almström Duregård
>>noneRepeated xs = xs == nub xs

> Not quite as bad, nub is O(n^2)

You are correct of course. Still, it will probably be a bit less
inefficient if the length of the lists are compared (as opposed to the
elements):

noneRepeated xs = length xs == length (nub xs)

On 23 February 2010 14:09, Daniel Fischer <[hidden email]> wrote:

> Am Dienstag 23 Februar 2010 13:59:49 schrieb Ertugrul Soeylemez:
>> Jonas Almström Duregård <[hidden email]> wrote:
>> > Ertugrul: while your solution is minimalistic, Rafael deemed his
>> > ~n*log n implementation too inefficient. Thus your ~n^3 implementation
>> > is hardly an improvement...
>
> Not quite as bad, nub is O(n^2).
>
>>
>> My variant has an advantage, though.  It is completely lazy, so it will
>> take a shortcut, as soon as a duplicate is found.  Depending on his
>> application, this may be useful or not.
>>
>> I think the nub-based solution is the best one in general, but it's the
>> base library implementation of nub, which is unfortunate.  In fact, with
>> a better nub implementation, this becomes an O(n * log n) time
>
> How can you nub in O(n*log n)? Remember, you only have Eq for nub.
>
>> algorithm, too, but with the additional laziness advantage.  The article
>> you linked to contains such an implementation, I think.
>>
>>
>> Greets
>> Ertugrul
>
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: Function to detect duplicates

Daniel Fischer-4
Am Dienstag 23 Februar 2010 14:54:36 schrieb Jonas Almström Duregård:
> You are correct of course. Still, it will probably be a bit less
> inefficient if the length of the lists are compared (as opposed to the
> elements):
>
> noneRepeated xs = length xs == length (nub xs)

Only if no repeated elements appear early.
For xs = 1 : [1 .. 10^7], xs == nub xs will return False without noticeable
delay, length xs == length (nub xs) will take VERY long.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: Function to detect duplicates

Ketil Malde-5
In reply to this post by Rafael Gustavo da Cunha Pereira Pinto-2

| Am Freitag 26 Februar 2010 00:57:48 schrieb Rafael Gustavo da Cunha Pereira
| Pinto:
|
|> There is a single 10 digit number that:
|>
|> 1) uses all ten digits [0..9], with no repetitions
|> 2) the number formed by the first digit (right to left, most
|> significant) is divisible by one
|> 3) the number formed by the first 2 digits (again right to left) is
|> divisible by two
|> 4) the number formed by the first 3 digits is divisible by three
|>  and so on, until:
|> 11) the number formed by the first 10 digits (all!) is by 10

Since Ishaaq Chandy just posted about how to generalize nested list
comprehensions, I thought this was an interesting way to approach this.

First a couple of simple helper functions:

> val = foldl (\x y -> x*10+y) 0
> divides d n = n `mod` d == 0

So you could solve it using a set of list comprehensions:

> solutions = [[x1,x2,x3,x4,x5,x6,x7,x8,x9,x10]
>           | x1 <- [0..9]
>           , x2 <- [0..9], divides 2 $ val [x1,x2]
>           , x3 <- [0..9], divides 3 $ val [x1,x2,x3]
>           , x4 <- [0..9], divides 4 $ val [x1,x2,x3,x4]
>           , x5 <- [0..9], divides 5 $ val [x1,x2,x3,x4,x5]
>           , x6 <- [0..9], divides 6 $ val [x1,x2,x3,x4,x5,x6]
>           , x7 <- [0..9], divides 7 $ val [x1,x2,x3,x4,x5,x6,x7]
>           , x8 <- [0..9], divides 8 $ val [x1,x2,x3,x4,x5,x6,x7,x8]
>           , x9 <- [0..9], divides 9 $ val [x1,x2,x3,x4,x5,x6,x7,x8,x9]
>           , x10 <- [0]
>           , length (nub [x1,x2,x3,x4,x5,x6,x7,x8,x9,x10]) == 10
>           ]                    

This is a nicely declarative way to do it, and a pretty clear way to
formulate the original problem statement.  But it's a bit tedious with
all the repetitions, so you would rather recurse to make it more
general.  Since list comprehensions are just a different way to work in
the list monad (where | becomes 'guard'), I managed to come up with this:

> solve :: [Int] -> [[Int]]
> solve prefix = do
>   let l = length prefix
>   if l == 10
>     then return prefix
>     else do
>       x <- [0..9]
>       let r = prefix++[x]
>       guard (divides (l+1) (val r) && nub r == r)
>       solve r

-k

(PS: I'm happy to hear any comments regarding style or other issues)
--
If I haven't seen further, it is by standing in the footprints of giants
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: Function to detect duplicates

Daniel Fischer-4
Am Freitag 26 Februar 2010 16:50:42 schrieb Ketil Malde:

> | Am Freitag 26 Februar 2010 00:57:48 schrieb Rafael Gustavo da Cunha
> | Pereira
> |
> | Pinto:
> |> There is a single 10 digit number that:
> |>
> |> 1) uses all ten digits [0..9], with no repetitions
> |> 2) the number formed by the first digit (right to left, most
> |> significant) is divisible by one
> |> 3) the number formed by the first 2 digits (again right to left) is
> |> divisible by two
> |> 4) the number formed by the first 3 digits is divisible by three
> |>  and so on, until:
> |> 11) the number formed by the first 10 digits (all!) is by 10
>
> Since Ishaaq Chandy just posted about how to generalize nested list
> comprehensions, I thought this was an interesting way to approach this.

Yes. But it approaches the border, for 20 digits it would become annoying
to type.

>
> First a couple of simple helper functions:
> > val = foldl (\x y -> x*10+y) 0
> > divides d n = n `mod` d == 0
>
> So you could solve it using a set of list comprehensions:
> > solutions = [[x1,x2,x3,x4,x5,x6,x7,x8,x9,x10]
> >
> >           | x1 <- [0..9]

First digit can't be 0, so make it [1 .. 9].
Since you use the fact that the last digit must be the 0, pull all others
from [1 .. 9].

> >
> >           , x2 <- [0..9], divides 2 $ val [x1,x2]
            , x1 /= x2
> >           , x3 <- [0..9], divides 3 $ val [x1,x2,x3]
            , x3 `notElem` [x1,x2] -- etc.
> >           , x4 <- [0..9], divides 4 $ val [x1,x2,x3,x4]
> >           , x5 <- [0..9], divides 5 $ val [x1,x2,x3,x4,x5]
> >           , x6 <- [0..9], divides 6 $ val [x1,x2,x3,x4,x5,x6]
> >           , x7 <- [0..9], divides 7 $ val [x1,x2,x3,x4,x5,x6,x7]
> >           , x8 <- [0..9], divides 8 $ val [x1,x2,x3,x4,x5,x6,x7,x8]
> >           , x9 <- [0..9], divides 9 $ val [x1,x2,x3,x4,x5,x6,x7,x8,x9]
> >           , x10 <- [0]
> >           , length (nub [x1,x2,x3,x4,x5,x6,x7,x8,x9,x10]) == 10
> >           ]

Doesn't look as nice, but early pruning saves a lot of work (in this case,
for very small values of "a lot").

>
> This is a nicely declarative way to do it, and a pretty clear way to
> formulate the original problem statement.

A very direct translation :)

> But it's a bit tedious with
> all the repetitions, so you would rather recurse to make it more
> general.  Since list comprehensions are just a different way to work in
>
> the list monad (where | becomes 'guard'), I managed to come up with this:
> > solve :: [Int] -> [[Int]]

Not on a 32-bit system. Word would suffice there, but you don't know that
in advance, so it'd be Int64 or Integer

> > solve prefix = do
> >   let l = length prefix
> >   if l == 10
> >     then return prefix
> >     else do
> >       x <- [0..9]

You can

       guard (x `notElem` prefix)

here, or use x `notElem` prefix below, but don't use nub r == r when you
know that only the new element may be duplicated.

> >       let r = prefix++[x]
> >       guard (divides (l+1) (val r) && nub r == r)
> >       solve r
>
> -k
>
> (PS: I'm happy to hear any comments regarding style or other issues)

I would make the length of the prefix a parameter of solve.
It's admittedly less elegant, but all those calls to length hurt me :)
Regarding style, I think I prefer

solve prefix =
      case length prefix of
        10 -> return prefix
        l -> do
           x <- [0 .. 9]
            ...

over the if-then-else.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: Function to detect duplicates

Ketil Malde-5
Daniel Fischer <[hidden email]> skrev:

> Am Freitag 26 Februar 2010 16:50:42 schrieb Ketil Malde:
>>> solutions = [[x1,x2,x3,x4,x5,x6,x7,x8,x9,x10]
>>>           | x1 <- [0..9]

> First digit can't be 0, so make it [1 .. 9].
> Since you use the fact that the last digit must be the 0, pull all others
> from [1 .. 9].

Originally, I pulled from alternating odds (x1 <- [1,3..9] etc) and
evens, since this is fairly easy to deduce...  I reverted this since the
point was to use brute force.

>>> solve :: [Int] -> [[Int]]
>
> Not on a 32-bit system. Word would suffice there, but you don't know that
> in advance, so it'd be Int64 or Integer

Hm?  The Ints are just individual digits here.

> I would make the length of the prefix a parameter of solve.

I thought about generating a list with solutions for increasing lenghts,
so that e.g. 'solve [] !! 10' would solve this particular problem.

> solve prefix =
>       case length prefix of
>         10 -> return prefix
>         l -> do
>            x <- [0 .. 9]
>             ...
>
> over the if-then-else.

Yes, much nicer.  Thanks for the feedback!

-k
--
If I haven't seen further, it is by standing in the footprints of giants
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: Function to detect duplicates

Daniel Fischer-4
Am Freitag 26 Februar 2010 21:34:28 schrieb Ketil Malde:

> Daniel Fischer <[hidden email]> skrev:
> > Am Freitag 26 Februar 2010 16:50:42 schrieb Ketil Malde:
> >>> solutions = [[x1,x2,x3,x4,x5,x6,x7,x8,x9,x10]
> >>>
> >>>           | x1 <- [0..9]
> >
> > First digit can't be 0, so make it [1 .. 9].
> > Since you use the fact that the last digit must be the 0, pull all
> > others from [1 .. 9].
>
> Originally, I pulled from alternating odds (x1 <- [1,3..9] etc) and
> evens, since this is fairly easy to deduce...  I reverted this since the
> point was to use brute force.

Yes, but did you forget x10 or did you think that one was too obvious?

>
> >>> solve :: [Int] -> [[Int]]
> >
> > Not on a 32-bit system. Word would suffice there, but you don't know
> > that in advance, so it'd be Int64 or Integer
>
> Hm?  The Ints are just individual digits here.
>

Yup. I didn't realise that you don't call val for the 10-digit number(s).
If you also did x10 <- [0 .. 9] and checked
val [x1, x2, ..., x10] `mod` 10 == 0, it would overflow, that's what I was
thinking of.

> > I would make the length of the prefix a parameter of solve.
>
> I thought about generating a list with solutions for increasing lenghts,
> so that e.g. 'solve [] !! 10' would solve this particular problem.
>

That's nice, but I think it'd be ugly with a DFS, much nicer with a BFS,
like Rafael did.

> > solve prefix =
> >       case length prefix of
> >         10 -> return prefix
> >         l -> do
> >            x <- [0 .. 9]
> >             ...
> >
> > over the if-then-else.
>
> Yes, much nicer.  Thanks for the feedback!
>
> -k

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe