hamming distance allocation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

hamming distance allocation

Arnoldo Muller
Hello all:

I want to generate some hamming distance statistics about a set of strings.
As explained in another e-mail in this list, I used the following code to call the
functions:
(exampl holds the list of strings of size w)
filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl]

I have two hamming functions:
-- hamming distance for variable length strings
hamming :: String -> String -> Int
hamming x y = hamming' x y 0
    where      
      hamming' [] _ !c = c
      hamming' _ [] !c = c
      hamming' (x:xs) (y:ys) !c
          | x == y = hamming' xs ys c
          | otherwise = hamming' xs ys (c + 1)

-- function posted in this mailing list
hamming2 :: String -> String -> Int
hamming2 xs ys = length (filter not (zipWith (==) xs ys))

I am executing these functions millions of times and the bottleneck of my program is in them as explained by running in profiling mode with  +RTS -K400M -p -RTS

The costlier function is the hamming distance
COST CENTRE                    MODULE               %time %alloc

hamming                        Distances             66.6   41.9

It says that it is performing 41% of the allocations. In the case of hamming2 the allocations go as far as 52%.  I could understand that there are allocations in "hamming2" because we are creating pairs, but in the case of "hamming" there should be no allocation.

How can I execute my hamming functions without allocating memory?

Best regards,

Arnoldo Muller


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: hamming distance allocation

Daniel Fischer-4
Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:

> Hello all:
>
> I want to generate some hamming distance statistics about a set of
> strings. As explained in another e-mail in this list, I used the
> following code to call the
> functions:
> (exampl holds the list of strings of size w)
> filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl,
> ys <- exampl]
>
> I have two hamming functions:
> -- hamming distance for variable length strings
> hamming :: String -> String -> Int
> hamming x y = hamming' x y 0
>     where
>       hamming' [] _ !c = c
>       hamming' _ [] !c = c
>       hamming' (x:xs) (y:ys) !c
>           | x == y = hamming' xs ys c
>           | otherwise = hamming' xs ys (c + 1)
>
> -- function posted in this mailing list
> hamming2 :: String -> String -> Int
> hamming2 xs ys = length (filter not (zipWith (==) xs ys))
>
> I am executing these functions millions of times and the bottleneck of
> my program is in them as explained by running in profiling mode with
> +RTS -K400M -p -RTS
>
> The costlier function is the hamming distance
> COST CENTRE                    MODULE               %time %alloc
>
> hamming                        Distances             66.6   41.9
>
> It says that it is performing 41% of the allocations. In the case of
> hamming2 the allocations go as far as 52%.

Allocations are cheap, so that's not necessarily a problem. More important
is, what's the maximum residency and how much is copied during GC?
Are you compiling with -O2 ?

> I could understand that
> there are allocations in "hamming2" because we are creating pairs, but
> in the case of "hamming" there should be no allocation.

Why not? I don't know how GHC counts allocations, but everytime you go from
(x:xs) to xs, you need a new pointer to the tail. If that counts as
allocation, hamming must allocate a lot, too.

>
> How can I execute my hamming functions without allocating memory?
>
> Best regards,
>
> Arnoldo Muller

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: hamming distance allocation

Heinrich Apfelmus
In reply to this post by Arnoldo Muller
Arnoldo Muller wrote:

> I want to generate some hamming distance statistics about a set of strings.
>
>   filter (\x -> x /= 0) $
>     map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl]
>
> [...]
>
> -- function posted in this mailing list
> hamming2 :: String -> String -> Int
> hamming2 xs ys = length (filter not (zipWith (==) xs ys))
>
> I am executing these functions millions of times and the bottleneck of my
> program is in them as explained by running in profiling mode with  +RTS
> -K400M -p -RTS
>
> The costlier function is the hamming distance
> COST CENTRE                    MODULE               %time %alloc
>
> hamming                        Distances             66.6   41.9

Another way to look at it is that you shouldn't optimize  hamming
itself, but rather make sure that it's called less often!

For instance, your expression can be replaced by

   filter (/=0) [hammingX x y | (x:xs) <- tails example, y <- xs]

which cuts the total running time in half. It's still quadratic in the
length of  example . I'm sure there are faster algorithms out there that
can bring it down to O(n log n) if you want.


Regards,
Heinrich Apfelmus

--
http://apfelmus.nfshost.com

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: hamming distance allocation

Daniel Fischer-4
Am Montag 19 April 2010 14:13:53 schrieb Heinrich Apfelmus:

> Arnoldo Muller wrote:
> > I want to generate some hamming distance statistics about a set of
> > strings.
> >
> >   filter (\x -> x /= 0) $
> >     map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl]
> >
> > [...]
> >
> > -- function posted in this mailing list
> > hamming2 :: String -> String -> Int
> > hamming2 xs ys = length (filter not (zipWith (==) xs ys))
> >
> > I am executing these functions millions of times and the bottleneck of
> > my program is in them as explained by running in profiling mode with
> > +RTS -K400M -p -RTS
> >
> > The costlier function is the hamming distance
> > COST CENTRE                    MODULE               %time %alloc
> >
> > hamming                        Distances             66.6   41.9
>
> Another way to look at it is that you shouldn't optimize  hamming
> itself, but rather make sure that it's called less often!
>
> For instance, your expression can be replaced by
>
>    filter (/=0) [hammingX x y | (x:xs) <- tails example, y <- xs]
>
> which cuts the total running time in half. It's still quadratic in the
> length of  example . I'm sure there are faster algorithms out there that

If it's likely that there are many repetitions, collect the Strings in a
Set/Map (depending on whether you're interested in the count) and call
hamming only on the distinct pairs.

> can bring it down to O(n log n) if you want.

I don't think so. You can't calculate the Hamming distance of x and z from
the distances between x and y and y and z, so you'd have to consider all
pairs of distinct strings, wouldn't you?

>
>
> Regards,
> Heinrich Apfelmus

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: hamming distance allocation

John Lato-2
In reply to this post by Arnoldo Muller
> Subject: Re: [Haskell-cafe] hamming distance allocation
>
> Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:
>> Hello all:
>>
>> I want to generate some hamming distance statistics about a set of
>> strings. As explained in another e-mail in this list, I used the
>> following code to call the
>> functions:
>> (exampl holds the list of strings of size w)
>> filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl,
>> ys <- exampl]
>>
>> I have two hamming functions:
>> -- hamming distance for variable length strings
>> hamming :: String -> String -> Int
>> hamming x y = hamming' x y 0
>>     where
>>       hamming' [] _ !c = c
>>       hamming' _ [] !c = c
>>       hamming' (x:xs) (y:ys) !c
>>           | x == y = hamming' xs ys c
>>           | otherwise = hamming' xs ys (c + 1)
>>
>> -- function posted in this mailing list
>> hamming2 :: String -> String -> Int
>> hamming2 xs ys = length (filter not (zipWith (==) xs ys))
>>
>> I am executing these functions millions of times and the bottleneck of
>> my program is in them as explained by running in profiling mode with
>> +RTS -K400M -p -RTS
>>
>> The costlier function is the hamming distance
>> COST CENTRE                    MODULE               %time %alloc
>>
>> hamming                        Distances             66.6   41.9
>>
>> It says that it is performing 41% of the allocations. In the case of
>> hamming2 the allocations go as far as 52%.
>
> Allocations are cheap, so that's not necessarily a problem. More important
> is, what's the maximum residency and how much is copied during GC?
> Are you compiling with -O2 ?
>
>> I could understand that
>> there are allocations in "hamming2" because we are creating pairs, but
>> in the case of "hamming" there should be no allocation.
>
> Why not? I don't know how GHC counts allocations, but everytime you go from
> (x:xs) to xs, you need a new pointer to the tail. If that counts as
> allocation, hamming must allocate a lot, too.

Is it really necessary to use Strings?  I think a packed type, e.g.
Vector or ByteString, would be much more efficient here.  Of course
this is only likely to be a benefit if you can move away from String
entirely.

I suspect that "hamming2" would perform better then.

John
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: hamming distance allocation

Daniel Fischer-4
Am Montag 19 April 2010 14:37:33 schrieb John Lato:
> Is it really necessary to use Strings?  I think a packed type, e.g.
> Vector or ByteString, would be much more efficient here.

Not very much if the strings are fairly short (and the list isn't too long,
so there's not a big difference in cache-friendliness).
If eight-bit characters aren't enough, packing the strings into
UArray Int Char gives performance quite close to ByteStrings.

> Of course this is only likely to be a benefit if you can move away from
> String entirely.
>
> I suspect that "hamming2" would perform better then.
>
> John

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: hamming distance allocation

John Lato-2
In reply to this post by Arnoldo Muller
> From: Daniel Fischer <[hidden email]>
>
> Am Montag 19 April 2010 14:37:33 schrieb John Lato:
>> Is it really necessary to use Strings?  I think a packed type, e.g.
>> Vector or ByteString, would be much more efficient here.
>
> Not very much if the strings are fairly short (and the list isn't too long,
> so there's not a big difference in cache-friendliness).
> If eight-bit characters aren't enough, packing the strings into
> UArray Int Char gives performance quite close to ByteStrings.

I agree, provided that the Strings are used only once.  I missed the
first part of this thread, so there may be context I don't have, but
if Strings are retained and the Hamming distance calculated for
multiple pairings, I think it would be a noticeable difference.
Although probably not as much as reducing the number of calls to
Hamming as you and H. Apfelmus suggest.

John
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: hamming distance allocation

Arnoldo Muller
In reply to this post by John Lato-2
Hello John:

Well I could use a packed type. The only letters that will be found in the string are "ATCG" so yeah I don't need unicode and those things.

Will try out with vector or ByteString. Thanks! :)

On Mon, Apr 19, 2010 at 2:37 PM, John Lato <[hidden email]> wrote:
> Subject: Re: [Haskell-cafe] hamming distance allocation
>
> Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:
>> Hello all:
>>
>> I want to generate some hamming distance statistics about a set of
>> strings. As explained in another e-mail in this list, I used the
>> following code to call the
>> functions:
>> (exampl holds the list of strings of size w)
>> filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl,
>> ys <- exampl]
>>
>> I have two hamming functions:
>> -- hamming distance for variable length strings
>> hamming :: String -> String -> Int
>> hamming x y = hamming' x y 0
>>     where
>>       hamming' [] _ !c = c
>>       hamming' _ [] !c = c
>>       hamming' (x:xs) (y:ys) !c
>>           | x == y = hamming' xs ys c
>>           | otherwise = hamming' xs ys (c + 1)
>>
>> -- function posted in this mailing list
>> hamming2 :: String -> String -> Int
>> hamming2 xs ys = length (filter not (zipWith (==) xs ys))
>>
>> I am executing these functions millions of times and the bottleneck of
>> my program is in them as explained by running in profiling mode with
>> +RTS -K400M -p -RTS
>>
>> The costlier function is the hamming distance
>> COST CENTRE                    MODULE               %time %alloc
>>
>> hamming                        Distances             66.6   41.9
>>
>> It says that it is performing 41% of the allocations. In the case of
>> hamming2 the allocations go as far as 52%.
>
> Allocations are cheap, so that's not necessarily a problem. More important
> is, what's the maximum residency and how much is copied during GC?
> Are you compiling with -O2 ?
>
>> I could understand that
>> there are allocations in "hamming2" because we are creating pairs, but
>> in the case of "hamming" there should be no allocation.
>
> Why not? I don't know how GHC counts allocations, but everytime you go from
> (x:xs) to xs, you need a new pointer to the tail. If that counts as
> allocation, hamming must allocate a lot, too.

Is it really necessary to use Strings?  I think a packed type, e.g.
Vector or ByteString, would be much more efficient here.  Of course
this is only likely to be a benefit if you can move away from String
entirely.

I suspect that "hamming2" would perform better then.

John
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: hamming distance allocation

Arnoldo Muller
In reply to this post by Daniel Fischer-4

The strings will not be longer than 30 characters.
I am doing sets of 2000  (total of 2000^2 distance computations)

I am expecting that all the operations will be lazyly performed but at some point I get a memory error.

Most of the memory is being allocated for the hamming distance and I am still unable to find the source of my memory leak.

Regards,

Arnoldo

On Mon, Apr 19, 2010 at 3:47 PM, Daniel Fischer <[hidden email]> wrote:
Am Montag 19 April 2010 14:37:33 schrieb John Lato:
> Is it really necessary to use Strings?  I think a packed type, e.g.
> Vector or ByteString, would be much more efficient here.

Not very much if the strings are fairly short (and the list isn't too long,
so there's not a big difference in cache-friendliness).
If eight-bit characters aren't enough, packing the strings into
UArray Int Char gives performance quite close to ByteStrings.

> Of course this is only likely to be a benefit if you can move away from
> String entirely.
>
> I suspect that "hamming2" would perform better then.
>
> John

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: hamming distance allocation

Arnoldo Muller
In reply to this post by Daniel Fischer-4
Hello Daniel:

My % GC time is : 75.0%  (81.4% elapsed) and I am compiling with -O2.
Thank you for clarifying about the pointers.

Slowly my memory grows up and eventually it explodes. I would expect that the list comprehension is lazily evaluated and therefore at any given time I am only executing one hamming distance. The result of the hamming distance is stored into a small statistics datatype I built (only stores sums and sum of squares and the counts). This datatype is updated using a foldr.

I have no idea where the leak is. What do you see in a .prof file to find a leak (hamming distance has the largest amount of time and %alloc)  ? From my .prof file where would you start looking at?


Best Regards,

Arnoldo Muller



On Mon, Apr 19, 2010 at 3:18 AM, Daniel Fischer <[hidden email]> wrote:
Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:
> Hello all:
>
> I want to generate some hamming distance statistics about a set of
> strings. As explained in another e-mail in this list, I used the
> following code to call the
> functions:
> (exampl holds the list of strings of size w)
> filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl,
> ys <- exampl]
>
> I have two hamming functions:
> -- hamming distance for variable length strings
> hamming :: String -> String -> Int
> hamming x y = hamming' x y 0
>     where
>       hamming' [] _ !c = c
>       hamming' _ [] !c = c
>       hamming' (x:xs) (y:ys) !c
>           | x == y = hamming' xs ys c
>           | otherwise = hamming' xs ys (c + 1)
>
> -- function posted in this mailing list
> hamming2 :: String -> String -> Int
> hamming2 xs ys = length (filter not (zipWith (==) xs ys))
>
> I am executing these functions millions of times and the bottleneck of
> my program is in them as explained by running in profiling mode with
> +RTS -K400M -p -RTS
>
> The costlier function is the hamming distance
> COST CENTRE                    MODULE               %time %alloc
>
> hamming                        Distances             66.6   41.9
>
> It says that it is performing 41% of the allocations. In the case of
> hamming2 the allocations go as far as 52%.

Allocations are cheap, so that's not necessarily a problem. More important
is, what's the maximum residency and how much is copied during GC?
Are you compiling with -O2 ?

> I could understand that
> there are allocations in "hamming2" because we are creating pairs, but
> in the case of "hamming" there should be no allocation.

Why not? I don't know how GHC counts allocations, but everytime you go from
(x:xs) to xs, you need a new pointer to the tail. If that counts as
allocation, hamming must allocate a lot, too.

>
> How can I execute my hamming functions without allocating memory?
>
> Best regards,
>
> Arnoldo Muller



_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

jasparReader.prof (13K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: hamming distance allocation

Arnoldo Muller
Hello all:

I found my leak after adding some bang patterns in a different part of the program. The compiler was generating all the combinations of the list comprehensions and therefore the performance dropped very badly.

BTW, hamming is 2 times faster than hamming2.

Thank you as always!

Arnoldo

On Mon, Apr 19, 2010 at 5:53 PM, Arnoldo Muller <[hidden email]> wrote:
Hello Daniel:

My % GC time is : 75.0%  (81.4% elapsed) and I am compiling with -O2.
Thank you for clarifying about the pointers.

Slowly my memory grows up and eventually it explodes. I would expect that the list comprehension is lazily evaluated and therefore at any given time I am only executing one hamming distance. The result of the hamming distance is stored into a small statistics datatype I built (only stores sums and sum of squares and the counts). This datatype is updated using a foldr.

I have no idea where the leak is. What do you see in a .prof file to find a leak (hamming distance has the largest amount of time and %alloc)  ? From my .prof file where would you start looking at?


Best Regards,

Arnoldo Muller




On Mon, Apr 19, 2010 at 3:18 AM, Daniel Fischer <[hidden email]> wrote:
Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:
> Hello all:
>
> I want to generate some hamming distance statistics about a set of
> strings. As explained in another e-mail in this list, I used the
> following code to call the
> functions:
> (exampl holds the list of strings of size w)
> filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl,
> ys <- exampl]
>
> I have two hamming functions:
> -- hamming distance for variable length strings
> hamming :: String -> String -> Int
> hamming x y = hamming' x y 0
>     where
>       hamming' [] _ !c = c
>       hamming' _ [] !c = c
>       hamming' (x:xs) (y:ys) !c
>           | x == y = hamming' xs ys c
>           | otherwise = hamming' xs ys (c + 1)
>
> -- function posted in this mailing list
> hamming2 :: String -> String -> Int
> hamming2 xs ys = length (filter not (zipWith (==) xs ys))
>
> I am executing these functions millions of times and the bottleneck of
> my program is in them as explained by running in profiling mode with
> +RTS -K400M -p -RTS
>
> The costlier function is the hamming distance
> COST CENTRE                    MODULE               %time %alloc
>
> hamming                        Distances             66.6   41.9
>
> It says that it is performing 41% of the allocations. In the case of
> hamming2 the allocations go as far as 52%.

Allocations are cheap, so that's not necessarily a problem. More important
is, what's the maximum residency and how much is copied during GC?
Are you compiling with -O2 ?

> I could understand that
> there are allocations in "hamming2" because we are creating pairs, but
> in the case of "hamming" there should be no allocation.

Why not? I don't know how GHC counts allocations, but everytime you go from
(x:xs) to xs, you need a new pointer to the tail. If that counts as
allocation, hamming must allocate a lot, too.

>
> How can I execute my hamming functions without allocating memory?
>
> Best regards,
>
> Arnoldo Muller




_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: hamming distance allocation

Daniel Fischer-4
In reply to this post by Arnoldo Muller
Am Montag 19 April 2010 17:17:11 schrieb Arnoldo Muller:
> The strings will not be longer than 30 characters.

For 20 -30 character strings, using ByteStrings should be better, in my
tests about 40% faster, allocation figures slightly lower, resident memory
much lower and bytes copied during GC very much lower.
For a sample of english text (many short words, few long), ByteStrings were
about 25% faster, allocation figures very slightly lower, resident memory
much lower, bytes copied much lower (relative difference not as large as
for longer strings).

> I am doing sets of 2000  (total of 2000^2 distance computations)

That shouldn't give memory problems either way.

>
> I am expecting that all the operations will be lazyly performed but at
> some point I get a memory error.

My guess is a bad consumption pattern.

>
> Most of the memory is being allocated for the hamming distance and I am
> still unable to find the source of my memory leak.

Allocation as such is not a problem, resident memory is the important
thing.
Try heap profiling to see what holds on to memory (+RTS -hc would be a good
first run).

>
> Regards,
>
> Arnoldo
>
> On Mon, Apr 19, 2010 at 3:47 PM, Daniel Fischer
<[hidden email]>wrote:

> > Am Montag 19 April 2010 14:37:33 schrieb John Lato:
> > > Is it really necessary to use Strings?  I think a packed type, e.g.
> > > Vector or ByteString, would be much more efficient here.
> >
> > Not very much if the strings are fairly short (and the list isn't too
> > long, so there's not a big difference in cache-friendliness).
> > If eight-bit characters aren't enough, packing the strings into
> > UArray Int Char gives performance quite close to ByteStrings.
> >
> > > Of course this is only likely to be a benefit if you can move away
> > > from String entirely.
> > >
> > > I suspect that "hamming2" would perform better then.
> > >
> > > John
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: hamming distance allocation

Daniel Fischer-4
In reply to this post by Arnoldo Muller
Am Montag 19 April 2010 17:53:27 schrieb Arnoldo Muller:
> Hello Daniel:
>
> My % GC time is : 75.0%  (81.4% elapsed) and I am compiling with -O2.

Very bad. Can I see the code?

> Thank you for clarifying about the pointers.

Not to forget the Ints for counting.

>
> Slowly my memory grows up and eventually it explodes. I would expect
> that the list comprehension is lazily evaluated and therefore at any
> given time I am only executing one hamming distance. The result of the
> hamming distance is stored into a small statistics datatype I built
> (only stores sums and sum of squares and the counts). This datatype is
> updated using a foldr.

That might very well be the problem, if you update it with a foldr, you
must construct the entire list of 2000^2 hamming-thunks before the work can
begin.
It's probably better to use foldl' (and make the type strict) so you can
start the work immediately.

>
> I have no idea where the leak is. What do you see in a .prof file to
> find a leak (hamming distance has the largest amount of time and %alloc)

For finding leaks, heap profiling (-h*) gives more info than -p. The .prof
says more about where you spend your time than what hangs on to memory.

>  ? From my .prof file where would you start looking at?

- use hamming instead of hamming2
- convertIntToDouble looks suspicious
- calculating a few million Hamming distances takes some time, but what
about getMyStats, should that really take 25%?


> > > filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <-
> > > exampl, ys <- exampl]
> > >

filter (/= 0) [hamming xs ys | xs <- example, ys <- example]

And of course, you can trivially avoid half of the work.

>
>
> Best Regards,
>
> Arnoldo Muller
>
> On Mon, Apr 19, 2010 at 3:18 AM, Daniel Fischer
<[hidden email]>wrote:

> > Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:
> > > Hello all:
> > >
> > > I want to generate some hamming distance statistics about a set of
> > > strings. As explained in another e-mail in this list, I used the
> > > following code to call the
> > > functions:
> > > (exampl holds the list of strings of size w)
> > > filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <-
> > > exampl, ys <- exampl]
> > >
> > > I have two hamming functions:
> > > -- hamming distance for variable length strings
> > > hamming :: String -> String -> Int
> > > hamming x y = hamming' x y 0
> > >     where
> > >       hamming' [] _ !c = c
> > >       hamming' _ [] !c = c
> > >       hamming' (x:xs) (y:ys) !c
> > >
> > >           | x == y = hamming' xs ys c
> > >           | otherwise = hamming' xs ys (c + 1)
> > >
> > > -- function posted in this mailing list
> > > hamming2 :: String -> String -> Int
> > > hamming2 xs ys = length (filter not (zipWith (==) xs ys))
> > >
> > > I am executing these functions millions of times and the bottleneck
> > > of my program is in them as explained by running in profiling mode
> > > with +RTS -K400M -p -RTS
> > >
> > > The costlier function is the hamming distance
> > > COST CENTRE                    MODULE               %time %alloc
> > >
> > > hamming                        Distances             66.6   41.9
> > >
> > > It says that it is performing 41% of the allocations. In the case of
> > > hamming2 the allocations go as far as 52%.
> >
> > Allocations are cheap, so that's not necessarily a problem. More
> > important is, what's the maximum residency and how much is copied
> > during GC? Are you compiling with -O2 ?
> >
> > > I could understand that
> > > there are allocations in "hamming2" because we are creating pairs,
> > > but in the case of "hamming" there should be no allocation.
> >
> > Why not? I don't know how GHC counts allocations, but everytime you go
> > from (x:xs) to xs, you need a new pointer to the tail. If that counts
> > as allocation, hamming must allocate a lot, too.
> >
> > > How can I execute my hamming functions without allocating memory?
> > >
> > > Best regards,
> > >
> > > Arnoldo Muller

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: hamming distance allocation

Arnoldo Muller
Daniel thank you for all your advice.

An additional ! bang pattern in convertIntToDouble fixed the issue! Also using a foldl'
did the trick.

Now the program runs as it should with a constant amount of memory and in a very small amount of time.

I believe these problems are one of the major sources of frustration for Haskell newbies. Things that could work in <X> language easily suddenly become problems in Haskell. When you overcome these issues then you feel happy again that you chose Haskell as the main programming language of your research project.

Is there any guide that explains more about the "bad consumption pattern". Are there any general rules defined to avoid these issues? It helped me to re-read the chapter on profiling in the Real World Haskell book to sorta understand the problem. Is there a more detailed definition of the problem than in RWH?

Regards,

Arnoldo

On Tue, Apr 20, 2010 at 2:49 AM, Daniel Fischer <[hidden email]> wrote:
Am Montag 19 April 2010 17:53:27 schrieb Arnoldo Muller:
> Hello Daniel:
>
> My % GC time is : 75.0%  (81.4% elapsed) and I am compiling with -O2.

Very bad. Can I see the code?

> Thank you for clarifying about the pointers.

Not to forget the Ints for counting.

>
> Slowly my memory grows up and eventually it explodes. I would expect
> that the list comprehension is lazily evaluated and therefore at any
> given time I am only executing one hamming distance. The result of the
> hamming distance is stored into a small statistics datatype I built
> (only stores sums and sum of squares and the counts). This datatype is
> updated using a foldr.

That might very well be the problem, if you update it with a foldr, you
must construct the entire list of 2000^2 hamming-thunks before the work can
begin.
It's probably better to use foldl' (and make the type strict) so you can
start the work immediately.

>
> I have no idea where the leak is. What do you see in a .prof file to
> find a leak (hamming distance has the largest amount of time and %alloc)

For finding leaks, heap profiling (-h*) gives more info than -p. The .prof
says more about where you spend your time than what hangs on to memory.

>  ? From my .prof file where would you start looking at?

- use hamming instead of hamming2
- convertIntToDouble looks suspicious
- calculating a few million Hamming distances takes some time, but what
about getMyStats, should that really take 25%?


> > > filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <-
> > > exampl, ys <- exampl]
> > >

filter (/= 0) [hamming xs ys | xs <- example, ys <- example]

And of course, you can trivially avoid half of the work.

>
>
> Best Regards,
>
> Arnoldo Muller
>
> On Mon, Apr 19, 2010 at 3:18 AM, Daniel Fischer
<[hidden email]>wrote:
> > Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:
> > > Hello all:
> > >
> > > I want to generate some hamming distance statistics about a set of
> > > strings. As explained in another e-mail in this list, I used the
> > > following code to call the
> > > functions:
> > > (exampl holds the list of strings of size w)
> > > filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <-
> > > exampl, ys <- exampl]
> > >
> > > I have two hamming functions:
> > > -- hamming distance for variable length strings
> > > hamming :: String -> String -> Int
> > > hamming x y = hamming' x y 0
> > >     where
> > >       hamming' [] _ !c = c
> > >       hamming' _ [] !c = c
> > >       hamming' (x:xs) (y:ys) !c
> > >
> > >           | x == y = hamming' xs ys c
> > >           | otherwise = hamming' xs ys (c + 1)
> > >
> > > -- function posted in this mailing list
> > > hamming2 :: String -> String -> Int
> > > hamming2 xs ys = length (filter not (zipWith (==) xs ys))
> > >
> > > I am executing these functions millions of times and the bottleneck
> > > of my program is in them as explained by running in profiling mode
> > > with +RTS -K400M -p -RTS
> > >
> > > The costlier function is the hamming distance
> > > COST CENTRE                    MODULE               %time %alloc
> > >
> > > hamming                        Distances             66.6   41.9
> > >
> > > It says that it is performing 41% of the allocations. In the case of
> > > hamming2 the allocations go as far as 52%.
> >
> > Allocations are cheap, so that's not necessarily a problem. More
> > important is, what's the maximum residency and how much is copied
> > during GC? Are you compiling with -O2 ?
> >
> > > I could understand that
> > > there are allocations in "hamming2" because we are creating pairs,
> > > but in the case of "hamming" there should be no allocation.
> >
> > Why not? I don't know how GHC counts allocations, but everytime you go
> > from (x:xs) to xs, you need a new pointer to the tail. If that counts
> > as allocation, hamming must allocate a lot, too.
> >
> > > How can I execute my hamming functions without allocating memory?
> > >
> > > Best regards,
> > >
> > > Arnoldo Muller



_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: hamming distance allocation

Heinrich Apfelmus
In reply to this post by Daniel Fischer-4
Daniel Fischer wrote:

> Heinrich Apfelmus:
>>
>> For instance, your expression can be replaced by
>>
>>    filter (/=0) [hammingX x y | (x:xs) <- tails example, y <- xs]
>>
>> which cuts the total running time in half. It's still quadratic in the
>> length of  example . I'm sure there are faster algorithms out there that
>> can bring it down to O(n log n) if you want.
>
> I don't think so. You can't calculate the Hamming distance of x and z from
> the distances between x and y and y and z, so you'd have to consider all
> pairs of distinct strings, wouldn't you?

And there I was sure about something once, only to see that it's
actually really doubtful... ;)

The thing about the Hamming distance is that it's not a black box, so
you can't get a lower bound by counting the number of minimum calls to
hamming  that have to be made. You are essentially arguing that the
different Hamming distances are independent, which they are not. Not to
mention that there are also "black-box" restrictions like the triangle
inequality

    d(x,z) <= d(x,y) + d(y,z)

but that one is probably harmless. In short, the situation is similar to
how the sorting bound O(n*log n) does not apply to radix sort.


Still, you are right to question my O(n*log n) claim; so far, my
attempts at finding such an algorithm myself have failed.

More precisely, the goal is to make a histogram of the possible hamming
distances. We need at least O(n*w) time to do that, where n is the
number of strings and w their maximum length; after all, we have to
"touch" every character. For simplicity, that the characters are just
one bit each. Furthermore, we can assume that w <= log n, otherwise
there are lots of duplicate strings which can be grouped together. In
this notation, the simple algorithm takes O(n^2*w) time.


I did find a straightforward divide-and-conquer algorithm to tally small
Hamming distances, but it's not good enough for the whole histogram.
Here's the specification:

    countHemming :: Int -> [Bool] -> [Bool]
    countHemming d xs ys = length [() | x<-xs, y<-ys, hamming x y == d]

In other words,  countHemming d xs ys  counts the number of pairings
(x,y) whose Hamming distance is exactly  d .

Now, the idea is that this can be made faster for small  d . For
instance, for  d == 0 , we are essentially just calculating the number
of different elements of  xs  and  ys . By requiring that  xs  and  ys
be sorted, this can be done in linear time

    countHemming 0 xs ys = ... a variation on  merge xs ys

And for slightly larger  d , we can partition the lists by their first
bits and continue recursively

    countHemming _ [] [] = 0
    countHemming d xs ys =
          countHemming (d-1) x0 y1 + countHemming (d-1) x1 y0
        + countHemming d     x0 y0 + countHemming d     x1 y1
        where
        (x0,x1) = partitionByHead xs
        (y0,y1) = partitionByHead ys

    partitionByHead xs = (behead True xs, behead False xs)
    behead b xs = [bs | (b':bs) <- xs, b == b']

To estimate the running time, we set  n = length (xs ++ ys) and let

    T(d,n) = running time of  countHamming d xs ys

We started with

    T(0,n) = O(n)

and want to discuss the recursive case. The idea is that each list is
divided in half, so we get

    T(d,n) = 2*T(d-1,n/2) + 2*T(d,n/2)

>From this, we can calculate

    T(1,n) = 2*T(0,n/2) + 2*T(1,n/2)
           = O(n) + 2*T(1,n/2)     -- remember quicksort!
           = O(n*log n)
    T(2,n) = O(n*log n) + 2*T(2,n/2)
           = O(n*(log n)^2)

and so on, yielding

    T(d,n) = O(n*(log n)^d)


Alas, this can be used to search a dictionary while accounting for
spelling errors, but it's no good to calculate a full histogram because
it becomes prohibitively expensive when  d ~ w/2 ~ (log n)/2  .



Regards,
Heinrich Apfelmus

--
http://apfelmus.nfshost.com

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: hamming distance allocation

Heinrich Apfelmus
In reply to this post by Arnoldo Muller
Arnoldo Muller wrote:
>
> I believe these problems are one of the major sources of frustration for
> Haskell newbies. Things that could work in <X> language easily suddenly
> become problems in Haskell. When you overcome these issues then you feel
> happy again that you chose Haskell as the main programming language of your
> research project.

Well, the difference between <X> and Haskell is pretty much unavoidable.
If you care about space and time usage, then there is no way around
learning about lazy evaluation and Haskell's execution model.

> Is there any guide that explains more about the "bad consumption pattern".
> Are there any general rules defined to avoid these issues? It helped me to
> re-read the chapter on profiling in the Real World Haskell book to sorta
> understand the problem. Is there a more detailed definition of the problem
> than in RWH?

Two of the most commonly occurring patterns are

  1) foldl' vs foldl

  2) average xs = sum xs / length xs
     vs
     average = uncurry (/) . foldl' (\(!s,!n) x -> (s+x,n+1)) (0,0)

Other than that, most Haskell books offer a clear exposition of the
reduction model. For instance, there is

   Graham Hutton. Programming in Haskell, chapter 12.
   Richard Bird. Introduction to Functional Programming using Haskell
       2nd edition, chapter 7.

The wikibook contains some preliminary material, too.

   http://en.wikibooks.org/wiki/Haskell/Graph_reduction



Regards,
Heinrich Apfelmus

--
http://apfelmus.nfshost.com

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe