# hamming distance allocation

16 messages
Open this post in threaded view
|

## hamming distance allocation

 Hello all:I want to generate some hamming distance statistics about a set of strings. As explained in another e-mail in this list, I used the following code to call thefunctions:(exampl holds the list of strings of size w) filter (\x -> x /= 0) \$ map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl]I have two hamming functions:-- hamming distance for variable length stringshamming :: String -> String -> Int hamming x y = hamming' x y 0    where             hamming' [] _ !c = c      hamming' _ [] !c = c      hamming' (x:xs) (y:ys) !c           | x == y = hamming' xs ys c          | otherwise = hamming' xs ys (c + 1) -- function posted in this mailing listhamming2 :: String -> String -> Inthamming2 xs ys = length (filter not (zipWith (==) xs ys))I am executing these functions millions of times and the bottleneck of my program is in them as explained by running in profiling mode with  +RTS -K400M -p -RTS The costlier function is the hamming distanceCOST CENTRE                    MODULE               %time %allochamming                        Distances             66.6   41.9It says that it is performing 41% of the allocations. In the case of hamming2 the allocations go as far as 52%.  I could understand that there are allocations in "hamming2" because we are creating pairs, but in the case of "hamming" there should be no allocation. How can I execute my hamming functions without allocating memory?Best regards,Arnoldo Muller _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe
Open this post in threaded view
|

## Re: hamming distance allocation

 Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller: > Hello all: > > I want to generate some hamming distance statistics about a set of > strings. As explained in another e-mail in this list, I used the > following code to call the > functions: > (exampl holds the list of strings of size w) > filter (\x -> x /= 0) \$ map (uncurry hammingX) [(xs, ys) | xs <- exampl, > ys <- exampl] > > I have two hamming functions: > -- hamming distance for variable length strings > hamming :: String -> String -> Int > hamming x y = hamming' x y 0 >     where >       hamming' [] _ !c = c >       hamming' _ [] !c = c >       hamming' (x:xs) (y:ys) !c >           | x == y = hamming' xs ys c >           | otherwise = hamming' xs ys (c + 1) > > -- function posted in this mailing list > hamming2 :: String -> String -> Int > hamming2 xs ys = length (filter not (zipWith (==) xs ys)) > > I am executing these functions millions of times and the bottleneck of > my program is in them as explained by running in profiling mode with > +RTS -K400M -p -RTS > > The costlier function is the hamming distance > COST CENTRE                    MODULE               %time %alloc > > hamming                        Distances             66.6   41.9 > > It says that it is performing 41% of the allocations. In the case of > hamming2 the allocations go as far as 52%. Allocations are cheap, so that's not necessarily a problem. More important is, what's the maximum residency and how much is copied during GC? Are you compiling with -O2 ? > I could understand that > there are allocations in "hamming2" because we are creating pairs, but > in the case of "hamming" there should be no allocation. Why not? I don't know how GHC counts allocations, but everytime you go from (x:xs) to xs, you need a new pointer to the tail. If that counts as allocation, hamming must allocate a lot, too. > > How can I execute my hamming functions without allocating memory? > > Best regards, > > Arnoldo Muller _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe
Open this post in threaded view
|

## Re: hamming distance allocation

 In reply to this post by Arnoldo Muller Arnoldo Muller wrote: > I want to generate some hamming distance statistics about a set of strings. > >   filter (\x -> x /= 0) \$ >     map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl] > > [...] > > -- function posted in this mailing list > hamming2 :: String -> String -> Int > hamming2 xs ys = length (filter not (zipWith (==) xs ys)) > > I am executing these functions millions of times and the bottleneck of my > program is in them as explained by running in profiling mode with  +RTS > -K400M -p -RTS > > The costlier function is the hamming distance > COST CENTRE                    MODULE               %time %alloc > > hamming                        Distances             66.6   41.9 Another way to look at it is that you shouldn't optimize  hamming itself, but rather make sure that it's called less often! For instance, your expression can be replaced by    filter (/=0) [hammingX x y | (x:xs) <- tails example, y <- xs] which cuts the total running time in half. It's still quadratic in the length of  example . I'm sure there are faster algorithms out there that can bring it down to O(n log n) if you want. Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com_______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe
Open this post in threaded view
|

## Re: Re: hamming distance allocation

 Am Montag 19 April 2010 14:13:53 schrieb Heinrich Apfelmus: > Arnoldo Muller wrote: > > I want to generate some hamming distance statistics about a set of > > strings. > > > >   filter (\x -> x /= 0) \$ > >     map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl] > > > > [...] > > > > -- function posted in this mailing list > > hamming2 :: String -> String -> Int > > hamming2 xs ys = length (filter not (zipWith (==) xs ys)) > > > > I am executing these functions millions of times and the bottleneck of > > my program is in them as explained by running in profiling mode with > > +RTS -K400M -p -RTS > > > > The costlier function is the hamming distance > > COST CENTRE                    MODULE               %time %alloc > > > > hamming                        Distances             66.6   41.9 > > Another way to look at it is that you shouldn't optimize  hamming > itself, but rather make sure that it's called less often! > > For instance, your expression can be replaced by > >    filter (/=0) [hammingX x y | (x:xs) <- tails example, y <- xs] > > which cuts the total running time in half. It's still quadratic in the > length of  example . I'm sure there are faster algorithms out there that If it's likely that there are many repetitions, collect the Strings in a Set/Map (depending on whether you're interested in the count) and call hamming only on the distinct pairs. > can bring it down to O(n log n) if you want. I don't think so. You can't calculate the Hamming distance of x and z from the distances between x and y and y and z, so you'd have to consider all pairs of distinct strings, wouldn't you? > > > Regards, > Heinrich Apfelmus _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe
Open this post in threaded view
|

## Re: hamming distance allocation

 In reply to this post by Arnoldo Muller > Subject: Re: [Haskell-cafe] hamming distance allocation > > Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller: >> Hello all: >> >> I want to generate some hamming distance statistics about a set of >> strings. As explained in another e-mail in this list, I used the >> following code to call the >> functions: >> (exampl holds the list of strings of size w) >> filter (\x -> x /= 0) \$ map (uncurry hammingX) [(xs, ys) | xs <- exampl, >> ys <- exampl] >> >> I have two hamming functions: >> -- hamming distance for variable length strings >> hamming :: String -> String -> Int >> hamming x y = hamming' x y 0 >>     where >>       hamming' [] _ !c = c >>       hamming' _ [] !c = c >>       hamming' (x:xs) (y:ys) !c >>           | x == y = hamming' xs ys c >>           | otherwise = hamming' xs ys (c + 1) >> >> -- function posted in this mailing list >> hamming2 :: String -> String -> Int >> hamming2 xs ys = length (filter not (zipWith (==) xs ys)) >> >> I am executing these functions millions of times and the bottleneck of >> my program is in them as explained by running in profiling mode with >> +RTS -K400M -p -RTS >> >> The costlier function is the hamming distance >> COST CENTRE                    MODULE               %time %alloc >> >> hamming                        Distances             66.6   41.9 >> >> It says that it is performing 41% of the allocations. In the case of >> hamming2 the allocations go as far as 52%. > > Allocations are cheap, so that's not necessarily a problem. More important > is, what's the maximum residency and how much is copied during GC? > Are you compiling with -O2 ? > >> I could understand that >> there are allocations in "hamming2" because we are creating pairs, but >> in the case of "hamming" there should be no allocation. > > Why not? I don't know how GHC counts allocations, but everytime you go from > (x:xs) to xs, you need a new pointer to the tail. If that counts as > allocation, hamming must allocate a lot, too. Is it really necessary to use Strings?  I think a packed type, e.g. Vector or ByteString, would be much more efficient here.  Of course this is only likely to be a benefit if you can move away from String entirely. I suspect that "hamming2" would perform better then. John _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe
Open this post in threaded view
|

## Re: hamming distance allocation

 Am Montag 19 April 2010 14:37:33 schrieb John Lato: > Is it really necessary to use Strings?  I think a packed type, e.g. > Vector or ByteString, would be much more efficient here. Not very much if the strings are fairly short (and the list isn't too long, so there's not a big difference in cache-friendliness). If eight-bit characters aren't enough, packing the strings into UArray Int Char gives performance quite close to ByteStrings. > Of course this is only likely to be a benefit if you can move away from > String entirely. > > I suspect that "hamming2" would perform better then. > > John _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe
Open this post in threaded view
|

## Re: hamming distance allocation

 In reply to this post by Arnoldo Muller > From: Daniel Fischer <[hidden email]> > > Am Montag 19 April 2010 14:37:33 schrieb John Lato: >> Is it really necessary to use Strings?  I think a packed type, e.g. >> Vector or ByteString, would be much more efficient here. > > Not very much if the strings are fairly short (and the list isn't too long, > so there's not a big difference in cache-friendliness). > If eight-bit characters aren't enough, packing the strings into > UArray Int Char gives performance quite close to ByteStrings. I agree, provided that the Strings are used only once.  I missed the first part of this thread, so there may be context I don't have, but if Strings are retained and the Hamming distance calculated for multiple pairings, I think it would be a noticeable difference. Although probably not as much as reducing the number of calls to Hamming as you and H. Apfelmus suggest. John _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe
Open this post in threaded view
|

## Re: hamming distance allocation

 In reply to this post by John Lato-2 Hello John:Well I could use a packed type. The only letters that will be found in the string are "ATCG" so yeah I don't need unicode and those things. Will try out with vector or ByteString. Thanks! :) On Mon, Apr 19, 2010 at 2:37 PM, John Lato wrote: > Subject: Re: [Haskell-cafe] hamming distance allocation > > Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller: >> Hello all: >> >> I want to generate some hamming distance statistics about a set of >> strings. As explained in another e-mail in this list, I used the >> following code to call the >> functions: >> (exampl holds the list of strings of size w) >> filter (\x -> x /= 0) \$ map (uncurry hammingX) [(xs, ys) | xs <- exampl, >> ys <- exampl] >> >> I have two hamming functions: >> -- hamming distance for variable length strings >> hamming :: String -> String -> Int >> hamming x y = hamming' x y 0 >>     where >>       hamming' [] _ !c = c >>       hamming' _ [] !c = c >>       hamming' (x:xs) (y:ys) !c >>           | x == y = hamming' xs ys c >>           | otherwise = hamming' xs ys (c + 1) >> >> -- function posted in this mailing list >> hamming2 :: String -> String -> Int >> hamming2 xs ys = length (filter not (zipWith (==) xs ys)) >> >> I am executing these functions millions of times and the bottleneck of >> my program is in them as explained by running in profiling mode with >> +RTS -K400M -p -RTS >> >> The costlier function is the hamming distance >> COST CENTRE                    MODULE               %time %alloc >> >> hamming                        Distances             66.6   41.9 >> >> It says that it is performing 41% of the allocations. In the case of >> hamming2 the allocations go as far as 52%. > > Allocations are cheap, so that's not necessarily a problem. More important > is, what's the maximum residency and how much is copied during GC? > Are you compiling with -O2 ? > >> I could understand that >> there are allocations in "hamming2" because we are creating pairs, but >> in the case of "hamming" there should be no allocation. > > Why not? I don't know how GHC counts allocations, but everytime you go from > (x:xs) to xs, you need a new pointer to the tail. If that counts as > allocation, hamming must allocate a lot, too. Is it really necessary to use Strings?  I think a packed type, e.g. Vector or ByteString, would be much more efficient here.  Of course this is only likely to be a benefit if you can move away from String entirely. I suspect that "hamming2" would perform better then. John _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe
Open this post in threaded view
|

## Re: hamming distance allocation

 In reply to this post by Daniel Fischer-4 The strings will not be longer than 30 characters. I am doing sets of 2000  (total of 2000^2 distance computations)I am expecting that all the operations will be lazyly performed but at some point I get a memory error. Most of the memory is being allocated for the hamming distance and I am still unable to find the source of my memory leak.Regards,ArnoldoOn Mon, Apr 19, 2010 at 3:47 PM, Daniel Fischer wrote: Am Montag 19 April 2010 14:37:33 schrieb John Lato: > Is it really necessary to use Strings?  I think a packed type, e.g. > Vector or ByteString, would be much more efficient here. Not very much if the strings are fairly short (and the list isn't too long, so there's not a big difference in cache-friendliness). If eight-bit characters aren't enough, packing the strings into UArray Int Char gives performance quite close to ByteStrings. > Of course this is only likely to be a benefit if you can move away from > String entirely. > > I suspect that "hamming2" would perform better then. > > John _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe
Open this post in threaded view
|

## Re: hamming distance allocation

 In reply to this post by Daniel Fischer-4 Hello Daniel:My % GC time is : 75.0%  (81.4% elapsed) and I am compiling with -O2. Thank you for clarifying about the pointers.Slowly my memory grows up and eventually it explodes. I would expect that the list comprehension is lazily evaluated and therefore at any given time I am only executing one hamming distance. The result of the hamming distance is stored into a small statistics datatype I built (only stores sums and sum of squares and the counts). This datatype is updated using a foldr. I have no idea where the leak is. What do you see in a .prof file to find a leak (hamming distance has the largest amount of time and %alloc)  ? From my .prof file where would you start looking at?Best Regards, Arnoldo MullerOn Mon, Apr 19, 2010 at 3:18 AM, Daniel Fischer wrote: Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller: > Hello all: > > I want to generate some hamming distance statistics about a set of > strings. As explained in another e-mail in this list, I used the > following code to call the > functions: > (exampl holds the list of strings of size w) > filter (\x -> x /= 0) \$ map (uncurry hammingX) [(xs, ys) | xs <- exampl, > ys <- exampl] > > I have two hamming functions: > -- hamming distance for variable length strings > hamming :: String -> String -> Int > hamming x y = hamming' x y 0 >     where >       hamming' [] _ !c = c >       hamming' _ [] !c = c >       hamming' (x:xs) (y:ys) !c >           | x == y = hamming' xs ys c >           | otherwise = hamming' xs ys (c + 1) > > -- function posted in this mailing list > hamming2 :: String -> String -> Int > hamming2 xs ys = length (filter not (zipWith (==) xs ys)) > > I am executing these functions millions of times and the bottleneck of > my program is in them as explained by running in profiling mode with > +RTS -K400M -p -RTS > > The costlier function is the hamming distance > COST CENTRE                    MODULE               %time %alloc > > hamming                        Distances             66.6   41.9 > > It says that it is performing 41% of the allocations. In the case of > hamming2 the allocations go as far as 52%. Allocations are cheap, so that's not necessarily a problem. More important is, what's the maximum residency and how much is copied during GC? Are you compiling with -O2 ? > I could understand that > there are allocations in "hamming2" because we are creating pairs, but > in the case of "hamming" there should be no allocation. Why not? I don't know how GHC counts allocations, but everytime you go from (x:xs) to xs, you need a new pointer to the tail. If that counts as allocation, hamming must allocate a lot, too. > > How can I execute my hamming functions without allocating memory? > > Best regards, > > Arnoldo Muller _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe jasparReader.prof (13K) Download Attachment
Open this post in threaded view
|

## Re: hamming distance allocation

 Hello all:I found my leak after adding some bang patterns in a different part of the program. The compiler was generating all the combinations of the list comprehensions and therefore the performance dropped very badly. BTW, hamming is 2 times faster than hamming2. Thank you as always!ArnoldoOn Mon, Apr 19, 2010 at 5:53 PM, Arnoldo Muller wrote: Hello Daniel:My % GC time is : 75.0%  (81.4% elapsed) and I am compiling with -O2. Thank you for clarifying about the pointers.Slowly my memory grows up and eventually it explodes. I would expect that the list comprehension is lazily evaluated and therefore at any given time I am only executing one hamming distance. The result of the hamming distance is stored into a small statistics datatype I built (only stores sums and sum of squares and the counts). This datatype is updated using a foldr. I have no idea where the leak is. What do you see in a .prof file to find a leak (hamming distance has the largest amount of time and %alloc)  ? From my .prof file where would you start looking at?Best Regards, Arnoldo MullerOn Mon, Apr 19, 2010 at 3:18 AM, Daniel Fischer wrote: Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller: > Hello all: > > I want to generate some hamming distance statistics about a set of > strings. As explained in another e-mail in this list, I used the > following code to call the > functions: > (exampl holds the list of strings of size w) > filter (\x -> x /= 0) \$ map (uncurry hammingX) [(xs, ys) | xs <- exampl, > ys <- exampl] > > I have two hamming functions: > -- hamming distance for variable length strings > hamming :: String -> String -> Int > hamming x y = hamming' x y 0 >     where >       hamming' [] _ !c = c >       hamming' _ [] !c = c >       hamming' (x:xs) (y:ys) !c >           | x == y = hamming' xs ys c >           | otherwise = hamming' xs ys (c + 1) > > -- function posted in this mailing list > hamming2 :: String -> String -> Int > hamming2 xs ys = length (filter not (zipWith (==) xs ys)) > > I am executing these functions millions of times and the bottleneck of > my program is in them as explained by running in profiling mode with > +RTS -K400M -p -RTS > > The costlier function is the hamming distance > COST CENTRE                    MODULE               %time %alloc > > hamming                        Distances             66.6   41.9 > > It says that it is performing 41% of the allocations. In the case of > hamming2 the allocations go as far as 52%. Allocations are cheap, so that's not necessarily a problem. More important is, what's the maximum residency and how much is copied during GC? Are you compiling with -O2 ? > I could understand that > there are allocations in "hamming2" because we are creating pairs, but > in the case of "hamming" there should be no allocation. Why not? I don't know how GHC counts allocations, but everytime you go from (x:xs) to xs, you need a new pointer to the tail. If that counts as allocation, hamming must allocate a lot, too. > > How can I execute my hamming functions without allocating memory? > > Best regards, > > Arnoldo Muller _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe
Open this post in threaded view
|

## Re: hamming distance allocation

 In reply to this post by Arnoldo Muller Am Montag 19 April 2010 17:17:11 schrieb Arnoldo Muller: > The strings will not be longer than 30 characters. For 20 -30 character strings, using ByteStrings should be better, in my tests about 40% faster, allocation figures slightly lower, resident memory much lower and bytes copied during GC very much lower. For a sample of english text (many short words, few long), ByteStrings were about 25% faster, allocation figures very slightly lower, resident memory much lower, bytes copied much lower (relative difference not as large as for longer strings). > I am doing sets of 2000  (total of 2000^2 distance computations) That shouldn't give memory problems either way. > > I am expecting that all the operations will be lazyly performed but at > some point I get a memory error. My guess is a bad consumption pattern. > > Most of the memory is being allocated for the hamming distance and I am > still unable to find the source of my memory leak. Allocation as such is not a problem, resident memory is the important thing. Try heap profiling to see what holds on to memory (+RTS -hc would be a good first run). > > Regards, > > Arnoldo > > On Mon, Apr 19, 2010 at 3:47 PM, Daniel Fischer <[hidden email]>wrote: > > Am Montag 19 April 2010 14:37:33 schrieb John Lato: > > > Is it really necessary to use Strings?  I think a packed type, e.g. > > > Vector or ByteString, would be much more efficient here. > > > > Not very much if the strings are fairly short (and the list isn't too > > long, so there's not a big difference in cache-friendliness). > > If eight-bit characters aren't enough, packing the strings into > > UArray Int Char gives performance quite close to ByteStrings. > > > > > Of course this is only likely to be a benefit if you can move away > > > from String entirely. > > > > > > I suspect that "hamming2" would perform better then. > > > > > > John _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe
Open this post in threaded view
|

## Re: hamming distance allocation

Open this post in threaded view
|