Hello all:
I want to generate some hamming distance statistics about a set of strings. As explained in another e-mail in this list, I used the following code to call the functions: (exampl holds the list of strings of size w) filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl] I have two hamming functions: -- hamming distance for variable length strings hamming :: String -> String -> Int hamming x y = hamming' x y 0 where hamming' [] _ !c = c hamming' _ [] !c = c hamming' (x:xs) (y:ys) !c | x == y = hamming' xs ys c | otherwise = hamming' xs ys (c + 1) -- function posted in this mailing list hamming2 :: String -> String -> Int hamming2 xs ys = length (filter not (zipWith (==) xs ys)) I am executing these functions millions of times and the bottleneck of my program is in them as explained by running in profiling mode with +RTS -K400M -p -RTS The costlier function is the hamming distance COST CENTRE MODULE %time %alloc hamming Distances 66.6 41.9 It says that it is performing 41% of the allocations. In the case of hamming2 the allocations go as far as 52%. I could understand that there are allocations in "hamming2" because we are creating pairs, but in the case of "hamming" there should be no allocation. How can I execute my hamming functions without allocating memory? Best regards, Arnoldo Muller _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:
> Hello all: > > I want to generate some hamming distance statistics about a set of > strings. As explained in another e-mail in this list, I used the > following code to call the > functions: > (exampl holds the list of strings of size w) > filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl, > ys <- exampl] > > I have two hamming functions: > -- hamming distance for variable length strings > hamming :: String -> String -> Int > hamming x y = hamming' x y 0 > where > hamming' [] _ !c = c > hamming' _ [] !c = c > hamming' (x:xs) (y:ys) !c > | x == y = hamming' xs ys c > | otherwise = hamming' xs ys (c + 1) > > -- function posted in this mailing list > hamming2 :: String -> String -> Int > hamming2 xs ys = length (filter not (zipWith (==) xs ys)) > > I am executing these functions millions of times and the bottleneck of > my program is in them as explained by running in profiling mode with > +RTS -K400M -p -RTS > > The costlier function is the hamming distance > COST CENTRE MODULE %time %alloc > > hamming Distances 66.6 41.9 > > It says that it is performing 41% of the allocations. In the case of > hamming2 the allocations go as far as 52%. Allocations are cheap, so that's not necessarily a problem. More important is, what's the maximum residency and how much is copied during GC? Are you compiling with -O2 ? > I could understand that > there are allocations in "hamming2" because we are creating pairs, but > in the case of "hamming" there should be no allocation. Why not? I don't know how GHC counts allocations, but everytime you go from (x:xs) to xs, you need a new pointer to the tail. If that counts as allocation, hamming must allocate a lot, too. > > How can I execute my hamming functions without allocating memory? > > Best regards, > > Arnoldo Muller _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Arnoldo Muller
Arnoldo Muller wrote:
> I want to generate some hamming distance statistics about a set of strings. > > filter (\x -> x /= 0) $ > map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl] > > [...] > > -- function posted in this mailing list > hamming2 :: String -> String -> Int > hamming2 xs ys = length (filter not (zipWith (==) xs ys)) > > I am executing these functions millions of times and the bottleneck of my > program is in them as explained by running in profiling mode with +RTS > -K400M -p -RTS > > The costlier function is the hamming distance > COST CENTRE MODULE %time %alloc > > hamming Distances 66.6 41.9 Another way to look at it is that you shouldn't optimize hamming itself, but rather make sure that it's called less often! For instance, your expression can be replaced by filter (/=0) [hammingX x y | (x:xs) <- tails example, y <- xs] which cuts the total running time in half. It's still quadratic in the length of example . I'm sure there are faster algorithms out there that can bring it down to O(n log n) if you want. Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
Am Montag 19 April 2010 14:13:53 schrieb Heinrich Apfelmus:
> Arnoldo Muller wrote: > > I want to generate some hamming distance statistics about a set of > > strings. > > > > filter (\x -> x /= 0) $ > > map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl] > > > > [...] > > > > -- function posted in this mailing list > > hamming2 :: String -> String -> Int > > hamming2 xs ys = length (filter not (zipWith (==) xs ys)) > > > > I am executing these functions millions of times and the bottleneck of > > my program is in them as explained by running in profiling mode with > > +RTS -K400M -p -RTS > > > > The costlier function is the hamming distance > > COST CENTRE MODULE %time %alloc > > > > hamming Distances 66.6 41.9 > > Another way to look at it is that you shouldn't optimize hamming > itself, but rather make sure that it's called less often! > > For instance, your expression can be replaced by > > filter (/=0) [hammingX x y | (x:xs) <- tails example, y <- xs] > > which cuts the total running time in half. It's still quadratic in the > length of example . I'm sure there are faster algorithms out there that If it's likely that there are many repetitions, collect the Strings in a Set/Map (depending on whether you're interested in the count) and call hamming only on the distinct pairs. > can bring it down to O(n log n) if you want. I don't think so. You can't calculate the Hamming distance of x and z from the distances between x and y and y and z, so you'd have to consider all pairs of distinct strings, wouldn't you? > > > Regards, > Heinrich Apfelmus _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Arnoldo Muller
> Subject: Re: [Haskell-cafe] hamming distance allocation
> > Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller: >> Hello all: >> >> I want to generate some hamming distance statistics about a set of >> strings. As explained in another e-mail in this list, I used the >> following code to call the >> functions: >> (exampl holds the list of strings of size w) >> filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl, >> ys <- exampl] >> >> I have two hamming functions: >> -- hamming distance for variable length strings >> hamming :: String -> String -> Int >> hamming x y = hamming' x y 0 >> where >> hamming' [] _ !c = c >> hamming' _ [] !c = c >> hamming' (x:xs) (y:ys) !c >> | x == y = hamming' xs ys c >> | otherwise = hamming' xs ys (c + 1) >> >> -- function posted in this mailing list >> hamming2 :: String -> String -> Int >> hamming2 xs ys = length (filter not (zipWith (==) xs ys)) >> >> I am executing these functions millions of times and the bottleneck of >> my program is in them as explained by running in profiling mode with >> +RTS -K400M -p -RTS >> >> The costlier function is the hamming distance >> COST CENTRE MODULE %time %alloc >> >> hamming Distances 66.6 41.9 >> >> It says that it is performing 41% of the allocations. In the case of >> hamming2 the allocations go as far as 52%. > > Allocations are cheap, so that's not necessarily a problem. More important > is, what's the maximum residency and how much is copied during GC? > Are you compiling with -O2 ? > >> I could understand that >> there are allocations in "hamming2" because we are creating pairs, but >> in the case of "hamming" there should be no allocation. > > Why not? I don't know how GHC counts allocations, but everytime you go from > (x:xs) to xs, you need a new pointer to the tail. If that counts as > allocation, hamming must allocate a lot, too. Is it really necessary to use Strings? I think a packed type, e.g. Vector or ByteString, would be much more efficient here. Of course this is only likely to be a benefit if you can move away from String entirely. I suspect that "hamming2" would perform better then. John _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
Am Montag 19 April 2010 14:37:33 schrieb John Lato:
> Is it really necessary to use Strings? I think a packed type, e.g. > Vector or ByteString, would be much more efficient here. Not very much if the strings are fairly short (and the list isn't too long, so there's not a big difference in cache-friendliness). If eight-bit characters aren't enough, packing the strings into UArray Int Char gives performance quite close to ByteStrings. > Of course this is only likely to be a benefit if you can move away from > String entirely. > > I suspect that "hamming2" would perform better then. > > John _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Arnoldo Muller
> From: Daniel Fischer <[hidden email]>
> > Am Montag 19 April 2010 14:37:33 schrieb John Lato: >> Is it really necessary to use Strings? I think a packed type, e.g. >> Vector or ByteString, would be much more efficient here. > > Not very much if the strings are fairly short (and the list isn't too long, > so there's not a big difference in cache-friendliness). > If eight-bit characters aren't enough, packing the strings into > UArray Int Char gives performance quite close to ByteStrings. I agree, provided that the Strings are used only once. I missed the first part of this thread, so there may be context I don't have, but if Strings are retained and the Hamming distance calculated for multiple pairings, I think it would be a noticeable difference. Although probably not as much as reducing the number of calls to Hamming as you and H. Apfelmus suggest. John _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by John Lato-2
Hello John:
Well I could use a packed type. The only letters that will be found in the string are "ATCG" so yeah I don't need unicode and those things. Will try out with vector or ByteString. Thanks! :) On Mon, Apr 19, 2010 at 2:37 PM, John Lato <[hidden email]> wrote: > Subject: Re: [Haskell-cafe] hamming distance allocation _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Daniel Fischer-4
The strings will not be longer than 30 characters. I am doing sets of 2000 (total of 2000^2 distance computations) I am expecting that all the operations will be lazyly performed but at some point I get a memory error. Most of the memory is being allocated for the hamming distance and I am still unable to find the source of my memory leak. Regards, Arnoldo On Mon, Apr 19, 2010 at 3:47 PM, Daniel Fischer <[hidden email]> wrote: Am Montag 19 April 2010 14:37:33 schrieb John Lato: _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Daniel Fischer-4
Hello Daniel:
My % GC time is : 75.0% (81.4% elapsed) and I am compiling with -O2. Thank you for clarifying about the pointers. Slowly my memory grows up and eventually it explodes. I would expect that the list comprehension is lazily evaluated and therefore at any given time I am only executing one hamming distance. The result of the hamming distance is stored into a small statistics datatype I built (only stores sums and sum of squares and the counts). This datatype is updated using a foldr. I have no idea where the leak is. What do you see in a .prof file to find a leak (hamming distance has the largest amount of time and %alloc) ? From my .prof file where would you start looking at? Best Regards, Arnoldo Muller On Mon, Apr 19, 2010 at 3:18 AM, Daniel Fischer <[hidden email]> wrote: Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller: _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe jasparReader.prof (13K) Download Attachment |
Hello all:
I found my leak after adding some bang patterns in a different part of the program. The compiler was generating all the combinations of the list comprehensions and therefore the performance dropped very badly. BTW, hamming is 2 times faster than hamming2. Thank you as always! Arnoldo On Mon, Apr 19, 2010 at 5:53 PM, Arnoldo Muller <[hidden email]> wrote: Hello Daniel: _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Arnoldo Muller
Am Montag 19 April 2010 17:17:11 schrieb Arnoldo Muller:
> The strings will not be longer than 30 characters. For 20 -30 character strings, using ByteStrings should be better, in my tests about 40% faster, allocation figures slightly lower, resident memory much lower and bytes copied during GC very much lower. For a sample of english text (many short words, few long), ByteStrings were about 25% faster, allocation figures very slightly lower, resident memory much lower, bytes copied much lower (relative difference not as large as for longer strings). > I am doing sets of 2000 (total of 2000^2 distance computations) That shouldn't give memory problems either way. > > I am expecting that all the operations will be lazyly performed but at > some point I get a memory error. My guess is a bad consumption pattern. > > Most of the memory is being allocated for the hamming distance and I am > still unable to find the source of my memory leak. Allocation as such is not a problem, resident memory is the important thing. Try heap profiling to see what holds on to memory (+RTS -hc would be a good first run). > > Regards, > > Arnoldo > > On Mon, Apr 19, 2010 at 3:47 PM, Daniel Fischer <[hidden email]>wrote: > > Am Montag 19 April 2010 14:37:33 schrieb John Lato: > > > Is it really necessary to use Strings? I think a packed type, e.g. > > > Vector or ByteString, would be much more efficient here. > > > > Not very much if the strings are fairly short (and the list isn't too > > long, so there's not a big difference in cache-friendliness). > > If eight-bit characters aren't enough, packing the strings into > > UArray Int Char gives performance quite close to ByteStrings. > > > > > Of course this is only likely to be a benefit if you can move away > > > from String entirely. > > > > > > I suspect that "hamming2" would perform better then. > > > > > > John Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Arnoldo Muller
Am Montag 19 April 2010 17:53:27 schrieb Arnoldo Muller:
> Hello Daniel: > > My % GC time is : 75.0% (81.4% elapsed) and I am compiling with -O2. Very bad. Can I see the code? > Thank you for clarifying about the pointers. Not to forget the Ints for counting. > > Slowly my memory grows up and eventually it explodes. I would expect > that the list comprehension is lazily evaluated and therefore at any > given time I am only executing one hamming distance. The result of the > hamming distance is stored into a small statistics datatype I built > (only stores sums and sum of squares and the counts). This datatype is > updated using a foldr. That might very well be the problem, if you update it with a foldr, you must construct the entire list of 2000^2 hamming-thunks before the work can begin. It's probably better to use foldl' (and make the type strict) so you can start the work immediately. > > I have no idea where the leak is. What do you see in a .prof file to > find a leak (hamming distance has the largest amount of time and %alloc) For finding leaks, heap profiling (-h*) gives more info than -p. The .prof says more about where you spend your time than what hangs on to memory. > ? From my .prof file where would you start looking at? - use hamming instead of hamming2 - convertIntToDouble looks suspicious - calculating a few million Hamming distances takes some time, but what about getMyStats, should that really take 25%? > > > filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- > > > exampl, ys <- exampl] > > > filter (/= 0) [hamming xs ys | xs <- example, ys <- example] And of course, you can trivially avoid half of the work. > > > Best Regards, > > Arnoldo Muller > > On Mon, Apr 19, 2010 at 3:18 AM, Daniel Fischer <[hidden email]>wrote: > > Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller: > > > Hello all: > > > > > > I want to generate some hamming distance statistics about a set of > > > strings. As explained in another e-mail in this list, I used the > > > following code to call the > > > functions: > > > (exampl holds the list of strings of size w) > > > filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- > > > exampl, ys <- exampl] > > > > > > I have two hamming functions: > > > -- hamming distance for variable length strings > > > hamming :: String -> String -> Int > > > hamming x y = hamming' x y 0 > > > where > > > hamming' [] _ !c = c > > > hamming' _ [] !c = c > > > hamming' (x:xs) (y:ys) !c > > > > > > | x == y = hamming' xs ys c > > > | otherwise = hamming' xs ys (c + 1) > > > > > > -- function posted in this mailing list > > > hamming2 :: String -> String -> Int > > > hamming2 xs ys = length (filter not (zipWith (==) xs ys)) > > > > > > I am executing these functions millions of times and the bottleneck > > > of my program is in them as explained by running in profiling mode > > > with +RTS -K400M -p -RTS > > > > > > The costlier function is the hamming distance > > > COST CENTRE MODULE %time %alloc > > > > > > hamming Distances 66.6 41.9 > > > > > > It says that it is performing 41% of the allocations. In the case of > > > hamming2 the allocations go as far as 52%. > > > > Allocations are cheap, so that's not necessarily a problem. More > > important is, what's the maximum residency and how much is copied > > during GC? Are you compiling with -O2 ? > > > > > I could understand that > > > there are allocations in "hamming2" because we are creating pairs, > > > but in the case of "hamming" there should be no allocation. > > > > Why not? I don't know how GHC counts allocations, but everytime you go > > from (x:xs) to xs, you need a new pointer to the tail. If that counts > > as allocation, hamming must allocate a lot, too. > > > > > How can I execute my hamming functions without allocating memory? > > > > > > Best regards, > > > > > > Arnoldo Muller _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
Daniel thank you for all your advice.
An additional ! bang pattern in convertIntToDouble fixed the issue! Also using a foldl' did the trick. Now the program runs as it should with a constant amount of memory and in a very small amount of time. I believe these problems are one of the major sources of frustration for Haskell newbies. Things that could work in <X> language easily suddenly become problems in Haskell. When you overcome these issues then you feel happy again that you chose Haskell as the main programming language of your research project. Is there any guide that explains more about the "bad consumption pattern". Are there any general rules defined to avoid these issues? It helped me to re-read the chapter on profiling in the Real World Haskell book to sorta understand the problem. Is there a more detailed definition of the problem than in RWH? Regards, Arnoldo On Tue, Apr 20, 2010 at 2:49 AM, Daniel Fischer <[hidden email]> wrote: Am Montag 19 April 2010 17:53:27 schrieb Arnoldo Muller: _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Daniel Fischer-4
Daniel Fischer wrote:
> Heinrich Apfelmus: >> >> For instance, your expression can be replaced by >> >> filter (/=0) [hammingX x y | (x:xs) <- tails example, y <- xs] >> >> which cuts the total running time in half. It's still quadratic in the >> length of example . I'm sure there are faster algorithms out there that >> can bring it down to O(n log n) if you want. > > I don't think so. You can't calculate the Hamming distance of x and z from > the distances between x and y and y and z, so you'd have to consider all > pairs of distinct strings, wouldn't you? And there I was sure about something once, only to see that it's actually really doubtful... ;) The thing about the Hamming distance is that it's not a black box, so you can't get a lower bound by counting the number of minimum calls to hamming that have to be made. You are essentially arguing that the different Hamming distances are independent, which they are not. Not to mention that there are also "black-box" restrictions like the triangle inequality d(x,z) <= d(x,y) + d(y,z) but that one is probably harmless. In short, the situation is similar to how the sorting bound O(n*log n) does not apply to radix sort. Still, you are right to question my O(n*log n) claim; so far, my attempts at finding such an algorithm myself have failed. More precisely, the goal is to make a histogram of the possible hamming distances. We need at least O(n*w) time to do that, where n is the number of strings and w their maximum length; after all, we have to "touch" every character. For simplicity, that the characters are just one bit each. Furthermore, we can assume that w <= log n, otherwise there are lots of duplicate strings which can be grouped together. In this notation, the simple algorithm takes O(n^2*w) time. I did find a straightforward divide-and-conquer algorithm to tally small Hamming distances, but it's not good enough for the whole histogram. Here's the specification: countHemming :: Int -> [Bool] -> [Bool] countHemming d xs ys = length [() | x<-xs, y<-ys, hamming x y == d] In other words, countHemming d xs ys counts the number of pairings (x,y) whose Hamming distance is exactly d . Now, the idea is that this can be made faster for small d . For instance, for d == 0 , we are essentially just calculating the number of different elements of xs and ys . By requiring that xs and ys be sorted, this can be done in linear time countHemming 0 xs ys = ... a variation on merge xs ys And for slightly larger d , we can partition the lists by their first bits and continue recursively countHemming _ [] [] = 0 countHemming d xs ys = countHemming (d-1) x0 y1 + countHemming (d-1) x1 y0 + countHemming d x0 y0 + countHemming d x1 y1 where (x0,x1) = partitionByHead xs (y0,y1) = partitionByHead ys partitionByHead xs = (behead True xs, behead False xs) behead b xs = [bs | (b':bs) <- xs, b == b'] To estimate the running time, we set n = length (xs ++ ys) and let T(d,n) = running time of countHamming d xs ys We started with T(0,n) = O(n) and want to discuss the recursive case. The idea is that each list is divided in half, so we get T(d,n) = 2*T(d-1,n/2) + 2*T(d,n/2) >From this, we can calculate T(1,n) = 2*T(0,n/2) + 2*T(1,n/2) = O(n) + 2*T(1,n/2) -- remember quicksort! = O(n*log n) T(2,n) = O(n*log n) + 2*T(2,n/2) = O(n*(log n)^2) and so on, yielding T(d,n) = O(n*(log n)^d) Alas, this can be used to search a dictionary while accounting for spelling errors, but it's no good to calculate a full histogram because it becomes prohibitively expensive when d ~ w/2 ~ (log n)/2 . Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Arnoldo Muller
Arnoldo Muller wrote:
> > I believe these problems are one of the major sources of frustration for > Haskell newbies. Things that could work in <X> language easily suddenly > become problems in Haskell. When you overcome these issues then you feel > happy again that you chose Haskell as the main programming language of your > research project. Well, the difference between <X> and Haskell is pretty much unavoidable. If you care about space and time usage, then there is no way around learning about lazy evaluation and Haskell's execution model. > Is there any guide that explains more about the "bad consumption pattern". > Are there any general rules defined to avoid these issues? It helped me to > re-read the chapter on profiling in the Real World Haskell book to sorta > understand the problem. Is there a more detailed definition of the problem > than in RWH? Two of the most commonly occurring patterns are 1) foldl' vs foldl 2) average xs = sum xs / length xs vs average = uncurry (/) . foldl' (\(!s,!n) x -> (s+x,n+1)) (0,0) Other than that, most Haskell books offer a clear exposition of the reduction model. For instance, there is Graham Hutton. Programming in Haskell, chapter 12. Richard Bird. Introduction to Functional Programming using Haskell 2nd edition, chapter 7. The wikibook contains some preliminary material, too. http://en.wikibooks.org/wiki/Haskell/Graph_reduction Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
Free forum by Nabble | Edit this page |