After reading Daniel Fischer's message about trying to use EnumSet in
his Sudoku.hs and finding that it was slower when processing a large data set, I decided to do some profiling. I rewrote his solver to use EnumSets and profiled it. The culprit turns out to be the following function which is responsible for 22% of the allocating in the run. Is there a more efficient way to write this function? foldBits :: Bits c => (a -> Int -> a) -> a -> c -> a foldbits _ z 0 = z foldBits f z bs = foldBits' f 0 bs z foldBits' :: Bits c => (a -> Int -> a) -> Int -> c -> a -> a foldBits' f i bs z | bs == 0 = z | otherwise = z' `seq` foldBits' f i' bs' z' where z' | 1 == bs .&. 1 = f z i | otherwise = z i' = i + 1 bs' = bs `shiftR` 1 ps. I was impressed with how hairy DF's algorithm is and I am not really enough interested in Sudoku to spend the time needed to grok it. So, I decided to try an experiment to see if I could restructure it without understanding it very deeply. Step 1. comment out all the type signatures. Step 2. find the main place that I wanted to change from [Int] to (Set Int) Step 3. compile; make obvious edits; repeat until 0 errors I had it running in a few minutes. I can't imagine doing that in any other programming environment! Cheers, David -------------------------------- David F. Place mailto:[hidden email] _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
Am Freitag, 7. April 2006 22:57 schrieb David F. Place:
> After reading Daniel Fischer's message about trying to use EnumSet in > his Sudoku.hs and finding that it was slower when processing a large > data set, I decided to do some profiling. I rewrote his solver to > use EnumSets and profiled it. The culprit turns out to be the The main & evil culprit, methinks now, was DiffArray and the small allocation area. Care to re-profile with my SudokuSet.hs ? Unless I overlooked something, I use foldBits only via size (though that's used a lot). > following function which is responsible for 22% of the allocating in > the run. Is there a more efficient way to write this function? > > foldBits :: Bits c => (a -> Int -> a) -> a -> c -> a > foldbits _ z 0 = z > foldBits f z bs = foldBits' f 0 bs z > > foldBits' :: Bits c => (a -> Int -> a) -> Int -> c -> a -> a > foldBits' f i bs z > > | bs == 0 = z > | otherwise = z' `seq` foldBits' f i' bs' z' > > where z' | 1 == bs .&. 1 = f z i testbit bs 0 ? and foldbits(') is only used for c = Word, so why the polymorphism? > > | otherwise = z > > i' = i + 1 > bs' = bs `shiftR` 1 > > ps. I was impressed with how hairy DF's algorithm is and I am not Now there are comments, I hope they explain what I do. > really enough interested in Sudoku to spend the time needed to grok > it. So, I decided to try an experiment to see if I could restructure > it without understanding it very deeply. > > Step 1. comment out all the type signatures. > Step 2. find the main place that I wanted to change from [Int] to > (Set Int) > Step 3. compile; make obvious edits; repeat until 0 errors > > I had it running in a few minutes. I can't imagine doing that in any > other programming environment! Great! Triple Cheer for Haskell!!! I wonder how different your translation is to mine (I've no experience with bit-twiddling, but I tried to be as cheap as possible) > > Cheers, David > Cheers, Daniel -- "In My Egotistical Opinion, most people's C programs should be indented six feet downward and covered with dirt." -- Blair P. Houghton _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
Hello Daniel,
Saturday, April 8, 2006, 4:21:14 AM, you wrote: > Unless I overlooked something, I use foldBits only via size (though that's > used a lot). size of set? there is much faster method - use a table [0..255] -> number of bits in this number seen as set then we split Word to the bytes and count total size of set by adding number of bits set in each byte foldBits can be made faster (may be) by adding strict annotations: foldBits :: Bits c => (a -> Int -> a) -> a -> c -> a foldbits _ z bs | z `seq` bs `seq` False = undefined foldBits' :: Bits c => (a -> Int -> a) -> Int -> c -> a -> a foldbits' _ i bs z | i `seq` bs `seq` z `seq` False = undefined moreover, GHC don't inline recursive functions! so foldbits' is out of luck and it seems that GHC generates polymorphic version that is of course very-very slow. what you can do? 1. use SPECIALIZE pragma. this allow to make faster version at least for typical cases (a=c=Int, for example) 2. use recursion on the internal foldbits' function. may be this will help to inline and therefore specialize each call to foldbits'. it's better to ask Simon Marlow about this -- Best regards, Bulat mailto:[hidden email] _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
On Apr 8, 2006, at 4:24 AM, Bulat Ziganshin wrote: > Hello Daniel, > > Saturday, April 8, 2006, 4:21:14 AM, you wrote: > >> Unless I overlooked something, I use foldBits only via size >> (though that's >> used a lot). > > size of set? there is much faster method - use a table > > [0..255] -> number of bits in this number seen as set Or: bitcount :: Word -> Int bitcount 0 = 0 bitcount x = 1 + bitcount (x .&. (x-1)) -- | /O(1)/. The number of elements in the set. size :: Set a -> Int size (Set w) = bitcount w Taking a look at the generated core (with -O2) , bitcount gets unboxed the way I'd expect, so this might do the trick. > then we split Word to the bytes and count total size of set > by adding number of bits set in each byte > > foldBits can be made faster (may be) by adding strict annotations: > > foldBits :: Bits c => (a -> Int -> a) -> a -> c -> a > foldbits _ z bs | z `seq` bs `seq` False = undefined > > foldBits' :: Bits c => (a -> Int -> a) -> Int -> c -> a -> a > foldbits' _ i bs z | i `seq` bs `seq` z `seq` False = undefined > > moreover, GHC don't inline recursive functions! so foldbits' is out of > luck and it seems that GHC generates polymorphic version that is of > course very-very slow. what you can do? > > 1. use SPECIALIZE pragma. this allow to make faster version at least > for typical cases (a=c=Int, for example) > > 2. use recursion on the internal foldbits' function. may be this will > help to inline and therefore specialize each call to foldbits'. it's > better to ask Simon Marlow about this > > -- > Best regards, > Bulat mailto:[hidden email] > > _______________________________________________ > Haskell-Cafe mailing list > [hidden email] > http://www.haskell.org/mailman/listinfo/haskell-cafe Speak softly and drive a Sherman tank. Laugh hard; it's a long way to the bank. -- TMBG _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
Thanks Bulat and Robert. I implemented Bulat's idea as the
following. It tests faster than Roberts. I use Robert's to compute the table. The performance seems satisfactory now. size :: Set a -> Int size (Set w) = countBits w where countBits w | w == 0 = 0 | otherwise = countBits (w `shiftR` 8) + bitsTable!(w .&. 0xFF) bitsTable :: Array Word Int bitsTable = array (0,255) $ [(i,bitcount i) | i <- [0..255]] bitcount :: Word -> Int bitcount 0 = 0 bitcount x = 1 + bitcount (x .&. (x-1)) On Apr 8, 2006, at 1:21 PM, Robert Dockins wrote: > On Apr 8, 2006, at 4:24 AM, Bulat Ziganshin wrote: > >> Hello Daniel, >> >> Saturday, April 8, 2006, 4:21:14 AM, you wrote: >> >>> Unless I overlooked something, I use foldBits only via size >>> (though that's >>> used a lot). >> >> size of set? there is much faster method - use a table >> >> [0..255] -> number of bits in this number seen as set > > Or: > > bitcount :: Word -> Int > bitcount 0 = 0 > bitcount x = 1 + bitcount (x .&. (x-1)) > > -- | /O(1)/. The number of elements in the set. > size :: Set a -> Int > size (Set w) = bitcount w -------------------------------- David F. Place mailto:[hidden email] _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Bulat Ziganshin-2
On Apr 8, 2006, at 4:24 AM, Bulat Ziganshin wrote: > foldBits can be made faster (may be) by adding strict annotations: > > foldBits :: Bits c => (a -> Int -> a) -> a -> c -> a > foldbits _ z bs | z `seq` bs `seq` False = undefined > > foldBits' :: Bits c => (a -> Int -> a) -> Int -> c -> a -> a > foldbits' _ i bs z | i `seq` bs `seq` z `seq` False = undefined Indeed, I had tried this. It is slower for reasons that are mysterious to me. -------------------------------- David F. Place mailto:[hidden email] _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by David Place
Hello David,
Saturday, April 8, 2006, 9:58:56 PM, you wrote: > bitsTable :: Array Word Int > bitsTable = array (0,255) $ [(i,bitcount i) | i <- [0..255]] bitsTable :: UArray Word Int bitsTable = listArray (0,255) $ map bitcount [0..255] UArray is much faster than Array but can be used only for simple datatypes (integral, floating point, Char, Bool). more info at http://haskell.org/haskellwiki/Arrays btw: you can use "UArray a Bool" to represent sets of ANY type `a` belonging to class Ix. It uses only one bit per elemnt in current implementation -- Best regards, Bulat mailto:[hidden email] _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by David Place
On Apr 8, 2006, at 1:58 PM, David F. Place wrote: > Thanks Bulat and Robert. I implemented Bulat's idea as the > following. It tests faster than Roberts. I use Robert's to > compute the table. The performance seems satisfactory now. > > size :: Set a -> Int > size (Set w) = countBits w > where > countBits w > | w == 0 = 0 > | otherwise = countBits (w `shiftR` 8) + bitsTable!(w .&. > 0xFF) > > bitsTable :: Array Word Int > bitsTable = array (0,255) $ [(i,bitcount i) | i <- [0..255]] > > bitcount :: Word -> Int > bitcount 0 = 0 > bitcount x = 1 + bitcount (x .&. (x-1)) There's a couple of other nice bit-twiddily things you can do: countBits :: Word -> Int countBits w | w == 0 = 0 | otherwise = countBits (w `shiftR` 8) + bitsTable!(w .&. 0xFF) bitsTable :: Array Word Int bitsTable = array (0,255) $ [(i,bitcount i) | i <- [0..255]] bitcount :: Word -> Int bitcount 0 = 0 bitcount x = 1 + bitcount (x .&. (x-1)) lsb :: Word -> Int lsb x = countBits ((x-1) .&. (complement x)) -- stolen from http://aggregate.org/MAGIC/ msb :: Word -> Int msb x0 = let x1 = x0 .|. (x0 `shiftR` 1) x2 = x1 .|. (x1 `shiftR` 2) x3 = x2 .|. (x2 `shiftR` 4) x4 = x3 .|. (x3 `shiftR` 8) x5 = x4 .|. (x4 `shiftR` 16) in countBits x5 - 1 findMinIndex :: Word -> Int findMinIndex 0 = error "EnumSet.findMin: empty set has no minimal element" findMinIndex w = lsb w findMaxIndex :: Word -> Int findMaxIndex 0 = error "EnumSet.findMax: empty set has no maximal element" findMaxIndex w = msb w Which should make all access to the greatest or least element O(1). I guess, come to think of it, all operations on EnumSet are O(1) by virtue of the set size being upper-bounded. At any rate this turns recursion into unboxable straight-line code and I think it does less allocations. Rob Dockins Speak softly and drive a Sherman tank. Laugh hard; it's a long way to the bank. -- TMBG _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
Hum,
oddly, these actually slow things down. While the new size brought the sudoku17 time from ~570s down to ~490s, the new findMinIndex/findMaxIndex increased the time to ~515s, although hardly used. Why? Cheers, Daniel Am Sonntag, 9. April 2006 00:54 schrieb Robert Dockins: > On Apr 8, 2006, at 1:58 PM, David F. Place wrote: > > Thanks Bulat and Robert. I implemented Bulat's idea as the > > following. It tests faster than Roberts. I use Robert's to > > compute the table. The performance seems satisfactory now. > > > > size :: Set a -> Int > > size (Set w) = countBits w > > where > > countBits w > > > > | w == 0 = 0 > > | otherwise = countBits (w `shiftR` 8) + bitsTable!(w .&. 0xFF) > > > > bitsTable :: Array Word Int > > bitsTable = array (0,255) $ [(i,bitcount i) | i <- [0..255]] > > > > bitcount :: Word -> Int > > bitcount 0 = 0 > > bitcount x = 1 + bitcount (x .&. (x-1)) > > There's a couple of other nice bit-twiddily things you can do: > > countBits :: Word -> Int > countBits w > > | w == 0 = 0 > | otherwise = countBits (w `shiftR` 8) + bitsTable!(w .&. 0xFF) > > bitsTable :: Array Word Int > bitsTable = array (0,255) $ [(i,bitcount i) | i <- [0..255]] > > bitcount :: Word -> Int > bitcount 0 = 0 > bitcount x = 1 + bitcount (x .&. (x-1)) > > lsb :: Word -> Int > lsb x = countBits ((x-1) .&. (complement x)) > > -- stolen from http://aggregate.org/MAGIC/ > msb :: Word -> Int > msb x0 = let > x1 = x0 .|. (x0 `shiftR` 1) > x2 = x1 .|. (x1 `shiftR` 2) > x3 = x2 .|. (x2 `shiftR` 4) > x4 = x3 .|. (x3 `shiftR` 8) > x5 = x4 .|. (x4 `shiftR` 16) > in countBits x5 - 1 > > > findMinIndex :: Word -> Int > findMinIndex 0 = > error "EnumSet.findMin: empty set has no minimal element" > findMinIndex w = lsb w > > findMaxIndex :: Word -> Int > findMaxIndex 0 = > error "EnumSet.findMax: empty set has no maximal element" > findMaxIndex w = msb w > > > > Which should make all access to the greatest or least element O(1). > I guess, come to think of it, all operations on EnumSet are O(1) by > virtue of the set size being upper-bounded. At any rate this turns > recursion into unboxable straight-line code and I think it does less > allocations. > > > > Rob Dockins > > Speak softly and drive a Sherman tank. > Laugh hard; it's a long way to the bank. > -- TMBG > _______________________________________________ > Haskell-Cafe mailing list > [hidden email] > http://www.haskell.org/mailman/listinfo/haskell-cafe -- "In My Egotistical Opinion, most people's C programs should be indented six feet downward and covered with dirt." -- Blair P. Houghton _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Bugzilla from robdockins@fastmail.fm
Hello Robert,
Sunday, April 9, 2006, 2:54:58 AM, you wrote: > findMinIndex :: Word -> Int > findMaxIndex :: Word -> Int on the other side, these procedures can use the same divide-to-bytes technique as `size` findMinIndex 0 = undefined findMinIndex n = case (n `shiftR` 8) of 0 -> minIndexInByte ! (n .&. 255) b -> 8 + findMinIndex b something like this should work -- Best regards, Bulat mailto:[hidden email] _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
Hi Bulat,
On Apr 9, 2006, at 6:31 AM, Bulat Ziganshin wrote: > on the other side, these procedures can use the same divide-to-bytes > technique as `size` > > findMinIndex 0 = undefined > findMinIndex n = case (n `shiftR` 8) of > 0 -> minIndexInByte ! (n .&. 255) > b -> 8 + findMinIndex b > > something like this should work In an email to me, Jean-Philippe Bernardy expressed a concern that a large table could needlessly fill the data cache. He proposed checking 4 bits at a time and using a small table of 16 elements. Not surprisingly, it isn't as fast. I wonder what you think of this. Also, I wonder what would be a good test to demonstrate this possible interaction with the cache. Cheers, David ps. Thanks for the tip about UArray. -------------------------------- David F. Place mailto:[hidden email] _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
Hello David,
Sunday, April 9, 2006, 5:47:03 PM, you wrote: > In an email to me, Jean-Philippe Bernardy expressed a concern that a > large table could needlessly fill the data cache. He proposed > checking 4 bits at a time and using a small table of 16 elements. > Not surprisingly, it isn't as fast. I wonder what you think of > this. Also, I wonder what would be a good test to demonstrate this > possible interaction with the cache. it depends on the actual task, CPU and other programs running at the same time :) i just can say that it will be better to use array of Word8 (with `fromIntegral` for conversion) and use `unsafeAt` for indexing -- Best regards, Bulat mailto:[hidden email] _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Daniel Fischer-4
On Apr 8, 2006, at 9:30 PM, Daniel Fischer wrote: > Hum, > oddly, these actually slow things down. > While the new size brought the sudoku17 time from ~570s down to ~490s, > the new findMinIndex/findMaxIndex increased the time to ~515s, > although hardly > used. > Why? Hard to say. I'd expect that if the bit twiddly parts get turned directly into the corresponding opcodes, it would help (but that's not certain). It's possible that GHC munges things somewhere in the pipe and accidently unoptimizes; its possible that strictness isn't correctly discovered in 'msb'. Or, it could be that whatever (probably superscalar) chip you're running does a better job with the loop (although that's hard to believe...), or its some sort of cache effect... or who knows. You'd probably have to study the core and/or disassembly to figure out exactly what's happened. I suppose its possible that the change had some bizarre ripple effect that somehow suppressed a helpful optimization in some other part of the program. At this point it sort of becomes black magic, and I must confess I'm no magician. Its disappointing that those lovely, inscrutable algorithms don't help, though ;-) Rob Dockins > Cheers, > Daniel > > Am Sonntag, 9. April 2006 00:54 schrieb Robert Dockins: >> On Apr 8, 2006, at 1:58 PM, David F. Place wrote: >>> Thanks Bulat and Robert. I implemented Bulat's idea as the >>> following. It tests faster than Roberts. I use Robert's to >>> compute the table. The performance seems satisfactory now. >>> >>> size :: Set a -> Int >>> size (Set w) = countBits w >>> where >>> countBits w >>> >>> | w == 0 = 0 >>> | otherwise = countBits (w `shiftR` 8) + bitsTable! >>> (w .&. 0xFF) >>> >>> bitsTable :: Array Word Int >>> bitsTable = array (0,255) $ [(i,bitcount i) | i <- [0..255]] >>> >>> bitcount :: Word -> Int >>> bitcount 0 = 0 >>> bitcount x = 1 + bitcount (x .&. (x-1)) >> >> There's a couple of other nice bit-twiddily things you can do: >> >> countBits :: Word -> Int >> countBits w >> >> | w == 0 = 0 >> | otherwise = countBits (w `shiftR` 8) + bitsTable!(w .&. 0xFF) >> >> bitsTable :: Array Word Int >> bitsTable = array (0,255) $ [(i,bitcount i) | i <- [0..255]] >> >> bitcount :: Word -> Int >> bitcount 0 = 0 >> bitcount x = 1 + bitcount (x .&. (x-1)) >> >> lsb :: Word -> Int >> lsb x = countBits ((x-1) .&. (complement x)) >> >> -- stolen from http://aggregate.org/MAGIC/ >> msb :: Word -> Int >> msb x0 = let >> x1 = x0 .|. (x0 `shiftR` 1) >> x2 = x1 .|. (x1 `shiftR` 2) >> x3 = x2 .|. (x2 `shiftR` 4) >> x4 = x3 .|. (x3 `shiftR` 8) >> x5 = x4 .|. (x4 `shiftR` 16) >> in countBits x5 - 1 >> >> >> findMinIndex :: Word -> Int >> findMinIndex 0 = >> error "EnumSet.findMin: empty set has no minimal element" >> findMinIndex w = lsb w >> >> findMaxIndex :: Word -> Int >> findMaxIndex 0 = >> error "EnumSet.findMax: empty set has no maximal element" >> findMaxIndex w = msb w >> >> >> >> Which should make all access to the greatest or least element O(1). >> I guess, come to think of it, all operations on EnumSet are O(1) by >> virtue of the set size being upper-bounded. At any rate this turns >> recursion into unboxable straight-line code and I think it does less >> allocations. >> >> >> >> Rob Dockins >> >> Speak softly and drive a Sherman tank. >> Laugh hard; it's a long way to the bank. >> -- TMBG >> _______________________________________________ >> Haskell-Cafe mailing list >> [hidden email] >> http://www.haskell.org/mailman/listinfo/haskell-cafe > > -- > > "In My Egotistical Opinion, most people's C programs should be > indented six feet downward and covered with dirt." > -- Blair P. Houghton > Speak softly and drive a Sherman tank. Laugh hard; it's a long way to the bank. -- TMBG _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
Free forum by Nabble | Edit this page |