|
Hi,
for one of my applications, I am generating huge IntSets (and IntMaps), and I am regularly hitting the memory constraints of my machine. Hence I am looking for a better implementation of IntSets with smaller memory foot print, at least in the case of dense sets, e.g. those where it is likely that a few values are equal in all but the lower 5 or 6 bits. So instead of a simple tree, I re-implemented IntSet as a tree where the nodes contain bit maps of one machine word (32 or 64 entries): Instead of Data.IntSet> putStr $ showTree $ fromList [-3, -1,1,2,42,100] * +--* | +--* | | +--* | | | +-- 1 | | | +-- 2 | | +-- 42 | +-- 100 +--* +-- -3 +-- -1 we have Data.DenesIntSet> putStr $ showTree $ fromList [-3, -1,1,2,42,100] * +--* | +-- 0 0000000000000000000001000000000000000000000000000000000000000110 | +-- 64 0000000000000000000000000001000000000000000000000000000000000000 +-- -64 1010000000000000000000000000000000000000000000000000000000000000 The results are promising. I have attached a "progress" graph, comparing the performance of IntMap with DenseIntMap. As you can see, performance is better for most common operations, greatly better for intersection and union, and comparable for fold/filter/findMax/findMin. And here is the memory usage of an IntSet with one million members and varying distance between the entries: $ cat membench.hs import qualified Data.IntSet as S import qualified Data.DenseIntSet as DS import System.Environment main = do [what, sizeS, stepS] <- getArgs let list = [0,read stepS .. read stepS * read sizeS] if what `elem` ["r","R"] then S.size ( S.fromList list) `seq` return () else DS.size (DS.fromList list) `seq` return () For IntSet, the memory consumption does not change notably with the step size: $ ./membench +RTS -t -RTS r 1000000 1 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 80M $ ./membench +RTS -t -RTS r 1000000 2 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 80M $ ./membench +RTS -t -RTS r 1000000 10 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 79M $ ./membench +RTS -t -RTS r 1000000 100 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 79M But for DenseIntMap, we see that dense maps are, as desired, much more compact: $ ./membench +RTS -t -RTS d 1000000 1 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 2M $ ./membench +RTS -t -RTS d 1000000 2 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 3M $ ./membench +RTS -t -RTS d 1000000 10 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 18M $ ./membench +RTS -t -RTS d 1000000 100 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' 77M I have put the code here: https://github.com/nomeata/containers/blob/denseintmap/Data/DenseIntSet.hs It certainly needs a bit more cleanup, e.g. documenting, commenting and maybe improving the test coverage (as, after all, I added quite a bit of logic that can now go wrong). And I did not test the code on a 32 bit system yet. But I’d like to hear from you whether this could in principle go into the containers library (as IntSet, not as DenseIntSet) or if there is a reason not to use this implementation. Thanks, Joachim -- Joachim "nomeata" Breitner [hidden email] | [hidden email] | GPG: 0x4743206C xmpp: [hidden email] | http://www.joachim-breitner.de/ _______________________________________________ Libraries mailing list [hidden email] http://www.haskell.org/mailman/listinfo/libraries |
|
On Sat, 17 Sep 2011, Joachim Breitner wrote: > for one of my applications, I am generating huge IntSets (and IntMaps), > and I am regularly hitting the memory constraints of my machine. Hence I > am looking for a better implementation of IntSets with smaller memory > foot print, at least in the case of dense sets, e.g. those where it is > likely that a few values are equal in all but the lower 5 or 6 bits. It sounds reasonable for me. I guess in my cases the tree will often only contain a few bit-vector nodes. _______________________________________________ Libraries mailing list [hidden email] http://www.haskell.org/mailman/listinfo/libraries |
|
In reply to this post by Joachim Breitner-2
Hi Joachim,
> for one of my applications, I am generating huge IntSets (and IntMaps), > and I am regularly hitting the memory constraints of my machine. Hence I > am looking for a better implementation of IntSets with smaller memory > foot print, at least in the case of dense sets, e.g. those where it is > likely that a few values are equal in all but the lower 5 or 6 bits. > > So instead of a simple tree, I re-implemented IntSet as a tree where the > nodes contain bit maps of one machine word (32 or 64 entries): > > Instead of > Data.IntSet> putStr $ showTree $ fromList [-3, -1,1,2,42,100] > * > +--* > | +--* > | | +--* > | | | +-- 1 > | | | +-- 2 > | | +-- 42 > | +-- 100 > +--* > +-- -3 > +-- -1 > > we have > > Data.DenesIntSet> putStr $ showTree $ fromList [-3, -1,1,2,42,100] > * > +--* > | +-- 0 0000000000000000000001000000000000000000000000000000000000000110 > | +-- 64 0000000000000000000000000001000000000000000000000000000000000000 > +-- -64 1010000000000000000000000000000000000000000000000000000000000000 > > > The results are promising. I have attached a "progress" graph, comparing > the performance of IntMap with DenseIntMap. As you can see, performance > is better for most common operations, greatly better for intersection > and union, and comparable for fold/filter/findMax/findMin. > > And here is the memory usage of an IntSet with one million members and > varying distance between the entries: > > $ cat membench.hs > import qualified Data.IntSet as S > import qualified Data.DenseIntSet as DS > import System.Environment > > main = do > [what, sizeS, stepS] <- getArgs > let list = [0,read stepS .. read stepS * read sizeS] > if what `elem` ["r","R"] then S.size ( S.fromList list) `seq` return () > else DS.size (DS.fromList list) `seq` return () > > For IntSet, the memory consumption does not change notably with the step > size: > > $ ./membench +RTS -t -RTS r 1000000 1 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' > 80M > $ ./membench +RTS -t -RTS r 1000000 2 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' > 80M > $ ./membench +RTS -t -RTS r 1000000 10 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' > 79M > $ ./membench +RTS -t -RTS r 1000000 100 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' > 79M > > But for DenseIntMap, we see that dense maps are, as desired, much more compact: > > $ ./membench +RTS -t -RTS d 1000000 1 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' > 2M > $ ./membench +RTS -t -RTS d 1000000 2 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' > 3M > $ ./membench +RTS -t -RTS d 1000000 10 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' > 18M > $ ./membench +RTS -t -RTS d 1000000 100 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"' > 77M > > > I have put the code here: > https://github.com/nomeata/containers/blob/denseintmap/Data/DenseIntSet.hs > > It certainly needs a bit more cleanup, e.g. documenting, commenting and > maybe improving the test coverage (as, after all, I added quite a bit of > logic that can now go wrong). And I did not test the code on a 32 bit > system yet. > > But I’d like to hear from you whether this could in principle go into > the containers library (as IntSet, not as DenseIntSet) or if there is a > reason not to use this implementation. Great job! Yes, this can definitely go into containers as IntSet. It is a neat idea, which vastly decreases memory consumption for dense sets. In case the set is dense many operations are also faster as the tree is of lower depth. The operations which are searching for nonexact value (find minimum, find neighbor of an element, also list elements in ascending order) can go slower, as visible on your findMax. Could you please also generate comparison of and IntSet and DenseIntSet, when the set is sparse (eg [1, 65 .. N])? I would like to see the time complexities then. Also please include toList. If the results are fine, I currently see no reason why not to push the code in. When you are happy with the code, could you please send a pull request on the github containers repo? I (and/or Johan, if he wants) will look at the code. Of course, having good comments and test coverage would be a great help. I am quite busy till Sun 25, but have some time after that. Cheers, Milan _______________________________________________ Libraries mailing list [hidden email] http://www.haskell.org/mailman/listinfo/libraries |
|
Hi,
Am Samstag, den 17.09.2011, 23:37 +0200 schrieb Milan Straka: > Great job! thanks for the positive feedback. > Could you please also generate comparison of and IntSet and DenseIntSet, > when the set is sparse (eg [1, 65 .. N])? I would like to see the time > complexities then. Also please include toList. If the results are fine, > I currently see no reason why not to push the code in. I have attached some benchmarking result. While results are good for member, great for insert, intersection and union, toList is slower for sparse maps. toList is basically a foldr, so I think the culprit is this function: foldrBits :: Int -> (Int -> a -> a) -> a -> Word -> a foldrBits shift f x = go shift where STRICT_1_OF_2(go) go bi 0 = x go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1)) | otherwise = go (succ bi) (n `shiftRL` 1) I’ll try to optimize this function individually now, any suggestions are welcome. Greetings, Joachim -- Joachim "nomeata" Breitner [hidden email] | [hidden email] | GPG: 0x4743206C xmpp: [hidden email] | http://www.joachim-breitner.de/ _______________________________________________ Libraries mailing list [hidden email] http://www.haskell.org/mailman/listinfo/libraries |
|
On Sun, 18 Sep 2011, Joachim Breitner wrote: > I have attached some benchmarking result. While results are good for > member, great for insert, intersection and union, toList is slower for > sparse maps. toList is basically a foldr, so I think the culprit is this > function: > > foldrBits :: Int -> (Int -> a -> a) -> a -> Word -> a > foldrBits shift f x = go shift > where STRICT_1_OF_2(go) > go bi 0 = x > go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1)) > | otherwise = go (succ bi) (n `shiftRL` 1) > > I’ll try to optimize this function individually now, any suggestions are > welcome. patterns like 1 `shiftL` 32 - 1 `shiftL` 16 1 `shiftL` 16 - 1 `shiftL` 0 _______________________________________________ Libraries mailing list [hidden email] http://www.haskell.org/mailman/listinfo/libraries |
|
On Sun, Sep 18, 2011 at 4:13 PM, Henning Thielemann <[hidden email]> wrote: You can also gain some speed by switching to unsafeShiftRL, to drop the needless comparison. -Edward
_______________________________________________ Libraries mailing list [hidden email] http://www.haskell.org/mailman/listinfo/libraries |
|
In reply to this post by Joachim Breitner-2
> Am Samstag, den 17.09.2011, 23:37 +0200 schrieb Milan Straka:
> > Great job! > > thanks for the positive feedback. > > > Could you please also generate comparison of and IntSet and DenseIntSet, > > when the set is sparse (eg [1, 65 .. N])? I would like to see the time > > complexities then. Also please include toList. If the results are fine, > > I currently see no reason why not to push the code in. > > I have attached some benchmarking result. Thank you very much. I am quite surprised by the slowdown in the intersection 100 case -- I would expect the times to be nearly identical for the DenseIntSet and IntSet, as seen in the union case. Any ideas? > While results are good for member, great for insert, intersection and > union, toList is slower for sparse maps. toList is basically a foldr, > so I think the culprit is this function: > > foldrBits :: Int -> (Int -> a -> a) -> a -> Word -> a > foldrBits shift f x = go shift > where STRICT_1_OF_2(go) > go bi 0 = x > go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1)) > | otherwise = go (succ bi) (n `shiftRL` 1) > > I’ll try to optimize this function individually now, any suggestions are > welcome. We have to be sure there is no unneeded slowdown because of inefficient Core code -- that all numbers are unpacked, succ is unpacked to primop, a primop is used for the shift, and so on. Also, you can use some additional bit tricks -- for example the following heuristic, which skips last 6 bits if they are all zeroes: > go bi 0 = x > {- heuristic -} go bi n | n .&. 0x3f == 0 = go (bi + 6) (n `shiftRL` 6) > go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1)) > | otherwise = go (succ bi) (n `shiftRL` 1) Of course, the 6 used can be changed to some other number - some measurement should be made. I expected a square root of the number of bits to behave reasonably well and 6 seemed to be a good compromise for 32 and 64. Cheers, Milan _______________________________________________ Libraries mailing list [hidden email] http://www.haskell.org/mailman/listinfo/libraries |
|
In reply to this post by Henning Thielemann
Hi,
Am Sonntag, den 18.09.2011, 22:13 +0200 schrieb Henning Thielemann: > On Sun, 18 Sep 2011, Joachim Breitner wrote: > > I have attached some benchmarking result. While results are good for > > member, great for insert, intersection and union, toList is slower for > > sparse maps. toList is basically a foldr, so I think the culprit is this > > function: > > > > foldrBits :: Int -> (Int -> a -> a) -> a -> Word -> a > > foldrBits shift f x = go shift > > where STRICT_1_OF_2(go) > > go bi 0 = x > > go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1)) > > | otherwise = go (succ bi) (n `shiftRL` 1) > > > > I’ll try to optimize this function individually now, any suggestions are > > welcome. > > You can certainly do some binary search by masking and comparing with bit > patterns like > 1 `shiftL` 32 - 1 `shiftL` 16 > 1 `shiftL` 16 - 1 `shiftL` 0 sets. Milan’s suggestion of shifts by 6 might be a good compromise. Another approach might be to first use lowestBitSet to start with the lowest bit. In case of only one bit set, it will not iterate further then. I tried adding some strictness annotation to go and see if that helps. According to the attachement, it does, instead of 4 times slower in the worst case (Size 4 million, step 100) it is only ~2,2x slower. What does "Str=DmdType U(L)U(L)m" in -fdump-stranal mean? Acccording to the core, GHCziPrim.uncheckedShiftRLzh is used, and also succ gets properly resolved to GHCziPrim.zpzh. Intersection is slower because the intersection of a Tip with a Bin cannot just be resolved by a single lookup. Might this be related to the following and if yes, what I can do about it? SpecConstr Function `main:Data.DenseIntSet.intersection{v r1dK} [lidx]' has four call patterns, but the limit is 3 Use -fspec-constr-count=n to set the bound Use -dppr-debug to see specialisations Greetings, Joachim -- Joachim "nomeata" Breitner [hidden email] | [hidden email] | GPG: 0x4743206C xmpp: [hidden email] | http://www.joachim-breitner.de/ _______________________________________________ Libraries mailing list [hidden email] http://www.haskell.org/mailman/listinfo/libraries |
|
On Sun, 18 Sep 2011, Joachim Breitner wrote: > I’d like to avoid the binary search, as it is more expensive for dense > sets. Milan’s suggestion of shifts by 6 might be a good compromise. > Another approach might be to first use lowestBitSet to start with the > lowest bit. In case of only one bit set, it will not iterate further > then. If lowestBitSet is a fast CPU instruction you can iterate through all bits by clearing the lowest set bit after visit and call lowestBitSet again. However, I am not aware of such a machine instruction on x86. _______________________________________________ Libraries mailing list [hidden email] http://www.haskell.org/mailman/listinfo/libraries |
|
In reply to this post by Joachim Breitner-2
> Am Sonntag, den 18.09.2011, 22:13 +0200 schrieb Henning Thielemann:
> > On Sun, 18 Sep 2011, Joachim Breitner wrote: > > > I have attached some benchmarking result. While results are good for > > > member, great for insert, intersection and union, toList is slower for > > > sparse maps. toList is basically a foldr, so I think the culprit is this > > > function: > > > > > > foldrBits :: Int -> (Int -> a -> a) -> a -> Word -> a > > > foldrBits shift f x = go shift > > > where STRICT_1_OF_2(go) > > > go bi 0 = x > > > go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1)) > > > | otherwise = go (succ bi) (n `shiftRL` 1) > > > > > > I’ll try to optimize this function individually now, any suggestions are > > > welcome. > > > > You can certainly do some binary search by masking and comparing with bit > > patterns like > > 1 `shiftL` 32 - 1 `shiftL` 16 > > 1 `shiftL` 16 - 1 `shiftL` 0 > > I’d like to avoid the binary search, as it is more expensive for dense > sets. Milan’s suggestion of shifts by 6 might be a good compromise. > Another approach might be to first use lowestBitSet to start with the > lowest bit. In case of only one bit set, it will not iterate further > then. You could use lowestBitSet iteratively, but I am afraid it will be too slow. > > I tried adding some strictness annotation to go and see if that helps. > According to the attachement, it does, instead of 4 times slower in the > worst case (Size 4 million, step 100) it is only ~2,2x slower. > > > What does "Str=DmdType U(L)U(L)m" in -fdump-stranal mean? U(L) is the best -- the outer U means "unboxed". The (L) is kinda wrong in this unboxed case -- just consider U(L) as U -- that is the way GHC always dumps unboxed ints. > Acccording to the core, GHCziPrim.uncheckedShiftRLzh is used, and also > succ gets properly resolved to GHCziPrim.zpzh. Shouldn't succ be resolved to +1? If I look correctly, you are incrementing an Int. Aha -- succ is checking, whether the Int is 2^32-1. You should probably use (bi+1) instead of (succ bi). > Intersection is slower because the intersection of a Tip with a Bin > cannot just be resolved by a single lookup. You could improve the two Bin vs Tip cases. When you are in that case, call a function 'lookupTip' which will lookup a corresponding Tip in the larger tree (or returns Nil) and then do the .&.. Currently you are doing unnecessary pattern matchings each time you call intersection. (This is actually how the Data.IntSet.intersection works -- the Bin vs Tip case calls member, which patter matches only the bigger tree.) When you mimic the Data.IntSet.intersection more closely, you should get nearly same complexity. Cheers, Milan > Might this be related to the > following and if yes, what I can do about it? > > SpecConstr > Function `main:Data.DenseIntSet.intersection{v r1dK} [lidx]' > has four call patterns, but the limit is 3 > Use -fspec-constr-count=n to set the bound > Use -dppr-debug to see specialisations > > Greetings, > Joachim > > -- > Joachim "nomeata" Breitner > [hidden email] | [hidden email] | GPG: 0x4743206C > xmpp: [hidden email] | http://www.joachim-breitner.de/ > > _______________________________________________ > Libraries mailing list > [hidden email] > http://www.haskell.org/mailman/listinfo/libraries _______________________________________________ Libraries mailing list [hidden email] http://www.haskell.org/mailman/listinfo/libraries |
|
In reply to this post by Joachim Breitner-2
There are some neat tricks you can use using deBruijn multiplication to optimize finding the least significant set bit.
My geometric coalgebra code in https://github.com/ekmett/algebra/blob/master/Numeric/Coalgebra/Geometric.hs#L72 uses the following:
On Sun, Sep 18, 2011 at 4:01 PM, Joachim Breitner <[hidden email]> wrote: Hi, _______________________________________________ Libraries mailing list [hidden email] http://www.haskell.org/mailman/listinfo/libraries |
|
I should have mentioned. This function asumes that the input n is not equal to 0.
You can use it or a nicely ByteArray#'d and uncheckedShiftRL64#'d version together with the iterated masking of bits to walk through the bits in order.
-Edward On Sun, Sep 18, 2011 at 5:07 PM, Edward Kmett <[hidden email]> wrote: There are some neat tricks you can use using deBruijn multiplication to optimize finding the least significant set bit. _______________________________________________ Libraries mailing list [hidden email] http://www.haskell.org/mailman/listinfo/libraries |
|
In reply to this post by Joachim Breitner-2
Hi,
Am Sonntag, den 18.09.2011, 22:47 +0200 schrieb Joachim Breitner: > I’d like to avoid the binary search, as it is more expensive for dense > sets. Milan’s suggestion of shifts by 6 might be a good compromise. > Another approach might be to first use lowestBitSet to start with the > lowest bit. In case of only one bit set, it will not iterate further > then. > > I tried adding some strictness annotation to go and see if that helps. > According to the attachement, it does, instead of 4 times slower in the > worst case (Size 4 million, step 100) it is only ~2,2x slower. first shifting by lowestBitSet seems to help noticably, at least in my benchmarks, see attachment. Slowdown in the worst case is only 1.25 by now. That is enough for today, the benchmarks take too long and it is late :-) lowestSetBit could maybe be optimized further. There seems to be an operation for that (bsr and bsf), it remains to be checked if this is sufficiently fast. If it is indeed, then this should indeed by done inside the loop, and not just one, as Henning suggests. Can library code easily call such primitive ops directly or would that require support in GHC? Changing succ bi to bi + 1 (not included in attached benchmark yet) seems to have a minor benefit on a pure foldrBits benchmark and a minor degression on a pure toList benchmark (see plot.png) Not sure what that means, probably nothing. > You could improve the two Bin vs Tip cases. When you are in that case, > call a function 'lookupTip'... I’ll do the intersection optimization, although I wonder if ghc would not do that for me if it were allowed to introduce enough call patterns? Good night for now Joachim -- Joachim "nomeata" Breitner [hidden email] | [hidden email] | GPG: 0x4743206C xmpp: [hidden email] | http://www.joachim-breitner.de/ _______________________________________________ Libraries mailing list [hidden email] http://www.haskell.org/mailman/listinfo/libraries |
|
In reply to this post by Edward A Kmett
Hi,
Am Sonntag, den 18.09.2011, 17:11 -0400 schrieb Edward Kmett: > I should have mentioned. This function asumes that the input n is not > equal to 0. that is fine, that is in invariant that holds for my bit array. But I guess we’d need to use CPP magic to separate bitness – my code uses whatever "Word" means. Or does it work gracefully on 32? How about adding it to Data.Word (then I don’t feel responsible for getting it fast and can just use it :-)) Thanks, Joachim -- Joachim "nomeata" Breitner [hidden email] | [hidden email] | GPG: 0x4743206C xmpp: [hidden email] | http://www.joachim-breitner.de/ _______________________________________________ Libraries mailing list [hidden email] http://www.haskell.org/mailman/listinfo/libraries |
|
On Sun, Sep 18, 2011 at 5:20 PM, Joachim Breitner <[hidden email]> wrote:
Hi, You can use a smaller DeBruijn multiplication table to handle 32 bits, but the one that I gave puts the answer in the most significant 6 bits of a 64 bit word, so if you use it with a 32 bit word, they'll be setting bits you don't have and all answers will be 0.
Also, if you don't care about the particular order in which you visit the bits you could store the bits in a different order in the Word(64) than you would if you just set them directly by preshuffling them. I was able to derive the following operations, which move the repeated lookups out of the fold:
magic = 0x07EDD5E59A4E28C2 offset = 58 preshuffleTable :: UArray Int Word8 preshuffleTable = listArray (0,63) [63,0,58,1,59,47,53,2,60,39,48,27,54,33,42,3,61,51,37,40,49,18,28,20,55,30,34,11,43,14,22,4,
62,57,46,52,38,26,32,41,50,36,17,19,29,10,13,21,56,45,25,31,35,16,9,12,44,24,15,8,23,7,6,5] preshuffle :: Int -> Int preshuffle n = fromIntegral (preshuffleTable ! n) setShuffled :: Word64 -> Int -> Word64 setShuffled w n = setBit w (preshuffle n) testShuffled :: Word64 -> Int -> Bool testShuffled w n = testBit w (preshuffle n)
clearShuffled :: Word64 -> Int -> Word64 clearShuffled w n = clearBit w (preshuffle n) -- NB: the bits do not come out in order! foldShuffledBits :: (Int -> a -> a) -> a -> Word64 -> a
foldShuffledBits f z 0 = z foldShuffledBits f z n | t <- n .&. negate n = f (fromIntegral (shiftR (t * magic) offset)) (foldShuffledBits f z (xor n t)) unshuffleTable :: UArray Int Word8
unshuffleTable = listArray (0,63) [1,3,7,15,31,63,62,61,59,54,45,27,55,46,29,58,53,42,21,43,23,47,30,60,57,50,37,11,22,44,25, 51,38,13,26,52,41,18,36,9,19,39,14,28,56,49,34,5,10,20,40,17,35,6,12,24,48,33,2,4,8,16,32,0]
I can derive a similar magic, offset and preShuffle table could be derived for 32 bits, I just haven't needed it. The caveat is that you'd get the bits out in the wrong order for toAscList, but if you are concerned with the performance of foldBits it may be a win in other areas and you can loop and test shuffled bits for when you really do need the values in ascending order.
-Edward _______________________________________________ Libraries mailing list [hidden email] http://www.haskell.org/mailman/listinfo/libraries |
|
Probably the simplest version is
magic = 0x07EDD5E59A4E28C2 offset = 58 lsbTable :: UArray Word64 Word8 lsbTable = listArray (0, 63)
[ 63, 0, 58, 1, 59, 47, 53, 2 , 60, 39, 48, 27, 54, 33, 42, 3 , 61, 51, 37, 40, 49, 18, 28, 20 , 55, 30, 34, 11, 43, 14, 22, 4 , 62, 57, 46, 52, 38, 26, 32, 41
, 50, 36, 17, 19, 29, 10, 13, 21 , 56, 45, 25, 31, 35, 16, 9, 12 , 44, 24, 15, 8, 23, 7, 6, 5 ] foldrBits :: (Int -> a -> a) -> a -> Word64 -> a
foldrBits f z 0 = z foldrBits f z n | t <- n .&. negate n = f (fromIntegral (lsbTable ! shiftR (t * magic) offset)) (foldrBits f z (xor n t)) swapping to lsbTable = listArray (0,31) [0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9] magic = 0x077CB531 offset = 27 for the 32 bit case. -Edward_______________________________________________ Libraries mailing list [hidden email] http://www.haskell.org/mailman/listinfo/libraries |
|
In reply to this post by Milan Straka
Hi Milan,
Am Sonntag, den 18.09.2011, 23:06 +0200 schrieb Milan Straka: > > Intersection is slower because the intersection of a Tip with a Bin > > cannot just be resolved by a single lookup. > > You could improve the two Bin vs Tip cases. When you are in that case, > call a function 'lookupTip' which will lookup a corresponding Tip in the > larger tree (or returns Nil) and then do the .&.. Currently you are > doing unnecessary pattern matchings each time you call intersection. > (This is actually how the Data.IntSet.intersection works -- the Bin > vs Tip case calls member, which patter matches only the bigger tree.) > > When you mimic the Data.IntSet.intersection more closely, you should get > nearly same complexity. the loookupTip (here called intersectBM, for consistency with similar helpers). If you compare comparison-3-spec-patterns.pdf (the base case) with comparison-intersectBM.pdf, it shows that it is even slower than my code, as follows: -- The intersection of one tip with a map intersectBM :: Prefix -> BitMap -> IntSet -> IntSet intersectBM kx bm (Bin p2 m2 l2 r2) | nomatch kx p2 m2 = Nil | zero kx m2 = intersectBM kx bm l2 | otherwise = intersectBM kx bm r2 intersectBM kx bm (Tip kx' bm') | kx == kx' = tip kx (bm .&. bm') | otherwise = Nil intersectBM kx bm Nil = Nil Inlining did not help much. Floating the constant parameters and having an explicit "go" function die not change anything, I guess I can rely on ghc to do that for me. Adding strictness helps a bit, so that for up to medium sized maps performance is equal, even for sparse maps. Not sure what goes wrong for large sets. BTW, note the great performance boost for very dense maps when it comes to intersection. It’s barely readable any more :-) I have attached the full statistics for the current version (comparison.pdf). The solid line is IntMap, the long dashed line is DenseIntMap and the short dashed line (left axis) is the speed-up in percent, so lower is better. Greetings, Joachim -- Joachim "nomeata" Breitner [hidden email] | [hidden email] | GPG: 0x4743206C xmpp: [hidden email] | http://www.joachim-breitner.de/ _______________________________________________ Libraries mailing list [hidden email] http://www.haskell.org/mailman/listinfo/libraries |
|
In reply to this post by Edward A Kmett
Hi,
Am Sonntag, den 18.09.2011, 18:46 -0400 schrieb Edward Kmett: > You can use a smaller DeBruijn multiplication table to handle 32 bits, > but the one that I gave puts the answer in the most significant 6 bits > of a 64 bit word, so if you use it with a 32 bit word, they'll be > setting bits you don't have and all answers will be 0. I’m not sure how to cleanly select the right implementation inside Data.IntMap. Also, we have architectures with 31 bit words (s390, at least). If anything, then the functions ought to move into the Bits type class and into GHC.Word. There it makes also more sense to select architecture-specific implementations, e.g. the primitive operation on x86. > Also, if you don't care about the particular order in which you visit > the bits you could store the bits in a different order in the Word(64) > than you would if you just set them directly by preshuffling them. The order is unfortunately relevant, and I am not sure I have a use for an arbitrarily ordered fold. Greetings, Joachim -- Joachim "nomeata" Breitner [hidden email] | [hidden email] | GPG: 0x4743206C xmpp: [hidden email] | http://www.joachim-breitner.de/ _______________________________________________ Libraries mailing list [hidden email] http://www.haskell.org/mailman/listinfo/libraries |
|
In reply to this post by Edward A Kmett
On Sun, Sep 18, 2011 at 11:07 PM, Edward Kmett <[hidden email]> wrote:
> There are some neat tricks you can use using deBruijn multiplication to > optimize finding the least significant set bit. > My geometric coalgebra code > in https://github.com/ekmett/algebra/blob/master/Numeric/Coalgebra/Geometric.hs#L72 > uses the following: I use this in hashtables also :) https://github.com/gregorycollins/hashtables/blob/master/src/Data/HashTable/Internal/CacheLine.hs#L273 -- Gregory Collins <[hidden email]> _______________________________________________ Libraries mailing list [hidden email] http://www.haskell.org/mailman/listinfo/libraries |
|
Hi,
Am Montag, den 19.09.2011, 16:03 +0200 schrieb Gregory Collins: > On Sun, Sep 18, 2011 at 11:07 PM, Edward Kmett <[hidden email]> wrote: > > There are some neat tricks you can use using deBruijn multiplication to > > optimize finding the least significant set bit. > > My geometric coalgebra code > > in https://github.com/ekmett/algebra/blob/master/Numeric/Coalgebra/Geometric.hs#L72 > > uses the following: > > I use this in hashtables also :) > > https://github.com/gregorycollins/hashtables/blob/master/src/Data/HashTable/Internal/CacheLine.hs#L273 -- only works with 32-bit values -- ok for us here firstBitSet# :: Int# -> Int# That’s why I feel uneasy about this method. But now we have already three users of a firstBitSet function, all of which had to re-implement the wheel. This is really a strong argument for putting this as a method into Data.Bits and the code into GHC.Word, where machine-specific instructions can be used if found to be faster. Greetings, Joachim -- Joachim "nomeata" Breitner [hidden email] | [hidden email] | GPG: 0x4743206C xmpp: [hidden email] | http://www.joachim-breitner.de/ _______________________________________________ Libraries mailing list [hidden email] http://www.haskell.org/mailman/listinfo/libraries |
| Powered by Nabble | Edit this page |
