Quantcast

Faster IntSet by using BitMaps in the lower branches

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Faster IntSet by using BitMaps in the lower branches

Joachim Breitner-2
Hi,

for one of my applications, I am generating huge IntSets (and IntMaps),
and I am regularly hitting the memory constraints of my machine. Hence I
am looking for a better implementation of IntSets with smaller memory
foot print, at least in the case of dense sets, e.g. those where it is
likely that a few values are equal in all but the lower 5 or 6 bits.

So instead of a simple tree, I re-implemented IntSet as a tree where the
nodes contain bit maps of one machine word (32 or 64 entries):

Instead of
Data.IntSet> putStr $ showTree $ fromList [-3, -1,1,2,42,100]
*
+--*
|  +--*
|  |  +--*
|  |  |  +-- 1
|  |  |  +-- 2
|  |  +-- 42
|  +-- 100
+--*
   +-- -3
   +-- -1

we have

Data.DenesIntSet> putStr $ showTree $ fromList [-3, -1,1,2,42,100]
*
+--*
|  +-- 0 0000000000000000000001000000000000000000000000000000000000000110
|  +-- 64 0000000000000000000000000001000000000000000000000000000000000000
+-- -64 1010000000000000000000000000000000000000000000000000000000000000


The results are promising. I have attached a "progress" graph, comparing
the performance of IntMap with DenseIntMap. As you can see, performance
is better for most common operations, greatly better for intersection
and union, and comparable for fold/filter/findMax/findMin.

And here is the memory usage of an IntSet with one million members and
varying distance between the entries:

$ cat membench.hs
import qualified Data.IntSet as S
import qualified Data.DenseIntSet as DS
import System.Environment

main = do
    [what, sizeS, stepS] <- getArgs
    let list = [0,read stepS .. read stepS * read sizeS]
    if what `elem` ["r","R"] then  S.size ( S.fromList list) `seq` return ()
                             else DS.size (DS.fromList list) `seq` return ()

For IntSet, the memory consumption does not change notably with the step
size:

$ ./membench +RTS -t -RTS r 1000000 1 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"'
80M
$ ./membench +RTS -t -RTS r 1000000 2 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"'
80M
$ ./membench +RTS -t -RTS r 1000000 10 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"'
79M
$ ./membench +RTS -t -RTS r 1000000 100 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"'
79M

But for DenseIntMap, we see that dense maps are, as desired, much more compact:

$ ./membench +RTS -t -RTS d 1000000 1 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"'
2M
$ ./membench +RTS -t -RTS d 1000000 2 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"'
3M
$ ./membench +RTS -t -RTS d 1000000 10 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"'
18M
$ ./membench +RTS -t -RTS d 1000000 100 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"'
77M


I have put the code here:
https://github.com/nomeata/containers/blob/denseintmap/Data/DenseIntSet.hs

It certainly needs a bit more cleanup, e.g. documenting, commenting and
maybe improving the test coverage (as, after all, I added quite a bit of
logic that can now go wrong). And I did not test the code on a 32 bit
system yet.

But I’d like to hear from you whether this could in principle go into
the containers library (as IntSet, not as DenseIntSet) or if there is a
reason not to use this implementation.

Thanks,
Joachim

--
Joachim "nomeata" Breitner
  [hidden email]  |  [hidden email]  |  GPG: 0x4743206C
  xmpp: [hidden email] | http://www.joachim-breitner.de/


_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries

plot.png (18K) Download Attachment
signature.asc (205 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Faster IntSet by using BitMaps in the lower branches

Henning Thielemann

On Sat, 17 Sep 2011, Joachim Breitner wrote:

> for one of my applications, I am generating huge IntSets (and IntMaps),
> and I am regularly hitting the memory constraints of my machine. Hence I
> am looking for a better implementation of IntSets with smaller memory
> foot print, at least in the case of dense sets, e.g. those where it is
> likely that a few values are equal in all but the lower 5 or 6 bits.

It sounds reasonable for me. I guess in my cases the tree will often only
contain a few bit-vector nodes.

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Faster IntSet by using BitMaps in the lower branches

Milan Straka
In reply to this post by Joachim Breitner-2
Hi Joachim,

> for one of my applications, I am generating huge IntSets (and IntMaps),
> and I am regularly hitting the memory constraints of my machine. Hence I
> am looking for a better implementation of IntSets with smaller memory
> foot print, at least in the case of dense sets, e.g. those where it is
> likely that a few values are equal in all but the lower 5 or 6 bits.
>
> So instead of a simple tree, I re-implemented IntSet as a tree where the
> nodes contain bit maps of one machine word (32 or 64 entries):
>
> Instead of
> Data.IntSet> putStr $ showTree $ fromList [-3, -1,1,2,42,100]
> *
> +--*
> |  +--*
> |  |  +--*
> |  |  |  +-- 1
> |  |  |  +-- 2
> |  |  +-- 42
> |  +-- 100
> +--*
>    +-- -3
>    +-- -1
>
> we have
>
> Data.DenesIntSet> putStr $ showTree $ fromList [-3, -1,1,2,42,100]
> *
> +--*
> |  +-- 0 0000000000000000000001000000000000000000000000000000000000000110
> |  +-- 64 0000000000000000000000000001000000000000000000000000000000000000
> +-- -64 1010000000000000000000000000000000000000000000000000000000000000
>
>
> The results are promising. I have attached a "progress" graph, comparing
> the performance of IntMap with DenseIntMap. As you can see, performance
> is better for most common operations, greatly better for intersection
> and union, and comparable for fold/filter/findMax/findMin.
>
> And here is the memory usage of an IntSet with one million members and
> varying distance between the entries:
>
> $ cat membench.hs
> import qualified Data.IntSet as S
> import qualified Data.DenseIntSet as DS
> import System.Environment
>
> main = do
>     [what, sizeS, stepS] <- getArgs
>     let list = [0,read stepS .. read stepS * read sizeS]
>     if what `elem` ["r","R"] then  S.size ( S.fromList list) `seq` return ()
>                              else DS.size (DS.fromList list) `seq` return ()
>
> For IntSet, the memory consumption does not change notably with the step
> size:
>
> $ ./membench +RTS -t -RTS r 1000000 1 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"'
> 80M
> $ ./membench +RTS -t -RTS r 1000000 2 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"'
> 80M
> $ ./membench +RTS -t -RTS r 1000000 10 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"'
> 79M
> $ ./membench +RTS -t -RTS r 1000000 100 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"'
> 79M
>
> But for DenseIntMap, we see that dense maps are, as desired, much more compact:
>
> $ ./membench +RTS -t -RTS d 1000000 1 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"'
> 2M
> $ ./membench +RTS -t -RTS d 1000000 2 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"'
> 3M
> $ ./membench +RTS -t -RTS d 1000000 10 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"'
> 18M
> $ ./membench +RTS -t -RTS d 1000000 100 2>&1 | perl -n -e '/, (\d+M) in use/ && print "$1\n"'
> 77M
>
>
> I have put the code here:
> https://github.com/nomeata/containers/blob/denseintmap/Data/DenseIntSet.hs
>
> It certainly needs a bit more cleanup, e.g. documenting, commenting and
> maybe improving the test coverage (as, after all, I added quite a bit of
> logic that can now go wrong). And I did not test the code on a 32 bit
> system yet.
>
> But I’d like to hear from you whether this could in principle go into
> the containers library (as IntSet, not as DenseIntSet) or if there is a
> reason not to use this implementation.

Great job! Yes, this can definitely go into containers as IntSet. It is
a neat idea, which vastly decreases memory consumption for dense sets.
In case the set is dense many operations are also faster as the tree is
of lower depth. The operations which are searching for nonexact value
(find minimum, find neighbor of an element, also list elements in
ascending order) can go slower, as visible on your findMax.

Could you please also generate comparison of and IntSet and DenseIntSet,
when the set is sparse (eg [1, 65 .. N])? I would like to see the time
complexities then. Also please include toList. If the results are fine,
I currently see no reason why not to push the code in.

When you are happy with the code, could you please send a pull request
on the github containers repo? I (and/or Johan, if he wants) will look
at the code. Of course, having good comments and test coverage would be
a great help. I am quite busy till Sun 25, but have some time after that.

Cheers,
Milan

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Faster IntSet by using BitMaps in the lower branches

Joachim Breitner-2
Hi,

Am Samstag, den 17.09.2011, 23:37 +0200 schrieb Milan Straka:
> Great job!

thanks for the positive feedback.

> Could you please also generate comparison of and IntSet and DenseIntSet,
> when the set is sparse (eg [1, 65 .. N])? I would like to see the time
> complexities then. Also please include toList. If the results are fine,
> I currently see no reason why not to push the code in.

I have attached some benchmarking result. While results are good for
member, great for insert, intersection and union, toList is slower for
sparse maps. toList is basically a foldr, so I think the culprit is this
function:

foldrBits :: Int -> (Int -> a -> a) -> a -> Word -> a
foldrBits shift f x = go shift
  where STRICT_1_OF_2(go)
        go bi 0 = x
        go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1))
                | otherwise     =       go (succ bi) (n `shiftRL` 1)

I’ll try to optimize this function individually now, any suggestions are
welcome.

Greetings,
Joachim

--
Joachim "nomeata" Breitner
  [hidden email]  |  [hidden email]  |  GPG: 0x4743206C
  xmpp: [hidden email] | http://www.joachim-breitner.de/


_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries

comparison2.pdf (24K) Download Attachment
signature.asc (205 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Faster IntSet by using BitMaps in the lower branches

Henning Thielemann

On Sun, 18 Sep 2011, Joachim Breitner wrote:

> I have attached some benchmarking result. While results are good for
> member, great for insert, intersection and union, toList is slower for
> sparse maps. toList is basically a foldr, so I think the culprit is this
> function:
>
> foldrBits :: Int -> (Int -> a -> a) -> a -> Word -> a
> foldrBits shift f x = go shift
>  where STRICT_1_OF_2(go)
>        go bi 0 = x
>        go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1))
>                | otherwise     =       go (succ bi) (n `shiftRL` 1)
>
> I’ll try to optimize this function individually now, any suggestions are
> welcome.
You can certainly do some binary search by masking and comparing with bit
patterns like
    1 `shiftL` 32 - 1 `shiftL` 16
    1 `shiftL` 16 - 1 `shiftL` 0


_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Faster IntSet by using BitMaps in the lower branches

Edward A Kmett

On Sun, Sep 18, 2011 at 4:13 PM, Henning Thielemann <[hidden email]> wrote:

On Sun, 18 Sep 2011, Joachim Breitner wrote:

I have attached some benchmarking result. While results are good for
member, great for insert, intersection and union, toList is slower for
sparse maps. toList is basically a foldr, so I think the culprit is this
function:

foldrBits :: Int -> (Int -> a -> a) -> a -> Word -> a
foldrBits shift f x = go shift
 where STRICT_1_OF_2(go)
      go bi 0 = x
      go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1))
              | otherwise     =       go (succ bi) (n `shiftRL` 1)

I’ll try to optimize this function individually now, any suggestions are
welcome.

You can certainly do some binary search by masking and comparing with bit patterns like
  1 `shiftL` 32 - 1 `shiftL` 16
  1 `shiftL` 16 - 1 `shiftL` 0

You can also gain some speed by switching to unsafeShiftRL, to drop the needless comparison.

-Edward

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Faster IntSet by using BitMaps in the lower branches

Milan Straka
In reply to this post by Joachim Breitner-2
> Am Samstag, den 17.09.2011, 23:37 +0200 schrieb Milan Straka:
> > Great job!
>
> thanks for the positive feedback.
>
> > Could you please also generate comparison of and IntSet and DenseIntSet,
> > when the set is sparse (eg [1, 65 .. N])? I would like to see the time
> > complexities then. Also please include toList. If the results are fine,
> > I currently see no reason why not to push the code in.
>
> I have attached some benchmarking result.

Thank you very much. I am quite surprised by the slowdown in the
intersection 100 case -- I would expect the times to be nearly identical for
the DenseIntSet and IntSet, as seen in the union case. Any ideas?

> While results are good for member, great for insert, intersection and
> union, toList is slower for sparse maps. toList is basically a foldr,
> so I think the culprit is this function:
>
> foldrBits :: Int -> (Int -> a -> a) -> a -> Word -> a
> foldrBits shift f x = go shift
>   where STRICT_1_OF_2(go)
>         go bi 0 = x
>         go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1))
>                 | otherwise     =       go (succ bi) (n `shiftRL` 1)
>
> I’ll try to optimize this function individually now, any suggestions are
> welcome.

We have to be sure there is no unneeded slowdown because of inefficient
Core code -- that all numbers are unpacked, succ is unpacked to primop,
a primop is used for the shift, and so on.

Also, you can use some additional bit tricks -- for example the
following heuristic, which skips last 6 bits if they are all zeroes:

>                  go bi 0 = x
> {- heuristic -}  go bi n | n .&. 0x3f == 0 = go (bi + 6) (n `shiftRL` 6)
>                  go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1))
>                          | otherwise     =       go (succ bi) (n `shiftRL` 1)

Of course, the 6 used can be changed to some other number - some
measurement should be made. I expected a square root of the number of
bits to behave reasonably well and 6 seemed to be a good compromise for
32 and 64.

Cheers,
Milan

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Faster IntSet by using BitMaps in the lower branches

Joachim Breitner-2
In reply to this post by Henning Thielemann
Hi,

Am Sonntag, den 18.09.2011, 22:13 +0200 schrieb Henning Thielemann:

> On Sun, 18 Sep 2011, Joachim Breitner wrote:
> > I have attached some benchmarking result. While results are good for
> > member, great for insert, intersection and union, toList is slower for
> > sparse maps. toList is basically a foldr, so I think the culprit is this
> > function:
> >
> > foldrBits :: Int -> (Int -> a -> a) -> a -> Word -> a
> > foldrBits shift f x = go shift
> >  where STRICT_1_OF_2(go)
> >        go bi 0 = x
> >        go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1))
> >                | otherwise     =       go (succ bi) (n `shiftRL` 1)
> >
> > I’ll try to optimize this function individually now, any suggestions are
> > welcome.
>
> You can certainly do some binary search by masking and comparing with bit
> patterns like
>     1 `shiftL` 32 - 1 `shiftL` 16
>     1 `shiftL` 16 - 1 `shiftL` 0
I’d like to avoid the binary search, as it is more expensive for dense
sets. Milan’s suggestion of shifts by 6 might be a good compromise.
Another approach might be to first use lowestBitSet to start with the
lowest bit. In case of only one bit set, it will not iterate further
then.

I tried adding some strictness annotation to go and see if that helps.
According to the attachement, it does, instead of 4 times slower in the
worst case (Size 4 million, step 100) it is only ~2,2x slower.


What does "Str=DmdType U(L)U(L)m" in -fdump-stranal mean?


Acccording to the core, GHCziPrim.uncheckedShiftRLzh is used, and also
succ gets properly resolved to GHCziPrim.zpzh.


Intersection is slower because the intersection of a Tip with a Bin
cannot just be resolved by a single lookup. Might this be related to the
following and if yes, what I can do about it?

SpecConstr
    Function `main:Data.DenseIntSet.intersection{v r1dK} [lidx]'
      has four call patterns, but the limit is 3
    Use -fspec-constr-count=n to set the bound
    Use -dppr-debug to see specialisations

Greetings,
Joachim

--
Joachim "nomeata" Breitner
  [hidden email]  |  [hidden email]  |  GPG: 0x4743206C
  xmpp: [hidden email] | http://www.joachim-breitner.de/


_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries

signature.asc (205 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Faster IntSet by using BitMaps in the lower branches

Henning Thielemann

On Sun, 18 Sep 2011, Joachim Breitner wrote:

> I’d like to avoid the binary search, as it is more expensive for dense
> sets. Milan’s suggestion of shifts by 6 might be a good compromise.
> Another approach might be to first use lowestBitSet to start with the
> lowest bit. In case of only one bit set, it will not iterate further
> then.

If lowestBitSet is a fast CPU instruction you can iterate through all bits
by clearing the lowest set bit after visit and call lowestBitSet again.
However, I am not aware of such a machine instruction on x86.


_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Faster IntSet by using BitMaps in the lower branches

Milan Straka
In reply to this post by Joachim Breitner-2
> Am Sonntag, den 18.09.2011, 22:13 +0200 schrieb Henning Thielemann:
> > On Sun, 18 Sep 2011, Joachim Breitner wrote:
> > > I have attached some benchmarking result. While results are good for
> > > member, great for insert, intersection and union, toList is slower for
> > > sparse maps. toList is basically a foldr, so I think the culprit is this
> > > function:
> > >
> > > foldrBits :: Int -> (Int -> a -> a) -> a -> Word -> a
> > > foldrBits shift f x = go shift
> > >  where STRICT_1_OF_2(go)
> > >        go bi 0 = x
> > >        go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1))
> > >                | otherwise     =       go (succ bi) (n `shiftRL` 1)
> > >
> > > I’ll try to optimize this function individually now, any suggestions are
> > > welcome.
> >
> > You can certainly do some binary search by masking and comparing with bit
> > patterns like
> >     1 `shiftL` 32 - 1 `shiftL` 16
> >     1 `shiftL` 16 - 1 `shiftL` 0
>
> I’d like to avoid the binary search, as it is more expensive for dense
> sets. Milan’s suggestion of shifts by 6 might be a good compromise.
> Another approach might be to first use lowestBitSet to start with the
> lowest bit. In case of only one bit set, it will not iterate further
> then.

You could use lowestBitSet iteratively, but I am afraid it will be too
slow.

>
> I tried adding some strictness annotation to go and see if that helps.
> According to the attachement, it does, instead of 4 times slower in the
> worst case (Size 4 million, step 100) it is only ~2,2x slower.
>
>
> What does "Str=DmdType U(L)U(L)m" in -fdump-stranal mean?

U(L) is the best -- the outer U means "unboxed". The (L) is kinda wrong
in this unboxed case -- just consider U(L) as U -- that is the way GHC
always dumps unboxed ints.

> Acccording to the core, GHCziPrim.uncheckedShiftRLzh is used, and also
> succ gets properly resolved to GHCziPrim.zpzh.

Shouldn't succ be resolved to +1? If I look correctly, you are
incrementing an Int. Aha -- succ is checking, whether the Int is 2^32-1.
You should probably use (bi+1) instead of (succ bi).

> Intersection is slower because the intersection of a Tip with a Bin
> cannot just be resolved by a single lookup.

You could improve the two Bin vs Tip cases. When you are in that case,
call a function 'lookupTip' which will lookup a corresponding Tip in the
larger tree (or returns Nil) and then do the .&.. Currently you are
doing unnecessary pattern matchings each time you call intersection.
(This is actually how the Data.IntSet.intersection works -- the Bin
vs Tip case calls member, which patter matches only the bigger tree.)

When you mimic the Data.IntSet.intersection more closely, you should get
nearly same complexity.

Cheers,
Milan

> Might this be related to the
> following and if yes, what I can do about it?
>
> SpecConstr
>     Function `main:Data.DenseIntSet.intersection{v r1dK} [lidx]'
>       has four call patterns, but the limit is 3
>     Use -fspec-constr-count=n to set the bound
>     Use -dppr-debug to see specialisations
>
> Greetings,
> Joachim
>
> --
> Joachim "nomeata" Breitner
>   [hidden email]  |  [hidden email]  |  GPG: 0x4743206C
>   xmpp: [hidden email] | http://www.joachim-breitner.de/
>



> _______________________________________________
> Libraries mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/libraries


_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Faster IntSet by using BitMaps in the lower branches

Edward A Kmett
In reply to this post by Joachim Breitner-2
There are some neat tricks you can use using deBruijn multiplication to optimize finding the least significant set bit.


lsb :: Word64 -> Int
lsb n = fromIntegral $ ix ! shiftR ((n .&. (-n)) * 0x07EDD5E59A4E28C2) 58
  where
    -- a 64 bit deBruijn multiplication table
    ix :: UArray Word64 Word8
    ix = listArray (0, 63)
      [ 63, 0, 58, 1, 59, 47, 53, 2
      , 60, 39, 48, 27, 54, 33, 42, 3
      , 61, 51, 37, 40, 49, 18, 28, 20
      , 55, 30, 34, 11, 43, 14, 22, 4
      , 62, 57, 46, 52, 38, 26, 32, 41
      , 50, 36, 17, 19, 29, 10, 13, 21
      , 56, 45, 25, 31, 35, 16, 9, 12
      , 44, 24, 15, 8, 23, 7, 6, 5
      ]

which could be sped up slightly by using a byteArray# directly for the backing store to find the least significant set bit in a Word64.

-Edward

On Sun, Sep 18, 2011 at 4:01 PM, Joachim Breitner <[hidden email]> wrote:
Hi,

Am Samstag, den 17.09.2011, 23:37 +0200 schrieb Milan Straka:
> Great job!

thanks for the positive feedback.

> Could you please also generate comparison of and IntSet and DenseIntSet,
> when the set is sparse (eg [1, 65 .. N])? I would like to see the time
> complexities then. Also please include toList. If the results are fine,
> I currently see no reason why not to push the code in.

I have attached some benchmarking result. While results are good for
member, great for insert, intersection and union, toList is slower for
sparse maps. toList is basically a foldr, so I think the culprit is this
function:

foldrBits :: Int -> (Int -> a -> a) -> a -> Word -> a
foldrBits shift f x = go shift
 where STRICT_1_OF_2(go)
       go bi 0 = x
       go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1))
               | otherwise     =       go (succ bi) (n `shiftRL` 1)

I’ll try to optimize this function individually now, any suggestions are
welcome.

Greetings,
Joachim

--
Joachim "nomeata" Breitner
 [hidden email]  |  [hidden email]  |  GPG: 0x4743206C
 xmpp: [hidden email] | http://www.joachim-breitner.de/


_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries



_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Faster IntSet by using BitMaps in the lower branches

Edward A Kmett
I should have mentioned. This function asumes that the input n is not equal to 0.

You can use it or a nicely ByteArray#'d and uncheckedShiftRL64#'d version together with the iterated masking of bits to walk through the bits in order.

-Edward

On Sun, Sep 18, 2011 at 5:07 PM, Edward Kmett <[hidden email]> wrote:
There are some neat tricks you can use using deBruijn multiplication to optimize finding the least significant set bit.


lsb :: Word64 -> Int
lsb n = fromIntegral $ ix ! shiftR ((n .&. (-n)) * 0x07EDD5E59A4E28C2) 58
  where
    -- a 64 bit deBruijn multiplication table
    ix :: UArray Word64 Word8
    ix = listArray (0, 63)
      [ 63, 0, 58, 1, 59, 47, 53, 2
      , 60, 39, 48, 27, 54, 33, 42, 3
      , 61, 51, 37, 40, 49, 18, 28, 20
      , 55, 30, 34, 11, 43, 14, 22, 4
      , 62, 57, 46, 52, 38, 26, 32, 41
      , 50, 36, 17, 19, 29, 10, 13, 21
      , 56, 45, 25, 31, 35, 16, 9, 12
      , 44, 24, 15, 8, 23, 7, 6, 5
      ]

which could be sped up slightly by using a byteArray# directly for the backing store to find the least significant set bit in a Word64.

-Edward

On Sun, Sep 18, 2011 at 4:01 PM, Joachim Breitner <[hidden email]> wrote:
Hi,

Am Samstag, den 17.09.2011, 23:37 +0200 schrieb Milan Straka:
> Great job!

thanks for the positive feedback.

> Could you please also generate comparison of and IntSet and DenseIntSet,
> when the set is sparse (eg [1, 65 .. N])? I would like to see the time
> complexities then. Also please include toList. If the results are fine,
> I currently see no reason why not to push the code in.

I have attached some benchmarking result. While results are good for
member, great for insert, intersection and union, toList is slower for
sparse maps. toList is basically a foldr, so I think the culprit is this
function:

foldrBits :: Int -> (Int -> a -> a) -> a -> Word -> a
foldrBits shift f x = go shift
 where STRICT_1_OF_2(go)
       go bi 0 = x
       go bi n | n `testBit` 0 = f bi (go (succ bi) (n `shiftRL` 1))
               | otherwise     =       go (succ bi) (n `shiftRL` 1)

I’ll try to optimize this function individually now, any suggestions are
welcome.

Greetings,
Joachim

--
Joachim "nomeata" Breitner
 [hidden email]  |  [hidden email]  |  GPG: 0x4743206C
 xmpp: [hidden email] | http://www.joachim-breitner.de/


_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries




_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Faster IntSet by using BitMaps in the lower branches

Joachim Breitner-2
In reply to this post by Joachim Breitner-2
Hi,

Am Sonntag, den 18.09.2011, 22:47 +0200 schrieb Joachim Breitner:
> I’d like to avoid the binary search, as it is more expensive for dense
> sets. Milan’s suggestion of shifts by 6 might be a good compromise.
> Another approach might be to first use lowestBitSet to start with the
> lowest bit. In case of only one bit set, it will not iterate further
> then.
>
> I tried adding some strictness annotation to go and see if that helps.
> According to the attachement, it does, instead of 4 times slower in the
> worst case (Size 4 million, step 100) it is only ~2,2x slower.

first shifting by lowestBitSet seems to help noticably, at least in my
benchmarks, see attachment. Slowdown in the worst case is only 1.25 by
now. That is enough for today, the benchmarks take too long and it is
late :-)

lowestSetBit could maybe be optimized further. There seems to be an
operation for that (bsr and bsf), it remains to be checked if this is
sufficiently fast. If it is indeed, then this should indeed by done
inside the loop, and not just one, as Henning suggests. Can library code
easily call such primitive ops directly or would that require support in
GHC?

Changing succ bi to bi + 1 (not included in attached benchmark yet)
seems to have a minor benefit on a pure foldrBits benchmark and a minor
degression on a pure toList benchmark (see plot.png)  Not sure what that
means, probably nothing.


> You could improve the two Bin vs Tip cases. When you are in that case,
> call a function 'lookupTip'...

I’ll do the intersection optimization, although I wonder if ghc would
not do that for me if it were allowed to introduce enough call patterns?

Good night for now
Joachim

--
Joachim "nomeata" Breitner
  [hidden email]  |  [hidden email]  |  GPG: 0x4743206C
  xmpp: [hidden email] | http://www.joachim-breitner.de/


_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries

comparison.pdf (16K) Download Attachment
plot.png (9K) Download Attachment
signature.asc (205 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Faster IntSet by using BitMaps in the lower branches

Joachim Breitner-2
In reply to this post by Edward A Kmett
Hi,

Am Sonntag, den 18.09.2011, 17:11 -0400 schrieb Edward Kmett:
> I should have mentioned. This function asumes that the input n is not
> equal to 0.

that is fine, that is in invariant that holds for my bit array.

But I guess we’d need to use CPP magic to separate bitness – my code
uses whatever "Word" means. Or does it work gracefully on 32?

How about adding it to Data.Word (then I don’t feel responsible for
getting it fast and can just use it :-))

Thanks,
Joachim

--
Joachim "nomeata" Breitner
  [hidden email]  |  [hidden email]  |  GPG: 0x4743206C
  xmpp: [hidden email] | http://www.joachim-breitner.de/


_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries

signature.asc (205 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Faster IntSet by using BitMaps in the lower branches

Edward A Kmett
On Sun, Sep 18, 2011 at 5:20 PM, Joachim Breitner <[hidden email]> wrote:
Hi,

Am Sonntag, den 18.09.2011, 17:11 -0400 schrieb Edward Kmett:
> I should have mentioned. This function asumes that the input n is not
> equal to 0.

that is fine, that is in invariant that holds for my bit array.

But I guess we’d need to use CPP magic to separate bitness – my code
uses whatever "Word" means. Or does it work gracefully on 32?

How about adding it to Data.Word (then I don’t feel responsible for
getting it fast and can just use it :-))


You can use a smaller DeBruijn multiplication table to handle 32 bits, but the one that I gave puts the answer in the most significant 6 bits of a 64 bit word, so if you use it with a 32 bit word, they'll be setting bits you don't have and all answers will be 0.

Also, if you don't care about the particular order in which you visit the bits you could store the bits in a different order in the Word(64) than you would if you just set them directly by preshuffling them. I was able to derive the following operations, which move the repeated lookups out of the fold:

magic = 0x07EDD5E59A4E28C2
offset = 58

preshuffleTable :: UArray Int Word8
preshuffleTable = listArray (0,63)
  [63,0,58,1,59,47,53,2,60,39,48,27,54,33,42,3,61,51,37,40,49,18,28,20,55,30,34,11,43,14,22,4,
  62,57,46,52,38,26,32,41,50,36,17,19,29,10,13,21,56,45,25,31,35,16,9,12,44,24,15,8,23,7,6,5]

preshuffle :: Int -> Int
preshuffle n = fromIntegral (preshuffleTable ! n)

setShuffled :: Word64 -> Int -> Word64
setShuffled w n = setBit w (preshuffle n)

testShuffled :: Word64 -> Int -> Bool
testShuffled w n = testBit w (preshuffle n)

clearShuffled :: Word64 -> Int -> Word64
clearShuffled w n = clearBit w (preshuffle n)

-- NB: the bits do not come out in order!
foldShuffledBits :: (Int -> a -> a) -> a -> Word64 -> a
foldShuffledBits f z 0 = z
foldShuffledBits f z n | t <- n .&. negate n = f (fromIntegral (shiftR (t * magic) offset)) (foldShuffledBits f z (xor n t))

unshuffleTable :: UArray Int Word8
unshuffleTable = listArray (0,63) 
  [1,3,7,15,31,63,62,61,59,54,45,27,55,46,29,58,53,42,21,43,23,47,30,60,57,50,37,11,22,44,25,
  51,38,13,26,52,41,18,36,9,19,39,14,28,56,49,34,5,10,20,40,17,35,6,12,24,48,33,2,4,8,16,32,0]

I can derive a similar magic, offset and preShuffle table could be derived for 32 bits, I just haven't needed it.

The caveat is that you'd get the bits out in the wrong order for toAscList, but if you are concerned with the performance of foldBits it may be a win in other areas and you can loop and test shuffled bits for when you really do need the values in ascending order.

-Edward

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Faster IntSet by using BitMaps in the lower branches

Edward A Kmett
Probably the simplest version is

magic = 0x07EDD5E59A4E28C2
offset = 58

lsbTable :: UArray Word64 Word8
lsbTable = listArray (0, 63)
   [ 63, 0,  58, 1,  59, 47, 53,  2
   , 60, 39, 48, 27, 54, 33, 42,  3
   , 61, 51, 37, 40, 49, 18, 28, 20
   , 55, 30, 34, 11, 43, 14, 22,  4
   , 62, 57, 46, 52, 38, 26, 32, 41
   , 50, 36, 17, 19, 29, 10, 13, 21
   , 56, 45, 25, 31, 35, 16,  9, 12
   , 44, 24, 15,  8, 23,  7,  6,  5
   ]

foldrBits :: (Int -> a -> a) -> a -> Word64 -> a
foldrBits f z 0 = z
foldrBits f z n | t <- n .&. negate n = f (fromIntegral (lsbTable ! shiftR (t * magic) offset)) (foldrBits f z (xor n t))

swapping to 

lsbTable = listArray (0,31) [0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9]
magic = 0x077CB531
offset = 27
for the 32 bit case.

-Edward

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Faster IntSet by using BitMaps in the lower branches

Joachim Breitner-2
In reply to this post by Milan Straka
Hi Milan,

Am Sonntag, den 18.09.2011, 23:06 +0200 schrieb Milan Straka:

> > Intersection is slower because the intersection of a Tip with a Bin
> > cannot just be resolved by a single lookup.
>
> You could improve the two Bin vs Tip cases. When you are in that case,
> call a function 'lookupTip' which will lookup a corresponding Tip in the
> larger tree (or returns Nil) and then do the .&.. Currently you are
> doing unnecessary pattern matchings each time you call intersection.
> (This is actually how the Data.IntSet.intersection works -- the Bin
> vs Tip case calls member, which patter matches only the bigger tree.)
>
> When you mimic the Data.IntSet.intersection more closely, you should get
> nearly same complexity.
Changing the number of specializations did not help, so I implemented
the loookupTip (here called intersectBM, for consistency with similar
helpers). If you compare comparison-3-spec-patterns.pdf (the base case)
with comparison-intersectBM.pdf, it shows that it is even slower than my
code, as follows:

-- The intersection of one tip with a map
intersectBM :: Prefix -> BitMap -> IntSet -> IntSet
intersectBM kx bm (Bin p2 m2 l2 r2)
  | nomatch kx p2 m2 = Nil
  | zero kx m2       = intersectBM kx bm l2
  | otherwise        = intersectBM kx bm r2
intersectBM kx bm (Tip kx' bm')
  | kx == kx' = tip kx (bm .&. bm')
  | otherwise  = Nil
intersectBM kx bm Nil = Nil


Inlining did not help much.

Floating the constant parameters and having an explicit "go" function
die not change anything, I guess I can rely on ghc to do that for me.

Adding strictness helps a bit, so that for up to medium sized maps
performance is equal, even for sparse maps. Not sure what goes wrong for
large sets.

BTW, note the great performance boost for very dense maps when it comes
to intersection. It’s barely readable any more :-)


I have attached the full statistics for the current version
(comparison.pdf). The solid line is IntMap, the long dashed line is
DenseIntMap and the short dashed line (left axis) is the speed-up in
percent, so lower is better.

Greetings,
Joachim

--
Joachim "nomeata" Breitner
  [hidden email]  |  [hidden email]  |  GPG: 0x4743206C
  xmpp: [hidden email] | http://www.joachim-breitner.de/


_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries

comparison-3-spec-patterns.pdf (17K) Download Attachment
comparison-intersectBM.pdf (17K) Download Attachment
comparison-intersectBM-strict.pdf (17K) Download Attachment
comparison.pdf (31K) Download Attachment
signature.asc (205 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Faster IntSet by using BitMaps in the lower branches

Joachim Breitner-2
In reply to this post by Edward A Kmett
Hi,

Am Sonntag, den 18.09.2011, 18:46 -0400 schrieb Edward Kmett:
> You can use a smaller DeBruijn multiplication table to handle 32 bits,
> but the one that I gave puts the answer in the most significant 6 bits
> of a 64 bit word, so if you use it with a 32 bit word, they'll be
> setting bits you don't have and all answers will be 0.

I’m not sure how to cleanly select the right implementation inside
Data.IntMap. Also, we have architectures with 31 bit words (s390, at
least). If anything, then the functions ought to move into the Bits type
class and into GHC.Word. There it makes also more sense to select
architecture-specific implementations, e.g. the primitive operation on
x86.

> Also, if you don't care about the particular order in which you visit
> the bits you could store the bits in a different order in the Word(64)
> than you would if you just set them directly by preshuffling them.

The order is unfortunately relevant, and I am not sure I have a use for
an arbitrarily ordered fold.

Greetings,
Joachim


--
Joachim "nomeata" Breitner
  [hidden email]  |  [hidden email]  |  GPG: 0x4743206C
  xmpp: [hidden email] | http://www.joachim-breitner.de/


_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries

signature.asc (205 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Faster IntSet by using BitMaps in the lower branches

Gregory Collins-3
In reply to this post by Edward A Kmett
On Sun, Sep 18, 2011 at 11:07 PM, Edward Kmett <[hidden email]> wrote:
> There are some neat tricks you can use using deBruijn multiplication to
> optimize finding the least significant set bit.
> My geometric coalgebra code
> in https://github.com/ekmett/algebra/blob/master/Numeric/Coalgebra/Geometric.hs#L72
> uses the following:

I use this in hashtables also :)

https://github.com/gregorycollins/hashtables/blob/master/src/Data/HashTable/Internal/CacheLine.hs#L273
--
Gregory Collins <[hidden email]>

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Faster IntSet by using BitMaps in the lower branches

Joachim Breitner-2
Hi,

Am Montag, den 19.09.2011, 16:03 +0200 schrieb Gregory Collins:

> On Sun, Sep 18, 2011 at 11:07 PM, Edward Kmett <[hidden email]> wrote:
> > There are some neat tricks you can use using deBruijn multiplication to
> > optimize finding the least significant set bit.
> > My geometric coalgebra code
> > in https://github.com/ekmett/algebra/blob/master/Numeric/Coalgebra/Geometric.hs#L72
> > uses the following:
>
> I use this in hashtables also :)
>
> https://github.com/gregorycollins/hashtables/blob/master/src/Data/HashTable/Internal/CacheLine.hs#L273
hmm...

-- only works with 32-bit values -- ok for us here
firstBitSet# :: Int# -> Int#

That’s why I feel uneasy about this method. But now we have already
three users of a firstBitSet function, all of which had to re-implement
the wheel. This is really a strong argument for putting this as a method
into Data.Bits and the code into GHC.Word, where machine-specific
instructions can be used if found to be faster.


Greetings,
Joachim

--
Joachim "nomeata" Breitner
  [hidden email]  |  [hidden email]  |  GPG: 0x4743206C
  xmpp: [hidden email] | http://www.joachim-breitner.de/


_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries

signature.asc (205 bytes) Download Attachment
12
Loading...