Faster Array#/MutableArray# copies

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Faster Array#/MutableArray# copies

Johan Tibell-2
Hi,

Daniel Peebles has done a great job adding new primops for copying
arrays (both Array# and MutableArray#). This gave me a nice speedup
for my hash-array mapped trie (HAMT) data structure, which is now
about as fast as Data.Map for inserts but about 5.5x faster for
lookups. (It should also have a lower per key/value overhead.)

However, for the HAMT is more than twice as slow for inserts compared
to IntMap and copying arrays is still the bottleneck.

C compilers, like gcc, go to great lengths making memcpy fast and I
was thinking that we might be able to steal a trick or two from them.
I'd like some feedback on these ideas:

 * Could we try to make the array primops inline? If so, would that
mean that if e.g. the number of elements to copy is known at the
(Haskell) call site it would also be known in C-- and we could get the
benefits of knowing that?

 * Could we use built-in compiler rules to catch array copies of known
length and replace them with e.g. unrolled loops? My particular use
case involves copying small arrays (size: 1-32). Ideally this should
be as fast as copying a tuple of the corresponding size but I'm pretty
sure we're far off that goal.

 * Can we use SSE instructions?

 * Can we get the C memcpy code inlined into the C-- source (crazy, I
know). If so we could perhaps benefit directly from optimizations in
libc.

Johan

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: Faster Array#/MutableArray# copies

Max Bolingbroke-2
On 18 February 2011 01:18, Johan Tibell <[hidden email]> wrote:
> C compilers, like gcc, go to great lengths making memcpy fast and I
> was thinking that we might be able to steal a trick or two from them.
> I'd like some feedback on these ideas:

It seems like a sufficient solution for your needs would be for us to
use the LTO support in LLVM to inline across module boundaries - in
particular to inline primop implementations into their call sites.
LLVM would then probably deal with unrolling small loops with
statically known bounds.

I don't think this would require a major change to GHC, though LTO
would only work with the Gold linker (which only supports ELF) at the
moment :-(

Cheers,
Max

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: Faster Array#/MutableArray# copies

Roman Leshchinskiy
Max Bolingbroke wrote:
> On 18 February 2011 01:18, Johan Tibell <[hidden email]> wrote:
>
> It seems like a sufficient solution for your needs would be for us to
> use the LTO support in LLVM to inline across module boundaries - in
> particular to inline primop implementations into their call sites. LLVM
> would then probably deal with unrolling small loops with statically known
> bounds.

Could we simply use this?

http://llvm.org/docs/LangRef.html#int_memcpy

Roman




_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: Faster Array#/MutableArray# copies

Roman Leshchinskiy
In reply to this post by Johan Tibell-2
Johan Tibell wrote:
>
> * Could we use built-in compiler rules to catch array copies of known
> length and replace them with e.g. unrolled loops? My particular use case
> involves copying small arrays (size: 1-32). Ideally this should be as fast
> as copying a tuple of the corresponding size but I'm pretty sure we're far
> off that goal.

Out of idle curiousity, couldn't you use tuples instead of arrays?

FWIW, I agree that doing something cleverer than just calling memcpy could
be very worthwhile. As Max points out, you could perhaps try to do
something with the LLVM backend.

Roman




_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: Faster Array#/MutableArray# copies

Nathan Howell-2
In reply to this post by Roman Leshchinskiy
On Fri, Feb 18, 2011 at 12:54 AM, Roman Leshchinskiy <[hidden email]> wrote:
Max Bolingbroke wrote:
> On 18 February 2011 01:18, Johan Tibell <[hidden email]> wrote:> It seems like a sufficient solution for your needs would be for us to
> use the LTO support in LLVM to inline across module boundaries - in
> particular to inline primop implementations into their call sites. LLVM
> would then probably deal with unrolling small loops with statically known
> bounds.

Could we simply use this?

http://llvm.org/docs/LangRef.html#int_memcpy

Might be easier to implement a PrimOp inlining pass, and to run it before LLVM's built-in MemCpyOptimization pass [0]. This wouldn't generally be as good as LTO but would work without gold.


_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: Faster Array#/MutableArray# copies

Tyson Whitehead
In reply to this post by Johan Tibell-2
On February 17, 2011 20:18:11 Johan Tibell wrote:
>  * Can we use SSE instructions?
>
>  * Can we get the C memcpy code inlined into the C-- source (crazy, I
> know). If so we could perhaps benefit directly from optimizations in
> libc.

From the standpoint of numerical code, it would be very nice to get some
vector .  Perhaps this would also give a natural expression of memcpy.

In the spirit of the common "infinite" register file assumption, I've always
imagined idealized arbitrary-length vector types/operations at the C-- level
(mapped to reality via chunking it over fixed length vector instructions).

I see this is LLVM supports arbitrary length vectors as a first class type

http://llvm.org/docs/LangRef.html#t_vector
http://llvm.org/docs/LangRef.html#i_add

Looking at the C-- specification (section 2.4 -- page 10)

http://www.cminusminus.org/extern/man2.pdf

it seems this may not be such a good fit as it considers everything as fixed-
size bit collections (bits8, bits16, etc.) and booleans (bool).  Presumably
memory orientated primitive instructions are also out as they break the strict
load/store architecture.  Has anyone though of how to do this?

Some half baked suggestions

 - perhaps it should be fixed-size bit collections with repetition, or
 - built in primitives for instructions composition (i.e., a map operation)?

Cheers!  -Tyson

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

signature.asc (501 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Faster Array#/MutableArray# copies

Simon Marlow-7
In reply to this post by Nathan Howell-2
On 18/02/2011 19:42, Nathan Howell wrote:

> On Fri, Feb 18, 2011 at 12:54 AM, Roman Leshchinskiy <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Max Bolingbroke wrote:
>      > On 18 February 2011 01:18, Johan Tibell <[hidden email]
>     <mailto:[hidden email]>> wrote:> It seems like a sufficient
>     solution for your needs would be for us to
>      > use the LTO support in LLVM to inline across module boundaries - in
>      > particular to inline primop implementations into their call
>     sites. LLVM
>      > would then probably deal with unrolling small loops with
>     statically known
>      > bounds.
>
>     Could we simply use this?
>
>     http://llvm.org/docs/LangRef.html#int_memcpy
>
>
> Might be easier to implement a PrimOp inlining pass, and to run it
> before LLVM's built-in MemCpyOptimization pass [0]. This wouldn't
> generally be as good as LTO but would work without gold.
>
> [0] http://llvm.org/doxygen/MemCpyOptimizer_8cpp_source.html

Ideally you'd want the heap check in the primop to be aggregated into
the calling function's heap check, and the primop should allocate
directly from the heap instead of calling out to the RTS allocate().
All this is a bit much to expect LLVM to do, but we could do it in the
Glorious New Code Generator...

For small arrays like this maybe we should have a new array type that
leaves out all the card-marking stuff too (or just use tuples, as Roman
suggested).

Cheers,
        Simon

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: Faster Array#/MutableArray# copies

Johan Tibell-2
On Mon, Feb 28, 2011 at 9:01 AM, Simon Marlow <[hidden email]> wrote:

> On 18/02/2011 19:42, Nathan Howell wrote:
>>
>> On Fri, Feb 18, 2011 at 12:54 AM, Roman Leshchinskiy <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>
>>    Max Bolingbroke wrote:
>>     > On 18 February 2011 01:18, Johan Tibell <[hidden email]
>>    <mailto:[hidden email]>> wrote:> It seems like a sufficient
>>    solution for your needs would be for us to
>>     > use the LTO support in LLVM to inline across module boundaries - in
>>     > particular to inline primop implementations into their call
>>    sites. LLVM
>>     > would then probably deal with unrolling small loops with
>>    statically known
>>     > bounds.
>>
>>    Could we simply use this?
>>
>>    http://llvm.org/docs/LangRef.html#int_memcpy
>>
>>
>> Might be easier to implement a PrimOp inlining pass, and to run it
>> before LLVM's built-in MemCpyOptimization pass [0]. This wouldn't
>> generally be as good as LTO but would work without gold.
>>
>> [0] http://llvm.org/doxygen/MemCpyOptimizer_8cpp_source.html
>
> Ideally you'd want the heap check in the primop to be aggregated into the
> calling function's heap check, and the primop should allocate directly from
> the heap instead of calling out to the RTS allocate(). All this is a bit
> much to expect LLVM to do, but we could do it in the Glorious New Code
> Generator...
>
> For small arrays like this maybe we should have a new array type that leaves
> out all the card-marking stuff too (or just use tuples, as Roman suggested).

I might try to use tuples directly. It would be very ugly though as I
would need a sum type of 32 different tuple sizes.

Johan

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: Faster Array#/MutableArray# copies

Gábor Lehel
On Mon, Feb 28, 2011 at 6:29 PM, Johan Tibell <[hidden email]> wrote:

> On Mon, Feb 28, 2011 at 9:01 AM, Simon Marlow <[hidden email]> wrote:
>> On 18/02/2011 19:42, Nathan Howell wrote:
>>>
>>> On Fri, Feb 18, 2011 at 12:54 AM, Roman Leshchinskiy <[hidden email]
>>> <mailto:[hidden email]>> wrote:
>>>
>>>    Max Bolingbroke wrote:
>>>     > On 18 February 2011 01:18, Johan Tibell <[hidden email]
>>>    <mailto:[hidden email]>> wrote:> It seems like a sufficient
>>>    solution for your needs would be for us to
>>>     > use the LTO support in LLVM to inline across module boundaries - in
>>>     > particular to inline primop implementations into their call
>>>    sites. LLVM
>>>     > would then probably deal with unrolling small loops with
>>>    statically known
>>>     > bounds.
>>>
>>>    Could we simply use this?
>>>
>>>    http://llvm.org/docs/LangRef.html#int_memcpy
>>>
>>>
>>> Might be easier to implement a PrimOp inlining pass, and to run it
>>> before LLVM's built-in MemCpyOptimization pass [0]. This wouldn't
>>> generally be as good as LTO but would work without gold.
>>>
>>> [0] http://llvm.org/doxygen/MemCpyOptimizer_8cpp_source.html
>>
>> Ideally you'd want the heap check in the primop to be aggregated into the
>> calling function's heap check, and the primop should allocate directly from
>> the heap instead of calling out to the RTS allocate(). All this is a bit
>> much to expect LLVM to do, but we could do it in the Glorious New Code
>> Generator...
>>
>> For small arrays like this maybe we should have a new array type that leaves
>> out all the card-marking stuff too (or just use tuples, as Roman suggested).
>
> I might try to use tuples directly. It would be very ugly though as I
> would need a sum type of 32 different tuple sizes.

You can't random-access the elements of a tuple based on an Int
though, can you? You'd have to do a switch statement (pattern match)
every time you wanted to access a specific element. Unless I'm missing
something.

>
> Johan
>
> _______________________________________________
> Glasgow-haskell-users mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
>



--
Work is punishment for failing to procrastinate effectively.

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: Faster Array#/MutableArray# copies

Roman Leshchinskiy
In reply to this post by Simon Marlow-7
Simon Marlow wrote:
>
> For small arrays like this maybe we should have a new array type that
> leaves out all the card-marking stuff too (or just use tuples, as Roman
> suggested).

Would it, in theory, be possible to have an "unpacked" array type? That
is, could we have constructors for which the length of the closure is
determined dynamically at runtime?

Roman




_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: Faster Array#/MutableArray# copies

Simon Marlow-7
On 01/03/2011 11:55, Roman Leshchinskiy wrote:
> Simon Marlow wrote:
>>
>> For small arrays like this maybe we should have a new array type that
>> leaves out all the card-marking stuff too (or just use tuples, as Roman
>> suggested).
>
> Would it, in theory, be possible to have an "unpacked" array type? That
> is, could we have constructors for which the length of the closure is
> determined dynamically at runtime?

Certainly, but the amount of effort to implement depends on what you
want to support. e.g. do you want to support {-# UNPACK #-} on primitive
array types in a constructor field?  That's probably quite hard.  I
believe Duncan Coutts has been thinking along similar lines, we talked
about it once.

Or were you thinking of something more restricted?

Cheers,
        Simon

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: Faster Array#/MutableArray# copies

Roman Leshchinskiy
Simon Marlow wrote:

> On 01/03/2011 11:55, Roman Leshchinskiy wrote:
>
>>
>> Would it, in theory, be possible to have an "unpacked" array type? That
>>  is, could we have constructors for which the length of the closure is
>> determined dynamically at runtime?
>
> Certainly, but the amount of effort to implement depends on what you
> want to support. e.g. do you want to support {-# UNPACK #-} on primitive
> array types in a constructor field?  That's probably quite hard.  I
> believe Duncan Coutts has been thinking along similar lines, we talked
> about it once.

I can see that supporting this would be rather hard:

data T a = T {-# UNPACK #-} (Array# a)

We would have to allow Array# to point to the middle of a closure and it's
far from obvious how to initialise this since we don't have Array#
literals.

> Or were you thinking of something more restricted?

Yes, I was thinking of some special syntax. Something along these lines:

data T a = T {a}

f x = T {x x x}
g (T {x y z}) = x
h (T xs) = xs{0}

I'm not seriously suggesting this syntax, this is just to demonstrate the
general idea. In the last function, it shouldn't be possible to do
anything with xs except indexing and taking the length.

This would be much easier, right?

Of course, we would also want this for byte arrays...

Roman




_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: Faster Array#/MutableArray# copies

Nathan Howell-2
In reply to this post by Simon Marlow-7
On Mon, Feb 28, 2011 at 9:01 AM, Simon Marlow <[hidden email]> wrote:
Ideally you'd want the heap check in the primop to be aggregated into the calling function's heap check, and the primop should allocate directly from the heap instead of calling out to the RTS allocate(). All this is a bit much to expect LLVM to do, but we could do it in the Glorious New Code Generator...

It's probably (certainly) better to do it in the code generator but it is rather easy to perform in LLVM and might be a worthwhile optimization during LTO/LTCG.

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users