Performance of `count`ing a char in a ByteString

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Performance of `count`ing a char in a ByteString

Georg Rudoy
Hi there,

tl;dr: I've noticed ByteString's `count`'s performance can be improved
by a factor between 2.5 and 8 by explicitly using certain SIMD
extensions, so what's the policy on those?

Results of a small C benchmark (since the corresponding BS routine is
implemented in C anyway) are available here (
https://github.com/0xd34df00d/counting-chars ), as well as some sample
code to give the idea of what's happening (which might be
incorrect/suboptimal, my SSE/AVX is rusty). I'll also put a sample
input file if I manage to get git-lfs going.

This change only really makes sense for large strings, but that's a
somewhat common use case at least in my applications, so I wanted to
gather some initial feedback, as well as ask a couple of questions and
emphasize a couple of points.

1. How are SIMD extensions generally controlled?

2. Is depending on gcc OK? If so, we could leverage gcc's support of
automatically generating a dispatching function that'll check cpuid
for the supported extensions and call the right implementation.

3. Does BS' allocator give any guarantees on the alignment? It'd be
nice to get rid of that "prefix" part that makes sure the SIMD code
churns through 64-byte-aligned data.

4. Note that compiling with `-O3 -march=native` makes code effectively
non-portable (at least, to CPUs lacking extensions the compiler
decides to use), while the code with `-O2`, explicit SIMD versions and
a dispatching function will get the best performance using the SIMD
extensions available, degrading gracefully on older hardware. So,
personally, when I look at that table with the results I compare the
SSE4.2/AVX2 implementations against the `-O2` column.

5. Thanks Gregory Collins for mentioning PCMPESTRM somewhere — that
remark sparked the desire to do some intrinsics stuff.

--
  Georg Rudoy
_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Performance of `count`ing a char in a ByteString

Vanessa McHale-2
clang is the default C compiler on Mac. It would be sensible to at least support clang as well as gcc

Cheers,
Vanessa McHale

> On Feb 8, 2020, at 4:34 PM, Georg Rudoy <[hidden email]> wrote:
>
> Hi there,
>
> tl;dr: I've noticed ByteString's `count`'s performance can be improved
> by a factor between 2.5 and 8 by explicitly using certain SIMD
> extensions, so what's the policy on those?
>
> Results of a small C benchmark (since the corresponding BS routine is
> implemented in C anyway) are available here (
> https://github.com/0xd34df00d/counting-chars ), as well as some sample
> code to give the idea of what's happening (which might be
> incorrect/suboptimal, my SSE/AVX is rusty). I'll also put a sample
> input file if I manage to get git-lfs going.
>
> This change only really makes sense for large strings, but that's a
> somewhat common use case at least in my applications, so I wanted to
> gather some initial feedback, as well as ask a couple of questions and
> emphasize a couple of points.
>
> 1. How are SIMD extensions generally controlled?
>
> 2. Is depending on gcc OK? If so, we could leverage gcc's support of
> automatically generating a dispatching function that'll check cpuid
> for the supported extensions and call the right implementation.
>
> 3. Does BS' allocator give any guarantees on the alignment? It'd be
> nice to get rid of that "prefix" part that makes sure the SIMD code
> churns through 64-byte-aligned data.
>
> 4. Note that compiling with `-O3 -march=native` makes code effectively
> non-portable (at least, to CPUs lacking extensions the compiler
> decides to use), while the code with `-O2`, explicit SIMD versions and
> a dispatching function will get the best performance using the SIMD
> extensions available, degrading gracefully on older hardware. So,
> personally, when I look at that table with the results I compare the
> SSE4.2/AVX2 implementations against the `-O2` column.
>
> 5. Thanks Gregory Collins for mentioning PCMPESTRM somewhere — that
> remark sparked the desire to do some intrinsics stuff.
>
> --
>  Georg Rudoy
> _______________________________________________
> Libraries mailing list
> [hidden email]
> http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries

_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Performance of `count`ing a char in a ByteString

Georg Rudoy
Totally forgot about that, thanks for pointing out! Manual dispatch it
is then — a bit less pleasant, but not too bad.

сб, 8 февр. 2020 г., 18:09 Vanessa McHale <[hidden email]>:

>
> clang is the default C compiler on Mac. It would be sensible to at least support clang as well as gcc
>
> Cheers,
> Vanessa McHale
>
> > On Feb 8, 2020, at 4:34 PM, Georg Rudoy <[hidden email]> wrote:
> >
> > Hi there,
> >
> > tl;dr: I've noticed ByteString's `count`'s performance can be improved
> > by a factor between 2.5 and 8 by explicitly using certain SIMD
> > extensions, so what's the policy on those?
> >
> > Results of a small C benchmark (since the corresponding BS routine is
> > implemented in C anyway) are available here (
> > https://github.com/0xd34df00d/counting-chars ), as well as some sample
> > code to give the idea of what's happening (which might be
> > incorrect/suboptimal, my SSE/AVX is rusty). I'll also put a sample
> > input file if I manage to get git-lfs going.
> >
> > This change only really makes sense for large strings, but that's a
> > somewhat common use case at least in my applications, so I wanted to
> > gather some initial feedback, as well as ask a couple of questions and
> > emphasize a couple of points.
> >
> > 1. How are SIMD extensions generally controlled?
> >
> > 2. Is depending on gcc OK? If so, we could leverage gcc's support of
> > automatically generating a dispatching function that'll check cpuid
> > for the supported extensions and call the right implementation.
> >
> > 3. Does BS' allocator give any guarantees on the alignment? It'd be
> > nice to get rid of that "prefix" part that makes sure the SIMD code
> > churns through 64-byte-aligned data.
> >
> > 4. Note that compiling with `-O3 -march=native` makes code effectively
> > non-portable (at least, to CPUs lacking extensions the compiler
> > decides to use), while the code with `-O2`, explicit SIMD versions and
> > a dispatching function will get the best performance using the SIMD
> > extensions available, degrading gracefully on older hardware. So,
> > personally, when I look at that table with the results I compare the
> > SSE4.2/AVX2 implementations against the `-O2` column.
> >
> > 5. Thanks Gregory Collins for mentioning PCMPESTRM somewhere — that
> > remark sparked the desire to do some intrinsics stuff.
> >
> > --
> >  Georg Rudoy
> > _______________________________________________
> > Libraries mailing list
> > [hidden email]
> > http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
>
_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries