Haskell performance when it comes to regex?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Haskell performance when it comes to regex?

Bram Neijt
Dear reader,

I decided to do a little project which is a simple search and replace
program for large text files.

Written in Haskell, it does a few different regex matches on each line
and stores them in a leveldb key-value store to create a
consistent/reviewable search-replace index. It should provide for some
simple/brute-force anonymization of data and therefore I called it
hanon (sorry, could not think of a better name).

https://github.com/BigDataRepublic/hanon

The code works, but I've done some benchmarking to compare it with
Python and the code is about 80x slower then doing the same thing in
Python, making it useless for larger data files.

I'm obviously doing something wrong.

Could you give me tips on improving the performance of this code?
Probably mainly looking at

https://github.com/BigDataRepublic/hanon/blob/master/src/Mapper.hs

where the regex code lives?

Greetings,

Bram
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Haskell performance when it comes to regex?

Станислав Черничкин
Try to use Text or ByteString instead of strings. Try to use compile and execute methods (http://hackage.haskell.org/package/regex-tdfa-1.2.1/docs/Text-Regex-TDFA-ByteString.html), make sure regex get compiled once.

2017-05-16 12:12 GMT+03:00 Bram Neijt <[hidden email]>:
Dear reader,

I decided to do a little project which is a simple search and replace
program for large text files.

Written in Haskell, it does a few different regex matches on each line
and stores them in a leveldb key-value store to create a
consistent/reviewable search-replace index. It should provide for some
simple/brute-force anonymization of data and therefore I called it
hanon (sorry, could not think of a better name).

https://github.com/BigDataRepublic/hanon

The code works, but I've done some benchmarking to compare it with
Python and the code is about 80x slower then doing the same thing in
Python, making it useless for larger data files.

I'm obviously doing something wrong.

Could you give me tips on improving the performance of this code?
Probably mainly looking at

https://github.com/BigDataRepublic/hanon/blob/master/src/Mapper.hs

where the regex code lives?

Greetings,

Bram
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.



--
Sincerely, Stanislav Chernichkin.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Haskell performance when it comes to regex?

Bram Neijt
Thank you!

I already changed to Text instead, but I thought the regex was already
memoized by GHC, so that should not be a problem.

I'm trying regex-applicative now, maybe that will help, but it takes
some time to figure out the syntax. I'll also try to see if
precompilation helps.

Greetings,

Bram



On Fri, May 19, 2017 at 1:17 PM, Станислав Черничкин
<[hidden email]> wrote:

> Try to use Text or ByteString instead of strings. Try to use compile and
> execute methods
> (http://hackage.haskell.org/package/regex-tdfa-1.2.1/docs/Text-Regex-TDFA-ByteString.html),
> make sure regex get compiled once.
>
> 2017-05-16 12:12 GMT+03:00 Bram Neijt <[hidden email]>:
>>
>> Dear reader,
>>
>> I decided to do a little project which is a simple search and replace
>> program for large text files.
>>
>> Written in Haskell, it does a few different regex matches on each line
>> and stores them in a leveldb key-value store to create a
>> consistent/reviewable search-replace index. It should provide for some
>> simple/brute-force anonymization of data and therefore I called it
>> hanon (sorry, could not think of a better name).
>>
>> https://github.com/BigDataRepublic/hanon
>>
>> The code works, but I've done some benchmarking to compare it with
>> Python and the code is about 80x slower then doing the same thing in
>> Python, making it useless for larger data files.
>>
>> I'm obviously doing something wrong.
>>
>> Could you give me tips on improving the performance of this code?
>> Probably mainly looking at
>>
>> https://github.com/BigDataRepublic/hanon/blob/master/src/Mapper.hs
>>
>> where the regex code lives?
>>
>> Greetings,
>>
>> Bram
>> _______________________________________________
>> Haskell-Cafe mailing list
>> To (un)subscribe, modify options or view archives go to:
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>> Only members subscribed via the mailman list are allowed to post.
>
>
>
>
> --
> Sincerely, Stanislav Chernichkin.
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Haskell performance when it comes to regex?

Alfredo Di Napoli
Hi Bram,

you might be interested in the “regex” package from my colleague Chris Dornan:


I know some proper performance work still needs to be done, but I would be curious to hear your experience report ;)

Alfredo

On 19 May 2017 at 18:52, Bram Neijt <[hidden email]> wrote:
Thank you!

I already changed to Text instead, but I thought the regex was already
memoized by GHC, so that should not be a problem.

I'm trying regex-applicative now, maybe that will help, but it takes
some time to figure out the syntax. I'll also try to see if
precompilation helps.

Greetings,

Bram



On Fri, May 19, 2017 at 1:17 PM, Станислав Черничкин
<[hidden email]> wrote:
> Try to use Text or ByteString instead of strings. Try to use compile and
> execute methods
> (http://hackage.haskell.org/package/regex-tdfa-1.2.1/docs/Text-Regex-TDFA-ByteString.html),
> make sure regex get compiled once.
>
> 2017-05-16 12:12 GMT+03:00 Bram Neijt <[hidden email]>:
>>
>> Dear reader,
>>
>> I decided to do a little project which is a simple search and replace
>> program for large text files.
>>
>> Written in Haskell, it does a few different regex matches on each line
>> and stores them in a leveldb key-value store to create a
>> consistent/reviewable search-replace index. It should provide for some
>> simple/brute-force anonymization of data and therefore I called it
>> hanon (sorry, could not think of a better name).
>>
>> https://github.com/BigDataRepublic/hanon
>>
>> The code works, but I've done some benchmarking to compare it with
>> Python and the code is about 80x slower then doing the same thing in
>> Python, making it useless for larger data files.
>>
>> I'm obviously doing something wrong.
>>
>> Could you give me tips on improving the performance of this code?
>> Probably mainly looking at
>>
>> https://github.com/BigDataRepublic/hanon/blob/master/src/Mapper.hs
>>
>> where the regex code lives?
>>
>> Greetings,
>>
>> Bram
>> _______________________________________________
>> Haskell-Cafe mailing list
>> To (un)subscribe, modify options or view archives go to:
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>> Only members subscribed via the mailman list are allowed to post.
>
>
>
>
> --
> Sincerely, Stanislav Chernichkin.
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Haskell performance when it comes to regex?

David Fox-12
In reply to this post by Станислав Черничкин
I have been surprised at how rarely switching to Text or ByteString makes things significantly faster.  If you do this you should look at Data.ByteString.Builder or Data.Text.Lazy.Builder.

On Fri, May 19, 2017 at 4:17 AM, Станислав Черничкин <[hidden email]> wrote:
Try to use Text or ByteString instead of strings. Try to use compile and execute methods (http://hackage.haskell.org/package/regex-tdfa-1.2.1/docs/Text-Regex-TDFA-ByteString.html), make sure regex get compiled once.

2017-05-16 12:12 GMT+03:00 Bram Neijt <[hidden email]>:
Dear reader,

I decided to do a little project which is a simple search and replace
program for large text files.

Written in Haskell, it does a few different regex matches on each line
and stores them in a leveldb key-value store to create a
consistent/reviewable search-replace index. It should provide for some
simple/brute-force anonymization of data and therefore I called it
hanon (sorry, could not think of a better name).

https://github.com/BigDataRepublic/hanon

The code works, but I've done some benchmarking to compare it with
Python and the code is about 80x slower then doing the same thing in
Python, making it useless for larger data files.

I'm obviously doing something wrong.

Could you give me tips on improving the performance of this code?
Probably mainly looking at

https://github.com/BigDataRepublic/hanon/blob/master/src/Mapper.hs

where the regex code lives?

Greetings,

Bram
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.



--
Sincerely, Stanislav Chernichkin.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Haskell performance when it comes to regex?

Rein Henrichs
I recommend benchmarking with criterion and GHC profiling so you know where the slow actually is before trying to optimize anything.

On Tue, May 23, 2017 at 9:26 AM David Fox <[hidden email]> wrote:
I have been surprised at how rarely switching to Text or ByteString makes things significantly faster.  If you do this you should look at Data.ByteString.Builder or Data.Text.Lazy.Builder.

On Fri, May 19, 2017 at 4:17 AM, Станислав Черничкин <[hidden email]> wrote:
Try to use Text or ByteString instead of strings. Try to use compile and execute methods (http://hackage.haskell.org/package/regex-tdfa-1.2.1/docs/Text-Regex-TDFA-ByteString.html), make sure regex get compiled once.

2017-05-16 12:12 GMT+03:00 Bram Neijt <[hidden email]>:
Dear reader,

I decided to do a little project which is a simple search and replace
program for large text files.

Written in Haskell, it does a few different regex matches on each line
and stores them in a leveldb key-value store to create a
consistent/reviewable search-replace index. It should provide for some
simple/brute-force anonymization of data and therefore I called it
hanon (sorry, could not think of a better name).

https://github.com/BigDataRepublic/hanon

The code works, but I've done some benchmarking to compare it with
Python and the code is about 80x slower then doing the same thing in
Python, making it useless for larger data files.

I'm obviously doing something wrong.

Could you give me tips on improving the performance of this code?
Probably mainly looking at

https://github.com/BigDataRepublic/hanon/blob/master/src/Mapper.hs

where the regex code lives?

Greetings,

Bram
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.



--
Sincerely, Stanislav Chernichkin.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Haskell performance when it comes to regex?

Chris Dornan
In reply to this post by Bram Neijt

Hi Bram,

 

Sorry for being a bit late to this -- I have been on the road.

 

I have switched over you example to pre-compile the REs and use ByteString and can see 13x speedup on scan and a 9x speedup on mapping. Curiously, nearly all of that speedup seems to be gained by lifting the RE compilation out of the loop but I am pretty sure there are gains to be had from re-writing the loops.

 

Do you have the Python code that was performing 80x better?

 

Chris

 

 

From: Alfredo Di Napoli <[hidden email]>
Date: Monday, 22 May 2017 at 08:48
To: Bram Neijt <[hidden email]>
Cc: Станислав Черничкин <[hidden email]>, haskell-cafe <[hidden email]>, Chris Dornan <[hidden email]>
Subject: Re: [Haskell-cafe] Haskell performance when it comes to regex?

 

Hi Bram,

 

you might be interested in the “regex” package from my colleague Chris Dornan:

 

 

I know some proper performance work still needs to be done, but I would be curious to hear your experience report ;)

 

Alfredo

 

On 19 May 2017 at 18:52, Bram Neijt <[hidden email]> wrote:

Thank you!

I already changed to Text instead, but I thought the regex was already
memoized by GHC, so that should not be a problem.

I'm trying regex-applicative now, maybe that will help, but it takes
some time to figure out the syntax. I'll also try to see if
precompilation helps.

Greetings,

Bram




On Fri, May 19, 2017 at 1:17 PM, Станислав Черничкин
<[hidden email]> wrote:


> Try to use Text or ByteString instead of strings. Try to use compile and
> execute methods
> (http://hackage.haskell.org/package/regex-tdfa-1.2.1/docs/Text-Regex-TDFA-ByteString.html),
> make sure regex get compiled once.
>
> 2017-05-16 12:12 GMT+03:00 Bram Neijt <[hidden email]>:
>>
>> Dear reader,
>>
>> I decided to do a little project which is a simple search and replace
>> program for large text files.
>>
>> Written in Haskell, it does a few different regex matches on each line
>> and stores them in a leveldb key-value store to create a
>> consistent/reviewable search-replace index. It should provide for some
>> simple/brute-force anonymization of data and therefore I called it
>> hanon (sorry, could not think of a better name).
>>
>> https://github.com/BigDataRepublic/hanon
>>
>> The code works, but I've done some benchmarking to compare it with
>> Python and the code is about 80x slower then doing the same thing in
>> Python, making it useless for larger data files.
>>
>> I'm obviously doing something wrong.
>>
>> Could you give me tips on improving the performance of this code?
>> Probably mainly looking at
>>
>> https://github.com/BigDataRepublic/hanon/blob/master/src/Mapper.hs
>>
>> where the regex code lives?
>>
>> Greetings,
>>
>> Bram
>> _______________________________________________
>> Haskell-Cafe mailing list
>> To (un)subscribe, modify options or view archives go to:
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>> Only members subscribed via the mailman list are allowed to post.
>
>
>
>
> --
> Sincerely, Stanislav Chernichkin.
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

 


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Haskell performance when it comes to regex?

Bram Neijt
Hi Chris,

Thank you for looking into this and thank you for your pull-request.

I moved the "=~" outside of the map and that makes the whole thing a
huge amount faster.

Seems my assumption that =~ would memoise the regex creation (I read
that in a post on regex in Haskell[1])

The 80% diff is now gone, the Python code was everything without the
leveldb stuff (but still, compiling the regexes every time, so it
seemed like a valid comparison at the time), see attachment for code.

Thank you all for your help!

Bram

[1] http://www.serpentine.com/blog/2007/02/27/a-haskell-regular-expression-tutorial/

On Sun, May 28, 2017 at 2:22 PM, Chris Dornan <[hidden email]> wrote:

> Hi Bram,
>
>
>
> Sorry for being a bit late to this -- I have been on the road.
>
>
>
> I have switched over you example to pre-compile the REs and use ByteString
> and can see 13x speedup on scan and a 9x speedup on mapping. Curiously,
> nearly all of that speedup seems to be gained by lifting the RE compilation
> out of the loop but I am pretty sure there are gains to be had from
> re-writing the loops.
>
>
>
> Do you have the Python code that was performing 80x better?
>
>
>
> Chris
>
>
>
>
>
> From: Alfredo Di Napoli <[hidden email]>
> Date: Monday, 22 May 2017 at 08:48
> To: Bram Neijt <[hidden email]>
> Cc: Станислав Черничкин <[hidden email]>, haskell-cafe
> <[hidden email]>, Chris Dornan <[hidden email]>
> Subject: Re: [Haskell-cafe] Haskell performance when it comes to regex?
>
>
>
> Hi Bram,
>
>
>
> you might be interested in the “regex” package from my colleague Chris
> Dornan:
>
>
>
> http://regex.uk/
>
>
>
> I know some proper performance work still needs to be done, but I would be
> curious to hear your experience report ;)
>
>
>
> Alfredo
>
>
>
> On 19 May 2017 at 18:52, Bram Neijt <[hidden email]> wrote:
>
> Thank you!
>
> I already changed to Text instead, but I thought the regex was already
> memoized by GHC, so that should not be a problem.
>
> I'm trying regex-applicative now, maybe that will help, but it takes
> some time to figure out the syntax. I'll also try to see if
> precompilation helps.
>
> Greetings,
>
> Bram
>
>
>
>
> On Fri, May 19, 2017 at 1:17 PM, Станислав Черничкин
> <[hidden email]> wrote:
>> Try to use Text or ByteString instead of strings. Try to use compile and
>> execute methods
>>
>> (http://hackage.haskell.org/package/regex-tdfa-1.2.1/docs/Text-Regex-TDFA-ByteString.html),
>> make sure regex get compiled once.
>>
>> 2017-05-16 12:12 GMT+03:00 Bram Neijt <[hidden email]>:
>>>
>>> Dear reader,
>>>
>>> I decided to do a little project which is a simple search and replace
>>> program for large text files.
>>>
>>> Written in Haskell, it does a few different regex matches on each line
>>> and stores them in a leveldb key-value store to create a
>>> consistent/reviewable search-replace index. It should provide for some
>>> simple/brute-force anonymization of data and therefore I called it
>>> hanon (sorry, could not think of a better name).
>>>
>>> https://github.com/BigDataRepublic/hanon
>>>
>>> The code works, but I've done some benchmarking to compare it with
>>> Python and the code is about 80x slower then doing the same thing in
>>> Python, making it useless for larger data files.
>>>
>>> I'm obviously doing something wrong.
>>>
>>> Could you give me tips on improving the performance of this code?
>>> Probably mainly looking at
>>>
>>> https://github.com/BigDataRepublic/hanon/blob/master/src/Mapper.hs
>>>
>>> where the regex code lives?
>>>
>>> Greetings,
>>>
>>> Bram
>>> _______________________________________________
>>> Haskell-Cafe mailing list
>>> To (un)subscribe, modify options or view archives go to:
>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>>> Only members subscribed via the mailman list are allowed to post.
>>
>>
>>
>>
>> --
>> Sincerely, Stanislav Chernichkin.
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.
>
>

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

hanon.py (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Haskell performance when it comes to regex?

Chris Dornan
Thanks Bram,

There is correction in the comments to that Bryan O’Sullivan regex blog
post by Chris Kuklewicz, the author of the regex packages:

> You wrote:
> > There’s normally no need to compile regular expression
> > patterns. A pattern will be compiled the first time it’s used,
> > and your Haskell runtime should memoise the compiled
> > representation for you.
>
> Which is wrong. If you use =~ then the regexp is compiled with makeRegex and probably not cached. If you need to cache the compiled form then you need to use makeRegex and hold > onto the result and then use match separately (and matchM for =~~).

(http://www.serpentine.com/blog/2007/02/27/a-haskell-regular-expression-tutorial/comment-page-1/#comment-22241)

Pure library function are rarely memoised in Haskell because memorisation
requires side effects to achieve transparently.

I made four main things to optimise ‘hanon’ in my own ‘annie’ experimental
Fork after profiling it:

  * I pre-compiled the REs (this accounted for most of the improvement);
  * as working with Strings are generally slow and ByteStrings just as
    convenient here, I converted everything to work with ByteString
    (though given your UTF8, requirements, Text might
    be more appropriate);
  * I switched from Text.Regex.TDFA to Text.Regex.PCRE (as I think you
    observed, judging by your commit messages), it seems to be somewhat
    faster, at lease for these relatively small data sets;
  * I combined all of the highlighter REs into a single RE of
    alternatives (using a function, leaving the ‘inputPaths’ organisation
    intact) as the multi-pass approach looked like it would be slower but
    also because it seemed to avoid some problems with overlapping REs.

With these changes I saw a ~20x speedup over the ‘hanon’ source code
(committed to master on 2017-05-16). The python code is ~4x faster than this
on your sample ‘hello’ data set (which lines up with your reported 80x differential),
though it only appears to identify the things to be anonymised in the input file,
without carrying out any anonymisation.

You can see my WIP here: https://github.com/cdornan/annie. (A warning, the
main application wit all of the above changes has been rewritten, though
there are variants of the original with the REs being precompiled and
the ability to switch between String and ByteString for comparison
purposes.  It hasn’t been properly documented yet and remains WIP.)

Chris


On 29/05/2017, 15:40, "Bram Neijt" <[hidden email]> wrote:

    Hi Chris,
   
    Thank you for looking into this and thank you for your pull-request.
   
    I moved the "=~" outside of the map and that makes the whole thing a
    huge amount faster.
   
    Seems my assumption that =~ would memoise the regex creation (I read
    that in a post on regex in Haskell[1])
   
    The 80% diff is now gone, the Python code was everything without the
    leveldb stuff (but still, compiling the regexes every time, so it
    seemed like a valid comparison at the time), see attachment for code.
   
    Thank you all for your help!
   
    Bram
   
    [1] http://www.serpentine.com/blog/2007/02/27/a-haskell-regular-expression-tutorial/
   
    On Sun, May 28, 2017 at 2:22 PM, Chris Dornan <[hidden email]> wrote:
    > Hi Bram,
    >
    >
    >
    > Sorry for being a bit late to this -- I have been on the road.
    >
    >
    >
    > I have switched over you example to pre-compile the REs and use ByteString
    > and can see 13x speedup on scan and a 9x speedup on mapping. Curiously,
    > nearly all of that speedup seems to be gained by lifting the RE compilation
    > out of the loop but I am pretty sure there are gains to be had from
    > re-writing the loops.
    >
    >
    >
    > Do you have the Python code that was performing 80x better?
    >
    >
    >
    > Chris
    >
    >
    >
    >
    >
    > From: Alfredo Di Napoli <[hidden email]>
    > Date: Monday, 22 May 2017 at 08:48
    > To: Bram Neijt <[hidden email]>
    > Cc: Станислав Черничкин <[hidden email]>, haskell-cafe
    > <[hidden email]>, Chris Dornan <[hidden email]>
    > Subject: Re: [Haskell-cafe] Haskell performance when it comes to regex?
    >
    >
    >
    > Hi Bram,
    >
    >
    >
    > you might be interested in the “regex” package from my colleague Chris
    > Dornan:
    >
    >
    >
    > http://regex.uk/
    >
    >
    >
    > I know some proper performance work still needs to be done, but I would be
    > curious to hear your experience report ;)
    >
    >
    >
    > Alfredo
    >
    >
    >
    > On 19 May 2017 at 18:52, Bram Neijt <[hidden email]> wrote:
    >
    > Thank you!
    >
    > I already changed to Text instead, but I thought the regex was already
    > memoized by GHC, so that should not be a problem.
    >
    > I'm trying regex-applicative now, maybe that will help, but it takes
    > some time to figure out the syntax. I'll also try to see if
    > precompilation helps.
    >
    > Greetings,
    >
    > Bram
    >
    >
    >
    >
    > On Fri, May 19, 2017 at 1:17 PM, Станислав Черничкин
    > <[hidden email]> wrote:
    >> Try to use Text or ByteString instead of strings. Try to use compile and
    >> execute methods
    >>
    >> (http://hackage.haskell.org/package/regex-tdfa-1.2.1/docs/Text-Regex-TDFA-ByteString.html),
    >> make sure regex get compiled once.
    >>
    >> 2017-05-16 12:12 GMT+03:00 Bram Neijt <[hidden email]>:
    >>>
    >>> Dear reader,
    >>>
    >>> I decided to do a little project which is a simple search and replace
    >>> program for large text files.
    >>>
    >>> Written in Haskell, it does a few different regex matches on each line
    >>> and stores them in a leveldb key-value store to create a
    >>> consistent/reviewable search-replace index. It should provide for some
    >>> simple/brute-force anonymization of data and therefore I called it
    >>> hanon (sorry, could not think of a better name).
    >>>
    >>> https://github.com/BigDataRepublic/hanon
    >>>
    >>> The code works, but I've done some benchmarking to compare it with
    >>> Python and the code is about 80x slower then doing the same thing in
    >>> Python, making it useless for larger data files.
    >>>
    >>> I'm obviously doing something wrong.
    >>>
    >>> Could you give me tips on improving the performance of this code?
    >>> Probably mainly looking at
    >>>
    >>> https://github.com/BigDataRepublic/hanon/blob/master/src/Mapper.hs
    >>>
    >>> where the regex code lives?
    >>>
    >>> Greetings,
    >>>
    >>> Bram
    >>> _______________________________________________
    >>> Haskell-Cafe mailing list
    >>> To (un)subscribe, modify options or view archives go to:
    >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
    >>> Only members subscribed via the mailman list are allowed to post.
    >>
    >>
    >>
    >>
    >> --
    >> Sincerely, Stanislav Chernichkin.
    > _______________________________________________
    > Haskell-Cafe mailing list
    > To (un)subscribe, modify options or view archives go to:
    > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
    > Only members subscribed via the mailman list are allowed to post.
    >
    >
   


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Haskell performance when it comes to regex?

Brandon Allbery

On Mon, May 29, 2017 at 3:53 PM, Chris Dornan <[hidden email]> wrote:
  * I switched from Text.Regex.TDFA to Text.Regex.PCRE (as I think you
    observed, judging by your commit messages), it seems to be somewhat
    faster, at lease for these relatively small data sets;

TDFA is pure Haskell and mostly exists for when you can't be certain that a faster C binding will support UTF8 (most POSIX regex implementations do not, and PCRE only does so if someone built it with UTF8 support). When it's usable, the C bindings will almost always be faster.

--
brandon s allbery kf8nh                               sine nomine associates
[hidden email]                                  [hidden email]
unix, openafs, kerberos, infrastructure, xmonad        http://sinenomine.net

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Haskell performance when it comes to regex?

Saurabh Nanda
Sorry for jumping in. IIUC the largest speedup was delivered by moving a reger compilation expression out of a loop. If the regex compilation is a pure function, being passed the exact same arguments during every invocation of the loop, shouldn't the compiler be smart enough to optimise it? Isnt this the basic premise of pure FP? 

On 30-May-2017 2:20 AM, "Brandon Allbery" <[hidden email]> wrote:

On Mon, May 29, 2017 at 3:53 PM, Chris Dornan <[hidden email]> wrote:
  * I switched from Text.Regex.TDFA to Text.Regex.PCRE (as I think you
    observed, judging by your commit messages), it seems to be somewhat
    faster, at lease for these relatively small data sets;

TDFA is pure Haskell and mostly exists for when you can't be certain that a faster C binding will support UTF8 (most POSIX regex implementations do not, and PCRE only does so if someone built it with UTF8 support). When it's usable, the C bindings will almost always be faster.

--
brandon s allbery kf8nh                               sine nomine associates
[hidden email]                                  [hidden email]
unix, openafs, kerberos, infrastructure, xmonad        http://sinenomine.net

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Haskell performance when it comes to regex?

Evan Laforge
In reply to this post by Bram Neijt
On Mon, May 29, 2017 at 7:40 AM, Bram Neijt <[hidden email]> wrote:
> The 80% diff is now gone, the Python code was everything without the
> leveldb stuff (but still, compiling the regexes every time, so it
> seemed like a valid comparison at the time), see attachment for code.

Python's re module has an internal cache, so it probably wasn't
compiling every time.
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Haskell performance when it comes to regex?

Chris Dornan
In reply to this post by Chris Dornan
Brandon Allbery [hidden email]:
> TDFA is pure Haskell and mostly exists for when you can't be certain that a faster C binding will support UTF8
> (most POSIX regex implementations do not, and PCRE only does so if someone built it with UTF8 support).
> When it's usable, the C bindings will almost always be faster.

I think we are in agreement, though I would generally recommend TDFA, unless you need the highest performance
(as in this case) or Perl-style REs. But then again I like my REs plain and simple (and Posix) and packages with
as few external dependencies as possible.

Chris



_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Haskell performance when it comes to regex?

Joachim Durchholz
Am 30.05.2017 um 12:08 schrieb Chris Dornan:
> I would generally recommend TDFA, unless you need the highest performance
> (as in this case) or Perl-style REs.

Just for the record: Be aware the PCRE isn't the fastest regex engine
out there. It's generally good, but (a) it's doing backtracking which
can make it exponentially slow, and (b) since Perl regexes have so many
features, the PCRE engine cannot apply all optimizations that a highly
optimized RE engine could.

As always, if speed is an issue, assumptions and word of mouth need to
be verified using benchmarks.
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.