Quantcast

Haskell performance when it comes to regex?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Haskell performance when it comes to regex?

Bram Neijt
Dear reader,

I decided to do a little project which is a simple search and replace
program for large text files.

Written in Haskell, it does a few different regex matches on each line
and stores them in a leveldb key-value store to create a
consistent/reviewable search-replace index. It should provide for some
simple/brute-force anonymization of data and therefore I called it
hanon (sorry, could not think of a better name).

https://github.com/BigDataRepublic/hanon

The code works, but I've done some benchmarking to compare it with
Python and the code is about 80x slower then doing the same thing in
Python, making it useless for larger data files.

I'm obviously doing something wrong.

Could you give me tips on improving the performance of this code?
Probably mainly looking at

https://github.com/BigDataRepublic/hanon/blob/master/src/Mapper.hs

where the regex code lives?

Greetings,

Bram
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Haskell performance when it comes to regex?

Станислав Черничкин
Try to use Text or ByteString instead of strings. Try to use compile and execute methods (http://hackage.haskell.org/package/regex-tdfa-1.2.1/docs/Text-Regex-TDFA-ByteString.html), make sure regex get compiled once.

2017-05-16 12:12 GMT+03:00 Bram Neijt <[hidden email]>:
Dear reader,

I decided to do a little project which is a simple search and replace
program for large text files.

Written in Haskell, it does a few different regex matches on each line
and stores them in a leveldb key-value store to create a
consistent/reviewable search-replace index. It should provide for some
simple/brute-force anonymization of data and therefore I called it
hanon (sorry, could not think of a better name).

https://github.com/BigDataRepublic/hanon

The code works, but I've done some benchmarking to compare it with
Python and the code is about 80x slower then doing the same thing in
Python, making it useless for larger data files.

I'm obviously doing something wrong.

Could you give me tips on improving the performance of this code?
Probably mainly looking at

https://github.com/BigDataRepublic/hanon/blob/master/src/Mapper.hs

where the regex code lives?

Greetings,

Bram
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.



--
Sincerely, Stanislav Chernichkin.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Haskell performance when it comes to regex?

Bram Neijt
Thank you!

I already changed to Text instead, but I thought the regex was already
memoized by GHC, so that should not be a problem.

I'm trying regex-applicative now, maybe that will help, but it takes
some time to figure out the syntax. I'll also try to see if
precompilation helps.

Greetings,

Bram



On Fri, May 19, 2017 at 1:17 PM, Станислав Черничкин
<[hidden email]> wrote:

> Try to use Text or ByteString instead of strings. Try to use compile and
> execute methods
> (http://hackage.haskell.org/package/regex-tdfa-1.2.1/docs/Text-Regex-TDFA-ByteString.html),
> make sure regex get compiled once.
>
> 2017-05-16 12:12 GMT+03:00 Bram Neijt <[hidden email]>:
>>
>> Dear reader,
>>
>> I decided to do a little project which is a simple search and replace
>> program for large text files.
>>
>> Written in Haskell, it does a few different regex matches on each line
>> and stores them in a leveldb key-value store to create a
>> consistent/reviewable search-replace index. It should provide for some
>> simple/brute-force anonymization of data and therefore I called it
>> hanon (sorry, could not think of a better name).
>>
>> https://github.com/BigDataRepublic/hanon
>>
>> The code works, but I've done some benchmarking to compare it with
>> Python and the code is about 80x slower then doing the same thing in
>> Python, making it useless for larger data files.
>>
>> I'm obviously doing something wrong.
>>
>> Could you give me tips on improving the performance of this code?
>> Probably mainly looking at
>>
>> https://github.com/BigDataRepublic/hanon/blob/master/src/Mapper.hs
>>
>> where the regex code lives?
>>
>> Greetings,
>>
>> Bram
>> _______________________________________________
>> Haskell-Cafe mailing list
>> To (un)subscribe, modify options or view archives go to:
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>> Only members subscribed via the mailman list are allowed to post.
>
>
>
>
> --
> Sincerely, Stanislav Chernichkin.
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Haskell performance when it comes to regex?

Alfredo Di Napoli
Hi Bram,

you might be interested in the “regex” package from my colleague Chris Dornan:


I know some proper performance work still needs to be done, but I would be curious to hear your experience report ;)

Alfredo

On 19 May 2017 at 18:52, Bram Neijt <[hidden email]> wrote:
Thank you!

I already changed to Text instead, but I thought the regex was already
memoized by GHC, so that should not be a problem.

I'm trying regex-applicative now, maybe that will help, but it takes
some time to figure out the syntax. I'll also try to see if
precompilation helps.

Greetings,

Bram



On Fri, May 19, 2017 at 1:17 PM, Станислав Черничкин
<[hidden email]> wrote:
> Try to use Text or ByteString instead of strings. Try to use compile and
> execute methods
> (http://hackage.haskell.org/package/regex-tdfa-1.2.1/docs/Text-Regex-TDFA-ByteString.html),
> make sure regex get compiled once.
>
> 2017-05-16 12:12 GMT+03:00 Bram Neijt <[hidden email]>:
>>
>> Dear reader,
>>
>> I decided to do a little project which is a simple search and replace
>> program for large text files.
>>
>> Written in Haskell, it does a few different regex matches on each line
>> and stores them in a leveldb key-value store to create a
>> consistent/reviewable search-replace index. It should provide for some
>> simple/brute-force anonymization of data and therefore I called it
>> hanon (sorry, could not think of a better name).
>>
>> https://github.com/BigDataRepublic/hanon
>>
>> The code works, but I've done some benchmarking to compare it with
>> Python and the code is about 80x slower then doing the same thing in
>> Python, making it useless for larger data files.
>>
>> I'm obviously doing something wrong.
>>
>> Could you give me tips on improving the performance of this code?
>> Probably mainly looking at
>>
>> https://github.com/BigDataRepublic/hanon/blob/master/src/Mapper.hs
>>
>> where the regex code lives?
>>
>> Greetings,
>>
>> Bram
>> _______________________________________________
>> Haskell-Cafe mailing list
>> To (un)subscribe, modify options or view archives go to:
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>> Only members subscribed via the mailman list are allowed to post.
>
>
>
>
> --
> Sincerely, Stanislav Chernichkin.
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Haskell performance when it comes to regex?

David Fox-12
In reply to this post by Станислав Черничкин
I have been surprised at how rarely switching to Text or ByteString makes things significantly faster.  If you do this you should look at Data.ByteString.Builder or Data.Text.Lazy.Builder.

On Fri, May 19, 2017 at 4:17 AM, Станислав Черничкин <[hidden email]> wrote:
Try to use Text or ByteString instead of strings. Try to use compile and execute methods (http://hackage.haskell.org/package/regex-tdfa-1.2.1/docs/Text-Regex-TDFA-ByteString.html), make sure regex get compiled once.

2017-05-16 12:12 GMT+03:00 Bram Neijt <[hidden email]>:
Dear reader,

I decided to do a little project which is a simple search and replace
program for large text files.

Written in Haskell, it does a few different regex matches on each line
and stores them in a leveldb key-value store to create a
consistent/reviewable search-replace index. It should provide for some
simple/brute-force anonymization of data and therefore I called it
hanon (sorry, could not think of a better name).

https://github.com/BigDataRepublic/hanon

The code works, but I've done some benchmarking to compare it with
Python and the code is about 80x slower then doing the same thing in
Python, making it useless for larger data files.

I'm obviously doing something wrong.

Could you give me tips on improving the performance of this code?
Probably mainly looking at

https://github.com/BigDataRepublic/hanon/blob/master/src/Mapper.hs

where the regex code lives?

Greetings,

Bram
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.



--
Sincerely, Stanislav Chernichkin.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Haskell performance when it comes to regex?

Rein Henrichs
I recommend benchmarking with criterion and GHC profiling so you know where the slow actually is before trying to optimize anything.

On Tue, May 23, 2017 at 9:26 AM David Fox <[hidden email]> wrote:
I have been surprised at how rarely switching to Text or ByteString makes things significantly faster.  If you do this you should look at Data.ByteString.Builder or Data.Text.Lazy.Builder.

On Fri, May 19, 2017 at 4:17 AM, Станислав Черничкин <[hidden email]> wrote:
Try to use Text or ByteString instead of strings. Try to use compile and execute methods (http://hackage.haskell.org/package/regex-tdfa-1.2.1/docs/Text-Regex-TDFA-ByteString.html), make sure regex get compiled once.

2017-05-16 12:12 GMT+03:00 Bram Neijt <[hidden email]>:
Dear reader,

I decided to do a little project which is a simple search and replace
program for large text files.

Written in Haskell, it does a few different regex matches on each line
and stores them in a leveldb key-value store to create a
consistent/reviewable search-replace index. It should provide for some
simple/brute-force anonymization of data and therefore I called it
hanon (sorry, could not think of a better name).

https://github.com/BigDataRepublic/hanon

The code works, but I've done some benchmarking to compare it with
Python and the code is about 80x slower then doing the same thing in
Python, making it useless for larger data files.

I'm obviously doing something wrong.

Could you give me tips on improving the performance of this code?
Probably mainly looking at

https://github.com/BigDataRepublic/hanon/blob/master/src/Mapper.hs

where the regex code lives?

Greetings,

Bram
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.



--
Sincerely, Stanislav Chernichkin.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Loading...