JRegex on "large" input sizes

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

JRegex on "large" input sizes

David House
Hi all. I need a decent regex library and JRegex seems the perfect
choice: simple API, yet well-featured, as well as PCRE support. I want
to use it on a simple project which involves input files a little
larger than typical -- between 100KB and 500KB -- but still small
enough so as to not present a problem.

However, and I'm fairly sure JRegex is at fault here, my program
segfaults on an input of ~230KB. Has anyone used JRegex successfully
in this way before? If so, what tactics did you use?

Thanks in advance.

--
-David House, [hidden email]
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: JRegex on "large" input sizes

Donald Bruce Stewart
dmhouse:

> Hi all. I need a decent regex library and JRegex seems the perfect
> choice: simple API, yet well-featured, as well as PCRE support. I want
> to use it on a simple project which involves input files a little
> larger than typical -- between 100KB and 500KB -- but still small
> enough so as to not present a problem.
>
> However, and I'm fairly sure JRegex is at fault here, my program
> segfaults on an input of ~230KB. Has anyone used JRegex successfully
> in this way before? If so, what tactics did you use?
>
> Thanks in advance.
>

Perhaps try Text.Regex.Lazy? We used it with success on 5-10M files in
the shootout.

-- Don
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: JRegex on "large" input sizes

haskell-2
In reply to this post by David House
David House wrote:
 > Hi all. I need a decent regex library and JRegex seems the perfect
 > choice: simple API, yet well-featured, as well as PCRE support.

I "maintain" Text.Regex.Lazy ( http://sourceforge.net/projects/lazy-regex ) so I
would mention it does not have full PCRE support.  The module's documentation (
summarize here http://sourceforge.net/forum/forum.php?forum_id=554104 ) explains
what it does have.  In summary of summary:

For simple Regex usage (with capture) the Text.Regex.Lazy.Compat module replaces
Text.Regex with a better implementation.

For simple expressions where a DFA works, the CompatDFA is fastest.

For fancier Regexes (such as using lazy pattern with ?? *? and +?) the
Text.Regex.Lazy.Full extends Text.Regex.Lazy.Compat.

For much fancier regular expressions (e.g. PCRE) you would need to add two
hopefully simple pieces:
(1) Extend the parsec code used to comprehend the meaning of the regex string.
(2) Extend the code that produces the Parsec parser that implements the desired
matching semantics.
(3) Test cases for the expanded syntax and semantics.

Note that Text.Regex.Lazy is an all Haskell solution.  There are other haskell
projects that wrap the standard regex/pcre libraries.  The problem is that
marshaling [Char] to c-strings is quite slow and cannot be lazy, so you may want
to use the new Fast Packed String (now ByteString) library with foreign
functions to call the pcre c-library.

 > I want
 > to use it on a simple project which involves input files a little
 > larger than typical -- between 100KB and 500KB -- but still small
 > enough so as to not present a problem.
 >
 > However, and I'm fairly sure JRegex is at fault here, my program
 > segfaults on an input of ~230KB. Has anyone used JRegex successfully
 > in this way before? If so, what tactics did you use?
 >
 > Thanks in advance.
 >


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: JRegex on "large" input sizes

David House
On 01/07/06, Chris Kuklewicz <[hidden email]> wrote:
> For fancier Regexes (such as using lazy pattern with ?? *? and +?) the
> Text.Regex.Lazy.Full extends Text.Regex.Lazy.Compat.

The non-greedy modifier is really what I need, so I'll check it out. Thanks.

--
-David House, [hidden email]
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: JRegex on "large" input sizes

Gregory Woodhouse
In reply to this post by David House

On Jul 1, 2006, at 6:27 AM, David House wrote:

Hi all. I need a decent regex library and JRegex seems the perfect

choice: simple API, yet well-featured, as well as PCRE support. I want

to use it on a simple project which involves input files a little

larger than typical -- between 100KB and 500KB -- but still small

enough so as to not present a problem.


However, and I'm fairly sure JRegex is at fault here, my program

segfaults on an input of ~230KB. Has anyone used JRegex successfully

in this way before? If so, what tactics did you use?


Thanks in advance.


I have no idea how JRegex works, but regular expressions are normally implemented using finite automata, so the input size should be immaterial.

Gregory Woodhouse

"Judge a man by his questions not 
his answers."   --Voltaire




_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: JRegex on "large" input sizes

haskell-2
Gregory Woodhouse wrote:

>
> On Jul 1, 2006, at 6:27 AM, David House wrote:
>
>> Hi all. I need a decent regex library and JRegex seems the perfect
>>
>> choice: simple API, yet well-featured, as well as PCRE support. I want
>>
>> to use it on a simple project which involves input files a little
>>
>> larger than typical -- between 100KB and 500KB -- but still small
>>
>> enough so as to not present a problem.
>>
>>
>> However, and I'm fairly sure JRegex is at fault here, my program
>>
>> segfaults on an input of ~230KB. Has anyone used JRegex successfully
>>
>> in this way before? If so, what tactics did you use?
>>
>>
>> Thanks in advance.
>>
>
> I have no idea how JRegex works, but regular expressions are normally
> implemented using finite automata, so the input size should be immaterial.
>
> Gregory Woodhouse
> [hidden email] <mailto:[hidden email]>
>
> "Judge a man by his questions not
> his answers."   --Voltaire
>

It depends on the advanced features that you need.  If you use sub-expression
capturing then it often changes to a backtracking algorithm. This use of
backtracking is true of the Text.Regex.Lazy.Full code, which creates a Parsec
parser instead of a DFA.  The CompatDFA module uses an automata (derived from
the CTKLight code, http://www.cse.unsw.edu.au/~chak/haskell/ctk/ ), and as a
bonus, the DFA itself is build in a Lazy fashion, so unexplored parts of the
automata don't get built.

I found creating a Parsec parser to do regex matching was possible, but a bit
odd in places.  Like CTK, it is more aimed at writing a Lexer/Parser than a
simple matcher.

At some point, the FastPackedString version of Text.Regex.Lazy should be changed
to the new ByteString library.  Perhaps there is even room to build in on (or
present it as) a type class.

--
Chris

"I'm here tonight to present the award everyone's been waiting for: best movie.
This award holds a special place in my heart because next year I'll be winning
it for Snakes on a Plane." -- June 3, 2006, Samuel L. Jackson
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: JRegex on "large" input sizes

John Meacham
In reply to this post by David House
On Sat, Jul 01, 2006 at 02:27:17PM +0100, David House wrote:
> Hi all. I need a decent regex library and JRegex seems the perfect
> choice: simple API, yet well-featured, as well as PCRE support. I want
> to use it on a simple project which involves input files a little
> larger than typical -- between 100KB and 500KB -- but still small
> enough so as to not present a problem.
>
> However, and I'm fairly sure JRegex is at fault here, my program
> segfaults on an input of ~230KB. Has anyone used JRegex successfully
> in this way before? If so, what tactics did you use?

This is due to the fact it tries to build the whole string in memory
before running the regex on it. Originally, this limitation was due to
the fact the API provided by the regex binding in fptools was
insufficient to allow chunking of data properly. now that JRegex uses
its own FFI binding, there is no reason this limitation should still
exist. I will happily accept any patches fixing this. There was some
talk of integrating JRegex into the main tree.

It should be noted that there are 2 somewhat independent things included
in JRegex, a better FFI binding to posix expressions as well as the PCRE
library. and the general framework for providing the =~ and =~~
operators. All that is needed is a smarter FFI binding, or even just
using another one that exists like Text.Regex.Lazy. it is
straightforward to add the instances needed to use the JRegex fancy
syntax.

        John

--
John Meacham - ⑆repetae.net⑆john⑈
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe