Annoucing alternative to Text.Regex

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Annoucing alternative to Text.Regex

haskell-2
This message is cross-posted to [hidden email] and [hidden email]

Hi,

  Working on the shootout (specifically
http://shootout.alioth.debian.org/gp4/benchmark.php?test=regexdna&lang=all ), it
was impossible to use Text.Regex since it was too slow (it would reach the
timeout limit, such as an hour, where TCL takes 3.21 seconds ).

  So we had to work around this by implementing our own regex engine.  I have
tooled up an all-parsec version that I think can be a nearly drop-in-alternative
to Text.Regex.  I will describe it a bit and then ask for advice.

It takes the string form of the regular expression and uses Parsec to create a
(data Pattern) tree structure.  This recognizes all of "man re_format" including
back references, missing only the [[:>]] or [[<:]] extension (word boundaries).
 It also accepts new modifiers '?' and '+' to be used like "a*?" and "a*+" for
lazy and possessive matching.

The Pattern form is amenable to simplification, and some easy simplification are
performed.

Then the Pattern is transformed in a Parsec parser that will match the input
(from the beginning only) and return all the captured sub-expressions.  It also
supports (get|set|update)UserState functions.  It should return the longest
match when all the pattern elements are greedy, i.e. no lazy or possessive
modifiers are used.  With lazy or possessive modifiers the longest alternative
is used, but this may not be the longest overall match.  (Parsec keeps track of
line and column position, but not the number of characters, which had to be added.)

Note that the pattern "(.)Q\1" can be used to match strings like "aQa", and
"(a|b\1)+" matches all of "ababba".

The Text.Regex compatibility code embeds the generated Parsec parser into more
complicated parsers and runs them to emulate the different functions in the
Text.Regex interface.

It is possible to extend all of this in several ways:
  1) Turn the Pattern into simpler parsec, e.g. if the expression captures are
not needed
  2) When possible, turn the Pattern into a faster DFA parser instead of Parsec
  3) Create parsers that with different semantics than "longest match", e.g.
left-biased
  4) Creates parsers that return all matches
  5) Employ smart optimizations for special forms of Pattern
  6) With an earlier version, I made a proof of principle String -> ExpQ
transformer that created Parsec at compile time using Template Haskell.  This
could be updated.

Not done: testing.  (If you have clever things to throw at it, pass them along).
 Not done: benchmarking (though it should usually beat Text.Regex).

And now I want to ask for advice.  If there is interest I could post the code
somewhere, but I don't own a stable place to put it at the moment.  Is there a
canonical place on the net to put small Haskell modules (28 kB uncompressed) ?
Is there some darcs-repository I could insert it into ?

Also, is there a library style guide somewhere with best practices?
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Annoucing alternative to Text.Regex

Bulat Ziganshin-2
Hello Chris,

Wednesday, March 8, 2006, 1:26:37 AM, you wrote:

CK> It takes the string form of the regular expression and uses Parsec to create a

he-he, i written the same thing (but very simple) 2 years ago :)  i
planned to submit it to the Parsec developers as an example of
double-Parsec usage :)

i think that it is a great lib, but not sure that it should completely
replace current lib. old lib is more appropriate for packed string,
new lib work directly with Haskell strings

one more interesting thing - generation of faster and simpler parsers
for simple regexps. just as example, code from my own program, that
parse filename wildcards. it translates simple patterns directly to
the "String->Bool" functions and use Regex library for more complex
patterns

-- |Compiled regexpr representation          EXAMPLE
data RegExpr = RE_End                     -- ""
             | RE_Anything                -- "*"
             | RE_FromEnd RegExpr         -- '*':"bc"
             | RE_AnyChar RegExpr         -- '?':"bc"
             | RE_Char    Char RegExpr    -- 'a':"bc"
             | RE_FullRE  Regex           -- "r[0-9][0-9]"

is_wildcard s  =  s `contains_one_of` "?*["

translate_RE re = "^"++ (replaceAll "*" ".*"
                        .replaceAll "?" "."
                        .replaceAll "$" "\\$"
                        .replaceAll "[[[" "[^"  
                        .replaceAll "^" "\\^"
                        .replaceAll "[^" "[[["  
                        .replaceAll "+" "\\+"
                        .replaceAll "." "\\.") re ++"$"

compile_RE s  =  case s of
  ""     -> RE_End
  "*"    -> RE_Anything
  '*':cs -> if ('*' `elem` cs) || ('[' `elem` cs)
              then RE_FullRE  (mkRegex$ translate_RE$ s)
              else RE_FromEnd (compile_RE$ reverse$   s)
  '[':cs -> RE_FullRE   (mkRegex$ translate_RE$ s)
  '?':cs -> RE_AnyChar  (compile_RE cs)
  c  :cs -> RE_Char   c (compile_RE cs)

match_RE re s  =  case re of
  RE_End        -> null s
  RE_Anything   -> True
  RE_FullRE   r -> isJust (matchRegex r s)
  RE_FromEnd  r -> match_RE r (reverse s)
  RE_AnyChar  r -> case s of
                     ""   -> False
                     _:xs -> match_RE r xs
  RE_Char   c r -> case s of
                     ""   -> False
                     x:xs -> x==c && match_RE r xs

match re {-s-}  =  match_RE (compile_RE re) {-s-}

{-# NOINLINE translate_RE #-}



--
Best regards,
 Bulat                            mailto:[hidden email]

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Annoucing alternative to Text.Regex

haskell-2
Bulat Ziganshin wrote:
> Hello Chris,
>
> Wednesday, March 8, 2006, 1:26:37 AM, you wrote:
>
> CK> It takes the string form of the regular expression and uses Parsec to create a
>
> he-he, i written the same thing (but very simple) 2 years ago :)  i
> planned to submit it to the Parsec developers as an example of
> double-Parsec usage :)

The proposed shootout version on the wiki (
http://haskell.org/hawiki/RegexDna#head-ac7a5b838757d66780247397221f3b4f1ace9051
) uses p_regexp :: CharParser () (CharParser st () -> CharParser st ()) which is
even more bizarre at first glance.  But this is not what my full library uses.

> i think that it is a great lib, but not sure that it should completely
> replace current lib. old lib is more appropriate for packed string,
> new lib work directly with Haskell strings

Exactly.  If you have a packed ascii (or unicode?) string you should call c via
regex.h or pcre.h to do the matching.  But doing this with length 10^6 [Char]
via Text.Regex is next to impossible. Thus the niche for the [Char] version I
have created.

Another nice thing is that the Parsec versions of matchRegex / matchRegexAll /
subRegex / splitRegex are lazy, so you could substitute or split an infinite string.

Last night "Igloo" on the #haskell shared a HUnit test suite he used for his
personal version of basic and extended regular expression matching.  This
located two bugs and one specification error in my code (all now fixed).

After some more testing I will be looking for a place to post it.  Is there
somewhere on www.haskell.org that would work?

> one more interesting thing - generation of faster and simpler parsers
> for simple regexps. just as example, code from my own program, that
> parse filename wildcards. it translates simple patterns directly to
> the "String->Bool" functions and use Regex library for more complex
> patterns

Hmmm...Yes.  Another String->Pattern parser (probably in Parsec) could transform
filename wildcards.  But that would lose information on the simplicity.  I have
not created the infrastructure for such alternatives or meta-data (such as
"anchored only at start of string" or "only uses greedy operators" or "does not
need back-references" or "can be reduced to a DFA").

>
> -- |Compiled regexpr representation          EXAMPLE
> data RegExpr = RE_End                     -- ""
>              | RE_Anything                -- "*"
>              | RE_FromEnd RegExpr         -- '*':"bc"
>              | RE_AnyChar RegExpr         -- '?':"bc"
>              | RE_Char    Char RegExpr    -- 'a':"bc"
>              | RE_FullRE  Regex           -- "r[0-9][0-9]"
>

My parsed form of the string Regex is the Pattern data type.  It is not used to
actually do matching (though that would be possible), but to later compile a
Parsec parser.

data Pattern = PEmpty | PCarat | PDollar | PFail String
             | PGroup  PatternIndex Pattern
             | POr     [Pattern]
             | PConcat [Pattern]
             | PQuest  Pattern    -- ?
             | PPlus   Pattern    -- +
             | PStar   Pattern    -- *
             | PBound  Int (Maybe Int) Pattern  -- {3} or {3,5} or {3,}
-- PLazy indicates the pattern should try the shortest matches first
             | PLazy   Pattern    -- non-greedy wrapper (?+*{} followed by ?)
-- PPossessive indicates the pattern can only fit the longest match
             | PPossessive Pattern -- possessive modifier (?+*{} followed by +)
             | PDot               -- Any character (newline?) at all
             | PAny    PatternSet -- Square bracketed things
             | PAnyNot PatternSet -- Inverted square bracketed things
             | PEscape Char       -- Backslashed Character
             | PChar   Char       -- Specific Character
-- After "simplify" adjacent PChar are merge'd into PString
             | PString String
               deriving (Eq,Show)

Where PatternSet is usually just a (Set Char) and Set of [:alpha:] character
classes.  It also holds parsed [.ch.] and [=x=] expressions, but these are not
really implemented in the matching.

PGroup is an empty or non-empty group "()" or "(foo)" with the back reference
index.  Also note that Pattern does not differentiate different types of escaped
characters (e.g. \a \* \4 are all PEscape patterns).

The meaning of PDot varies depending on the options.  Multiline expressions do
not match PDot with '\n' and to agree with Text.Regex '\NUL' characters are
disallowed in the string regex and not matched by PDot.  PCarat and PDollar
actions also depend on multiline, and case sensitivity affects the
character/string matching.

I ought to make another option to allow '\NUL' to be treated as a regular character.

--
Chris Kuklewicz
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe