Incremental Regex Parsing with Fixed Strings

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Incremental Regex Parsing with Fixed Strings

Andrew Martin
Greetings,

In Dan Piponi's blog post Fast Incremental Regular Expression Matching with Monoids [1], he outlines a way to take advantage of fingertrees to perform increment regex matching. In his article, the regular expression (more accurately, its equivalent DFA) is fixed, and the strings we are matching on are modified incrementally. What I'm interested in is the opposite: What if the regular expression is modified incrementally and the string we are testing is static?

Thinking about this, it seems to run into problems immediately. Typically, to evaluate a regular expression, I would convert it to a NFA (and then maybe to a DFA). But there is no way to do incremental graph modification on the resulting graphs. (Or is there?). And even if there were a way, it seems unclear how one would go about exploiting its incremental nature. To provide an example, consider the following regex and sample string:

Regex: Lost (dog|cat) loo for home
Sample: Lost dog looking for home

In the regex, I'm currently in the middle of typing "looking", so it doesn't match the sample string. But large parts of it do. So when I update the regex by typing "k", I'd like to be able to not redo this work. Maybe there's a certain subset of regex that's amenable to this kind of incremental evaluation. Maybe not. If anyone has any additional insights or can point me to any research, I would appreciate it greatly. Thanks.


_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Incremental Regex Parsing with Fixed Strings

David Feuer
I'm not too optimistic in general about editing regex machines interactively. I'd expect you to do a lot of work to save a little. But I'd expect you could win a good bit (at least in some common cases) by using something like a suffix tree of your text. That'll let you jump quickly to the end of "Lost ", for example, and may offer other interesting opportunities.

On Thu, Feb 28, 2019, 1:43 PM Andrew Martin <[hidden email]> wrote:
Greetings,

In Dan Piponi's blog post Fast Incremental Regular Expression Matching with Monoids [1], he outlines a way to take advantage of fingertrees to perform increment regex matching. In his article, the regular expression (more accurately, its equivalent DFA) is fixed, and the strings we are matching on are modified incrementally. What I'm interested in is the opposite: What if the regular expression is modified incrementally and the string we are testing is static?

Thinking about this, it seems to run into problems immediately. Typically, to evaluate a regular expression, I would convert it to a NFA (and then maybe to a DFA). But there is no way to do incremental graph modification on the resulting graphs. (Or is there?). And even if there were a way, it seems unclear how one would go about exploiting its incremental nature. To provide an example, consider the following regex and sample string:

Regex: Lost (dog|cat) loo for home
Sample: Lost dog looking for home

In the regex, I'm currently in the middle of typing "looking", so it doesn't match the sample string. But large parts of it do. So when I update the regex by typing "k", I'd like to be able to not redo this work. Maybe there's a certain subset of regex that's amenable to this kind of incremental evaluation. Maybe not. If anyone has any additional insights or can point me to any research, I would appreciate it greatly. Thanks.

_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries

_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Incremental Regex Parsing with Fixed Strings

Andrew Martin
Let me make sure I'm interpretting "a suffix tree of your text" correctly. You mean just stripping the common prefix "Lost" from both the regex (the texual representation of it) and from the sample text? If so, it should be possible to do both prefixes and suffixes. But it's still not the win I'm hoping for.

On Thu, Feb 28, 2019 at 4:47 PM David Feuer <[hidden email]> wrote:
I'm not too optimistic in general about editing regex machines interactively. I'd expect you to do a lot of work to save a little. But I'd expect you could win a good bit (at least in some common cases) by using something like a suffix tree of your text. That'll let you jump quickly to the end of "Lost ", for example, and may offer other interesting opportunities.

On Thu, Feb 28, 2019, 1:43 PM Andrew Martin <[hidden email]> wrote:
Greetings,

In Dan Piponi's blog post Fast Incremental Regular Expression Matching with Monoids [1], he outlines a way to take advantage of fingertrees to perform increment regex matching. In his article, the regular expression (more accurately, its equivalent DFA) is fixed, and the strings we are matching on are modified incrementally. What I'm interested in is the opposite: What if the regular expression is modified incrementally and the string we are testing is static?

Thinking about this, it seems to run into problems immediately. Typically, to evaluate a regular expression, I would convert it to a NFA (and then maybe to a DFA). But there is no way to do incremental graph modification on the resulting graphs. (Or is there?). And even if there were a way, it seems unclear how one would go about exploiting its incremental nature. To provide an example, consider the following regex and sample string:

Regex: Lost (dog|cat) loo for home
Sample: Lost dog looking for home

In the regex, I'm currently in the middle of typing "looking", so it doesn't match the sample string. But large parts of it do. So when I update the regex by typing "k", I'd like to be able to not redo this work. Maybe there's a certain subset of regex that's amenable to this kind of incremental evaluation. Maybe not. If anyone has any additional insights or can point me to any research, I would appreciate it greatly. Thanks.

_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries


--
-Andrew Thaddeus Martin

_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Incremental Regex Parsing with Fixed Strings

Henning Thielemann

On Thu, 28 Feb 2019, Andrew Martin wrote:

> Let me make sure I'm interpretting "a suffix tree of your text"
> correctly.

https://en.wikipedia.org/wiki/Suffix_tree
_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Incremental Regex Parsing with Fixed Strings

Sergey Vinokurov
In reply to this post by Andrew Martin
Modifying regular expressions reminds me of the great functional pearl
called 'A Play on Regular Expressions'. Perhaps this is the approach you
were thinking about?

Sergey

On 28/02/2019 18:43, Andrew Martin wrote:

> Greetings,
>
> In Dan Piponi's blog post Fast Incremental Regular Expression Matching
> with Monoids [1], he outlines a way to take advantage of fingertrees to
> perform increment regex matching. In his article, the regular expression
> (more accurately, its equivalent DFA) is fixed, and the strings we are
> matching on are modified incrementally. What I'm interested in is the
> opposite: What if the regular expression is modified incrementally and
> the string we are testing is static?
>
> Thinking about this, it seems to run into problems immediately.
> Typically, to evaluate a regular expression, I would convert it to a NFA
> (and then maybe to a DFA). But there is no way to do incremental graph
> modification on the resulting graphs. (Or is there?). And even if there
> were a way, it seems unclear how one would go about exploiting its
> incremental nature. To provide an example, consider the following regex
> and sample string:
>
> Regex: Lost (dog|cat) loo for home
> Sample: Lost dog looking for home
>
> In the regex, I'm currently in the middle of typing "looking", so it
> doesn't match the sample string. But large parts of it do. So when I
> update the regex by typing "k", I'd like to be able to not redo this
> work. Maybe there's a certain subset of regex that's amenable to this
> kind of incremental evaluation. Maybe not. If anyone has any additional
> insights or can point me to any research, I would appreciate it greatly.
> Thanks.
>
> [1] http://blog.sigfpe.com/2009/01/fast-incremental-regular-expression.html
>
>
> --
> -Andrew Thaddeus Martin
>
> _______________________________________________
> Libraries mailing list
> [hidden email]
> http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
>

_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Incremental Regex Parsing with Fixed Strings

Andrew Martin
Thank you so much for pointing me to this paper. What an absolute gem it is! Efficiently evaluating regex with a recursive ADT instead of a graph puts me one step closer to a solution. In my problem domain, regex frequently are mostly sequences at the top level. That is, these are common:

    The number [0-9]* is a good number
    I have [0-9]* pupp(ies|y).

This are not:

    (Hello|from|the|other|side)*

Using the paper's terminology, I think I should be able to have a `FingerTree X REG` (for some X I'm not sure of), where every top-level sequence is represented as adjacent elements in the finger tree. I'm not sure if this will actually work, but it seems plausible that it will.

On Fri, Mar 1, 2019 at 3:55 AM Sergey Vinokurov <[hidden email]> wrote:
Modifying regular expressions reminds me of the great functional pearl
called 'A Play on Regular Expressions'. Perhaps this is the approach you
were thinking about?

Sergey

On 28/02/2019 18:43, Andrew Martin wrote:
> Greetings,
>
> In Dan Piponi's blog post Fast Incremental Regular Expression Matching
> with Monoids [1], he outlines a way to take advantage of fingertrees to
> perform increment regex matching. In his article, the regular expression
> (more accurately, its equivalent DFA) is fixed, and the strings we are
> matching on are modified incrementally. What I'm interested in is the
> opposite: What if the regular expression is modified incrementally and
> the string we are testing is static?
>
> Thinking about this, it seems to run into problems immediately.
> Typically, to evaluate a regular expression, I would convert it to a NFA
> (and then maybe to a DFA). But there is no way to do incremental graph
> modification on the resulting graphs. (Or is there?). And even if there
> were a way, it seems unclear how one would go about exploiting its
> incremental nature. To provide an example, consider the following regex
> and sample string:
>
> Regex: Lost (dog|cat) loo for home
> Sample: Lost dog looking for home
>
> In the regex, I'm currently in the middle of typing "looking", so it
> doesn't match the sample string. But large parts of it do. So when I
> update the regex by typing "k", I'd like to be able to not redo this
> work. Maybe there's a certain subset of regex that's amenable to this
> kind of incremental evaluation. Maybe not. If anyone has any additional
> insights or can point me to any research, I would appreciate it greatly.
> Thanks.
>
> [1] http://blog.sigfpe.com/2009/01/fast-incremental-regular-expression.html
>
>
> --
> -Andrew Thaddeus Martin
>
> _______________________________________________
> Libraries mailing list
> [hidden email]
> http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
>



--
-Andrew Thaddeus Martin

_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Incremental Regex Parsing with Fixed Strings

Andreas Abel
Andrew,

The translation Regex -> NFA is compositional, thus, you could cache the
NFAs computed for subregexs, only needing to recompute the NFA along the
path to the changes subregex.  Compilation to DFAs is not needed as you
can use NFAs directly to recognize strings.

--Andreas

On 2019-03-01 17:06, Andrew Martin wrote:

> Thank you so much for pointing me to this paper. What an absolute gem it
> is! Efficiently evaluating regex with a recursive ADT instead of a graph
> puts me one step closer to a solution. In my problem domain, regex
> frequently are mostly sequences at the top level. That is, these are common:
>
>      The number [0-9]* is a good number
>      I have [0-9]* pupp(ies|y).
>
> This are not:
>
>      (Hello|from|the|other|side)*
>
> Using the paper's terminology, I think I should be able to have a
> `FingerTree X REG` (for some X I'm not sure of), where every top-level
> sequence is represented as adjacent elements in the finger tree. I'm not
> sure if this will actually work, but it seems plausible that it will.
>
> On Fri, Mar 1, 2019 at 3:55 AM Sergey Vinokurov <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Modifying regular expressions reminds me of the great functional pearl
>     called 'A Play on Regular Expressions'. Perhaps this is the approach you
>     were thinking about?
>
>     Sergey
>
>     On 28/02/2019 18:43, Andrew Martin wrote:
>      > Greetings,
>      >
>      > In Dan Piponi's blog post Fast Incremental Regular Expression
>     Matching
>      > with Monoids [1], he outlines a way to take advantage of
>     fingertrees to
>      > perform increment regex matching. In his article, the regular
>     expression
>      > (more accurately, its equivalent DFA) is fixed, and the strings
>     we are
>      > matching on are modified incrementally. What I'm interested in is the
>      > opposite: What if the regular expression is modified
>     incrementally and
>      > the string we are testing is static?
>      >
>      > Thinking about this, it seems to run into problems immediately.
>      > Typically, to evaluate a regular expression, I would convert it
>     to a NFA
>      > (and then maybe to a DFA). But there is no way to do incremental
>     graph
>      > modification on the resulting graphs. (Or is there?). And even if
>     there
>      > were a way, it seems unclear how one would go about exploiting its
>      > incremental nature. To provide an example, consider the following
>     regex
>      > and sample string:
>      >
>      > Regex: Lost (dog|cat) loo for home
>      > Sample: Lost dog looking for home
>      >
>      > In the regex, I'm currently in the middle of typing "looking", so it
>      > doesn't match the sample string. But large parts of it do. So when I
>      > update the regex by typing "k", I'd like to be able to not redo this
>      > work. Maybe there's a certain subset of regex that's amenable to this
>      > kind of incremental evaluation. Maybe not. If anyone has any
>     additional
>      > insights or can point me to any research, I would appreciate it
>     greatly.
>      > Thanks.
>      >
>      > [1]
>     http://blog.sigfpe.com/2009/01/fast-incremental-regular-expression.html
>      >
>      >
>      > --
>      > -Andrew Thaddeus Martin
>      >
>      > _______________________________________________
>      > Libraries mailing list
>      > [hidden email] <mailto:[hidden email]>
>      > http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
>      >
>
>
>
> --
> -Andrew Thaddeus Martin
>
> _______________________________________________
> Libraries mailing list
> [hidden email]
> http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
>
_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries