Quantcast

Parsing a bunch of strings without backtracking

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Parsing a bunch of strings without backtracking

Javran Cheng
Hi all,

I come across an idea about parsing a bunch of strings,
just want to share this idea and looking for some comments and maybe pointing me to some related articles.

The idea is when you have:

1 <$ string "aabb" <|> 2 <$ string "aacc"

you can rewrite it as:

string "aa" >> (1 <$ string "bb" <|> 2 <$ string "cc")

so that you don't have to deal with the common prefix "aa" twice.

Then I further extend this idea: if we have a set of strings (s1, s2, ...), whose element is not a prefix of any other set member,
why don't we turn things like (v1 <$ string s1 <|> v2 <$ string s2 <|> ...) into a "tree-structure" so we don't need to deal with
any common prefix more than once?

So this motivates my little toy implementation at https://gist.github.com/Javran/1cbfe9897d9a5c8fae5d20a33fd83813

basically given a list [(s1,v1),(s2,v2)...], we want to generate a parser equivalent to (v1 <$ string s1 <|> v2 <$ string s2 <|> ...)
but perhaps with less backtracking. My code does this by turning the set of strings into a tree structure (see function "compact") and
using this "Compact" representation as a driver to generate the parser.
I used ReadP to have a concrete parser for testing, but this should work with any parsing library supporting MonadPlus.

I think they are bunch of things that can be improved:

- for now every intermediate node of "Compact" contains just a char, so this will end up generating a parser that has branch `char 'f' >> char 'o' >> 'o'` in it
    instead of the slightly more efficient version `string "foo"`
- we could further attach a weight to each string, this allows rearranging (<|>) chains so that a more frequent string parser appears earlier.
- if the input [(s1,v1),(s2,v2)...] are all known constants at compile time, probably TemplateHaskell can be used for generating the parser at compile time (as a Haskell expression).
    I'm not sure how difficult this would be though, as I have very limited understanding of TH.

but before I explore further, I want to collect some comments as I said in the beginning.

Best,
--
Javran (Fang) Cheng

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Parsing a bunch of strings without backtracking

Jaro
Have you seen uu-parsinglib (http://foswiki.cs.uu.nl/foswiki/HUT/ParserCombinators)? It implements a non-backtracking parser that works similarly to what you describe.

On 9 March 2017 01:21:09 GMT+01:00, Javran Cheng <[hidden email]> wrote:
Hi all,

I come across an idea about parsing a bunch of strings,
just want to share this idea and looking for some comments and maybe pointing me to some related articles.

The idea is when you have:

1 <$ string "aabb" <|> 2 <$ string "aacc"

you can rewrite it as:

string "aa" >> (1 <$ string "bb" <|> 2 <$ string "cc")

so that you don't have to deal with the common prefix "aa" twice.

Then I further extend this idea: if we have a set of strings (s1, s2, ...), whose element is not a prefix of any other set member,
why don't we turn things like (v1 <$ string s1 <|> v2 <$ string s2 <|> ...) into a "tree-structure" so we don't need to deal with
any common prefix more than once?

So this motivates my little toy implementation at https://gist.github.com/Javran/1cbfe9897d9a5c8fae5d20a33fd83813

basically given a list [(s1,v1),(s2,v2)...], we want to generate a parser equivalent to (v1 <$ string s1 <|> v2 <$ string s2 <|> ...)
but perhaps with less backtracking. My code does this by turning the set of strings into a tree structure (see function "compact") and
using this "Compact" representation as a driver to generate the parser.
I used ReadP to have a concrete parser for testing, but this should work with any parsing library supporting MonadPlus.

I think they are bunch of things that can be improved:

- for now every intermediate node of "Compact" contains just a char, so this will end up generating a parser that has branch `char 'f' >> char 'o' >> 'o'` in it
    instead of the slightly more efficient version `string "foo"`
- we could further attach a weight to each string, this allows rearranging (<|>) chains so that a more frequent string parser appears earlier.
- if the input [(s1,v1),(s2,v2)...] are all known constants at compile time, probably TemplateHaskell can be used for generating the parser at compile time (as a Haskell expression).
    I'm not sure how difficult this would be though, as I have very limited understanding of TH.

but before I explore further, I want to collect some comments as I said in the beginning.

Best,
--
Javran (Fang) Cheng

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Parsing a bunch of strings without backtracking

Javran Cheng
Hi Jaro, thanks for the link, having worked with few parsing libs but uu-parsinglib sounds new to me and I'm definitely thinking about having a try.
I'm taking a look but haven't yet found any part of code that attempts to factor common prefixes out.

On Wed, Mar 8, 2017 at 7:48 PM, Jaro <[hidden email]> wrote:
Have you seen uu-parsinglib (http://foswiki.cs.uu.nl/foswiki/HUT/ParserCombinators)? It implements a non-backtracking parser that works similarly to what you describe.


On 9 March 2017 01:21:09 GMT+01:00, Javran Cheng <[hidden email]> wrote:
Hi all,

I come across an idea about parsing a bunch of strings,
just want to share this idea and looking for some comments and maybe pointing me to some related articles.

The idea is when you have:

1 <$ string "aabb" <|> 2 <$ string "aacc"

you can rewrite it as:

string "aa" >> (1 <$ string "bb" <|> 2 <$ string "cc")

so that you don't have to deal with the common prefix "aa" twice.

Then I further extend this idea: if we have a set of strings (s1, s2, ...), whose element is not a prefix of any other set member,
why don't we turn things like (v1 <$ string s1 <|> v2 <$ string s2 <|> ...) into a "tree-structure" so we don't need to deal with
any common prefix more than once?

So this motivates my little toy implementation at https://gist.github.com/Javran/1cbfe9897d9a5c8fae5d20a33fd83813

basically given a list [(s1,v1),(s2,v2)...], we want to generate a parser equivalent to (v1 <$ string s1 <|> v2 <$ string s2 <|> ...)
but perhaps with less backtracking. My code does this by turning the set of strings into a tree structure (see function "compact") and
using this "Compact" representation as a driver to generate the parser.
I used ReadP to have a concrete parser for testing, but this should work with any parsing library supporting MonadPlus.

I think they are bunch of things that can be improved:

- for now every intermediate node of "Compact" contains just a char, so this will end up generating a parser that has branch `char 'f' >> char 'o' >> 'o'` in it
    instead of the slightly more efficient version `string "foo"`
- we could further attach a weight to each string, this allows rearranging (<|>) chains so that a more frequent string parser appears earlier.
- if the input [(s1,v1),(s2,v2)...] are all known constants at compile time, probably TemplateHaskell can be used for generating the parser at compile time (as a Haskell expression).
    I'm not sure how difficult this would be though, as I have very limited understanding of TH.

but before I explore further, I want to collect some comments as I said in the beginning.

Best,
--
Javran (Fang) Cheng

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



--
Javran (Fang) Cheng

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Parsing a bunch of strings without backtracking

Jaro
After thinking about it more (I still don't completely understand uu-parsinglib myself). I think achieves non-backtracking by running the parsers in parallel and then returning the first one that succeeds. So I guess it doesn't really optimize away prefixes. The 'best' function is the function that merges two "sequences" of the intermediate type 'Steps a' and that causes the parsers to be run in parallel. The source is here: https://hackage.haskell.org/package/uu-parsinglib-2.9.1.1/docs/src/Text-ParserCombinators-UU-Core.html#best.

On 9 March 2017 03:41:30 GMT+01:00, Javran Cheng <[hidden email]> wrote:
Hi Jaro, thanks for the link, having worked with few parsing libs but uu-parsinglib sounds new to me and I'm definitely thinking about having a try.
I'm taking a look but haven't yet found any part of code that attempts to factor common prefixes out.

On Wed, Mar 8, 2017 at 7:48 PM, Jaro <[hidden email]> wrote:
Have you seen uu-parsinglib (http://foswiki.cs.uu.nl/foswiki/HUT/ParserCombinators)? It implements a non-backtracking parser that works similarly to what you describe.


On 9 March 2017 01:21:09 GMT+01:00, Javran Cheng <[hidden email]> wrote:
Hi all,

I come across an idea about parsing a bunch of strings,
just want to share this idea and looking for some comments and maybe pointing me to some related articles.

The idea is when you have:

1 <$ string "aabb" <|> 2 <$ string "aacc"

you can rewrite it as:

string "aa" >> (1 <$ string "bb" <|> 2 <$ string "cc")

so that you don't have to deal with the common prefix "aa" twice.

Then I further extend this idea: if we have a set of strings (s1, s2, ...), whose element is not a prefix of any other set member,
why don't we turn things like (v1 <$ string s1 <|> v2 <$ string s2 <|> ...) into a "tree-structure" so we don't need to deal with
any common prefix more than once?

So this motivates my little toy implementation at https://gist.github.com/Javran/1cbfe9897d9a5c8fae5d20a33fd83813

basically given a list [(s1,v1),(s2,v2)...], we want to generate a parser equivalent to (v1 <$ string s1 <|> v2 <$ string s2 <|> ...)
but perhaps with less backtracking. My code does this by turning the set of strings into a tree structure (see function "compact") and
using this "Compact" representation as a driver to generate the parser.
I used ReadP to have a concrete parser for testing, but this should work with any parsing library supporting MonadPlus.

I think they are bunch of things that can be improved:

- for now every intermediate node of "Compact" contains just a char, so this will end up generating a parser that has branch `char 'f' >> char 'o' >> 'o'` in it
    instead of the slightly more efficient version `string "foo"`
- we could further attach a weight to each string, this allows rearranging (<|>) chains so that a more frequent string parser appears earlier.
- if the input [(s1,v1),(s2,v2)...] are all known constants at compile time, probably TemplateHaskell can be used for generating the parser at compile time (as a Haskell expression).
    I'm not sure how difficult this would be though, as I have very limited understanding of TH.

but before I explore further, I want to collect some comments as I said in the beginning.

Best,
--
Javran (Fang) Cheng

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Parsing a bunch of strings without backtracking

Ben Franksen
In reply to this post by Javran Cheng
Am 09.03.2017 um 01:21 schrieb Javran Cheng:

> I come across an idea about parsing a bunch of strings,
> just want to share this idea and looking for some comments and maybe
> pointing me to some related articles.
>
> The idea is when you have:
>
> 1 <$ string "aabb" <|> 2 <$ string "aacc"
>
> you can rewrite it as:
>
> string "aa" >> (1 <$ string "bb" <|> 2 <$ string "cc")
>
> so that you don't have to deal with the common prefix "aa" twice.

In the parsing literature this is called "left factoring (a grammar)".
Left factoring is known to improve efficiency and there are published
algorithms to do it automatically; but -- as you observed -- to perform
it you need an explicit representation of the grammar, so you can
transform it before turning it into a parser; some would call this a
"deep embedding".

> Then I further extend this idea: if we have a set of strings (s1, s2, ...),
> whose element is not a prefix of any other set member,
> why don't we turn things like (v1 <$ string s1 <|> v2 <$ string s2 <|> ...)
> into a "tree-structure" so we don't need to deal with
> any common prefix more than once?

The problem is how to represent the grammar in a statically typed way
and then to transform it preserving the well-typing. This can be done,
but the price is high. Look for packages (and papers) named
'christmastree' and 'tttas' to get an idea. Even the authors admit that
their approach is extremely involved and that offline methods (such as
TH which you mentioned) may be more practical.

Cheers
Ben

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Loading...