uu-parsinglib - Greedy Parser

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

uu-parsinglib - Greedy Parser

Marco Vassena
In the uu-parsinglib the choice combinator (<|>) is symmetric and does not commit to any alternative.
If the grammar being parsed is ambiguous, the corresponding parser will fail at run time on an ambiguous input.

Consider for instance this html parser.

pTag = pOpenTag <|> pCloseTag <|> pCommentTag <|> pContent

pContent = Content <$> some (satisfy (/= '<'))
pHtml = some pTag

This parser will fail on "<a>123</a>", because this may be interpreted as:
[ [Open a , Content "123", Close a],
 [Open a, Content "12", Content "3", Close a],
 [Open a, Content "1", Content "2", Content "3", Close a] ... ]

In other parsing libraries such as parsec the choice operator <|> is greedy and commits to the first alternative that makes
any progress, so that some is greedy and Content "123" would be parsed.
The operator <<|> in uu-parsinglib has the same greedy behaviour.

I would like to disambiguate the grammar so that the first result ([Open a, Content "123", Close a]) is selected (longest matching rule).
I suppose that this may be done using <<|> to define a greedySome to be used in in pContent, however I am wondering whether
there is another way to do this, without using such operator <<|>.

Any help is appreciated.

All the best,
Marco
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: uu-parsinglib - Greedy Parser

Peter Simons
Hi Marco,

 > I suppose that this may be done using <<|> to define a greedySome to be
 > used in in pContent, however I am wondering whether
 > there is another way to do this, without using such operator <<|>.

the 'micro' function attaches a cost a branch of the parse tree, and the
library generally prefers the branch with the lowest cost. Maybe this is
what you're looking for?

Best regards,
Peter

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: uu-parsinglib - Greedy Parser

S. Doaitse Swierstra
In reply to this post by Marco Vassena

> On 13 Jan 2015, at 11:57 , Marco Vassena <[hidden email]> wrote:
>
> In the uu-parsinglib the choice combinator (<|>) is symmetric and does not commit to any alternative.
> If the grammar being parsed is ambiguous, the corresponding parser will fail at run time on an ambiguous input.
>
> Consider for instance this html parser.
>
> pTag = pOpenTag <|> pCloseTag <|> pCommentTag <|> pContent
>
> pContent = Content <$> some (satisfy (/= '<'))
> pHtml = some pTag
>
> This parser will fail on "<a>123</a>", because this may be interpreted as:
> [ [Open a , Content "123", Close a],
> [Open a, Content "12", Content "3", Close a],
> [Open a, Content "1", Content "2", Content "3", Close a] ... ]
>
> In other parsing libraries such as parsec the choice operator <|> is greedy and commits to the first alternative that makes
> any progress, so that some is greedy and Content "123" would be parsed.
> The operator <<|> in uu-parsinglib has the same greedy behaviour.
>
> I would like to disambiguate the grammar so that the first result ([Open a, Content "123", Close a]) is selected (longest matching rule).
> I suppose that this may be done using <<|> to define a greedySome to be used in in pContent, however I am wondering whether
> there is another way to do this, without using such operator <<|>.

The thing to do it to use the pList combinator which is greedy (the pList_ng counterpart is non-greedy):

pContent = Content <$> pList (satisfy (/= '<'))

An even better approach would be to reflect the structure of your HTML in your parser:

pContent = tag_semantics <$> pOpenTag <*> pContent <* pCloseTag

In case you re sure about the correctness of your HTML you might want to enforce, by using a monadic construct to check for the closing tag to correspons to the opening tag (note that I am in general not in favour of using monads, since they make the abstrcat interpretation going on expensive):

pTagged = do t <- pOpentag
             c <- pContents
             pClosetag t
             return (Tagged t c)

(of course with appropriate definitions for pOpentag and pCloseTag.

Finally the module BasicInstances contains a combinator pMunch which you could have used:

pContent = Content <$>  pMunch (/= '<')

This is far more efficient since it operates directly at the state.

   Doaitse




>
> Any help is appreciated.
>
> All the best,
> Marco
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: uu-parsinglib - Greedy Parser

Marco Vassena
> The thing to do it to use the pList combinator which is greedy (the pList_ng counterpart is non-greedy):
>
> pContent = Content <$> pList (satisfy (/= '<'))

Checking the source code pList is defined in terms of pFoldr, which uses opt, which eventually is defined in terms  of <<|>.

> An even better approach would be to reflect the structure of your HTML in your parser:
>
> pContent = tag_semantics <$> pOpenTag <*> pContent <* pCloseTag

> In case you re sure about the correctness of your HTML you might want to enforce, by using a monadic construct to check for the closing tag to correspons to the opening tag (note that I am in general not in favour of using monads, since they make the abstrcat interpretation going on expensive):
>
> pTagged = do t <- pOpentag
>             c <- pContents
>             pClosetag t
>             return (Tagged t c)
>
> (of course with appropriate definitions for pOpentag and pCloseTag.

Unfortunately in html there are also empty tags, which don't need to be closed. For instance the line-break tag <br>:
<h1> Line break tags are <br> not closed </h1>

The bigger picture is that I am trying to figure out what are the core constructs needed to define a parser, therefore I want to
have a rather abstract interface. In my set of core constructs there are:
        <$> : (a -> b) -> f a -> f b
  <*> : f (a -> b) -> f a -> f b
        <|> : f a -> f a -> f a  -- (symmetric choice)
        pure : a -> f a
        fail : f a
        pToken : f Char

Is it possible to define a parser that applies the longest matching rule using these constructs only?
Or is it necessary to extend it with another primitive, for instance greedy choice <<|> ?
(Note that f is abstract and it is not necessarily uu-parsinglib parsers).

Marco

On 13/gen/2015, at 12.22, S. Doaitse Swierstra wrote:

>
>> On 13 Jan 2015, at 11:57 , Marco Vassena <[hidden email]> wrote:
>>
>> In the uu-parsinglib the choice combinator (<|>) is symmetric and does not commit to any alternative.
>> If the grammar being parsed is ambiguous, the corresponding parser will fail at run time on an ambiguous input.
>>
>> Consider for instance this html parser.
>>
>> pTag = pOpenTag <|> pCloseTag <|> pCommentTag <|> pContent
>>
>> pContent = Content <$> some (satisfy (/= '<'))
>> pHtml = some pTag
>>
>> This parser will fail on "<a>123</a>", because this may be interpreted as:
>> [ [Open a , Content "123", Close a],
>> [Open a, Content "12", Content "3", Close a],
>> [Open a, Content "1", Content "2", Content "3", Close a] ... ]
>>
>> In other parsing libraries such as parsec the choice operator <|> is greedy and commits to the first alternative that makes
>> any progress, so that some is greedy and Content "123" would be parsed.
>> The operator <<|> in uu-parsinglib has the same greedy behaviour.
>>
>> I would like to disambiguate the grammar so that the first result ([Open a, Content "123", Close a]) is selected (longest matching rule).
>> I suppose that this may be done using <<|> to define a greedySome to be used in in pContent, however I am wondering whether
>> there is another way to do this, without using such operator <<|>.
>
> The thing to do it to use the pList combinator which is greedy (the pList_ng counterpart is non-greedy):
>
> pContent = Content <$> pList (satisfy (/= '<'))
>
> An even better approach would be to reflect the structure of your HTML in your parser:
>
> pContent = tag_semantics <$> pOpenTag <*> pContent <* pCloseTag
>
> In case you re sure about the correctness of your HTML you might want to enforce, by using a monadic construct to check for the closing tag to correspons to the opening tag (note that I am in general not in favour of using monads, since they make the abstrcat interpretation going on expensive):
>
> pTagged = do t <- pOpentag
>             c <- pContents
>             pClosetag t
>             return (Tagged t c)
>
> (of course with appropriate definitions for pOpentag and pCloseTag.
>
> Finally the module BasicInstances contains a combinator pMunch which you could have used:
>
> pContent = Content <$>  pMunch (/= '<')
>
> This is far more efficient since it operates directly at the state.
>
>   Doaitse
>
>
>
>
>>
>> Any help is appreciated.
>>
>> All the best,
>> Marco
>> _______________________________________________
>> Haskell-Cafe mailing list
>> [hidden email]
>> http://www.haskell.org/mailman/listinfo/haskell-cafe
>

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: uu-parsinglib - Greedy Parser

S. Doaitse Swierstra

> On 13 Jan 2015, at 14:25 , Marco Vassena <[hidden email]> wrote:
>
>> The thing to do it to use the pList combinator which is greedy (the pList_ng counterpart is non-greedy):
>>
>> pContent = Content <$> pList (satisfy (/= '<'))
>
> Checking the source code pList is defined in terms of pFoldr, which uses opt, which eventually is defined in terms  of <<|>.
>
>> An even better approach would be to reflect the structure of your HTML in your parser:
>>
>> pContent = tag_semantics <$> pOpenTag <*> pContent <* pCloseTag
>
>> In case you re sure about the correctness of your HTML you might want to enforce, by using a monadic construct to check for the closing tag to correspons to the opening tag (note that I am in general not in favour of using monads, since they make the abstrcat interpretation going on expensive):
>>
>> pTagged = do t <- pOpentag
>>            c <- pContents
>>            pClosetag t
>>            return (Tagged t c)
>>
>> (of course with appropriate definitions for pOpentag and pCloseTag.
>
> Unfortunately in html there are also empty tags, which don't need to be closed. For instance the line-break tag <br>:
> <h1> Line break tags are <br> not closed </h1>
>
> The bigger picture is that I am trying to figure out what are the core constructs needed to define a parser, therefore I want to
> have a rather abstract interface. In my set of core constructs there are:
> <$> : (a -> b) -> f a -> f b
> <*> : f (a -> b) -> f a -> f b
> <|> : f a -> f a -> f a  -- (symmetric choice)
> pure : a -> f a
> fail : f a
> pToken : f Char
>
> Is it possible to define a parser that applies the longest matching rule using these constructs only?
> Or is it necessary to extend it with another primitive, for instance greedy choice <<|> ?
> (Note that f is abstract and it is not necessarily uu-parsinglib parsers).
>
> Marco

I do not think so. As long as your grammar is essentially ambiguous you will get all parses. Using the amb combinator you can capture this, and avoid the runtime error. Then you can lateron decide which tree too take. I am afraid however that this will may lead to an exponential blowup in your case.

If the set og <br>like tags is statically known they might be taken care of in a special way? by the parser?

 Doaitse

>
> On 13/gen/2015, at 12.22, S. Doaitse Swierstra wrote:
>
>>
>>> On 13 Jan 2015, at 11:57 , Marco Vassena <[hidden email]> wrote:
>>>
>>> In the uu-parsinglib the choice combinator (<|>) is symmetric and does not commit to any alternative.
>>> If the grammar being parsed is ambiguous, the corresponding parser will fail at run time on an ambiguous input.
>>>
>>> Consider for instance this html parser.
>>>
>>> pTag = pOpenTag <|> pCloseTag <|> pCommentTag <|> pContent
>>>
>>> pContent = Content <$> some (satisfy (/= '<'))
>>> pHtml = some pTag
>>>
>>> This parser will fail on "<a>123</a>", because this may be interpreted as:
>>> [ [Open a , Content "123", Close a],
>>> [Open a, Content "12", Content "3", Close a],
>>> [Open a, Content "1", Content "2", Content "3", Close a] ... ]
>>>
>>> In other parsing libraries such as parsec the choice operator <|> is greedy and commits to the first alternative that makes
>>> any progress, so that some is greedy and Content "123" would be parsed.
>>> The operator <<|> in uu-parsinglib has the same greedy behaviour.
>>>
>>> I would like to disambiguate the grammar so that the first result ([Open a, Content "123", Close a]) is selected (longest matching rule).
>>> I suppose that this may be done using <<|> to define a greedySome to be used in in pContent, however I am wondering whether
>>> there is another way to do this, without using such operator <<|>.
>>
>> The thing to do it to use the pList combinator which is greedy (the pList_ng counterpart is non-greedy):
>>
>> pContent = Content <$> pList (satisfy (/= '<'))
>>
>> An even better approach would be to reflect the structure of your HTML in your parser:
>>
>> pContent = tag_semantics <$> pOpenTag <*> pContent <* pCloseTag
>>
>> In case you re sure about the correctness of your HTML you might want to enforce, by using a monadic construct to check for the closing tag to correspons to the opening tag (note that I am in general not in favour of using monads, since they make the abstrcat interpretation going on expensive):
>>
>> pTagged = do t <- pOpentag
>>            c <- pContents
>>            pClosetag t
>>            return (Tagged t c)
>>
>> (of course with appropriate definitions for pOpentag and pCloseTag.
>>
>> Finally the module BasicInstances contains a combinator pMunch which you could have used:
>>
>> pContent = Content <$>  pMunch (/= '<')
>>
>> This is far more efficient since it operates directly at the state.
>>
>>  Doaitse
>>
>>
>>
>>
>>>
>>> Any help is appreciated.
>>>
>>> All the best,
>>> Marco
>>> _______________________________________________
>>> Haskell-Cafe mailing list
>>> [hidden email]
>>> http://www.haskell.org/mailman/listinfo/haskell-cafe
>>
>
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: uu-parsinglib - Greedy Parser

Kyle Marek-Spartz
In reply to this post by Marco Vassena
This is the difference between Parsing Expression Grammars and more
general Context-Free Grammars.

https://en.wikipedia.org/wiki/Parsing_expression_grammar

You'll have to re-write your grammar to be unambiguous if you want to
use a library based on CFGs.


Marco Vassena writes:

> In the uu-parsinglib the choice combinator (<|>) is symmetric and does not commit to any alternative.
> If the grammar being parsed is ambiguous, the corresponding parser will fail at run time on an ambiguous input.
>
> Consider for instance this html parser.
>
> pTag = pOpenTag <|> pCloseTag <|> pCommentTag <|> pContent
>
> pContent = Content <$> some (satisfy (/= '<'))
> pHtml = some pTag
>
> This parser will fail on "<a>123</a>", because this may be interpreted as:
> [ [Open a , Content "123", Close a],
>  [Open a, Content "12", Content "3", Close a],
>  [Open a, Content "1", Content "2", Content "3", Close a] ... ]
>
> In other parsing libraries such as parsec the choice operator <|> is greedy and commits to the first alternative that makes
> any progress, so that some is greedy and Content "123" would be parsed.
> The operator <<|> in uu-parsinglib has the same greedy behaviour.
>
> I would like to disambiguate the grammar so that the first result ([Open a, Content "123", Close a]) is selected (longest matching rule).
> I suppose that this may be done using <<|> to define a greedySome to be used in in pContent, however I am wondering whether
> there is another way to do this, without using such operator <<|>.
>
> Any help is appreciated.
>
> All the best,
> Marco
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe

--
Kyle Marek-Spartz
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: uu-parsinglib - Greedy Parser

Mario Blažević
In reply to this post by Marco Vassena
On 15-01-13 08:25 AM, Marco Vassena wrote:
> Unfortunately in html there are also empty tags, which don't need to
> be closed. For instance the line-break tag <br>: <h1> Line break tags
> are <br> not closed </h1>
>
> The bigger picture is that I am trying to figure out what are the
> core constructs needed to define a parser, therefore I want to have a
> rather abstract interface. In my set of core constructs there are:
> <$> : (a -> b) -> f a -> f b
 > <*> : f (a -> b) -> f a -> f b
 > <|> : f a -> f a -> f a  -- (symmetric choice)
 > pure : a -> f a
 > fail : f a
> pToken : f Char
>
> Is it possible to define a parser that applies the longest matching
> rule using these constructs only? Or is it necessary to extend it
> with another primitive, for instance greedy choice <<|> ? (Note that
> f is abstract and it is not necessarily uu-parsinglib parsers).

        You can parse HTML with no ambiguous results if you allow monadic bind
(>>=) as well:

pTag = pElement <|> pCommentTag <|> pContent
pElement = do elemName <- pOpenTag
               elemContent <- pTag `manyTill` endElement elemName
endElement elemName = string "</" *> string elemName *> string ">"
                       <|> lookahead (string "</"
                                      *> some (satisfy (/= '>'))
                                      *> string ">")
pContent = Content <$> some (satisfy (/= '<'))
pHtml = some pTag

        Mind you, this code would not give you exactly the same parse tree as
an HTML 5 browser would. That spec is a nightmare.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: uu-parsinglib - Greedy Parser

oleg-30
In reply to this post by S. Doaitse Swierstra

Marco Vassena wrote:

> The bigger picture is that I am trying to figure out what
> are the core constructs needed to define a parser, therefore I want to
> have a rather abstract interface. In my set of core constructs there are:
> <$> : (a -> b) -> f a -> f b
> <*> : f (a -> b) -> f a -> f b
> <|> : f a -> f a -> f a  -- (symmetric choice)
> pure : a -> f a
> fail : f a
> pToken : f Char
>
> Is it possible to define a parser that applies the longest
> matching rule using these constructs only?

No, these are not sufficient. Longest-matching rule (aka maximal
munch) is defined as applying a parser for as long as it
succeeds. Therefore, we need a way to determine if a parser succeeded:
we need what is called a (monadic) reflection. In simpler words, we
need an operation like the soft-cut in Prolog. It can be defined in
terms of a primitive called `msplit', described in the LogicT paper:
        http://okmij.org/ftp/Computation/monads.html#LogicT

The following article talks about maximal munch in more detail, in
terms of logic programming:
        http://okmij.org/ftp/kakuritu/logic-programming.html#many


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe