Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Henning Thielemann

I want to parse and process HTML lazily. I use HXT because the HTML parser
is very liberal. However it uses Parsec and is thus strict. HaXML has a
so called lazy parser, but it is not what I consider lazy:

*Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>"++undefined++"</body></html>"
*** Exception: Prelude.undefined
*Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>&</body></html>"
*** Exception: Expected "</" but found &
  at file text  at line 1 col 26

If it would be lazy, it would return some HTML code before the error.
HaXML uses the Polyparse package for parsing which contains a so called
lazy parser. However it has return type (Either String a). That is, for
the decision whether the parse was successful, the document has to be
parsed completely.

*Text.ParserCombinators.PolyLazy> runParser (exactly 4 (satisfy Char.isAlpha)) ("abc104"++undefined)
("*** Exception: Parse.satisfy: failed

If it would have return type (String, a) it could return both a partial
value of type 'a' and the error message as String. It would be even better
if it has some handling for incorrect input texts, and returns ([String],
a), where [String] is the type of a list of warnings and error messages
and 'a' is the type of a total value of parse output.

Is there some parser of this type? Unfortunately
 http://www.haskell.org/haskellwiki/Applications_and_libraries/Compiler_tools
   does not compare the laziness of the mentioned parsers.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Neil Mitchell
Hi

Depending on exactly what you want, TagSoup may be of interest to you.
It is lazy, but it doesn't return a tree. It is very tollerant of
errors, and will simply never "fail to parse" something.

http://www-users.cs.york.ac.uk/~ndm/tagsoup/

Thanks

Neil

On 5/11/07, Henning Thielemann <[hidden email]> wrote:

>
> I want to parse and process HTML lazily. I use HXT because the HTML parser
> is very liberal. However it uses Parsec and is thus strict. HaXML has a
> so called lazy parser, but it is not what I consider lazy:
>
> *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>"++undefined++"</body></html>"
> *** Exception: Prelude.undefined
> *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>&</body></html>"
> *** Exception: Expected "</" but found &
>   at file text  at line 1 col 26
>
> If it would be lazy, it would return some HTML code before the error.
> HaXML uses the Polyparse package for parsing which contains a so called
> lazy parser. However it has return type (Either String a). That is, for
> the decision whether the parse was successful, the document has to be
> parsed completely.
>
> *Text.ParserCombinators.PolyLazy> runParser (exactly 4 (satisfy Char.isAlpha)) ("abc104"++undefined)
> ("*** Exception: Parse.satisfy: failed
>
> If it would have return type (String, a) it could return both a partial
> value of type 'a' and the error message as String. It would be even better
> if it has some handling for incorrect input texts, and returns ([String],
> a), where [String] is the type of a list of warnings and error messages
> and 'a' is the type of a total value of parse output.
>
> Is there some parser of this type? Unfortunately
>  http://www.haskell.org/haskellwiki/Applications_and_libraries/Compiler_tools
>    does not compare the laziness of the mentioned parsers.
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Jules Bean
In reply to this post by Henning Thielemann
Henning Thielemann wrote:

> I want to parse and process HTML lazily. I use HXT because the HTML parser
> is very liberal. However it uses Parsec and is thus strict. HaXML has a
> so called lazy parser, but it is not what I consider lazy:
>
> *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>"++undefined++"</body></html>"
> *** Exception: Prelude.undefined
> *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>&</body></html>"
> *** Exception: Expected "</" but found &
>   at file text  at line 1 col 26
>
> If it would be lazy, it would return some HTML code before the error.
>  

Are you sure that it is the parser, that is not lazy, and it isn't that
the pretty printer is overly strict?

 From the evidence above the parser could be returning some results
before the error, and the pretty printer strictly slurping it all up to
the error and then dying.

Jules

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Henning Thielemann
In reply to this post by Neil Mitchell

On Fri, 11 May 2007, Neil Mitchell wrote:

> Depending on exactly what you want, TagSoup may be of interest to you.
> It is lazy, but it doesn't return a tree. It is very tollerant of
> errors, and will simply never "fail to parse" something.
>
> http://www-users.cs.york.ac.uk/~ndm/tagsoup/

That's an interesting option. It could be used as a lexer for a full-blown
HTML parser. Sometimes I need the tree structure. But why does this simple
piece of code needs -fglasgow-exts?


Thanks for the package and the hint!
 Henning
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Neil Mitchell
Hi

> That's an interesting option. It could be used as a lexer for a full-blown
> HTML parser. Sometimes I need the tree structure. But why does this simple
> piece of code needs -fglasgow-exts?

It doesn't. The released version 0.1 doesn't require extensions, and
the next 0.2 won't either. In the meantime I accepted a patch from a
user that added a new feature and required the flag. I'm going to
rework it shortly, make a few tweaks, and remove that flag.

Thanks

Neil
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Henning Thielemann
In reply to this post by Jules Bean

On Fri, 11 May 2007, Jules Bean wrote:

> Henning Thielemann wrote:
> > I want to parse and process HTML lazily. I use HXT because the HTML parser
> > is very liberal. However it uses Parsec and is thus strict. HaXML has a
> > so called lazy parser, but it is not what I consider lazy:
> >
> > *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>"++undefined++"</body></html>"
> > *** Exception: Prelude.undefined
> > *Text.XML.HaXml.Html.ParseLazy> Text.XML.HaXml.Pretty.document $ htmlParse "text" $ "<html><head></head><body>&</body></html>"
> > *** Exception: Expected "</" but found &
> >   at file text  at line 1 col 26
> >
> > If it would be lazy, it would return some HTML code before the error.
>
> Are you sure that it is the parser, that is not lazy, and it isn't that
> the pretty printer is overly strict?
>
>  From the evidence above the parser could be returning some results
> before the error, and the pretty printer strictly slurping it all up to
> the error and then dying.

I know, but the type of the Polyparse parser prohibits lazy parsing.
Unfortunately there is no Show instance for HaXML trees, so one cannot
easily see whether laziness gets lost in the parser or in the pretty
printer.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Malcolm Wallace
In reply to this post by Henning Thielemann
Henning Thielemann <[hidden email]> wrote:

> HaXml has a so called lazy parser, but it is not what I consider lazy:

Lazy parsing is rather subtle, and it is easy to write a too-strict
parser when one intended to be more lazy.  Equally, it can be easy to
imagine that the parser is too strict, when in fact it is the usage
context that is wrong.

You have indeed found some bugs in HaXml's lazy HTML parser, but you
have also partly misunderstood what lazy parsing means.

>  Text.XML.HaXml.Pretty.document $ htmlParse "text" $
>      "<html><head></head><body>"++undefined++"</body></html>"
> *** Exception: Prelude.undefined

The problem here is not that the parser is too strict, but that the
pretty-printer is.  The pretty-printer is demanding an undefined portion
of the value before it produces any output.

> If it would be lazy, it would return some HTML code before the error.

It can do, but only if you consume the part without errors first!  For
instance, it would be safe to extract just the <head> tag, because that
is complete:

    import Text.XML.HaXml
    import Text.XML.HaXml.Posn
    import Text.XML.HaXml.Html.ParseLazy as Lazy
    import Text.XML.HaXml.Pretty as PP
    import Text.PrettyPrint.HughesPJ

    main = putStrLn $ render $ fsep PP.content $
           -- the following line extracts just the first child tag of <html>
           (\(Document _ _ e _)-> (position 0 children) (CElem e nopos)) $
           Lazy.htmlParse "text" $
           "<html><head></head><body>"++undefined++"</body></html>"

Unfortunately, this program currently does throw an "undefined"
exception, even though it should not.  The lazy HTML parser contains a
couple of tricky corners that _probably_ stop it from being lazy.
  (1) The element parser does not immediately return an element after
      seeing its start tag, because it also has to return a stack of
      improperly terminated elements inside this one (so they can be
      repaired).
  (2) After parsing, we simplify the tree structure, which of course
      traverses it, and may again force too much evaluation.
In any case, I will need to investigate further, and hopefully soon push
a patch to fix the problem.

> HaXML uses the Polyparse package for parsing which contains a so
> called lazy parser. However it has return type (Either String a). That
> is, for the decision whether the parse was successful, the document
> has to be parsed completely.

Not true.  PolyLazy.runParser has the signature
    runParser :: Parser t a -> [t] -> (a, [t])
that is, it returns the partially parsed value (which may contain
bottoms), and the remaining unparsed token-stream.  (Examining either of
these return values may cause sufficient evaluation to be forced to lead
to a runtime exception.)  There is no 'Either' type at the user level.
(Although an Either is used internally, see below, it does not do what
you think).

> *Text.ParserCombinators.PolyLazy>
>       runParser (exactly 4 (satisfy Char.isAlpha)) ("abc104"++undefined)
> ("*** Exception: Parse.satisfy: failed

This output is exactly correct.  You asked for the first four characters
provided that they were alphabetic, but in fact only the first three
were alphabetic.  Hence, 'satisfy' failed and threw an exception.  If
you ask for only the first three characters, then the parse succeeds:

  > fst $ runParser (exactly 3 (satisfy Char.isAlpha))
                    ("abc104"++undefined)
  "abc"

The purpose of the internal Either type that you mentioned, is to permit
backtracking within the parse, not to force complete evaluation.  Thus,
you can equally ask for the first four characters provided they are
alphanumeric, where alphanumeric is decided by a combination of
alternate parsers:

  > fst $ runParser (exactly 4 (satisfy Char.isAlpha `onFail`
                                    satisfy Char.isDigit))
                    ("abc104"++undefined)
  "abc1"

This example illustrates that a parse failure is still recoverable when
parsing lazily (but only by another parser, not once the failure has
escaped the parsing world).

Regards,
    Malcolm
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Heinrich Apfelmus
In reply to this post by Henning Thielemann
Henning Thielemann wrote:
> I want to parse and process HTML lazily. I use HXT because the HTML parser
> is very liberal. However it uses Parsec and is thus strict. HaXML has a
> so called lazy parser, but it is not what I consider lazy:
> [...]

Note that lazy parsing is inherently difficult and most often more or
less impossible. The fundamental problem is that one cannot really
return a result before deciding whether the given input is syntactically
correct or not. In other words, take the malformed HTML

   <html>
   <head></head>
     <body>
       <h1>Hello Lazy Parser
     </body>
   </html>

What results should a lazy parser return before emitting ⊥? At the time
you read the <html>-tag, you cannot know whether a syntax error far down
in the file makes it invalid. Thus, you may not return the top-most
<html>-tag until you see the closing </html>.

Furthermore, the fact that data appears in sequence disturbs on-demand
processing very much. For instance, take the the following XML-Tree

   <Node>
     <Node>
       <Leaf>1</Leaf>
       <Leaf>2</Leaf>
     </Node>
     <Leaf>3</Leaf>
   </Node>

modeled after the definition

   data Tree a = Leaf a | Node (Tree a) (Tree a)

Assume that processing only needs the left branch of the tree. In this
case, we do not need to parse (Leaf 3). But assume that we only need the
right branch. Much to our horror, we have to completely parse the left
branch (Node (Leaf 1) (Leaf 2)) as well, simply because it appears
before the right branch!


I think that at least the syntax-correctness problem can be solved by a
two separate parse passes: first check that the input is syntactically
correct, then lazily build the values. However, monadic parser
combinators are unsuitable for that because they allow the syntax
correctness to depend on the built values. You need f.i. applicative
parser combinators if you want to derive both passes from one parser
description.

Regards,
apfelmus

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Novák Zoltán
Hi,

> What results should a lazy parser return before emitting ⊥? At the time
> you read the <html>-tag, you cannot know whether a syntax error far down
> in the file makes it invalid. Thus, you may not return the top-most
> <html>-tag until you see the closing </html>.

But to return the top most <html> you don't have to parse the data until the
</html> tag, it is really enough to see it, of course you need to read the
whole file for that,  but the parsing can be lazy.

                                Zoli
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Novák Zoltán
Hi,

> Hi,
>
> > What results should a lazy parser return before emitting ⊥? At the time
> > you read the <html>-tag, you cannot know whether a syntax error far down
> > in the file makes it invalid. Thus, you may not return the top-most
> > <html>-tag until you see the closing </html>.
>
> But to return the top most <html> you don't have to parse the data until
> the </html> tag, it is really enough to see it, of course you need to read
> the whole file for that,  but the parsing can be lazy.
>
I've just found somebody wrote an article on a similar idea.
http://citeseer.ist.psu.edu/199634.html

Real lazy evaluation fans should code html in a Breadth-First way, with
forward pointers. :)

                        Zoli
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Henning Thielemann
In reply to this post by Malcolm Wallace

On Fri, 11 May 2007, Malcolm Wallace wrote:

> > *Text.ParserCombinators.PolyLazy>
> >       runParser (exactly 4 (satisfy Char.isAlpha)) ("abc104"++undefined)
> > ("*** Exception: Parse.satisfy: failed
>
> This output is exactly correct.  You asked for the first four characters
> provided that they were alphabetic, but in fact only the first three
> were alphabetic.  Hence, 'satisfy' failed and threw an exception.  If
> you ask for only the first three characters, then the parse succeeds:

The problem is obviously that a later wrong character can make the whole
parse fail. Thus successful generated data is not returned until the whole
input is parsed and checked. How can I suppress checking the whole input?
How can I tell the parser that everything it parsed so far will not be
invalidated by further input? How can I rewrite the above example that it
returns
  ("abc*** Exception: Parse.satisfy: failed
?

I wondered whether 'commit' helps, but it didn't. (I thought it would
convert a global 'fail' to a local 'error'.)

*Text.ParserCombinators.PolyLazy>
    runParser (exactly 4 (commit (satisfy Char.isAlpha))) ("abc104"++undefined)
*** Exception: Parse.satisfy: failed
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Malcolm Wallace
Henning Thielemann <[hidden email]> wrote:

> > > *Text.ParserCombinators.PolyLazy>
> > >       runParser (exactly 4 (satisfy Char.isAlpha))
> > >       ("abc104"++undefined)
> > > ("*** Exception: Parse.satisfy: failed
>
> How can I rewrite the above example that it returns
>   ("abc*** Exception: Parse.satisfy: failed

The problem in your example is that the 'exactly' combinator forces
parsing of 'n' items before returning any of them.  If you roll your
own, then you can return partial results:

    > let one = return (:) `apply` satisfy (Char.isAlpha)
      in runParser (one `apply` (one `apply`
                   (one `apply` (one `apply` return []))))
             ("abc104"++undefined)
    ("abc*** Exception: Parse.satisfy: failed

Equivalently:

    > let one f = ((return (:)) `apply` satisfy (Char.isAlpha)) `apply` f
      in runParser (one (one (one (one (return []))))) ("abc104"++undefined)
    ("abc*** Exception: Parse.satisfy: failed

Perhaps I should just rewrite the 'exactly' combinator to have the
behaviour you desire?  Its current definition is:

    exactly 0 p = return []
    exactly n p = do x <- p
                     xs <- exactly (n-1) p
                     return (x:xs)

and a lazier definition would be:

    exactly 0 p = return []
    exactly n p = return (:) `apply` p `apply` exactly (n-1) p

> How can I tell the parser that everything it parsed so
> far will not be invalidated by further input?

Essentially, you need to return a constructor as soon as you know that
the initial portion of parsed data is correct.  Often the only sensible
way to do that is to use the 'apply' combinator (as shown in the
examples above), returning a constructor _function_ which is lazily
applied to the remainder of the parsing task.

> I wondered whether 'commit' helps, but it didn't. (I thought it would
> convert a global 'fail' to a local 'error'.)

The 'commit' combinator is intended for early abortion of a parse
attempt that it is known can no longer succeed.  That's the opposite of
what you want.  By contrast, the 'apply' combinator causes a parse
attempt to succeed early, even though it may turn out to fail later.

Regards,
    Malcolm
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Henning Thielemann

On Mon, 14 May 2007, Malcolm Wallace wrote:

> Perhaps I should just rewrite the 'exactly' combinator to have the
> behaviour you desire?  Its current definition is:
>
>     exactly 0 p = return []
>     exactly n p = do x <- p
>                      xs <- exactly (n-1) p
>                      return (x:xs)

Is there a difference between 'exactly' and 'replicateM' ?
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Henning Thielemann
In reply to this post by Malcolm Wallace

On Mon, 14 May 2007, Malcolm Wallace wrote:

> Essentially, you need to return a constructor as soon as you know that
> the initial portion of parsed data is correct.  Often the only sensible
> way to do that is to use the 'apply' combinator (as shown in the
> examples above), returning a constructor _function_ which is lazily
> applied to the remainder of the parsing task.

Great, 'apply' is the solution! I admit that I couldn't derive its power
from its documentation which simply states
 "Apply a parsed function to a parsed value."

:-)
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Malcolm Wallace
In reply to this post by Henning Thielemann
Henning Thielemann <[hidden email]> wrote:

> >     exactly 0 p = return []
> >     exactly n p = do x <- p
> >                      xs <- exactly (n-1) p
> >                      return (x:xs)
>
> Is there a difference between 'exactly' and 'replicateM' ?

With this definition, clearly not.  But when rewritten to use lazy
application, there is certainly a pragmatic difference in where the
bottoms (if any) are located in the result.

Regards,
    Malcolm
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Lazy HTML parsing with HXT, HaXML/polyparse, what else?

Henning Thielemann
In reply to this post by Malcolm Wallace

On Mon, 14 May 2007, Malcolm Wallace wrote:

> Henning Thielemann <[hidden email]> wrote:
>
> > > > *Text.ParserCombinators.PolyLazy>
> > > >       runParser (exactly 4 (satisfy Char.isAlpha))
> > > >       ("abc104"++undefined)
> > > > ("*** Exception: Parse.satisfy: failed
> >
> > How can I rewrite the above example that it returns
> >   ("abc*** Exception: Parse.satisfy: failed
>
> The problem in your example is that the 'exactly' combinator forces
> parsing of 'n' items before returning any of them.  If you roll your
> own, then you can return partial results:
>
>     > let one = return (:) `apply` satisfy (Char.isAlpha)
>       in runParser (one `apply` (one `apply`
>                    (one `apply` (one `apply` return []))))
>              ("abc104"++undefined)
>     ("abc*** Exception: Parse.satisfy: failed
>
> Equivalently:
>
>     > let one f = ((return (:)) `apply` satisfy (Char.isAlpha)) `apply` f
>       in runParser (one (one (one (one (return []))))) ("abc104"++undefined)
>     ("abc*** Exception: Parse.satisfy: failed

 I wonder whether 'apply' merges two separate ideas: Applying a generated
function to some parser generated value and forcing some parser to always
succeed. From the documentation of 'apply' I assumed that 'apply f x'
fails if 'f' or 'x' fails. In contrast to that it seems to succeed if only
'f' succeeds. Wouldn't it be better to have an explicit 'force' which
declares a parser to never fail - and to return 'undefined' if this
assumption is wrong.
 I have seen this 'force' in the MIDI loader of Haskore:
  http://darcs.haskell.org/haskore/src/Haskore/General/Parser.hs

 It would hold:
   apply f x  ==  do g <- f; fmap g (force x)
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe