Is XHT a good tool for parsing web pages?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Is XHT a good tool for parsing web pages?

John Creighton

Subject: Is XHT a good tool for parsing web pages?
I looked a little bit at XHT and it seems very elegant for writing
concise definitions of parsers by forms but I read that it fails if
the XML isn't strict and I know a lot of web pages don't use strict
XHTML. Therefore I wonder if it is an appropriate tool for web pages.




_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Is XHT a good tool for parsing web pages?

Peter Robinson-10
On 27 April 2010 16:22, John Creighton <[hidden email]> wrote:
>> Subject: Is XHT a good tool for parsing web pages?
>> I looked a little bit at XHT and it seems very elegant for writing
>> concise definitions of parsers by forms but I read that it fails if
>> the XML isn't strict and I know a lot of web pages don't use strict
>> XHTML. Therefore I wonder if it is an appropriate tool for web pages.

I don't know about XHT but tagsoup [1] does a pretty good job parsing web pages.

  Peter

[1] http://hackage.haskell.org/package/tagsoup
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Is XHT a good tool for parsing web pages?

Malcolm Wallace
In reply to this post by John Creighton
> Is XHT a good tool for parsing web pages?
> I read that it fails if the XML isn't strict and I know a lot of web  
> pages don't use strict XHTML.

Do you mean HXT rather than XHT?

I know that the HaXml library has a separate error-correcting HTML  
parser that works around most of the common non-well-formedness bugs  
in HTML:
     Text.XML.HaXml.Html.Parse

I believe HXT has a similar parser:
     Text.XML.HXT.Parser.HtmlParsec

Indeed, some of the similarities suggest this parser was originally  
lifted directly out of HaXml (as permitted by HaXml's licence),  
although the two modules have now diverged significantly.

Regards,
     Malcolm

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Is XHT a good tool for parsing web pages?

Uwe Schmidt-2
Hi John and Malcom,

> I know that the HaXml library has a separate error-correcting HTML
> parser that works around most of the common non-well-formedness bugs
> in HTML:
>      Text.XML.HaXml.Html.Parse
>
> I believe HXT has a similar parser:
>      Text.XML.HXT.Parser.HtmlParsec
>
> Indeed, some of the similarities suggest this parser was originally
> lifted directly out of HaXml (as permitted by HaXml's licence),
> although the two modules have now diverged significantly.

The HTML parser in HXT is based on tagsoup. It's a lazy parser
(it does not use parsec) and it tries to parse everything as HTML.
But garbage in, garbage out, there is no approach to repair illegal HTML
as e.g. the Tidy parsers do. The parser uses tagsoup as a scanner.

The table driven approach for inserting missing closing tags is indeed taken
from HaXml. Malcom, I hope you don't have a patent on this algorithm.

Regards,

   Uwe



_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Is XHT a good tool for parsing web pages?

Ivan Lazar Miljenovic
Uwe Schmidt <[hidden email]> writes:
> The HTML parser in HXT is based on tagsoup. It's a lazy parser
> (it does not use parsec) and it tries to parse everything as HTML.
> But garbage in, garbage out, there is no approach to repair illegal HTML
> as e.g. the Tidy parsers do. The parser uses tagsoup as a scanner.

So what is parsec used for in HXT then?

--
Ivan Lazar Miljenovic
[hidden email]
IvanMiljenovic.wordpress.com
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Is XHT a good tool for parsing web pages?

Uwe Schmidt-2
Hi Ivan,

> Uwe Schmidt <[hidden email]> writes:
> > The HTML parser in HXT is based on tagsoup. It's a lazy parser
> > (it does not use parsec) and it tries to parse everything as HTML.
> > But garbage in, garbage out, there is no approach to repair illegal HTML
> > as e.g. the Tidy parsers do. The parser uses tagsoup as a scanner.
>
> So what is parsec used for in HXT then?

for the XML parser. This XML parser also deals with DTDs. This parser only
accepts well formed XML, everything else gives an error (not just a warning
like HTML parser). tagsoup and the HTML parser do not deal with DTDs,
so they can't be used for a full (validating) XML parser.

Regards,

   Uwe

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe