parsing machine-generated natural text

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

parsing machine-generated natural text

Evan Martin
For a toy project I want to parse the output of a program.  The
program runs on someone else's machine and mails me the results, so I
only have access to the output it generates,

Unfortunately, the output is intended to be human-readable, and this
makes parsing it a bit of a pain.  Here are some sample lines from its
output:

France: Army Marseilles SUPPORT Army Paris -> Burgundy.
Russia: Fleet St Petersburg (south coast) -> Gulf of Bothnia.
England:     4 Supply centers,  3 Units:  Builds   1 unit.
The next phase of 'dip' will be Movement for Fall of 1901.

I've been using Parsec and it's felt rather complicated.  For example,
a "location" is a series of words and possibly parenthesis, except if
the word is SUPPORT.  And that "Supply centers" line ends up being
code filled with stuff lie "char ':'; skipMany space".

I actually have a separate parser that's Javascript with a bunch of
regular expressions and it's far shorter than my Haskell one, which
makes sense as munging this sort of text feels to me more like a
regexp job than a careful parsing job.

I'm considering writing a preprocessing stage in Ruby or Perl that
munges those output lines into something a bit more
"machine-readable", but before I did that I thought I'd ask here if
anyone had any pointers, hints, or better ideas.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: parsing machine-generated natural text

Bulat Ziganshin-2
Hello Evan,

Saturday, May 20, 2006, 5:35:15 AM, you wrote:

> France: Army Marseilles SUPPORT Army Paris -> Burgundy.
> Russia: Fleet St Petersburg (south coast) -> Gulf of Bothnia.
> England:     4 Supply centers,  3 Units:  Builds   1 unit.
> The next phase of 'dip' will be Movement for Fall of 1901.

> I've been using Parsec and it's felt rather complicated.  For example,

i have an experience of parsing such human-readable, imprecise texts
and should say that regexps was developed just to do such jibs. ghc
and hugs already contains regex library in module Text.Regex.Posix
(it's available on all systems, including Windows). this lib has
rather dumb interface, i recommend you to install JRegex lib by Johc
Meacham that supports familiar =~ operators. there is also
Text.Regex.Lazy module:

Text.Regex.Lazy (0.33). Chris Kuklewicz [6]announced the release
       of [7]Text.Regex.Lazy. This is an alternative to Text.Regex along
       with some enhancements. GHC's Text.Regex marshals the data back
       and forth to C arrays, to call libc. This is far too slow (and
       strict). This module understands regular expression Strings via a
       Parsec parser and creates an internal data structure
       (Text.Regex.Lazy.Pattern). This is then transformed into a Parsec
       parser to process the input String, or into a DFA table for
       matching against the input String or FastPackedString. The input
       string is consumed lazily, so it may be an arbitrarily long or
       infinite source.

   6. http://article.gmane.org/gmane.comp.lang.haskell.libraries/4464
   7. http://sourceforge.net/projects/lazy-regex

--
Best regards,
 Bulat                            mailto:[hidden email]

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: parsing machine-generated natural text

Udo Stenzel
In reply to this post by Evan Martin
Evan Martin wrote:
> Unfortunately, the output is intended to be human-readable, and this
> makes parsing it a bit of a pain.  Here are some sample lines from its
> output:
>
> France: Army Marseilles SUPPORT Army Paris -> Burgundy.
> Russia: Fleet St Petersburg (south coast) -> Gulf of Bothnia.
> England:     4 Supply centers,  3 Units:  Builds   1 unit.
> The next phase of 'dip' will be Movement for Fall of 1901.

What's the difficulty?  "SUPPORT" and "CONVOY" are simply keywords, as
are "Army" and "Fleet", only other words are identifiers for locations.
Parsec supports this out of the box; have a look at the Language and
Token modules .  Note that CONVOY orders can get complex, so a true
parser is probably the right tool.


> And that "Supply centers" line ends up being
> code filled with stuff lie "char ':'; skipMany space".

do power
   colon
   integer
   reserved "Supply centers,"
   integer
   reserved "Units:"
   ((reserved "Builds" >> return id) <|>
    (reserved "Disbands" >> return negate))
        `ap` integer
   reserved "units." <|> reserved "unit."


Come on, it isn't nearly as bad as you make it sound.  Use the
combinators, they are far more powerful than ugly never-quite-correct
regexes.

Oh, and drop me a line when your Diplomacy bot is finished.


Udo.
--
Jeder echte Wettbewerb ist ruinös. Darum beruht jede funktionierende
Wirtschaft auf Schiebung.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: parsing machine-generated natural text

Evan Martin
On 5/21/06, Udo Stenzel <[hidden email]> wrote:

> do power
>    colon
>    integer
>    reserved "Supply centers,"
>    integer
>    reserved "Units:"
>    ((reserved "Builds" >> return id) <|>
>         (reserved "Disbands" >> return negate))
>         `ap` integer
>    reserved "units." <|> reserved "unit."
>
> Come on, it isn't nearly as bad as you make it sound.  Use the
> combinators, they are far more powerful than ugly never-quite-correct
> regexes.

Thanks!  I had looked at using the lexeme parser before but it didn't
seem like you can make newlines significant.  Here's the beginning of
the file, where it's not obvious to me how to distinguish elements in
the "::" section from the rest of the file.
  :: Judge: USDP  Game: dip  Variant: standard
  :: Deadline: F1901M Mon 20 Feb 2006 20:00 PST
  :: URL: http://www.diplom.org/dpjudge?game=dip

  Movement results for Fall of 1901.  (dip.F1901M)
I guess I could make "Movement" a reserved word?

> Oh, and drop me a line when your Diplomacy bot is finished.

:)

It's actually just for rendering nicer maps of the game state.
http://neugierig.org/software/hsdip/mapview.html
(It's draggable, too.)

I was trying to do it with Firefox's SVG+XUL but it's terribly slow,
XUL isn't quite there yet, and doing a large app with JavaScript is
painful.
http://neugierig.org/software/darcs/xuldip/dip.xul  (no install
necessary; only works in Firefox)
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: parsing machine-generated natural text

Evan Martin
On 5/21/06, Evan Martin <[hidden email]> wrote:
> Thanks!  I had looked at using the lexeme parser before but it didn't
> seem like you can make newlines significant.

Upon further consideration I realized that you can mix lexeme-based
parsers with "plain" parsers.  I think I've mostly figured this out.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: parsing machine-generated natural text

Bjorn Bringert-2
In reply to this post by Evan Martin
On May 19, 2006, at 6:35 PM, Evan Martin wrote:

> For a toy project I want to parse the output of a program.  The
> program runs on someone else's machine and mails me the results, so I
> only have access to the output it generates,
>
> Unfortunately, the output is intended to be human-readable, and this
> makes parsing it a bit of a pain.  Here are some sample lines from its
> output:
>
> France: Army Marseilles SUPPORT Army Paris -> Burgundy.
> Russia: Fleet St Petersburg (south coast) -> Gulf of Bothnia.
> England:     4 Supply centers,  3 Units:  Builds   1 unit.
> The next phase of 'dip' will be Movement for Fall of 1901.
>
> I've been using Parsec and it's felt rather complicated.  For example,
> a "location" is a series of words and possibly parenthesis, except if
> the word is SUPPORT.  And that "Supply centers" line ends up being
> code filled with stuff lie "char ':'; skipMany space".
>
> I actually have a separate parser that's Javascript with a bunch of
> regular expressions and it's far shorter than my Haskell one, which
> makes sense as munging this sort of text feels to me more like a
> regexp job than a careful parsing job.
>
> I'm considering writing a preprocessing stage in Ruby or Perl that
> munges those output lines into something a bit more
> "machine-readable", but before I did that I thought I'd ask here if
> anyone had any pointers, hints, or better ideas.

Hi Evan,

if the text you want to parse is actually similar to natural language  
(some posters have suggested that it is much simpler), you may want  
to have a look at grammar formalisms designed for natural languages.  
Grammatical Framework (GF) [1] is such a formalism, where the  
grammars are functional programs. The GF implementation is written in  
Haskell, and it has an interactive mode and a Haskell API.

[Disclaimer: I participate in the development of GF]

/Björn

[1] http://www.cs.chalmers.se/~aarne/GF/

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: parsing machine-generated natural text

Udo Stenzel
In reply to this post by Evan Martin
Evan Martin wrote:
> Here's the beginning of
> the file, where it's not obvious to me how to distinguish elements in
> the "::" section from the rest of the file.
>  :: Judge: USDP  Game: dip  Variant: standard
>  :: Deadline: F1901M Mon 20 Feb 2006 20:00 PST
>  :: URL: http://www.diplom.org/dpjudge?game=dip

You could make "::" the start-of-one-line-comment sequence or you could
just parse these three lines.  You can also throw them away like this:

string " ::" >> many (satisfy (/='\n')) >> newline

 
> Movement results for Fall of 1901.  (dip.F1901M)
> I guess I could make "Movement" a reserved word?

Or simply treat it as such.  'reserved "Movement"' expands to 'lexeme
(string "Movement")', so unless "Movement" appears as keyword where some
other identifier could also occur, you don't need to treat it specially.

 
> It's actually just for rendering nicer maps of the game state.
> http://neugierig.org/software/hsdip/mapview.html
> (It's draggable, too.)

That's nice, too.  There's not much Diplomacy software out there that
runs under Linux, and most that does is painfully slow.


Udo.
--
"Hey, Perl-Scripte sind wie Klobürsten.  
Da ist man nicht stolz drauf! Und man
gibt sie auch nicht weiter.  Schon gar
nicht in benutzter Form."  -- F. v. Leitner

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: parsing machine-generated natural text

Jason Dagit
In reply to this post by Udo Stenzel
On 5/20/06, Udo Stenzel <[hidden email]> wrote:

>
> do power
>    colon
>    integer
>    reserved "Supply centers,"
>    integer
>    reserved "Units:"
>    ((reserved "Builds" >> return id) <|>
>         (reserved "Disbands" >> return negate))
>         `ap` integer
>    reserved "units." <|> reserved "unit."
>

I always struggle with when I need to use 'try' with parsec.  I tend
to over use it because I've had unexpected results when I leave it
out.  In your other email you say that 'reserved "Movement"' expands
to 'lexeme
(string "Movement")'.  So I would think that,

reserved "units." <|> reserved "unit."

would need to be wrapped in a 'try'.  Something like,

try (reserved "units.") <|> reserved "unit."

My understanding is that if 'unit.' appears in the input the first
parser will parse up to the '.' and then fail and consume the input up
to that point, leaving the alternative with only the period as input
so it will also fail.

So I'm wondering if someone could explain to me what is wrong with my
understanding of parsec or point me to a resource that explains this
(probably common) misunderstanding.  Specifically, I would like to
develop a better understanding of when 'try' is needed, and when it is
not needed.

Thanks,
Jason
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: parsing machine-generated natural text

Udo Stenzel
Jason Dagit wrote:

> >   reserved "units." <|> reserved "unit."
>
> I always struggle with when I need to use 'try' with parsec.
>
> My understanding is that if 'unit.' appears in the input the first
> parser will parse up to the '.' and then fail and consume the input up
> to that point, leaving the alternative with only the period as input
> so it will also fail.
>
> So I'm wondering if someone could explain to me what is wrong with my
> understanding of parsec
Nothing at all.  'reserved' actually contains the necessary 'try', a
'notFollowedBy' in order not to swallow parts of longer identifiers and
a useful error message.  I tend to forget 'try', too, or add it in odd
places.  Actually, if ReadP hat better error reporting, I'd recommend
that over Parsec.


Udo.
--
Time is an illusion. Lunchtime doubly so. -- Douglas Adams

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

signature.asc (196 bytes) Download Attachment