Parsing Revisited

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Parsing Revisited

Jeffrey Drake-2
I believe it might be beneficial for me to ask a more general question
than I asked in 'Parsing Custom TeX':

Given a set of combinators that parse specific parts of a document, how
can you string them together so that you can get the whole document
parsed?

Within this, a consideration for extraneous whitespace should be
considered - where you might have specific rules for when to drop
whitespace.

Thank you again.
Jeff.

Reply | Threaded
Open this post in threaded view
|

Parsing Revisited

Tillmann Rendel-2
Jeffrey Drake wrote:
> Given a set of combinators that parse specific parts of a document, how
> can you string them together so that you can get the whole document
> parsed?

The general idea is to build parser for complex formats out of parsers
for simple formats. The structure of the parser often more or less
follows the structure of the data.

For example, the following data type could be a first approach to
capture the lexical structure of TeX:

   data TeX
     = Letter Char             -- for example: A
     | Command String          -- for example: \begin
     | Group [TeX]             -- for example: {abc\something{...}}

The idea is to parse "a\test {bc}" into the following list of TeX values:

   [Letter 'a', Command "test", Group [Letter 'c', Letter 'd']]

Note how the use of lists of TeX values allows to actually represent
whole documents; and how the Group data constructor allows to capture
the recursive structure of TeX programs.

Let start by writing the parser for a single TeX value. The datatype
definition shows that a such a value can be a letter, a command or a
list of TeX values enclosed in braces. We can capture the fact that we
have three choices directly in parsers:

   tex :: Parser TeX
   tex = texLetter <|> texCommand <|> texGroup

Note how the combinator <|> corresponds to the | syntax in the datatype
declaration.

Given this parser for TeX values, we can write the parser for a list of
such values using the many combinator:

   texList :: Parser [TeX]
   texList = many tex

Note how the many combinator corresponds to the list type constructor.

Now we have to define the parser for the three data constructors.
texLetter is easy:

   texLetter :: Parser TeX
   texLetter = do l <- letter
                  return (Letter l)

Note how the fact that texLetter just wraps letter corresponds to the
fact that Letter just wraps Char.

Commands are more interesting, because they eat all spaces after the
name of the control sequence.

   texCommand :: Parser TeX
   texCommand = do char '\\'
                   name <- many letter
                   many (char ' ')
                   return (Command name)

By implementing the space eating feature of commands as part of the
texCommand parser, we can be sure that spaces not following commands
will not be eaten.

Finally, I would consider the parser for groups the most interesting.
The inside of a group looks looks just like the whole TeX document
itself. Fortunately, we have already implemented a parser for whole TeX
documents, namely texList, which we use for the texGroup parser as follows:

   texGroup :: Parser TeX
   texGroup = do char '{'
                 content <- texList
                 char '}'

Note how the mutual recursion between texList and texGroup corresponds
to the recursion in the TeX data type.

Of course, the examples in this messages are not meant to be production
code. Actually, they are not tested at all. But I hope that they help
you get started with Parsec.

   Tillmann
Reply | Threaded
Open this post in threaded view
|

Parsing Revisited

Jeffrey Drake-2
This helps a lot, and I can go over this in the morning. The only final
question I have is what you would use to apply all this to an arbitrary
string.

For example, in the following partially fictitious code: (based on
something I saw)

main = getContents >>= ...

What would ... be so that you can turn the [Char] into [TeX]? Or can I
specifically use the combinators only?

- Jeff

On Mon, 2008-11-10 at 04:33 +0100, Tillmann Rendel wrote:

> Jeffrey Drake wrote:
> > Given a set of combinators that parse specific parts of a document, how
> > can you string them together so that you can get the whole document
> > parsed?
>
> The general idea is to build parser for complex formats out of parsers
> for simple formats. The structure of the parser often more or less
> follows the structure of the data.
>
> For example, the following data type could be a first approach to
> capture the lexical structure of TeX:
>
>    data TeX
>      = Letter Char             -- for example: A
>      | Command String          -- for example: \begin
>      | Group [TeX]             -- for example: {abc\something{...}}
>
> The idea is to parse "a\test {bc}" into the following list of TeX values:
>
>    [Letter 'a', Command "test", Group [Letter 'c', Letter 'd']]
>
> Note how the use of lists of TeX values allows to actually represent
> whole documents; and how the Group data constructor allows to capture
> the recursive structure of TeX programs.
>
> Let start by writing the parser for a single TeX value. The datatype
> definition shows that a such a value can be a letter, a command or a
> list of TeX values enclosed in braces. We can capture the fact that we
> have three choices directly in parsers:
>
>    tex :: Parser TeX
>    tex = texLetter <|> texCommand <|> texGroup
>
> Note how the combinator <|> corresponds to the | syntax in the datatype
> declaration.
>
> Given this parser for TeX values, we can write the parser for a list of
> such values using the many combinator:
>
>    texList :: Parser [TeX]
>    texList = many tex
>
> Note how the many combinator corresponds to the list type constructor.
>
> Now we have to define the parser for the three data constructors.
> texLetter is easy:
>
>    texLetter :: Parser TeX
>    texLetter = do l <- letter
>                   return (Letter l)
>
> Note how the fact that texLetter just wraps letter corresponds to the
> fact that Letter just wraps Char.
>
> Commands are more interesting, because they eat all spaces after the
> name of the control sequence.
>
>    texCommand :: Parser TeX
>    texCommand = do char '\\'
>                    name <- many letter
>                    many (char ' ')
>                    return (Command name)
>
> By implementing the space eating feature of commands as part of the
> texCommand parser, we can be sure that spaces not following commands
> will not be eaten.
>
> Finally, I would consider the parser for groups the most interesting.
> The inside of a group looks looks just like the whole TeX document
> itself. Fortunately, we have already implemented a parser for whole TeX
> documents, namely texList, which we use for the texGroup parser as follows:
>
>    texGroup :: Parser TeX
>    texGroup = do char '{'
>                  content <- texList
>                  char '}'
>
> Note how the mutual recursion between texList and texGroup corresponds
> to the recursion in the TeX data type.
>
> Of course, the examples in this messages are not meant to be production
> code. Actually, they are not tested at all. But I hope that they help
> you get started with Parsec.
>
>    Tillmann

Reply | Threaded
Open this post in threaded view
|

Parsing Revisited

Tillmann Rendel-2
Jeffrey Drake wrote:
> This helps a lot, and I can go over this in the morning. The only final
> question I have is what you would use to apply all this to an arbitrary
> string.

You can use parseTest for testing, e.g. in ghci.

   parseTest texList "hello\world {example}"

That will either print the resulting {TeX], or a parser error message.

For normal processing, use parse.

   case parse texList "<no source>" "hello\world {example}" of
     Left problem -> error (show problem)
     Right texList -> convertToHTML texList

   Tillmann
Reply | Threaded
Open this post in threaded view
|

Parsing Revisited

Jeffrey Drake-2
I have decided to go in another direction, so the parser here is going
to be different, but it works without error correction. I am using a
wiki style right here, where I am testing headers level 1 to 5.

I believe I need to essentially throw an error here, normally I would
think <?> might be appropriate, but it doesn't seem to apply in this
case.

The problem is that a1, a2 must satisfy two conditions:

        a1 == a2
        1 <= a1 <= 5

If it doesn't, then an error must be output.

It appears that <?> is just an operator for 'label'. Which goes to
labels, which does eventually go to construct Error, but not sure in the
correct context.

Again, any help is appreciated, I am getting a feeling for this I think.
- Jeff.

Code is below:


module Main where

import Control.Monad

import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Char
import Text.ParserCombinators.Parsec.Combinator


data Wiki
    =   Heading Int String
    |   Other String
    deriving Show


{-
    a1/a2 heading level
    2       5
    3       4
    4       3
    5       2
    6       1
    :. 7 - a1      
    -}
heading :: Parser Wiki
heading = do    (a1, s, a2) <- within
                        (length `liftM` many1 (char '='))
                        (length `liftM` many1 (char '='))
                        (many1 (alphaNum <|> space))
                return $ Heading (7 - a1) s
          where within open close p = do
                                        a1 <- open
                                        x <- p
                                        a2 <- close
                                        return (a1, x, a2)


On Mon, 2008-11-10 at 06:10 +0100, Tillmann Rendel wrote:

> Jeffrey Drake wrote:
> > This helps a lot, and I can go over this in the morning. The only final
> > question I have is what you would use to apply all this to an arbitrary
> > string.
>
> You can use parseTest for testing, e.g. in ghci.
>
>    parseTest texList "hello\world {example}"
>
> That will either print the resulting {TeX], or a parser error message.
>
> For normal processing, use parse.
>
>    case parse texList "<no source>" "hello\world {example}" of
>      Left problem -> error (show problem)
>      Right texList -> convertToHTML texList
>
>    Tillmann

Reply | Threaded
Open this post in threaded view
|

Parsing Revisited

Daniel Fischer-4
Am Dienstag, 11. November 2008 07:20 schrieb Jeffrey Drake:

> I have decided to go in another direction, so the parser here is going
> to be different, but it works without error correction. I am using a
> wiki style right here, where I am testing headers level 1 to 5.
>
> I believe I need to essentially throw an error here, normally I would
> think <?> might be appropriate, but it doesn't seem to apply in this
> case.
>
> The problem is that a1, a2 must satisfy two conditions:
>
> a1 == a2
> 1 <= a1 <= 5
>
> If it doesn't, then an error must be output.
>
> It appears that <?> is just an operator for 'label'. Which goes to
> labels, which does eventually go to construct Error, but not sure in the
> correct context.

<?> gives you the opportunity to construct a helpful error message if the
parser fails. You provide information what the parser expects and if it fails
you get the message
"unexpected whatever, expecting x, y, z or w"

>
> Again, any help is appreciated, I am getting a feeling for this I think.
> - Jeff.
>
> Code is below:
>
>
> module Main where
>
> import Control.Monad
>
> import Text.ParserCombinators.Parsec
> import Text.ParserCombinators.Parsec.Char
> import Text.ParserCombinators.Parsec.Combinator
>
>
> data Wiki
>     =   Heading Int String
>
>     |   Other String
>
>     deriving Show
>
>
> {-
>     a1/a2 heading level
>     2       5
>     3       4
>     4       3
>     5       2
>     6       1
>
>     :. 7 - a1
>
>     -}
> heading :: Parser Wiki
> heading = do    (a1, s, a2) <- within
>                         (length `liftM` many1 (char '='))
>                         (length `liftM` many1 (char '='))
>                         (many1 (alphaNum <|> space))
>                 return $ Heading (7 - a1) s
>           where within open close p = do
>                                         a1 <- open
>                                         x <- p
>                                         a2 <- close
>                                         return (a1, x, a2)

Perhaps

heading :: Parser Wiki
heading = do
        a1 <- length `liftM` many1 (char '=')
        title <- many1 (alphaNum <|> space)
        count a1 (char '=')
        notFollowedBy (char '=')
        return $ Heading (7-a1) title

? Although that doesn't check that there are at most 5 '=', so you might use
the combinators

upTo 0 p = do
        notFollowedBy p
  return []
upTo k p = (do
        a <- p
        as <- upTo (k-1) p
        return (a:as)) <|> return []

upTo1 k p = do
        a <- p
        as <- upTo (k-1) p
        return (a:as)

exactly k p = do
        as <- count k p
        notFollowedBy p
        return as

as in

heading = do
        a1 <- length `liftM` upTo1 5 (char '=') <?> ("1 to 5 '='s")
        title <- many1 (alphaNum <|> space)
        exactly a1 (char '=') <?> (show a1 ++ " '='s")
        return $ Heading (7-a1) title

>
> On Mon, 2008-11-10 at 06:10 +0100, Tillmann Rendel wrote:
> > Jeffrey Drake wrote:
> > > This helps a lot, and I can go over this in the morning. The only final
> > > question I have is what you would use to apply all this to an arbitrary
> > > string.
> >
> > You can use parseTest for testing, e.g. in ghci.
> >
> >    parseTest texList "hello\world {example}"
> >
> > That will either print the resulting {TeX], or a parser error message.
> >
> > For normal processing, use parse.
> >
> >    case parse texList "<no source>" "hello\world {example}" of
> >      Left problem -> error (show problem)
> >      Right texList -> convertToHTML texList
> >
> >    Tillmann
>
> _______________________________________________
> Beginners mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/beginners