# Parsing Revisited

6 messages
Open this post in threaded view
|

## Parsing Revisited

 I believe it might be beneficial for me to ask a more general question than I asked in 'Parsing Custom TeX': Given a set of combinators that parse specific parts of a document, how can you string them together so that you can get the whole document parsed? Within this, a consideration for extraneous whitespace should be considered - where you might have specific rules for when to drop whitespace. Thank you again. Jeff.
Open this post in threaded view
|

## Parsing Revisited

 Jeffrey Drake wrote: > Given a set of combinators that parse specific parts of a document, how > can you string them together so that you can get the whole document > parsed? The general idea is to build parser for complex formats out of parsers for simple formats. The structure of the parser often more or less follows the structure of the data. For example, the following data type could be a first approach to capture the lexical structure of TeX:    data TeX      = Letter Char             -- for example: A      | Command String          -- for example: \begin      | Group [TeX]             -- for example: {abc\something{...}} The idea is to parse "a\test {bc}" into the following list of TeX values:    [Letter 'a', Command "test", Group [Letter 'c', Letter 'd']] Note how the use of lists of TeX values allows to actually represent whole documents; and how the Group data constructor allows to capture the recursive structure of TeX programs. Let start by writing the parser for a single TeX value. The datatype definition shows that a such a value can be a letter, a command or a list of TeX values enclosed in braces. We can capture the fact that we have three choices directly in parsers:    tex :: Parser TeX    tex = texLetter <|> texCommand <|> texGroup Note how the combinator <|> corresponds to the | syntax in the datatype declaration. Given this parser for TeX values, we can write the parser for a list of such values using the many combinator:    texList :: Parser [TeX]    texList = many tex Note how the many combinator corresponds to the list type constructor. Now we have to define the parser for the three data constructors. texLetter is easy:    texLetter :: Parser TeX    texLetter = do l <- letter                   return (Letter l) Note how the fact that texLetter just wraps letter corresponds to the fact that Letter just wraps Char. Commands are more interesting, because they eat all spaces after the name of the control sequence.    texCommand :: Parser TeX    texCommand = do char '\\'                    name <- many letter                    many (char ' ')                    return (Command name) By implementing the space eating feature of commands as part of the texCommand parser, we can be sure that spaces not following commands will not be eaten. Finally, I would consider the parser for groups the most interesting. The inside of a group looks looks just like the whole TeX document itself. Fortunately, we have already implemented a parser for whole TeX documents, namely texList, which we use for the texGroup parser as follows:    texGroup :: Parser TeX    texGroup = do char '{'                  content <- texList                  char '}' Note how the mutual recursion between texList and texGroup corresponds to the recursion in the TeX data type. Of course, the examples in this messages are not meant to be production code. Actually, they are not tested at all. But I hope that they help you get started with Parsec.    Tillmann
Open this post in threaded view
|

## Parsing Revisited

 This helps a lot, and I can go over this in the morning. The only final question I have is what you would use to apply all this to an arbitrary string. For example, in the following partially fictitious code: (based on something I saw) main = getContents >>= ... What would ... be so that you can turn the [Char] into [TeX]? Or can I specifically use the combinators only? - Jeff On Mon, 2008-11-10 at 04:33 +0100, Tillmann Rendel wrote: > Jeffrey Drake wrote: > > Given a set of combinators that parse specific parts of a document, how > > can you string them together so that you can get the whole document > > parsed? > > The general idea is to build parser for complex formats out of parsers > for simple formats. The structure of the parser often more or less > follows the structure of the data. > > For example, the following data type could be a first approach to > capture the lexical structure of TeX: > >    data TeX >      = Letter Char             -- for example: A >      | Command String          -- for example: \begin >      | Group [TeX]             -- for example: {abc\something{...}} > > The idea is to parse "a\test {bc}" into the following list of TeX values: > >    [Letter 'a', Command "test", Group [Letter 'c', Letter 'd']] > > Note how the use of lists of TeX values allows to actually represent > whole documents; and how the Group data constructor allows to capture > the recursive structure of TeX programs. > > Let start by writing the parser for a single TeX value. The datatype > definition shows that a such a value can be a letter, a command or a > list of TeX values enclosed in braces. We can capture the fact that we > have three choices directly in parsers: > >    tex :: Parser TeX >    tex = texLetter <|> texCommand <|> texGroup > > Note how the combinator <|> corresponds to the | syntax in the datatype > declaration. > > Given this parser for TeX values, we can write the parser for a list of > such values using the many combinator: > >    texList :: Parser [TeX] >    texList = many tex > > Note how the many combinator corresponds to the list type constructor. > > Now we have to define the parser for the three data constructors. > texLetter is easy: > >    texLetter :: Parser TeX >    texLetter = do l <- letter >                   return (Letter l) > > Note how the fact that texLetter just wraps letter corresponds to the > fact that Letter just wraps Char. > > Commands are more interesting, because they eat all spaces after the > name of the control sequence. > >    texCommand :: Parser TeX >    texCommand = do char '\\' >                    name <- many letter >                    many (char ' ') >                    return (Command name) > > By implementing the space eating feature of commands as part of the > texCommand parser, we can be sure that spaces not following commands > will not be eaten. > > Finally, I would consider the parser for groups the most interesting. > The inside of a group looks looks just like the whole TeX document > itself. Fortunately, we have already implemented a parser for whole TeX > documents, namely texList, which we use for the texGroup parser as follows: > >    texGroup :: Parser TeX >    texGroup = do char '{' >                  content <- texList >                  char '}' > > Note how the mutual recursion between texList and texGroup corresponds > to the recursion in the TeX data type. > > Of course, the examples in this messages are not meant to be production > code. Actually, they are not tested at all. But I hope that they help > you get started with Parsec. > >    Tillmann