Parsing Custom TeX

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Parsing Custom TeX

Jeffrey Drake-2
I am getting frustrated when trying to implement a parser for a minimal
TeX subset. I am essentially trying to implement something that
recognizes \commands (no parameters yet), paragraphs (two newlines), and
text (almost anything else).

The code I have so far is:

module UnTeX where

import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Prim
import Text.ParserCombinators.Parsec.Language
import qualified Text.ParserCombinators.Parsec.Token as T

data UnTeX = Command String
            | Text String
            | Paragraph
        deriving Show


command :: Parser UnTeX
command = do    char '\\'
                cmd <- many1 (letter <|> digit) <?> "command"
                return $ Command cmd

paragraph :: Parser UnTeX
paragraph = do  newline
                newline
                return $ Paragraph

text :: Parser UnTeX
text = do       txt <- many1 (alphaNum <|> space)
                return $ Text txt


I have no illusions that this is probably far from what I actually need
to be doing. But I am finding documentation on all this to be spotty.
The 'parsec.pdf' is 7 years old and code doesn't always work.

Can anyone point me in the right direction? I just want something to
able to do this simply. After an initial version works, I would like to
have commands with parameters - something like
\command[param][param]{body} but need to have something working first.

Thank you for any help you can provide.
Jeffrey.

Reply | Threaded
Open this post in threaded view
|

Parsing Custom TeX

Jason Dusek
  Each one of the little combinators seems to work as
  advertized. Are you having trouble fitting them together?

--
_jsn


module UnTeX where

import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Prim
import Text.ParserCombinators.Parsec.Language
import qualified Text.ParserCombinators.Parsec.Token as T

data UnTeX
  = Command String [String] String
  | Text String
  | Paragraph
 deriving Show


 -- I don't remember TeX very well, so I'm not sure this is right.
command                     ::  Parser UnTeX
command                      =  do
  char '\\'
  cmd                       <-  ident
  p                         <-  orNot params []
  b                         <-  orNot body ""
  return $ Command cmd p b
 where
  params                     =  many1 $ between (char '[') (char ']') ident
  body                       =  do
    char '{'
    text                    <-  many1 $ noneOf "}"
    char '}'
    return text
  ident                      =  many1 $ letter <|> digit
  orNot p n                  =  choice [try p , return n]


paragraph                   ::  Parser UnTeX
paragraph                    =  do
  newline
  newline
  return $ Paragraph

text                        ::  Parser UnTeX
text                         =  do
  txt                       <-  many1 (alphaNum <|> space)
  return $ Text txt
Reply | Threaded
Open this post in threaded view
|

Parsing Custom TeX

Brent Yorgey-2
In reply to this post by Jeffrey Drake-2
What you have looks good.  Can you specifically describe the problems
you are having?

-Brent
Reply | Threaded
Open this post in threaded view
|

Parsing Custom TeX

Jeffrey Drake-2
In reply to this post by Jason Dusek
*This message was sent in reply to Jason Dusek but the reply went to him, not the list.

I plan to have Command String, CommandParams [String],
CommandParamsWithArgs [String] [String] or something to that effect. If
each of these are combinators, then I imagine you can use <|> with them.
But realistically, I have no idea how to get this to take a string input
(or something like getContents) and have [UnTeX] come out the other end.

The problem that I have also is that while Command* can come anywhere in
the text, everything goes back to Text unless it is a Paragraph. The
spaces after a Command* up till the next letter have to be ignored, and
superfluous spaces within Text itself also should be. I also can't have
two Paragraphs right beside each other, because that makes little sense.
So from what I can guess - I need a lexer, I think it was called a
lexeme lexer.

The syntax for Command* is like this:

\command
\command[arg]
\command[arg][arg][...]
\command[...]{body}
\command[...]{body1}{body2}{...}

The reason why this is necessary is because you could have something
like \frac{a}{b}. I am trying to be more consistent with my use of this
than LaTeX/TeX is.

I might need to implement something like a table generator eventually,
but this would be hopefully for the backend. Because I would like to
translate this stuff into HTML and other outputs eventually.

Thank you for your help,
Jeffrey.

On Sat, 2008-11-08 at 01:00 -0800, Jason Dusek wrote:

> Each one of the little combinators seems to work as
>   advertized. Are you having trouble fitting them together?
>
> --
> _jsn
>
>
> module UnTeX where
>
> import Text.ParserCombinators.Parsec
> import Text.ParserCombinators.Parsec.Prim
> import Text.ParserCombinators.Parsec.Language
> import qualified Text.ParserCombinators.Parsec.Token as T
>
> data UnTeX
>   = Command String [String] String
>   | Text String
>   | Paragraph
>  deriving Show
>
>
>  -- I don't remember TeX very well, so I'm not sure this is right.
> command                     ::  Parser UnTeX
> command                      =  do
>   char '\\'
>   cmd                       <-  ident
>   p                         <-  orNot params []
>   b                         <-  orNot body ""
>   return $ Command cmd p b
>  where
>   params                     =  many1 $ between (char '[') (char ']') ident
>   body                       =  do
>     char '{'
>     text                    <-  many1 $ noneOf "}"
>     char '}'
>     return text
>   ident                      =  many1 $ letter <|> digit
>   orNot p n                  =  choice [try p , return n]
>
>
> paragraph                   ::  Parser UnTeX
> paragraph                    =  do
>   newline
>   newline
>   return $ Paragraph
>
> text                        ::  Parser UnTeX
> text                         =  do
>   txt                       <-  many1 (alphaNum <|> space)
>   return $ Text txt