a better way to do this

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

a better way to do this

Michael Mossey
This is an early attempt to create some kind of parser, for text that is
xml-like but not actually xml. This is probably a disaster by Haskell
standards... If someone could point me in the direction of a better way of doing
things, that would be great. I don't want to use the existing parser library,
not at first, because I want to learn more from first principles (for now).

The input looks something like:

<entry>
   <field1> thingy </field1>
   <field2> other thingy </field2>
</entry>
<entry>
   ...
</entry>
...

where there are any number of entries. Each entry consists of a varied number of
named fields. For now, the named fields can be anything--any names, in any
order. Later I'll do sanity checking to ensure the right fields are there, to
provide default values, etc.

This uses Data.Bytestring.Char8 for efficiency processing large files.

The output types are as follows:

type Bs = B.ByteString  -- alias
type Component = ( Bs, Bs ) -- one named field
type Entry = [ Component ] -- all named fields in one entry
type Doc = [ Entry ] -- all entries in the input document

The basic strategy is to create parsing functions, which take in a string
(actually ByteString), and return an object, the remainder of the string, and an
index. (The index indicates the position of the first character in the remainder
of the string, which is useful for giving error messages.)

Top level function is called parseReqs ( "parse requirements" -- this is
actually going to be used for a software requirements management project).

Here's the rest of the code:

-- types for regular expression matching
type Re3 = ( Bs, Bs, Bs )
type Re4 = ( Bs, Bs, Bs, [ Bs ] )

parseReqs :: Bs -> ( Doc, Bs, Int )
parseReqs buf = parseReqs' buf 0
parseReqs' :: Bs -> Int -> ( Doc, Bs, Int )
parseReqs' buf idx
   | B.null buf = ( [], buf, idx )
   | otherwise = case parseEntry buf idx  of
     (Just e, rem, remIdx)   ->
       let ( doc, rem', remIdx' ) = parseReqs' rem remIdx
       in ( e : doc, rem', remIdx' )
     (Nothing, rem, remIdx ) -> ( [], rem, remIdx )

parseEntry :: Bs -> Int -> ( Maybe Entry, Bs, Int )
parseEntry buf idx =
   let ( before, match, after ) = buf =~ "<entry>" :: Re3
       idx' = idx + B.length before + B.length match
   in if B.null match
        then ( Nothing, after, idx' )
        else let ( e, after', idx'' ) = parseEntryBody after idx'
             in ( Just e, after', idx'' )

parseEntryBody :: Bs -> Int -> ( Entry, Bs, Int )
parseEntryBody buf idx =
   let ( before, match, after ) = buf =~ "</entry>" :: Re3
       idx' = idx + B.length before + B.length match
   in if B.null match
        then error "Missing </entry>"
        else
         -- Note: index passed to parseEntryComponents is same as one passed
         -- into this function, because we pass 'before' to
         -- parseEntryComponents. Index returned from from this function is
         -- the one calculated above to occur at the start of 'after'
         ( parseEntryComponents before idx, after, idx' )

parseEntryComponents :: Bs -> Int -> Entry
parseEntryComponents buf idx =
   let ( before, match, after, groups ) = buf =~ B.pack "<([^>]+)>" :: Re4
       idx' = idx + B.length before + B.length match
   in if B.null match
        then []
        else
          let ( component, buf', idx'' ) =
                parseCompBody (head groups) after idx'
              components = parseEntryComponents buf' idx''
          in component : components

parseCompBody :: Bs -> Bs -> Int -> ( Component, Bs, Int )
parseCompBody compName buf idx =
   let ( before, match, after ) = buf =~
         (B.pack "</" `mappend` compName `mappend` B.pack ">") :: Re3
       idx' = idx + B.length before + B.length match
   in if B.null match
      then error ("No ending to component " ++ B.unpack compName)
      else ( ( compName, before ), after, idx' )
Reply | Threaded
Open this post in threaded view
|

a better way to do this

Michael Mossey
Michael P Mossey wrote:
> This is an early attempt to create some kind of parser, for text that is
> xml-like but not actually xml. This is probably a disaster by Haskell
> standards... If someone could point me in the direction of a better way
> of doing things, that would be great. I don't want to use the existing
> parser library, not at first, because I want to learn more from first
> principles (for now).
>

To follow up my own post, I'm working through chapter 10 of "Real World Haskell"
right now, and then I'll look at chapter 14 (Monads). I think this is a better
way of creating parsers. Maybe I'll look at Parsec eventually too, although I
want to do a lot of stuff myself for the learning exprience.

Thanks,
Mike
Reply | Threaded
Open this post in threaded view
|

Re: a better way to do this

Heinrich Apfelmus
Michael P Mossey wrote:

> Michael P Mossey wrote:
>> This is an early attempt to create some kind of parser, for text that
>> is xml-like but not actually xml. This is probably a disaster by
>> Haskell standards... If someone could point me in the direction of a
>> better way of doing things, that would be great. I don't want to use
>> the existing parser library, not at first, because I want to learn
>> more from first principles (for now).
>>
>
> To follow up my own post, I'm working through chapter 10 of "Real World
> Haskell" right now, and then I'll look at chapter 14 (Monads). I think
> this is a better way of creating parsers. Maybe I'll look at Parsec
> eventually too, although I want to do a lot of stuff myself for the
> learning exprience.

I don't think that doing it "the hard way by hand" is very enlightening,
especially since you are using regular expressions anyway. (By the way,
the latter are a niche in Haskell; any serious and also most non-serious
parsing tasks are best done with parser combinators. The popularity of
regular expressions in languages like Python or Perl is mainly due to
the lack of more general parsing abstractions in these languages.)

A much more rewarding introduction to parsing and parser combinators is
probably to understand and experiment with their implementations, for
instance by starting with

    Graham Hutton, Erik Meijer.
    Monadic Parser Combinators.
    http://www.cs.nott.ac.uk/~gmh/monparsing.ps




Regards,
apfelmus

--
http://apfelmus.nfshost.com