Space leak when returning pairs?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Space leak when returning pairs?

Shin-Cheng Mu
Dear members,

I am experiencing a space leak, which I suspect to be
an instance of the problem addressed by Wadler before.
I'd appreciate if someone here would take a look.

Given the following datatype:

  data XMLEvent = StartEvent String
                | EndEvent   String
                | TextEvent  String  deriving Show

where the three constructors represent the start tag
(<a> where a is a string), end tag (</a>), and text data,
respectively, of an XML stream. (They are actually from
the library HXML). The following function
simply returns the same stream while doing a minimal
amount of validation (ignoring the closing tag).

   idX :: [XMLEvent] -> ([XMLEvent], [XMLEvent])
   idX [] = ([], [])
   idX (StartEvent a : strm) =
     let (ts, strm') = idX strm
         (us, strm'') = idX strm'
     in (StartEvent a [] : ts ++ EndEvent a : us, strm'')
   idX (EndEvent _: strm) = ([], strm)
   idX (TextEvent s : strm) =
     let (ts, strm') = idX strm
     in (TextEvent s : ts, strm')

The function idX returns a pair, where the first component
is the processed stream, while the second component is the
rest of the input. The intention is to thread the input
and release processed events.

If the function is used in a pipelined manner:

    print . fst . idX . parseInstance $ input

where parseInstance :: String -> [XMLEvent] is a
lexical analyser, I would expect that the input stream will
be consumed one by one as the output is produced, and the
program would use a constant amount of heap space (while
keeping a program stack proportional to the depth of the
XML tree). This was not the case, however. For some reason
the heap grew linearly. The GHC profiler showed that most
of the thunks are created by parseInstance, which implies
that the entire input stream somehow still resides in memory.

I was wondering where the space leak came from and suspected
that it's the leak described in one of Philip Wadler's
early paper Fixing Some Space Leaks With a Garbage Collector (1987).
Consider

     idX (StartEvent a : strm) =
       let (ts, strm') = idX strm
           (us, strm'') = idX strm'
       in (StartEvent a [] : ts ++ EndEvent a : us, strm'')

The body of the let clause might have actually been treated as

      (StartEvent a [] : ts ++
         EndEvent a : fst (idX (snd strm)),
       snd (idX (snd strm)))

Therefore strm will not be freed until the whole evaluation
finishes.

But since Wadler has addressed this problem a long time ago,
I think the fix he proposed should have been implemented in
GHC already. Was that really the source of the space leak?
If so is there a way to fix it? Or is there another reason
for the leak?

    *   *   *

The function idX above actually resulted from fusing two functions:
buildF parses a stream into a forest, while idF flattens the
tree.

    data ETree = Element String [ETree]
               | Text String

    buildF :: [XMLEvent] -> ([ETree], [XMLEvent])
    buildF [] = ([],[])
    buildF (StartEvent a : strm) =
      let (ts, strm') = buildF strm
          (us, strm'') = buildF strm'
      in (Element a ts : us, strm'')
    buildF (EndEvent _ : strm) = ([], strm)
    buildF (TextEvent s : strm) =
      let (ts, strm') = buildF strm
      in (Text s : ts, strm')

    idF :: [ETree] -> [XMLEvent]
    idF [] = []
    idF (Element a ts : us) =
      StartEvent a : idF ts ++ EndEvent a : idF us
    idF (Text s : ts) = TextEvent s : idF ts

My original program was like

     print . idF . fst . buildF . parseInstance $ input

and it had the same space leak.

By accident (assuming that the input is always a single tree),
I discovered that if I added a "wrap . head" into the pipe:

     print . idF . wrap . head . fst . buildF . parseInstance $ input

where wrap a = [a], the heap residency would reduce by half!
The output remains the same.

My explanation is that applying a "head" means that
(in the definition of buildF):

     buildF (StartEvent a : strm) =
      let (ts, strm') = buildF strm
          (us, strm'') = buildF strm'
      in (Element a ts : us, strm'')

the "us" part , which contains a reference to strm, need not
be kept. Thus the reduced heap residency.

This seems to support that the space leak resulted from
the same problem Wadler addressed before. But isn't
that solved already in GHC?

    *   *   *

I'd appreciate if someone could look into it. The actual program
is available at

    http://www.psdlab.org/~scm/hx.tar.gz

It actually uses HXML, a library by Joe (Joe English?) to
do the parsing. The main program is hxpc.hs. There is a
1 MB sized sample input file size1m.xml. There are two
simple scripts "recompile" and "runhxpc" for compiling
and running the program.

Thank you!

sincerely,
Shin-Cheng Mu


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Space leak when returning pairs?

Henning Thielemann

On Fri, 19 May 2006, Shin-Cheng Mu wrote:

>    idX :: [XMLEvent] -> ([XMLEvent], [XMLEvent])
>    idX [] = ([], [])
>    idX (StartEvent a : strm) =
>      let (ts, strm') = idX strm
>          (us, strm'') = idX strm'
>      in (StartEvent a [] : ts ++ EndEvent a : us, strm'')
>    idX (EndEvent _: strm) = ([], strm)
>    idX (TextEvent s : strm) =
>      let (ts, strm') = idX strm
>      in (TextEvent s : ts, strm')


let ~(ts, strm') = idX strm
    ~(us, strm'') = idX strm'

?

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Space leak when returning pairs?

Shin-Cheng Mu
Dear Henning,

On May 19, 2006, at 6:16 PM, Henning Thielemann wrote:
> On Fri, 19 May 2006, Shin-Cheng Mu wrote:
>>    idX :: [XMLEvent] -> ([XMLEvent], [XMLEvent])
>>    idX (StartEvent a : strm) =
>>      let (ts, strm') = idX strm
>>          (us, strm'') = idX strm'
>>      in (StartEvent a [] : ts ++ EndEvent a : us, strm'')
> let ~(ts, strm') = idX strm
>     ~(us, strm'') = idX strm'

Oh, I just tried using lazy patterns for the case for StartEvent
and TextEvent. But the space behaviour remained the same. :(

sincerely,
Shin-Cheng Mu
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Space leak when returning pairs?

Malcolm Wallace
In reply to this post by Henning Thielemann
Henning Thielemann <[hidden email]> wrote:

> let ~(ts, strm') = idX strm
>     ~(us, strm'') = idX strm'

Let-bindings are already lazy, so the ~ here is redundant.

Regards,
    Malcolm
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Space leak when returning pairs?

Shin-Cheng Mu
In reply to this post by Henning Thielemann

On May 19, 2006, at 6:16 PM, Henning Thielemann wrote:
> let ~(ts, strm') = idX strm
>     ~(us, strm'') = idX strm'

I seem to have found a partial solution to the problem.
It's rather ugly, however, and I think there should be
a better way.

The original definition for one of the clauses was like
this:

   idX (StartEvent a : strm) =
       let (ts, strm') = idX strm
           (us, strm'') = idX strm'
       in (StartEvent a [] : ts ++ EndEvent a : us, strm'')

The intention was to thread strm through the recursive calls
and free the items in strm as soon as they are processed. However,
with this definition I needed about 8Mb of heap for a 1Mb input
file. Since I suspected that delayed evaluation of us and strm''
keeps the pointer to strm around for too long, the natural solution
seems to be forcing their evaluation. So I tried:

   idX (StartEvent a : strm) =
       case idX strm of
          (ts, strm') ->
              case idX strm' of
                (us, strm'') ->
                  (StartEvent a [] : ts ++ EndEvent a : us, strm'')

which made things worse. The heap size increased a bit more
and I get a typical "bell" shaped heap profile suggesting idX
cannot output anything before processing the whole input.

So I have to allow idX to output something while consuming
the input. What eventually worked was this messy mixture
of let and case:

   idX (StartEvent a : strm) =
     let (xs, rest) =
          case idX strm of
            (ts, strm') ->
               let (ws, rest) =
                     case idX strm' of
                       (us, strm'') -> (us, strm'')
               in (ts ++ EndEvent a : ws, rest)
     in (StartEvent a [] : xs, rest)

The heap residency reduced from about 8MB to around 80Kb.

This is rather tricky, however. The following "nicer looking"
version has a heap profile as bad as the worse case.

    idX (StartEvent a : strm) =
      let (ts,us,strm'') =
          case idXsp strm of
            (ts, strm') ->
                case idXsp strm' of
                  (us, strm'') -> (ts, us, strm'')
      in (StartEvent a [] : ts ++ EndEvent a : us, strm'')

And I cannot quite explain why (Anyone?).

Is there a more structured solution? Can I leave the hard
work to the compiler?

sincerely,
Shin-Cheng Mu
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Space leak when returning pairs?

haskell-2
In reply to this post by Shin-Cheng Mu
Shin-Cheng Mu wrote:

> Dear members,
>
> I am experiencing a space leak, which I suspect to be
> an instance of the problem addressed by Wadler before.
> I'd appreciate if someone here would take a look.
>
> Given the following datatype:
>
>  data XMLEvent = StartEvent String
>                | EndEvent   String
>                | TextEvent  String  deriving Show
>
> where the three constructors represent the start tag
> (<a> where a is a string), end tag (</a>), and text data,
> respectively, of an XML stream. (They are actually from
> the library HXML). The following function
> simply returns the same stream while doing a minimal
> amount of validation (ignoring the closing tag).
>
>   idX :: [XMLEvent] -> ([XMLEvent], [XMLEvent])
>   idX [] = ([], [])
>   idX (StartEvent a : strm) =
>     let (ts, strm') = idX strm
>         (us, strm'') = idX strm'
>     in (StartEvent a [] : ts ++ EndEvent a : us, strm'')
>   idX (EndEvent _: strm) = ([], strm)
>   idX (TextEvent s : strm) =
>     let (ts, strm') = idX strm
>     in (TextEvent s : ts, strm')
>
> The function idX returns a pair, where the first component
> is the processed stream, while the second component is the
> rest of the input. The intention is to thread the input
> and release processed events.
>
> If the function is used in a pipelined manner:
>
>    print . fst . idX . parseInstance $ input
>

I would not have written idX in this manner.  My version is

> data XMLEvent = StartEvent String
>               | EndEvent   String
>               | TextEvent  String  deriving Show
>
> idX' :: [XMLEvent] -> [XMLEvent]
> idX' = untilTags []
>
> untilTags :: [String] -> [XMLEvent] -> [XMLEvent]
> untilTags [] [] = []
> untilTags tags [] = error ("Did not find closing EndEvents " ++ show tags)
> untilTags tags (x:xs) =
>   case x of
>     StartEvent tag' -> x : untilTags (tag':tags) xs
>     EndEvent tag' -> if null tags
>                        then error ("Unexpected EndEvent with tag " ++ show tag')
>                        else let (tag:tags') = tags
>                             in if tag == tag'
>                                  then x : untilTags tags' xs
>                                  else error ("Expected EndEvents " ++ show tags ++ " but found EndEvent " ++ show tag')
>     TextEvent _ -> x : untilTags tags xs

Here the flow of the input "x : ..." to the output "x : ..." is more obvious,
and the open tags are kept in a stack and used to check the closing tags.

The memory usage will be only slightly higher than your idX due to keeping the
list of open tag names.  If you want to remove that overhead and assume the
ending closing tags have correct strings, then you can keep a mere count of open
tags:

> countTags :: Int -> [XMLEvent] -> [XMLEvent]
> countTags 0 [] = []
> countTags n [] = error ("Did not find the closing EndEvents " ++ show n)
> countTags n (x:xs) =
>   case x of
>     StartEvent tag' -> x : countTags (succ n) xs
>     EndEvent tag' -> if 0 == n
>                        then error ("Unexpected EndEvent with tag " ++ show tag')
>                        else x : countTags (pred n) xs
>     TextEvent _ -> x : countTags n xs
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Space leak when returning pairs?

Malcolm Wallace
In reply to this post by Shin-Cheng Mu
Shin-Cheng Mu <[hidden email]> wrote:

> I was wondering where the space leak came from and suspected
> that it's the leak described in one of Philip Wadler's
> early paper Fixing Some Space Leaks With a Garbage Collector (1987).

> But since Wadler has addressed this problem a long time ago,
> I think the fix he proposed should have been implemented in
> GHC already. Was that really the source of the space leak?
> If so is there a way to fix it? Or is there another reason
> for the leak?

I compiled and profiled your code using nhc98 instead of ghc.  There was
no space leak visible.  Since nhc98 does implement the Wadler GC fix
(with refinements by Sparud), I can only assume that this does indeed
account for the space improvement over ghc.

Regards,
    Malcolm
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe