Quantcast

Splitting large files

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Splitting large files

Colin Woodbury
Hi all, I've hit a problem that feels like it has a straight-forward answer.

I have a large XML file that I'd like to split up into subfiles of roughly equal size. My first pass looks like:

import qualified Data.Text as T
import           Pipes
import qualified Pipes.Prelude as P
import qualified Pipes.Prelude.Text as PT
import           Pipes.Safe
import           Text.Printf.TH

-- | Streams lines from the source file.                                                      
-- This drops the first three lines, which are not Elements.                                              
xml
:: MonadSafe m => FilePath -> Producer T.Text m ()
xml fp
= PT.readFileLn fp >-> P.drop 3            
                                                                             
-- | Stream an entire OSM Element with all its children (tags, etc).                                      
-- Note: OSM XML is rather flat - children never have children.                                          
element
:: Monad m => Pipe T.Text [T.Text] m ()
element
= undefined  -- `await` lines until you find a closing tag, then pack as a List and `yield`?
                                                         
-- | Writes one legal @<osm> ... </osm>@ block.                                                          
osm
:: Int -> Consumer [T.Text] (SafeT IO) ()
osm
!fpn = do
  let fp
= [s|catalog/out-%d.osm|] fpn
  P
.take 1000 >-> P.concat >-> PT.writeFileLn fp                        
  osm $ fpn
+ 1

splitAll
:: Effect (SafeT IO) ()
splitAll
= xml "somefile" >-> element >-> osm 0

My intent is to stream groups of 1000 XML elements (and their children) to separate files. Luckily the XML in question is only ever one layer deep, like:

<foo>
 
<bar/>
 
<bar/>
 
<bar/>
</foo>

So this `foo` group here would count as 1 written element, not 5.

What stands out right away is the type signature of `element`. The `[T.Text]` feels very unidiomatic, but I couldn't think of another way to group all the parent and child nodes together in such a way that `osm` would know it had processed 1000 of such groups.
I read the `pipes-group` tutorial, but it wasn't immediately clear to me if that's what I needed. I do know that at maximum any given `<foo>` parent can only have a few hundred `<bar>` children, but that still breaks the output streaming as I wait for the List to populate.

Question: How can I structure things such that `osm` knows when to start writing to a new file?

Thanks

PS. `osm` defined as it currently is probably also ends up in an infinite loop.

--
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Splitting large files

Gabriel Gonzalez

Yes, the `pipes-group` library is what you need.  I recommend reading this if you haven't done so already:

http://www.haskellforall.com/2013/09/perfect-streaming-using-pipes-bytestring.html

... so the type of `element` would become:

    element :: Monad m => Producer Text m r -> FreeT (Producer Text m) m r

The result is a "stream of streams" that preserves the lazy streaming behavior of each sub-stream so that you don't have to wait to collect all children before processing them.

There is also another library you should check out which is more specialized to this particular use case which is the `streaming` library:

https://hackage.haskell.org/package/streaming


On 01/06/2017 05:00 PM, Colin Woodbury wrote:
Hi all, I've hit a problem that feels like it has a straight-forward answer.

I have a large XML file that I'd like to split up into subfiles of roughly equal size. My first pass looks like:

import qualified Data.Text as T
import           Pipes
import qualified Pipes.Prelude as P
import qualified Pipes.Prelude.Text as PT
import           Pipes.Safe
import           Text.Printf.TH

-- | Streams lines from the source file.                                                      
-- This drops the first three lines, which are not Elements.                                              
xml
:: MonadSafe m => FilePath -> Producer T.Text m ()
xml fp
= PT.readFileLn fp >-> P.drop 3            
                                                                             
-- | Stream an entire OSM Element with all its children (tags, etc).                                      
-- Note: OSM XML is rather flat - children never have children.                                          
element
:: Monad m => Pipe T.Text [T.Text] m ()
element
= undefined  -- `await` lines until you find a closing tag, then pack as a List and `yield`?
                                                         
-- | Writes one legal @<osm> ... </osm>@ block.                                                          
osm
:: Int -> Consumer [T.Text] (SafeT IO) ()
osm
!fpn = do
  let fp
= [s|catalog/out-%d.osm|] fpn
  P
.take 1000 >-> P.concat >-> PT.writeFileLn fp                        
  osm $ fpn
+ 1

splitAll
:: Effect (SafeT IO) ()
splitAll
= xml "somefile" >-> element >-> osm 0

My intent is to stream groups of 1000 XML elements (and their children) to separate files. Luckily the XML in question is only ever one layer deep, like:

<foo>
 
<bar/>
 
<bar/>
 
<bar/>
</foo>

So this `foo` group here would count as 1 written element, not 5.

What stands out right away is the type signature of `element`. The `[T.Text]` feels very unidiomatic, but I couldn't think of another way to group all the parent and child nodes together in such a way that `osm` would know it had processed 1000 of such groups.
I read the `pipes-group` tutorial, but it wasn't immediately clear to me if that's what I needed. I do know that at maximum any given `<foo>` parent can only have a few hundred `<bar>` children, but that still breaks the output streaming as I wait for the List to populate.

Question: How can I structure things such that `osm` knows when to start writing to a new file?

Thanks

PS. `osm` defined as it currently is probably also ends up in an infinite loop.
--
You received this message because you are subscribed to the Google Groups "Haskell Pipes" group.
send an email to [hidden email].

--
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Splitting large files

Colin Woodbury
Much thanks Gabriel, `streaming` was definitely the way to go. I've landed on a nice solution that parses 3gb of XML in about 20s.

--
Loading...