ANNOUNCE: streaming-conduit

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

ANNOUNCE: streaming-conduit

Ivan Lazar Miljenovic
I've recently found myself really enjoying using Michael Thompson's
[streaming] library for, well, streaming data.

However, a lot of packages that I want to use already use conduit for
all their streaming needs.  As such, I've just written the
[streaming-conduit] library to convert between the two (rather than
just switching this project to conduit entirely as I couldn't find a
simple way to just stream data from a PostgreSQL database).

I make no guarantees at this stage of performance, and it's quite
possible that - especially for the asStream and asConduit functions -
that they may not interoperate cleanly with monadic operations.

[streaming]: https://hackage.haskell.org/package/streaming
[streaming-conduit]: https://hackage.haskell.org/package/streaming-conduit

--
Ivan Lazar Miljenovic
[hidden email]
http://IvanMiljenovic.wordpress.com
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ANNOUNCE: streaming-conduit

Olaf Klinke
Dear Ivan,

I'm confused. The package documentation states that the typical problem to avoid is extracting a long list from IO, which is bound to allocate too much memory. However, if I do

preprocess :: IO [a]
preprocess = fmap (map (parse :: String -> a) . lines) $ readFile filename

then this may create a huge pile of thunks if the rest of the program is not written in a streaming style. But if done right, lazy IO will make sure only the necessary part of the list is kept im memory. For example, a strict function like

postprocess :: [a] -> IO ()
postprocess = mapM (print . f)

with some strict f would fit the bill. Did I get the concept of lazy IO wrong? What is the operational semantics of the following fragment?

x:xs <- someIOaction :: IO [a]
Control.DeepSeq.force x

I used to think that xs is now a thunk which, when evalated further, may trigger more (read) IO actions. Hence, would behave as if someIOaction was one of your Streams. I do acknowledge, though, that the style above is easy to go wrong [*], whereas explicitly wrapping each list element in its own monadic action makes the intentions more verbose.

Cheers,
Olaf

[*] If someIOaction needs to touch every list element in order to ensure the result has the shape _:_ then we will indeed have a huge list of thunks.
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ANNOUNCE: streaming-conduit

Ivan Lazar Miljenovic
On 12 June 2017 at 06:55, Olaf Klinke <[hidden email]> wrote:
> Dear Ivan,
>
> I'm confused. The package documentation states that the typical problem to avoid is extracting a long list from IO, which is bound to allocate too much memory.

Well, this is the whole point of libraries like streaming, conduit,
pipes, etc. but not specifically of streaming-conduit (which is just
to help two of these libraries interoperate).

> However, if I do
>
> preprocess :: IO [a]
> preprocess = fmap (map (parse :: String -> a) . lines) $ readFile filename
>
> then this may create a huge pile of thunks if the rest of the program is not written in a streaming style. But if done right, lazy IO will make sure only the necessary part of the list is kept im memory.

Lazy I/O works really well if you have just a single input stream that
you can read in and process elements of using an existing lazy I/O
function such as readFile then write out lazily as well.  However, it
does require the unsafeInterleaveIO hack in readFile to work properly.
I have done stuff like this in production software where great care
was taken to ensure it remained lazy and was able to make it stream
(even though we had existing streaming use in other parts of the
system).

In particular, the issue arises in that you have to be able to somehow
get the entire result of IO, treat it as a pure value, then do IO with
it again... but doing so lazily, which doesn't always work too well.

What these libraries do is make that streaming style explicit, so at a
low-level you specifically request the next chunk.  Taking your
preprocess example, the equivalent would be that:

1. You read in a file as a ByteString
2. This is converted to String a chunk at a time
3. This implicit stream of Strings (or possibly just Chars, depending
on the implementation) is converted into an explicit Stream of
line-based Strings (though 2 and 3 may be switched around)
4. You map the parsing function on each element of the stream.

However, at a higher level there are various combinators to take most
of this into account for you.  Streaming in particular tries to
emulate a list-based style, so that `IO [a]` is equivalent to `Stream
(Of a) IO r`.  The API lets you act like you are operating over a list
(and still doing it with function composition, which makes it
different from most of the other similar libraries) whilst taking care
of the IO for you.

This does mean you can't do as much of the "do IO to get value,
operate purely on the value then do IO to put it back out", but if you
write combinators to do the streaming without specifying which monad
it's operating in (and thus being unable to do any IO).

See writings like
https://stackoverflow.com/questions/5892653/whats-so-bad-about-lazy-i-o
and https://www.reddit.com/r/haskell/comments/380kmq/illustrating_the_problem_with_lazy_io/
for more

TL;DR: these kinds of libraries allow you to explicitly state how
values are streamed throughout your code rather than relying upon the
lazy I/O hacks (which if they work for you, go ahead; but if your code
gets too complex they start to fail).

> For example, a strict function like
>
> postprocess :: [a] -> IO ()
> postprocess = mapM (print . f)
>
> with some strict f would fit the bill. Did I get the concept of lazy IO wrong? What is the operational semantics of the following fragment?
>
> x:xs <- someIOaction :: IO [a]
> Control.DeepSeq.force x
>
> I used to think that xs is now a thunk which, when evalated further, may trigger more (read) IO actions. Hence, would behave as if someIOaction was one of your Streams. I do acknowledge, though, that the style above is easy to go wrong [*], whereas explicitly wrapping each list element in its own monadic action makes the intentions more verbose.
>
> Cheers,
> Olaf
>
> [*] If someIOaction needs to touch every list element in order to ensure the result has the shape _:_ then we will indeed have a huge list of thunks.



--
Ivan Lazar Miljenovic
[hidden email]
http://IvanMiljenovic.wordpress.com
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Loading...