Fusion & Laziness

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Fusion & Laziness

Hilco Wijbenga
Hi all,

I'd like to have a better understanding of fusion and (maybe?) laziness.

Let's say I have an (exported) type "data EndResult = ... !(Vector
Thing) ..." and an intermediate, unexported type "data
IntermediateResult = ... !(Vector Thing) ...". (I suppose it doesn't
need to be Vector, it could be Set, or some other data structure that
(I imagine) is relatively expensive to map over, unlike, say [].)

[To start off, I want to state my, possibly incorrect, understanding
of fusion and how it does not (in my expectation) apply to Vector,
Set, et cetera. A List can essentially disappear as it's replaced by a
loop but a Vector would not: the executing code would create and
garbage collect multiple intermediate Vectors before finally returning
the end result.]

Assume that I need my algorithm to go from initial input to
IntermediateResult to subsequent IntermediateResult (a few times) to
EndResult. In my case, each subsequent IntermediateResult is a bit
smaller than the previous one but that's probably irrelevant.

Should I prefer IntermediateResult to be lazy? Should I use [] instead
of Vector in the IntermediateResult? What about the functions that
actually operate on IntermediateResult, should I prefer to use [] or
Vector there? I'm currently able to use Data.Vector.concatMap in some
places, is that just as optimized?

I realize that the correct answer more than likely is "don't worry
about it". And that I'm being very vague. :-) I'm not looking for
anything definitive, I'm just hoping to improve my understanding and
intuition.

What should I consider when thinking about these types of things? If I
don't want to create two separate implementations and profile them,
are there "obvious" signs one way or another?

Cheers,
Hilco
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Fusion & Laziness

Alexis King
Disclaimer: I am not an expert on fusion.

I think it would help you to understand a little bit about how fusion works, since that makes it easier to develop an intuition for what will/won’t get fused. Here’s a crash course: fusion is implemented via rewrite rules, which are user-defined code transformation rules applied by the GHC optimizer. The essence of list fusion is a single rewrite rule defined in the standard library, which rewrites all expressions of the shape foldr k z (build g) to g k z, where build is a function exported by GHC.Exts with the following simple definition:

build g = g (:) []

How do you get fusion from that? Here’s the trick: fusable functions that consume lists are implemented using foldr, while ones that build lists are implemented using build. For example, sum could be implemented using foldr like this:

sum = foldr (+) 0

…and map (which both consumes and builds) could be expressed like this:

map f xs = build (\cons nil -> foldr (\x ys -> f x `cons` ys) nil xs)

Now if you write sum (map f xs), and sum and map are both inlined, you’ll get

foldr (+) 0 (build (\cons nil -> foldr (\x ys -> f x `cons` ys) nil xs))

which matches the rewrite rule from above and gets rewritten into this:

foldr (\x ys -> f x + ys) 0 xs

Like magic, the intermediate list has disappeared!

The details are a little more complicated than that in practice, but that’s the core idea. Keeping that context in mind, let me answer some of your questions directly.

On Sep 16, 2019, at 21:44, Hilco Wijbenga <[hidden email]> wrote:
[To start off, I want to state my, possibly incorrect, understanding
of fusion and how it does not (in my expectation) apply to Vector,
Set, et cetera. A List can essentially disappear as it's replaced by a
loop but a Vector would not: the executing code would create and
garbage collect multiple intermediate Vectors before finally returning
the end result.]

Since fusion is implemented via rewrite rules, which can be defined by anyone, operations on library-defined datatypes can be fused if the library author provides the necessary rewrite rules. Other datatypes use different fusion strategies than the foldr/build fusion mentioned above, but the core ideas are similar.

For the specific two types you mentioned, Vector operations do happen to be fused, while Set operations are not. Set operations are hard to fuse because duplicates need to be removed from the stream, and determining if an element is already in the set fundamentally requires a data structure. You could, however, always convert a set to a list, transform the list with a sequence of fusable operations, and turn it back into a set.

Assume that I need my algorithm to go from initial input to
IntermediateResult to subsequent IntermediateResult (a few times) to
EndResult. In my case, each subsequent IntermediateResult is a bit
smaller than the previous one but that's probably irrelevant.
 
Should I prefer IntermediateResult to be lazy? Should I use [] instead
of Vector in the IntermediateResult? What about the functions that
actually operate on IntermediateResult, should I prefer to use [] or
Vector there? I'm currently able to use Data.Vector.concatMap in some
places, is that just as optimized?

Remember that fusion is based on rewrite rules, which are fundamentally transformations on the code. Therefore, what matters most is the structure of the code you have written, not what values are flowing around your program at runtime. If a list produced by build ends up being passed to foldr, but GHC couldn’t inline definitions enough to see the syntactic pattern foldr k z (build g), then fusion can’t happen.

Therefore, if you want fusion to happen, make sure that you use standard list operations to construct or consume lists (e.g. anything in Prelude or Data.List), not manually-written recursive functions that pattern-match on lists (GHC doesn’t know how to fuse those). Make sure GHC is able to inline things enough it will be able to discover chains of fusable operations. If you’re really worried about it, learn to read the GHC Core that can be dumped by the optimizer; this blog post is a good overview of all of these concepts. But your gut is right: you probably just shouldn’t worry about it.

Alexis


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Fusion & Laziness

Shao Cheng
In reply to this post by Hilco Wijbenga
Hi Hilco,

To guarantee no intermediate Vectors are ever allocated, you can use the "Bundle" type in https://hackage.haskell.org/package/vector-0.12.0.3/docs/Data-Vector-Fusion-Bundle.html for intermediate values, and only convert the Bundle to a list or vector in the final step. In fact, most interfaces in the vector package are wrapped with "stream" and "unstream" calls under the hood, using Bundle to do the actual transformation, but still exposing the Vector type as input/output, and the composed stream/unstream calls are eliminated by the ghc rewrite rules.

Cheers,
Cheng

On Tue, Sep 17, 2019 at 10:45 AM Hilco Wijbenga <[hidden email]> wrote:
Hi all,

I'd like to have a better understanding of fusion and (maybe?) laziness.

Let's say I have an (exported) type "data EndResult = ... !(Vector
Thing) ..." and an intermediate, unexported type "data
IntermediateResult = ... !(Vector Thing) ...". (I suppose it doesn't
need to be Vector, it could be Set, or some other data structure that
(I imagine) is relatively expensive to map over, unlike, say [].)

[To start off, I want to state my, possibly incorrect, understanding
of fusion and how it does not (in my expectation) apply to Vector,
Set, et cetera. A List can essentially disappear as it's replaced by a
loop but a Vector would not: the executing code would create and
garbage collect multiple intermediate Vectors before finally returning
the end result.]

Assume that I need my algorithm to go from initial input to
IntermediateResult to subsequent IntermediateResult (a few times) to
EndResult. In my case, each subsequent IntermediateResult is a bit
smaller than the previous one but that's probably irrelevant.

Should I prefer IntermediateResult to be lazy? Should I use [] instead
of Vector in the IntermediateResult? What about the functions that
actually operate on IntermediateResult, should I prefer to use [] or
Vector there? I'm currently able to use Data.Vector.concatMap in some
places, is that just as optimized?

I realize that the correct answer more than likely is "don't worry
about it". And that I'm being very vague. :-) I'm not looking for
anything definitive, I'm just hoping to improve my understanding and
intuition.

What should I consider when thinking about these types of things? If I
don't want to create two separate implementations and profile them,
are there "obvious" signs one way or another?

Cheers,
Hilco
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.