I'm trying to squeeze more performance out of Haskell, and I'm hoping
someone can help me understand why the following two pieces of code have
different performance. In particular, I find that the apparently 'more
parallel' version of the code performs worse.
Quick description of the code: basically it solves the following problem
[https://www.spoj.com/problems/PALIN/]. The input is a set of huge
integer numbers. For each number, print the smallest 'palindrome' that
is larger. A palindrome is a number for which when you reverse its
digits you end up with the same number.
The code uses 'parBuffer' to do the calculations in parallel. Below is
high-level description of the steps of each version:
Version 1 [palin.parfromstring.hs]:
1. Read the input numbers as strings into a list
2. Perform the following calculation for each number (the steps for this
calculation run in sequence, the entire calculation runs in parallel
w.r.t. different numbers):
2.1 Convert from string to array of digits
2.2 Calculate the palindrome in-place
The result of step 2 is a list of digit-arrays. That concludes the
3. For each palindrome, convert it to a string and print it.
Version 2 [palin.parfromstringtostring.hs]:
1. Same as above
2.1 Same as above
2.2 Same as above
2.3 Convert the palindrome to a string
3. For each string, print it.
I'd expect that version 2 is faster than version 1 because it's doing
the conversion back to string also in parallel, but on a set of 50
numbers it's surprisingly ~750ms slower on my computer. I see that the
GC is doing a lot more copying for version 2 (1GB more than version 1),
but I don't really understand what I'm looking it. :)
How can one best think about getting more performance out of parallelism
in Haskell? As in, what are some good rules of thumb by which one can
decide 'parallelize this, don't parallelize that'?
Attached you can find the event logs for ThreadScope and full code for
each version. The code is almost identical, so attached is also the diff
between the two.