Re: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Re: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Compl Yue
Hi Devs & Cafe,

I would report back my progress on it, actually I've got a rough conclusion that TL;DR:

> For data-intensive workloads, x86_64 ISA has its cache of CPU chips being a hardware bottleneck, it's very hard to scale up with added number of cores, so long as they share the cache as being in a single chip.

For the details -

I developed a minimal script interpreter for diagnostic purpose, dependent only on libraries bundled with GHC, the source repository is at: https://github.com/complyue/txs

I benchmarked on my machine with a single 6-core Xeon E5 CPU chip, for contention-free read/write performance scaling, got numbers at: https://github.com/complyue/txs/blob/master/results/baseline.csv


conc
thread avg tps
scale
eff
populate
1
1741
1.00
1.00
2
1285
1.48
0.74
3
1028
1.77
0.59
4
843
1.94
0.48
5
696
2.00
0.40
6
600
2.07
0.34





scan
1
1565
1.00
1.00
2
1285
1.64
0.82
3
1018
1.95
0.65
4
843
2.15
0.54
5
696
2.22
0.44
6
586
2.25
0.37








ghc --make -Wall -threaded -rtsopts -prof -o txs -outputdir . -stubdir . -i../src ../src/Main.hs && (

./txs +RTS -N10 -A32m -H256m -qg -I0 -M5g -T -s <../scripts/"${SCRIPT}".txs

)

I intended to use a single Haskell based process to handle meta data about many ndarrays being crunched, acting as a centralized graph database, as it turned out, many clients queued to query/insert meta data against a single database node, will create such high data throughput that just few CPU chips can't handle well, we didn't expect this but apparently we'll have to deploy more machines as for such a database instance, with data partitioned and distributed to more nodes for load balancing. (A single machine with many sockets for CPU thus many NUMA nodes is neither an option for us.) While the flexibility a central graph database would provide, is not currently a crucial requirement of our business,  so we are not interested to further develop this database system.

We currently have CPU intensive workloads handled by some cluster of machines running Python processes (crunching numbers with Numpy and C++ tensors), while some Haskell based number crunching software are still under development, it may turn out some day in the future, that some heavier computation be bound with the db access, effectively creating some CPU intensive workloads for the database functionality, then we'll have the opportunity to dive deeper into the database implementation. And in case more flexibility required in near future, I think I'll tend to implement embedded database instances in those worker processes, in contrast to centralized db servers.

I wonder if ARM servers will have up scaling of data intensive workloads easier, though that's neither a near feasible option for us.

Thanks for everyone that have been helpful!

Best regards,
Compl


On 2020-07-31, at 22:35, YueCompl via Haskell-Cafe <[hidden email]> wrote:

Hi Ben,

Thanks as always for your great support! And at the moment I'm working on a minimum working example to reproduce the symptoms, I intend to work out a program depends only on libraries bundled with GHC, so it can be easily diagnosed without my complex env,  but so far no reprod yet. I'll come with some piece of code once it can reproduce something.

Thanks in advance.

Sincerely,
Compl


On 2020-07-31, at 21:36, Ben Gamari <[hidden email]> wrote:

Simon Peyton Jones via Haskell-Cafe <[hidden email]> writes:

Compl’s problem is (apparently) that execution becomes dominated by
GC. That doesn’t sound like a constant-factor overhead from TVars, no
matter how efficient (or otherwise) they are. It sounds more like a
space leak to me; perhaps you need some strict evaluation or
something.

My point is only: before re-engineering STM it would make sense to get
a much more detailed insight into what is actually happening, and
where the space and time is going. We have tools to do this (heap
profiling, Threadscope, …) but I know they need some skill and insight
to use well. But we don’t have nearly enough insight to draw
meaningful conclusions yet.

Maybe someone with experience of performance debugging might feel able
to help Compl?

Compl,

If you want to discuss the issue feel free to get in touch on IRC. I
would be happy to help.

It would be great if we had something of a decision tree for performance
tuning of Haskell code in the users guide or Wiki. We have so many tools
yet there isn't a comprehensive overview of

1. what factors might affect which runtime characteristics of your
  program
2. which tools can be used to measure which factors
3. how these factors can be improved

Cheers,

- Ben
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.


_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs