Regular, shape-polymorphic, parallel arrays in Haskell


Don Stewart summarised the state of play of parallel programming in Haskell at "ACM Reflections | Projections 2009". He covers strategies, Concurrent Haskell, STM, and Data Parallel Haskell: http://donsbot.wordpress.com/2009/10/17/multicore-haskell-now-acm-reflections-projections-2009/
Here is a video of Roman Leshchinskiy's talk on our work on implementing Data Parallel Haskell (DPH), which he presented at the Haskell Implementors' Workshop (co-located with ICFP'09):

These graphs summarise the performance of Data Parallel Haskell for three simple benchmarks on two different architectures, both of which have 8 cores. At the top, we have 8 cores with one hardware thread each, in the form of two Quad-Core 3GHz Xeon processors. At the bottom, we have 8 cores with 8 hardware threads each (for a total of 64 hardware threads), in the form of a single 1.4GHz UltraSPARC T2.
The benchmarks are the following: (1) sumsq computes the parallel sum of the squares from 1 to 10 million; (2) dotp computes the dot product of two dense vectors of 100 million double-precision floating-point numbers; and (3) smvm multiplies a sparse matrix with 10 million non-zero double-precision floating-point elements with a dense vector.
The graphs show the speedup of Data Parallel Haskell with respect to a sequential C implementation of each benchmark – whenever, a curve climbs above 1 on the y-axis, the parallel Haskell program beats the sequential C program in absolute runtime.
The scalability of these three programs in Data Parallel Haskell is very good, although dotp is limited by memory bandwidth on the Quad-Core Xeon processors, as discussed in my previous post. The grey curve is a parallel C implementation of dotp that is bandwidth limited in the same manner.

This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. The performance is memory bound. The DPH code is
dotp' :: [:Double:] -> [:Double:] -> Double dotp' v w = D.sumP (zipWithP (*) v w)
Interesting is that despite the much lower single-thread performance, the T2 makes it all up in excellent multi-threading performance. Interesting to Haskell folks is that the corresponding multi-threaded C code using pthreads is much harder to write and barely any faster when using all parallelism (the Sun C-Compiler still manages to reduce the runtime by about 30% with 64 threads, but on the Xeons, there is no significant difference).
NB: This uses the latest development version of GHC and DPH. Thanks to Ben Lippmeier for his nice work on the SPARC backend of GHC.
http://www.cse.unsw.edu.au/~pls/damp09/programme.html
John Reppy talked about The Manticore Project and Ulf Wiger talked about Erlang Programming for Multi-core.