Just Testing RSS

Random ramblings by me.

Archive

Jul
3rd
Fri
permalink
May
30th
Sat
permalink
May
11th
Mon
permalink
Mar
17th
Tue
permalink
Mar
16th
Mon
permalink
Mar
10th
Tue
permalink
These graphs summarise the performance of Data Parallel Haskell for three simple benchmarks on two different architectures, both of which have 8 cores. At the top, we have 8 cores with one hardware thread each, in the form of two Quad-Core 3GHz Xeon processors. At the bottom, we have 8 cores with 8 hardware threads each (for a total of 64 hardware threads), in the form of a single 1.4GHz UltraSPARC T2.
The benchmarks are the following: (1) sumsq computes the parallel sum of the squares from 1 to 10 million; (2) dotp computes the dot product of two dense vectors of 100 million double-precision floating-point numbers; and (3) smvm multiplies a sparse matrix with 10 million non-zero double-precision floating-point elements with a dense vector.
The graphs show the speedup of Data Parallel Haskell with respect to a sequential C implementation of each benchmark – whenever, a curve climbs above 1 on the y-axis, the parallel Haskell program beats the sequential C program in absolute runtime.
The scalability of these three programs in Data Parallel Haskell is very good, although dotp is limited by memory bandwidth on the Quad-Core Xeon processors, as discussed in my previous post. The grey curve is a parallel C implementation of dotp that is bandwidth limited in the same manner.

These graphs summarise the performance of Data Parallel Haskell for three simple benchmarks on two different architectures, both of which have 8 cores. At the top, we have 8 cores with one hardware thread each, in the form of two Quad-Core 3GHz Xeon processors. At the bottom, we have 8 cores with 8 hardware threads each (for a total of 64 hardware threads), in the form of a single 1.4GHz UltraSPARC T2.

The benchmarks are the following: (1) sumsq computes the parallel sum of the squares from 1 to 10 million; (2) dotp computes the dot product of two dense vectors of 100 million double-precision floating-point numbers; and (3) smvm multiplies a sparse matrix with 10 million non-zero double-precision floating-point elements with a dense vector.

The graphs show the speedup of Data Parallel Haskell with respect to a sequential C implementation of each benchmark – whenever, a curve climbs above 1 on the y-axis, the parallel Haskell program beats the sequential C program in absolute runtime.

The scalability of these three programs in Data Parallel Haskell is very good, although dotp is limited by memory bandwidth on the Quad-Core Xeon processors, as discussed in my previous post. The grey curve is a parallel C implementation of dotp that is bandwidth limited in the same manner.

Mar
4th
Wed
permalink
Mar
3rd
Tue
permalink
This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. The performance is memory bound. The DPH code is
dotp' :: [:Double:] -> [:Double:] -> Double
dotp' v w = D.sumP (zipWithP (*) v w)
Interesting is that despite the much lower single-thread performance, the T2 makes it all up in excellent multi-threading performance. Interesting to Haskell folks is that the corresponding multi-threaded C code using pthreads is much harder to write and barely any faster when using all parallelism (the Sun C-Compiler still manages to reduce the runtime by about 30% with 64 threads, but on the Xeons, there is no significant difference).
NB: This uses the latest development version of GHC and DPH.  Thanks to Ben Lippmeier for his nice work on the SPARC backend of GHC.

This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. The performance is memory bound. The DPH code is

dotp' :: [:Double:] -> [:Double:] -> Double
dotp' v w = D.sumP (zipWithP (*) v w)

Interesting is that despite the much lower single-thread performance, the T2 makes it all up in excellent multi-threading performance. Interesting to Haskell folks is that the corresponding multi-threaded C code using pthreads is much harder to write and barely any faster when using all parallelism (the Sun C-Compiler still manages to reduce the runtime by about 30% with 64 threads, but on the Xeons, there is no significant difference).

NB: This uses the latest development version of GHC and DPH.  Thanks to Ben Lippmeier for his nice work on the SPARC backend of GHC.

permalink
permalink