More Fun Tracing Parallel Haskell Programs

I recently wrote about profiling sparse-matrix vector multiplication implemented with Data Parallel Haskell.  Let's take this a step further and look at a program parallelised with package parallel, instead of Data Parallel Haskell.  Specifically, we look at the dense matrix-matrix multiplication benchmark of the parallel section of the nofib benchmark suite.  This benchmark uses a linewise version of the torus-based Gentleman algorithm implemented with vanilla Haskell lists.  As we see in the following profile, the algorithm parallelises very well on two cores.  It also has a low allocation rate; hence, there is little garbage collection.


In the above profile and that of the sparse-matrix vector multiplication that I showed in my previous post, the width of the time slices of the mutator threads and of garbage collection are determined by the wall clock time of the run-thread/stop-thread and gc-start/gc-end events, respectively.  These events are generated by the Haskell thread scheduler and do not —they cannot— take the OS scheduling of the Haskell Execution Contexts (HECs) running Haskell threads into account.  In other words, a Haskell thread may be scheduled on a HEC for a long time, but if the HEC itself (i.e., the operating thread on which it executes) is descheduled by the operating system, then the Haskell thread still won't make any progress.  To visualise this effect, the custom instruments used here also support graphing the virtual time that expired in a time slice.  The following profile is a slightly magnified version of the initial portion of the previous profile, where the height of a mutator time slice indicates its virtual extent.  The screenshot also shows the configuration pane for one mutator track.  The virtual extent of time slices in the garbage collection pane can also be displayed in this alternative form, although that is not shown here.

The custom instruments for GHC provide addition information that is by default not displayed in the track pane, including the wall clock percentage during which the runtime performs garbage collection and the percentage of utilisation of the mutators threads.  However, all information is accessible via Instrument's tabular detail view.  In particular, the library function GHC.Exts.traceEvent can be used to inject user-defined messages into the event stream.  In the DPH benchmarking framework, we use this for example to delimit the start and end of the benchmark payload, as opposed to benchmark set up, data generation, and tear down code.  The following snapshot shows part of the detail view for the sparse-matrix vector multiplication benchmark, where the event indicating the start of benchmarking is highlighted.

By clicking on the user message events in the detail view, we can easily mark the relevant section of the profile — it is the section highlighted in blue.  Below, we see that the sequential part of the profile is only part of the set up, but not of the benchmarked code.

If you like to try for yourself, download the trace template GHC HEC 2-core and use it in Instruments.

A current caveat: Instruments supports zooming the track views horizontally (on the time axis) with a slider.  Unfortunately, in version 2.0.1 of Instruments, the maximum resolution is not sufficient to see the exact relative timing of mutator and garbage collection time slices if garbage collection is very frequent.  Moreover, when using the Block Graph style, blocks that are very closely spaced may be coloured very lightly or appear transparent — if you are suspicious of a section that seems to show no activity in a profile, use the Peak Graph style to double check.  Instruments collects all the data to provide a better resolution; it's just a matter of supporting a wider range in the zoom slider.  (If anybody knows a work around for that, please post it in the comments.)

Loading mentions Retweet
Filed under  //  dph   dtrace   ghc   haskell   instruments  
Comments (2)
Posted 20 days ago

Tracing Sparse Matrix-Vector Multiplication with DTrace

The following is an example trace of a Data Parallel Haskell program multiplying a sparse matrix (10k x 10k elements with 10% non-zero elements) with a dense vector.  The program runs on both cores of an Intel Core 2 Duo processor and uses GHC's new DTrace support on Mac OS X to gather the trace data, which I visualised in Instruments.  Of the three tracks, the topmost shows garbage collection activity in blue.  The other two show the activity of the two HECs (Haskell Execution Contexts) running the application code on two cores in green.  The program starts by loading the matrix from disk and generating a random vector — this is where only HEC #1 does any work.  Then, during the actual matrix multiplication, both cores are utilised almost evenly.

There is little garbage collection during the parallel matrix multiplication as GHC successfully unboxes the numeric code resulting in little heap allocation.

Loading mentions Retweet
Filed under  //  dph   dtrace   ghc   haskell   instruments  
Comments (0)
Posted 25 days ago

Status Update of the Glasgow Haskell Compiler, October 2009

If you are interested in Haskell generally or the Glasgow Haskell Compiler (GHC) in particular, you may want to have a look at the latest issue of the biannual GHC status updates.  It includes a summary of our recent progress in the Data Parallel Haskell project.

Loading mentions Retweet
Filed under  //  dph   ghc   haskell  
Comments (2)
Posted 4 months ago

First benchmark of regular, multidimensional arrays in DPH

(download)

Loading mentions Retweet
Filed under  //  dph  
Comments (0)
Posted 4 months ago

Data-oriented programming and the vectorisation transformation

I just realised that the data layout and the resulting code organisation in data-oriented programming, as proposed for games design, in many aspects resembles the data layout and code organisation favoured by the vectorisation transformation that is at the core of Data Parallel Haskell.  This should not be surprising as both have similar goals, namely to maximise data throughput, to minimise stalls by utilising the memory hierarchy, and to maximise parallelism.

However, the parallels are interesting given the very different origins of both approaches.  The big difference is of course that data-oriented programming is a design methodology for programmers, whereas the vectorisation transformation is a program transformation automatically applied by a compiler.  Nevertheless, we may regard the vectorisation transformation as a program transformation that turns purely functional code into code that is structured in a data-oriented manner, where the layout of bulk data shapes the organisation of the code.

This raises the question of how useful nested data parallel programming and the vectorisation transformation may be to games programming.

Loading mentions Retweet
Filed under  //  data-orientation   dph   vectorisation  
Comments (5)
Posted 4 months ago

Multicore Haskell Now!

Don Stewart summarised the state of play of parallel programming in Haskell at "ACM Reflections | Projections 2009". He covers strategies, Concurrent Haskell, STM, and Data Parallel Haskell: http://donsbot.wordpress.com/2009/10/17/multicore-haskell-now-acm-reflections-projections-2009/

Loading mentions Retweet
Filed under  //  dph   haskell   multicore   parallelism   stm  
Comments (0)
Posted 4 months ago

Implementing Data Parallel Haskell

Here is a video of Roman Leshchinskiy's talk on our work on implementing Data Parallel Haskell (DPH), which he presented at the Haskell Implementors' Workshop (co-located with ICFP'09):

 

Loading mentions Retweet
Filed under  //  dph   haskell   multicore   parallelism  
Comments (0)
Posted 5 months ago

These graphs summarise the performance of Data Parallel Haskell for three simple benchmarks on two different architectures, both of which have 8 cores. At the top, we have 8 cores with one hardware thread each, in the form of two Quad-Core 3GHz Xeon processors. At the bottom, we have 8 cores with 8 hardware threads each (for a total of 64 hardware threads), in the form of a single 1.4GHz UltraSPARC T2.

The benchmarks are the following: (1) sumsq computes the parallel sum of the squares from 1 to 10 million; (2) dotp computes the dot product of two dense vectors of 100 million double-precision floating-point numbers; and (3) smvm multiplies a sparse matrix with 10 million non-zero double-precision floating-point elements with a dense vector.

The graphs show the speedup of Data Parallel Haskell with respect to a sequential C implementation of each benchmark – whenever, a curve climbs above 1 on the y-axis, the parallel Haskell program beats the sequential C program in absolute runtime.

The scalability of these three programs in Data Parallel Haskell is very good, although dotp is limited by memory bandwidth on the Quad-Core Xeon processors, as discussed in my previous post. The grey curve is a parallel C implementation of dotp that is bandwidth limited in the same manner.

Loading mentions Retweet
Filed under  //  dph   haskell   multicore   parallelism  
Comments (0)
Posted 1 year ago

This is the performance of a dot product of two vectors of 10 million doubles each using Data Parallel Haskell. Both machines have 8 cores. Each core of the T2 has 8 hardware thread contexts. The performance is memory bound. The DPH code is

dotp' :: [:Double:] -> [:Double:] -> Double
dotp' v w = D.sumP (zipWithP (*) v w)

Interesting is that despite the much lower single-thread performance, the T2 makes it all up in excellent multi-threading performance. Interesting to Haskell folks is that the corresponding multi-threaded C code using pthreads is much harder to write and barely any faster when using all parallelism (the Sun C-Compiler still manages to reduce the runtime by about 30% with 64 threads, but on the Xeons, there is no significant difference).

NB: This uses the latest development version of GHC and DPH.  Thanks to Ben Lippmeier for his nice work on the SPARC backend of GHC.

Loading mentions Retweet
Filed under  //  dph   haskell   multicore   parallelism  
Comments (0)
Posted 1 year ago