More Fun Tracing Parallel Haskell Programs

I recently wrote about profiling sparse-matrix vector multiplication implemented with Data Parallel Haskell.  Let's take this a step further and look at a program parallelised with package parallel, instead of Data Parallel Haskell.  Specifically, we look at the dense matrix-matrix multiplication benchmark of the parallel section of the nofib benchmark suite.  This benchmark uses a linewise version of the torus-based Gentleman algorithm implemented with vanilla Haskell lists.  As we see in the following profile, the algorithm parallelises very well on two cores.  It also has a low allocation rate; hence, there is little garbage collection.


Matmult-600-1-10

In the above profile and that of the sparse-matrix vector multiplication that I showed in my previous post, the width of the time slices of the mutator threads and of garbage collection are determined by the wall clock time of the run-thread/stop-thread and gc-start/gc-end events, respectively.  These events are generated by the Haskell thread scheduler and do not —they cannot— take the OS scheduling of the Haskell Execution Contexts (HECs) running Haskell threads into account.  In other words, a Haskell thread may be scheduled on a HEC for a long time, but if the HEC itself (i.e., the operating thread on which it executes) is descheduled by the operating system, then the Haskell thread still won't make any progress.  To visualise this effect, the custom instruments used here also support graphing the virtual time that expired in a time slice.  The following profile is a slightly magnified version of the initial portion of the previous profile, where the height of a mutator time slice indicates its virtual extent.  The screenshot also shows the configuration pane for one mutator track.  The virtual extent of time slices in the garbage collection pane can also be displayed in this alternative form, although that is not shown here.

Matmult-time-slices

The custom instruments for GHC provide addition information that is by default not displayed in the track pane, including the wall clock percentage during which the runtime performs garbage collection and the percentage of utilisation of the mutators threads.  However, all information is accessible via Instrument's tabular detail view.  In particular, the library function GHC.Exts.traceEvent can be used to inject user-defined messages into the event stream.  In the DPH benchmarking framework, we use this for example to delimit the start and end of the benchmark payload, as opposed to benchmark set up, data generation, and tear down code.  The following snapshot shows part of the detail view for the sparse-matrix vector multiplication benchmark, where the event indicating the start of benchmarking is highlighted.

Smvm-detailed-marked

By clicking on the user message events in the detail view, we can easily mark the relevant section of the profile — it is the section highlighted in blue.  Below, we see that the sequential part of the profile is only part of the set up, but not of the benchmarked code.

Smvm-marked

If you like to try for yourself, download the trace template GHC HEC 2-core and use it in Instruments.

A current caveat: Instruments supports zooming the track views horizontally (on the time axis) with a slider.  Unfortunately, in version 2.0.1 of Instruments, the maximum resolution is not sufficient to see the exact relative timing of mutator and garbage collection time slices if garbage collection is very frequent.  Moreover, when using the Block Graph style, blocks that are very closely spaced may be coloured very lightly or appear transparent — if you are suspicious of a section that seems to show no activity in a profile, use the Peak Graph style to double check.  Instruments collects all the data to provide a better resolution; it's just a matter of supporting a wider range in the zoom slider.  (If anybody knows a work around for that, please post it in the comments.)

Posted