More Fun Tracing Parallel Haskell Programs

I recently wrote about profiling sparse-matrix vector multiplication implemented with Data Parallel Haskell.  Let's take this a step further and look at a program parallelised with package parallel, instead of Data Parallel Haskell.  Specifically, we look at the dense matrix-matrix multiplication benchmark of the parallel section of the nofib benchmark suite.  This benchmark uses a linewise version of the torus-based Gentleman algorithm implemented with vanilla Haskell lists.  As we see in the following profile, the algorithm parallelises very well on two cores.  It also has a low allocation rate; hence, there is little garbage collection.


In the above profile and that of the sparse-matrix vector multiplication that I showed in my previous post, the width of the time slices of the mutator threads and of garbage collection are determined by the wall clock time of the run-thread/stop-thread and gc-start/gc-end events, respectively.  These events are generated by the Haskell thread scheduler and do not —they cannot— take the OS scheduling of the Haskell Execution Contexts (HECs) running Haskell threads into account.  In other words, a Haskell thread may be scheduled on a HEC for a long time, but if the HEC itself (i.e., the operating thread on which it executes) is descheduled by the operating system, then the Haskell thread still won't make any progress.  To visualise this effect, the custom instruments used here also support graphing the virtual time that expired in a time slice.  The following profile is a slightly magnified version of the initial portion of the previous profile, where the height of a mutator time slice indicates its virtual extent.  The screenshot also shows the configuration pane for one mutator track.  The virtual extent of time slices in the garbage collection pane can also be displayed in this alternative form, although that is not shown here.

The custom instruments for GHC provide addition information that is by default not displayed in the track pane, including the wall clock percentage during which the runtime performs garbage collection and the percentage of utilisation of the mutators threads.  However, all information is accessible via Instrument's tabular detail view.  In particular, the library function GHC.Exts.traceEvent can be used to inject user-defined messages into the event stream.  In the DPH benchmarking framework, we use this for example to delimit the start and end of the benchmark payload, as opposed to benchmark set up, data generation, and tear down code.  The following snapshot shows part of the detail view for the sparse-matrix vector multiplication benchmark, where the event indicating the start of benchmarking is highlighted.

By clicking on the user message events in the detail view, we can easily mark the relevant section of the profile — it is the section highlighted in blue.  Below, we see that the sequential part of the profile is only part of the set up, but not of the benchmarked code.

If you like to try for yourself, download the trace template GHC HEC 2-core and use it in Instruments.

A current caveat: Instruments supports zooming the track views horizontally (on the time axis) with a slider.  Unfortunately, in version 2.0.1 of Instruments, the maximum resolution is not sufficient to see the exact relative timing of mutator and garbage collection time slices if garbage collection is very frequent.  Moreover, when using the Block Graph style, blocks that are very closely spaced may be coloured very lightly or appear transparent — if you are suspicious of a section that seems to show no activity in a profile, use the Peak Graph style to double check.  Instruments collects all the data to provide a better resolution; it's just a matter of supporting a wider range in the zoom slider.  (If anybody knows a work around for that, please post it in the comments.)

Loading mentions Retweet
Filed under  //  dph   dtrace   ghc   haskell   instruments  
Comments (2)
Posted 20 days ago

Tracing Sparse Matrix-Vector Multiplication with DTrace

The following is an example trace of a Data Parallel Haskell program multiplying a sparse matrix (10k x 10k elements with 10% non-zero elements) with a dense vector.  The program runs on both cores of an Intel Core 2 Duo processor and uses GHC's new DTrace support on Mac OS X to gather the trace data, which I visualised in Instruments.  Of the three tracks, the topmost shows garbage collection activity in blue.  The other two show the activity of the two HECs (Haskell Execution Contexts) running the application code on two cores in green.  The program starts by loading the matrix from disk and generating a random vector — this is where only HEC #1 does any work.  Then, during the actual matrix multiplication, both cores are utilised almost evenly.

There is little garbage collection during the parallel matrix multiplication as GHC successfully unboxes the numeric code resulting in little heap allocation.

Loading mentions Retweet
Filed under  //  dph   dtrace   ghc   haskell   instruments  
Comments (0)
Posted 25 days ago

Profiling Garbage Collection in Haskell with DTrace & Instruments

I recently added DTrace probes to the GHC runtime system.  In the following, I will explore the use of these probes for profiling the garbage collection behaviour of Haskell programs.  Mac OS X comes with a graphical tool, called Instruments, that supports visualising data gathered with DTrace.  Let's start by having a look at visualising the DTrace events gc-start and gc-end for the standard nfib function (compiled with GHC 6.13 and -O).

The two most important parts of the Instruments window above are the track pane, which graphs numerical values computed by the DTrace probes, and the detail pane, which displays all probe data in a tabular view.  Here, the track pane contains graphs for 

  1. time slices of the mutator,
  2. time slices spent on garbage collection,
  3. the running sum of the mutator time slices,
  4. the running sum of garbage collection time slices, and
  5. the running percentage of the runtime spent on garbage collection.

The height of each spike in the graphs is proportional to the length of the time slice (or the running sum), measuring virtual time (i.e., process time determined using DTrace's vtimestamp).  It is important to note that the x-axis of the Instruments track pane uses wall clock time instead.  (That is crucial when using Instruments to correlate the behaviour of two or more processes or when inspecting overall system behaviour.)  This explains the gap about one third into the above run.  During that time, the process executing our Haskell program simply was not scheduled by the operating system.

The detail pane in the above screenshot displays the numeric values corresponding to the DTrace events represented as spikes in the track pane.  In addition, it also displays the event names.

Overall, our nfib implementation spends very little time on garbage collection, only about 1%.  This becomes very clear in the following alternative few of the track pane, where we overlay the time slices for the mutator and garbage collection. Black spikes are mutator time slices, whereas red ones are garbage collection slices.

Let's have a look at a look at a more interesting program, namely Andrew Tolmach and Thomas Nordin's constraint solver from the GC portion of the nofib benchmark suite.  We compare executing the program with two different heap-size settings, namely with a 10MB versus a 100MB heap (the program was again compiled with GHC 6.13 and -O, and we used an input value of 10).  With Instruments, we can directly compare the 10MB and the 100MB run.

Remember that the x-axis of the track pane is wall clock time — i.e., the constraint solver runs approximately twice as fast with a 100MB than with a 10MB heap.  The longer 10MB run shows evenly spaced major garbage collections after an initial startup phase (the red peaks in the second graph from the top).  As GHC by default uses a generational garbage collector, we see many small red events in between the major collections — these correspond to the much less time consuming minor collections.  Except in the start up phase, the mutator graph has only very short peaks, and overall garbage collection accounts for more than half of the overall runtime.

The 100MB run presents a dramatically different picture.  Here, we only have three time consuming garbage collection events during the whole program execution.  Other than that, there are only (barely visible in the screenshot) minor collections that briefly interrupt the mutator.  Overall, the program only spends 14% of its runtime on garbage collection.  The graph of the 100MB run clearly illustrates that we graph events (as peaks) and the gaps between events are event free.  The height of a peak is the virtual duration of the corresponding mutator or garbage-collection time-slice and it will only correspond to the gap between events if the process was continuously running.  In addition to graphing peaks, Instruments supports block diagrams, too — here, the same two program runs using a block diagram.

The height of a block still corresponds to a virtual time slice and its width is wall clock time.  In the same manner as we overlayed the peak diagrams of the mutator and garbage-collection time-slices in the nfib example, we can do so with block diagrams.

For simple diagrams, this may seem to be the most easily understood graph.  However, its interpretation does require some care.  The human mind likes to assign meaning to the area of the blocks in a block diagram, but that leads to the wrong intuition here.  In the above figure, the 10MB run appears to be entirely dominated by the red garbage collection blocks.  Although garbage collection accounts for a hefty 65% of the execution time, the diagram suggests that the mutator makes nearly no progress, which is an incorrect conclusion.  To avoid such misinterpretation, peak diagrams are the default.

You may like to argue that Instruments should arrange for block diagrams to use the area of blocks in such a manner that it is consistent with an intuitive interpretation.  However, this runs contrary to using wall clock time for the x-axsis and the event value for the y-axsis (that is the height of a peak or block), which in turn is important to understand overall system behaviour and to correlate multiple processes.  Let's look at a simple example, where wall clock time matters.  The following is a profile of GHC itself when compiling a small Haskell program — in fact, it is compiling the previous constraint-solver benchmark.

We notice two points: firstly, GHC spends quite a bit of time on garbage collection —45% of its runtime, to be precise— and secondly, there is a big gap in the end, where nothing appears to happen.  Indeed, we can verify that the process executing GHC is not active by adding another instrument that profiles process activity (the Instruments application comes with a whole library of pre-defined instruments).

GHC invokes the system assembler and linker to produce an executable.  This obviously accounts for a large fraction of the overall execution time.

If you like to try this for yourself, you need to build the current GHC HEAD (6.13) and get this trace template for Instruments: GHC GC single.  This trace template is designed for single-threaded programs — i.e., programs linked against the vanilla GHC runtime without the -threaded link-time option.  DTrace support for GHC is currently only implemented for Mac OS X.  However, it shouldn't be hard to port to another DTrace implementation (contact me if you need advise).  I don't know whether graphical DTrace frontends are available on other platforms.  To profile GHC's garbage collection and multi-threading behaviour, you can also use ThreadScope (which I found difficult to build on Mac OS).

Loading mentions Retweet
Filed under  //  dtrace   ghc   haskell   instruments  
Comments (0)
Posted 1 month ago

Using DTrace to track scheduler events of GHC's runtime

I finally got around to committing my instrumentation of the GHC runtime system with DTrace probes (commit message).  The current set of probes tracks all the events of the eventlog framework used by ThreadScope.  A detailed description of the probes is on the new Using DTrace with GHC page of the GHC developer wiki.

Loading mentions Retweet
Filed under  //  dtrace   ghc   haskell   profiling  
Comments (0)
Posted 2 months ago