I recently added DTrace probes to the GHC runtime system. In the following, I will explore the use of these probes for profiling the garbage collection behaviour of Haskell programs. Mac OS X comes with a graphical tool, called Instruments, that supports visualising data gathered with DTrace. Let's start by having a look at visualising the DTrace events gc-start and gc-end for the standard nfib function (compiled with GHC 6.13 and -O). The two most important parts of the Instruments window above are the track pane, which graphs numerical values computed by the DTrace probes, and the detail pane, which displays all probe data in a tabular view. Here, the track pane contains graphs for
- time slices of the mutator,
- time slices spent on garbage collection,
- the running sum of the mutator time slices,
- the running sum of garbage collection time slices, and
- the running percentage of the runtime spent on garbage collection.
The height of each spike in the graphs is proportional to the length of the time slice (or the running sum), measuring virtual time (i.e., process time determined using DTrace's vtimestamp). It is important to note that the x-axis of the Instruments track pane uses wall clock time instead. (That is crucial when using Instruments to correlate the behaviour of two or more processes or when inspecting overall system behaviour.) This explains the gap about one third into the above run. During that time, the process executing our Haskell program simply was not scheduled by the operating system.
The detail pane in the above screenshot displays the numeric values corresponding to the DTrace events represented as spikes in the track pane. In addition, it also displays the event names.
Overall, our nfib implementation spends very little time on garbage collection, only about 1%. This becomes very clear in the following alternative few of the track pane, where we overlay the time slices for the mutator and garbage collection. Black spikes are mutator time slices, whereas red ones are garbage collection slices.
Let's have a look at a look at a more interesting program, namely Andrew Tolmach and Thomas Nordin's
constraint solver from the GC portion of the nofib benchmark suite. We compare executing the program with two different heap-size settings, namely with a 10MB versus a 100MB heap (the program was again compiled with GHC 6.13
and -O, and we used an input value of 10). With Instruments, we can directly compare the 10MB and the 100MB run.Remember that the x-axis of the track pane is wall clock time — i.e., the constraint solver runs approximately twice as fast with a 100MB than with a 10MB heap. The longer 10MB run shows evenly spaced major garbage collections after an initial startup phase (the red peaks in the second graph from the top). As GHC by default uses a generational garbage collector, we see many small red events in between the major collections — these correspond to the much less time consuming minor collections. Except in the start up phase, the mutator graph has only very short peaks, and overall garbage collection accounts for more than half of the overall runtime.
The 100MB run presents a dramatically different picture. Here, we only have three time consuming garbage collection events during the whole program execution. Other than that, there are only (barely visible in the screenshot) minor collections that briefly interrupt the mutator. Overall, the program only spends 14% of its runtime on garbage collection. The graph of the 100MB run clearly illustrates that we graph events (as peaks) and the gaps between events are event free. The height of a peak is the virtual duration of the corresponding mutator or garbage-collection time-slice and it will only correspond to the gap between events if the process was continuously running. In addition to graphing peaks, Instruments supports block diagrams, too — here, the same two program runs using a block diagram.
The height of a block still corresponds to a virtual time slice and its width is wall clock time. In the same manner as we overlayed the peak diagrams of the mutator and garbage-collection time-slices in the nfib example, we can do so with block diagrams.
For simple diagrams, this may seem to be the most easily understood graph. However, its interpretation does require some care. The human mind likes to assign meaning to the area of the blocks in a block diagram, but that leads to the wrong intuition here. In the above figure, the 10MB run appears to be entirely dominated by the red garbage collection blocks. Although garbage collection accounts for a hefty 65% of the execution time, the diagram suggests that the mutator makes nearly no progress, which is an incorrect conclusion. To avoid such misinterpretation, peak diagrams are the default.
You may like to argue that Instruments should arrange for block diagrams to use the area of blocks in such a manner that it is consistent with an intuitive interpretation. However, this runs contrary to using wall clock time for the x-axsis and the event value for the y-axsis (that is the height of a peak or block), which in turn is important to understand overall system behaviour and to correlate multiple processes. Let's look at a simple example, where wall clock time matters. The following is a profile of GHC itself when compiling a small Haskell program — in fact, it is compiling the previous constraint-solver benchmark.
We notice two points: firstly, GHC spends quite a bit of time on garbage collection —45% of its runtime, to be precise— and secondly, there is a big gap in the end, where nothing appears to happen. Indeed, we can verify that the process executing GHC is not active by adding another instrument that profiles process activity (the Instruments application comes with a whole library of pre-defined instruments).
GHC invokes the system assembler and linker to produce an executable. This obviously accounts for a large fraction of the overall execution time.
If you like to try this for yourself, you need to
build the current GHC HEAD (6.13) and get this trace template for Instruments:
GHC GC single. This trace template is designed for single-threaded programs — i.e., programs linked against the vanilla GHC runtime without the
-threaded link-time option. DTrace support for GHC is currently only implemented for Mac OS X. However, it shouldn't be hard to port to another DTrace implementation (contact me if you need advise). I don't know whether graphical DTrace frontends are available on other platforms. To profile GHC's garbage collection and multi-threading behaviour, you can also use
ThreadScope (which I found difficult to build on Mac OS).