Just Testing

  • Archive
  • RSS

Tracing Sparse Matrix-Vector Multiplication with DTrace

The following is an example trace of a Data Parallel Haskell program multiplying a sparse matrix (10k x 10k elements with 10% non-zero elements) with a dense vector.  The program runs on both cores of an Intel Core 2 Duo processor and uses GHC’s new DTrace support on Mac OS X to gather the trace data, which I visualised in Instruments.  Of the three tracks, the topmost shows garbage collection activity in blue.  The other two show the activity of the two HECs (Haskell Execution Contexts) running the application code on two cores in green.  The program starts by loading the matrix from disk and generating a random vector — this is where only HEC #1 does any work.  Then, during the actual matrix multiplication, both cores are utilised almost evenly.

There is little garbage collection during the parallel matrix multiplication as GHC successfully unboxes the numeric code resulting in little heap allocation.

Posted via email from Just Testing | Comment »

    • #dph
    • #dtrace
    • #ghc
    • #haskell
    • #instruments
  • 3 years ago
  • 1
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Profiling Garbage Collection in Haskell with DTrace & Instruments

I recently added DTrace probes to the GHC runtime system.  In the following, I will explore the use of these probes for profiling the garbage collection behaviour of Haskell programs.  Mac OS X comes with a graphical tool, called Instruments, that supports visualising data gathered with DTrace.  Let’s start by having a look at visualising the DTrace events gc-start and gc-end for the standard nfib function (compiled with GHC 6.13 and -O).

The two most important parts of the Instruments window above are the track pane, which graphs numerical values computed by the DTrace probes, and the detail pane, which displays all probe data in a tabular view.  Here, the track pane contains graphs for 
  1. time slices of the mutator,
  2. time slices spent on garbage collection,
  3. the running sum of the mutator time slices,
  4. the running sum of garbage collection time slices, and
  5. the running percentage of the runtime spent on garbage collection.

The height of each spike in the graphs is proportional to the length of the time slice (or the running sum), measuring virtual time (i.e., process time determined using DTrace’s vtimestamp).  It is important to note that the x-axis of the Instruments track pane uses wall clock time instead.  (That is crucial when using Instruments to correlate the behaviour of two or more processes or when inspecting overall system behaviour.)  This explains the gap about one third into the above run.  During that time, the process executing our Haskell program simply was not scheduled by the operating system.
The detail pane in the above screenshot displays the numeric values corresponding to the DTrace events represented as spikes in the track pane.  In addition, it also displays the event names.

Overall, our nfib implementation spends very little time on garbage collection, only about 1%.  This becomes very clear in the following alternative few of the track pane, where we overlay the time slices for the mutator and garbage collection. Black spikes are mutator time slices, whereas red ones are garbage collection slices.

Let’s have a look at a look at a more interesting program, namely Andrew Tolmach and Thomas Nordin’s constraint solver from the GC portion of the nofib benchmark suite.  We compare executing the program with two different heap-size settings, namely with a 10MB versus a 100MB heap (the program was again compiled with GHC 6.13 and -O, and we used an input value of 10).  With Instruments, we can directly compare the 10MB and the 100MB run.

Remember that the x-axis of the track pane is wall clock time — i.e., the constraint solver runs approximately twice as fast with a 100MB than with a 10MB heap.  The longer 10MB run shows evenly spaced major garbage collections after an initial startup phase (the red peaks in the second graph from the top).  As GHC by default uses a generational garbage collector, we see many small red events in between the major collections — these correspond to the much less time consuming minor collections.  Except in the start up phase, the mutator graph has only very short peaks, and overall garbage collection accounts for more than half of the overall runtime.
The 100MB run presents a dramatically different picture.  Here, we only have three time consuming garbage collection events during the whole program execution.  Other than that, there are only (barely visible in the screenshot) minor collections that briefly interrupt the mutator.  Overall, the program only spends 14% of its runtime on garbage collection.  The graph of the 100MB run clearly illustrates that we graph events (as peaks) and the gaps between events are event free.  The height of a peak is the virtual duration of the corresponding mutator or garbage-collection time-slice and it will only correspond to the gap between events if the process was continuously running.  In addition to graphing peaks, Instruments supports block diagrams, too — here, the same two program runs using a block diagram.

The height of a block still corresponds to a virtual time slice and its width is wall clock time.  In the same manner as we overlayed the peak diagrams of the mutator and garbage-collection time-slices in the nfib example, we can do so with block diagrams.

For simple diagrams, this may seem to be the most easily understood graph.  However, its interpretation does require some care.  The human mind likes to assign meaning to the area of the blocks in a block diagram, but that leads to the wrong intuition here.  In the above figure, the 10MB run appears to be entirely dominated by the red garbage collection blocks.  Although garbage collection accounts for a hefty 65% of the execution time, the diagram suggests that the mutator makes nearly no progress, which is an incorrect conclusion.  To avoid such misinterpretation, peak diagrams are the default.

You may like to argue that Instruments should arrange for block diagrams to use the area of blocks in such a manner that it is consistent with an intuitive interpretation.  However, this runs contrary to using wall clock time for the x-axsis and the event value for the y-axsis (that is the height of a peak or block), which in turn is important to understand overall system behaviour and to correlate multiple processes.  Let’s look at a simple example, where wall clock time matters.  The following is a profile of GHC itself when compiling a small Haskell program — in fact, it is compiling the previous constraint-solver benchmark.

We notice two points: firstly, GHC spends quite a bit of time on garbage collection —45% of its runtime, to be precise— and secondly, there is a big gap in the end, where nothing appears to happen.  Indeed, we can verify that the process executing GHC is not active by adding another instrument that profiles process activity (the Instruments application comes with a whole library of pre-defined instruments).

GHC invokes the system assembler and linker to produce an executable.  This obviously accounts for a large fraction of the overall execution time.
If you like to try this for yourself, you need to build the current GHC HEAD (6.13) and get this trace template for Instruments: GHC GC single.  This trace template is designed for single-threaded programs — i.e., programs linked against the vanilla GHC runtime without the -threaded link-time option.  DTrace support for GHC is currently only implemented for Mac OS X.  However, it shouldn’t be hard to port to another DTrace implementation (contact me if you need advise).  I don’t know whether graphical DTrace frontends are available on other platforms.  To profile GHC’s garbage collection and multi-threading behaviour, you can also use ThreadScope (which I found difficult to build on Mac OS).

Posted via email from Just Testing | Comment »

    • #dtrace
    • #ghc
    • #haskell
    • #instruments
  • 3 years ago
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Using DTrace to track scheduler events of GHC’s runtime

I finally got around to committing my instrumentation of the GHC runtime system with DTrace probes (commit message).  The current set of probes tracks all the events of the eventlog framework used by ThreadScope.  A detailed description of the probes is on the new Using DTrace with GHC page of the GHC developer wiki.

Posted via web from Just Testing | Comment »

    • #dtrace
    • #ghc
    • #haskell
    • #profiling
  • 3 years ago
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Haskell 2010

Four years ago the Haskell’ Committee was assembled, charged with the task to produce a new revision of the Haskell language standard (aka Haskell 98). There was a lot of enthusiasm and many wild ideas in the beginning, but it turned out that it was hard to agree on the scope of changes and that it was very hard to find a person willing to shoulder the rather large editorial burden that a major revision of Haskell entails.

The Committee finally decided that our only realistic chance of making progress was to break the mammoth task into smaller chunks. More concretely, we decided to evolve the language in small increments with a release every year, while rotating the committee on an annual basis to avoid burn out.[1] Whether this fine-grained evolutionary approach to language design will be successful in the long run remains to be seen. Nevertheless, the new approach helped us to make progress and to agree on a revision for this year: behold Haskell 2010!

[1] Details about The Haskell Prime Process

Posted via web from Just Testing | Comment »

    • #haskell
    • #haskell2010
  • 3 years ago
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Status Update of the Glasgow Haskell Compiler, October 2009

If you are interested in Haskell generally or the Glasgow Haskell Compiler (GHC) in particular, you may want to have a look at the latest issue of the biannual GHC status updates.  It includes a summary of our recent progress in the Data Parallel Haskell project.

Posted via web from Just Testing | Comment »

    • #dph
    • #ghc
    • #haskell
  • 3 years ago
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Finally found the ghci bug on Snow Leopard

After much digging through otool output and tracing Haskell code with dtrace, log messages, and gdb, I finally found the cause of the SIGBUS errors with ghci on Snow Leopard. At the end, only a tiny change was required[1]: the homegrown dynamic linker of ghci had simply ignored a relocation type that ld on Snow Leopard chose to use. What this really shows again is that re-implementing parts of basic infrastructure in an ad hoc manner always comes back to bite you. Such infrastructure is usually full of low-level, platform-dependent details, which you are bound to get wrong once in a while and then it can be rather involved to find the problem. In this case, the relocation of the memory access of a bit of C code in GMP, reading a global variable, went wrong. As a result, the wrong memory location was accessed, which happened to contain a NULL pointer. Upon dereferencing the NULL, the program crashed.

All of this was further obscured by the fact that the homegrown linker approach of ghci implies that we have two copies of the basic libraries in memory and in use at the same time. One copy has been statically linked with the ghc executable and the second copy is dynamically loaded by ghci. The relocation in the statically linked copy was perfectly fine, but the relocation in the dynamically loaded one was broken. [1] http://www.haskell.org/pipermail/cvs-ghc/2009-October/050863.html

Posted via email from Just Testing | Comment »

    • #ghc
    • #haskell
    • #snowleopard
  • 3 years ago
  • 1
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Data-oriented programming and the vectorisation transformation

I just realised that the data layout and the resulting code organisation in data-oriented programming, as proposed for games design, in many aspects resembles the data layout and code organisation favoured by the vectorisation transformation that is at the core of Data Parallel Haskell.  This should not be surprising as both have similar goals, namely to maximise data throughput, to minimise stalls by utilising the memory hierarchy, and to maximise parallelism.

However, the parallels are interesting given the very different origins of both approaches.  The big difference is of course that data-oriented programming is a design methodology for programmers, whereas the vectorisation transformation is a program transformation automatically applied by a compiler.  Nevertheless, we may regard the vectorisation transformation as a program transformation that turns purely functional code into code that is structured in a data-oriented manner, where the layout of bulk data shapes the organisation of the code.

This raises the question of how useful nested data parallel programming and the vectorisation transformation may be to games programming.

Posted via web from Just Testing | Comment »

    • #dataorientation
    • #dph
    • #vectorisation
  • 3 years ago
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Fun with higher-order functions in C and Objective-C

The latest release of Mac OS X (Snow Leopard) came with an upgrade to the C and Objective-C languages, adding lambda abstractions — which they call blocks. Matt Gallagher has a nice blog post that describes some of the trickier aspects of blocks: http://cocoawithlove.com/2009/10/ugly-side-of-blocks-explicit.html Interesting is the function ‘newDoubleToIntComparison’ (in Section “Declaring a block that returns a block”), which is a curried[1] version of ‘compareDoubleToInt’. Together with the explicit memory management for returned blocks and the need for involved type annotations and casts, it becomes clear why functional languages like Haskell make curried functions the default as well as support garbage collection and type inference out of the box.

[1] http://en.wikipedia.org/wiki/Currying

Posted via email from Just Testing | Comment »

    • #blocks
    • #objectivec
  • 3 years ago
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Multicore Haskell Now!

Don Stewart summarised the state of play of parallel programming in Haskell at “ACM Reflections | Projections 2009”. He covers strategies, Concurrent Haskell, STM, and Data Parallel Haskell: http://donsbot.wordpress.com/2009/10/17/multicore-haskell-now-acm-reflections-projections-2009/

Posted via email from Just Testing | Comment »

    • #dph
    • #haskell
    • #multicore
    • #parallelism
    • #stm
  • 3 years ago
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Don Stewart’s talk on Domain Specific Languages and Haskell

Don argues in favour of domain-specific languages for high-performance computing. Not surprisingly, he suggests that Haskell is well suited as a host for realising such domain-specific languages as embedded languages: http://donsbot.wordpress.com/2009/10/16/lacss-2009-domain-specific-languages-and-haskell/

Posted via email from Just Testing | Comment »

    • #edsl
    • #haskell
    • #parallelism
  • 3 years ago
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+
Page 5 of 14
← Newer • Older →

About

Avatar

I research and teach programming languages, compilers, and their applications at the University of New South Wales (UNSW), Sydney. My main interest is in functional and parallel programming. Most of my code is in Haskell.

You can find me on App.net (preferred) and Twitter, or on GitHub. Check out my UNSW website.

  • RSS
  • Random
  • Archive
  • Mobile
Effector Theme by Pixel Union