In An LLVM Backend For GHC, we describe the design and implementation of an LLVM backend for the Glasgow Haskell Compiler (GHC). We outline the conceptual obstacles that we had to overcome and give a detailed performance analysis of the new backend compared to the existing C backend and native code generator.
When I decided to by an iPad, I had one killer application in mind: reading. I was hoping the device would also be useful otherwise, but cutting down on printing papers and the like was my main aim. As a researcher, I inevitably read a lot: research papers, books, reports, manuals, thesis drafts, the drafts of my own work-in-progress papers, and so on. I have got reading material on my desk at home, in my office, in my bag…it’s a mess. The prospect of consolidating it all into one device, which would hopefully also cover most of my note taking and sketching out of ideas, was enough to sell me on the iPad.
So, how did it work out? So far, very well — but slightly more than a week of use makes it hard to draw any final conclusions. The printed area of a paper is about the same as the size of the iPad screen, which combined with the high quality of the display makes for a good reading experience. Instead of printing a paper or other document I like to read, I now sync it to my iPad —usually using Dropbox— and read it with GoodReader for iPad. If I expect that I may want to refer to a paper at a later time, I’m adding it to my mobile library kept in Papers. If I need to annotate a document —for example, for a review— I currently use iAnnotate PDF. I like Dropbox, GoodReader, and Papers, but I’m only using iAnnotate as there doesn’t seem to be an alternative at the moment — the user interface is pretty rough.
In addition, I found over the last week that iPad is also the perfect Twitter device (I’m currently using Twitterific for iPad), and great for reading blogs and online articles, which I often file away for later reading on my iPad using Instapaper.
Simon Peyton Jones recently got stranded in Boston due to Eyjafjallajokul. He took the opportunity to give a talk about our joint Data Parallel Haskell project at Northeastern University. The Northeastern folks kindly recorded and published the talk. The talk provides a nice summary of our approach.
The paper Regular, shape-polymorphic, parallel arrays in Haskell describes a new Haskell library, named Repa, for multi-dimensional, regular arrays based on our work on Data Parallel Haskell and a delayed array representation avoiding unnecessary data structures. As the following table shows, the sequential performance already gets close to the performance of C programs. More importantly, the Haskell code transparently makes use of multicore hardware — something that is much more difficult in C.
The Haskell code is purely functional, using collective array operations. Here is matrix-matrix multiplication:
mmMult :: (Num e, Elt e)
=> Array DIM2 e -> Array DIM2 e -> Array DIM2 e
mmMult arr brr = sum (zipWith (*) arrRepl brrRepl)
trr = force (transpose2D brr)
arrRepl = replicate (Z :.All :.colsB :.All) arr
brrRepl = replicate (Z :.All :.All :.rowsA) trr
(Z:. colsA:. rowsA) = extent arr
(Z:. colsB:. rowsB) = extent brr
In addition to good performance, the library facilitates reuse through the use of shape polymorphism.
GHC’s new LLVM backend, developed by David Terei as part of his Honours thesis at PLS, is being made ready for inclusion into the main GHC repository. The backend requires a new calling convention to be added to LLVM. David got a patch including the new calling convention accepted into the main LLVM distribution, just in time to become part of the forthcoming LLVM 2.7 release — the inclusion into 2.7 has been confirmed.
The next step is to incorporate a number of suggestions by Simon Marlow on how to improve the LLVM backend code. After that, the patch should be ready to go into the tree!
David will be on an internship at GHC HQ at Microsoft Research Cambridge in the second quarter of this year, specifically to work on the GHC LLVM backend.
I recently wrote about profiling sparse-matrix vector multiplication implemented with Data Parallel Haskell. Let’s take this a step further and look at a program parallelised with package parallel, instead of Data Parallel Haskell. Specifically, we look at the dense matrix-matrix multiplication benchmark of the parallel section of the nofib benchmark suite. This benchmark uses a linewise version of the torus-based Gentleman algorithm implemented with vanilla Haskell lists. As we see in the following profile, the algorithm parallelises very well on two cores. It also has a low allocation rate; hence, there is little garbage collection.
In the above profile and that of the sparse-matrix vector multiplication that I showed in my previous post, the width of the time slices of the mutator threads and of garbage collection are determined by the wall clock time of the run-thread/stop-thread and gc-start/gc-end events, respectively. These events are generated by the Haskell thread scheduler and do not —they cannot— take the OS scheduling of the Haskell Execution Contexts (HECs) running Haskell threads into account. In other words, a Haskell thread may be scheduled on a HEC for a long time, but if the HEC itself (i.e., the operating thread on which it executes) is descheduled by the operating system, then the Haskell thread still won’t make any progress. To visualise this effect, the custom instruments used here also support graphing the virtual time that expired in a time slice. The following profile is a slightly magnified version of the initial portion of the previous profile, where the height of a mutator time slice indicates its virtual extent. The screenshot also shows the configuration pane for one mutator track. The virtual extent of time slices in the garbage collection pane can also be displayed in this alternative form, although that is not shown here.
The custom instruments for GHC provide addition information that is by default not displayed in the track pane, including the wall clock percentage during which the runtime performs garbage collection and the percentage of utilisation of the mutators threads. However, all information is accessible via Instrument’s tabular detail view. In particular, the library function GHC.Exts.traceEvent can be used to inject user-defined messages into the event stream. In the DPH benchmarking framework, we use this for example to delimit the start and end of the benchmark payload, as opposed to benchmark set up, data generation, and tear down code. The following snapshot shows part of the detail view for the sparse-matrix vector multiplication benchmark, where the event indicating the start of benchmarking is highlighted.
By clicking on the user message events in the detail view, we can easily mark the relevant section of the profile — it is the section highlighted in blue. Below, we see that the sequential part of the profile is only part of the set up, but not of the benchmarked code.
If you like to try for yourself, download the trace template GHC HEC 2-core and use it in Instruments.
A current caveat: Instruments supports zooming the track views horizontally (on the time axis) with a slider. Unfortunately, in version 2.0.1 of Instruments, the maximum resolution is not sufficient to see the exact relative timing of mutator and garbage collection time slices if garbage collection is very frequent. Moreover, when using the Block Graph style, blocks that are very closely spaced may be coloured very lightly or appear transparent — if you are suspicious of a section that seems to show no activity in a profile, use the Peak Graph style to double check. Instruments collects all the data to provide a better resolution; it’s just a matter of supporting a wider range in the zoom slider. (If anybody knows a work around for that, please post it in the comments.)
David Terei is close to merging the LLVM-based code generator that he developed for GHC (Glasgow Haskell Compiler) into the main repository. He developed the new backend as part of his Honours thesis at PLS and just requested to merge the code on the GHC developer list.
Trevor McDonell, who works on nested data parallelism for GPGPU programming at PLS, released package cuda, a brand new Haskell binding for CUDA, NVIDIA’s general-purpose compute language for GPUs. The purpose of the binding is not to write CUDA code in Haskell, but to call and control existing CUDA code from Haskell.
We are using package cuda to execute code generated by our, not yet released, CUDA-backend of the embedded array language accelerate.