Profiling with stream fusion
As we move on to bigger examples in DPH, identifying performance problems just by staring at the Core output becomes somewhat difficult. We’ve finally reached a point where we actually have to profile DPH programs and identify slow loops. But how? Cost centre profiling isn’t supported by the vectoriser and in any case, it works on the Haskell source whereas we are interested in loops generated by the vectoriser. Ticky-ticky profiling happens a bit too late, when all those loops have been transformed into something mostly unrecognisable. It also seems to have problems with -threaded. Looks like we have to implement it ourselves…
With a bit of thought, this turns out to be surprisingly easy. Since DPH uses stream fusion, (almost) all those loops operate on streams. Now, the basic idea is to annotate all streams with a string which tells us how the stream has been produced.
data Stream a = forall s. Stream ... String replicateS :: Int -> a -> Stream a replicateS n x = Stream ... "replicateS" mapS :: (a -> b) -> Stream a -> Stream b mapS f (Stream ... c) = Stream ... ("mapS <- " ++ c)
and so on. Now, when a real loop consumes a stream it simply logs its running time together with the stream’s producer:
unstreamU :: Stream a -> UArr a unstreamU (Stream ... c) = runST (traceLoopST ("unstreamU <- " ++ c) $ fill) where fill =
Since streams are purely compile-time structures and no streams should be left in the program if the simplifier is doing its job properly, all those strings simply go away if we compile our library without tracing (i.e., make traceLoopST a noop). With tracing on, however, our rather fancy implementation actually produces dtrace events at the beginning and at the end of a loop. This is a very flexible mechanism. In particular, it allows us to produce profiles like this (pardon my formatting):
... unstreamMU <- zipWithS <- (zipWithS <- (zipWithS <- (replicateEachS <- zipWithS <- (streamU, streamU), streamU), zipWithS <- (replicateEachS <- zipWithS <- (streamU, streamU), streamU)), zipWithS <- (zipWithS <- (replicateEachS <- zipWithS <- (streamU, st 12115903 unstreamMU <- mapS <- mapS <- streamU 21787693 foldS <- streamU 27096786 unstreamMU <- zipWithS <- (streamU, mapS <- fold1SS <- (streamU, zipWithS <- (enumFromToEachS <- zipWithS <- (replicateS, mapS <- streamU), streamU))) 60756439 unstreamMU <- toStream 817568512
This shows the raw running time for each kind of stream loop in the program. Of course, it doesn’t distinguish between different instantiations of the same loop structure (for instance, foldS <- streamU occurs rather frequently) but it’s still very helpful and doesn’t require any compiler support whatsoever. Ignoring toStream which is only called during test data generation, the loop with fold1SS takes by far the most time. Indeed, it turns out that fold1SS is the problem; optimising its implementation speeds up the program almost by a factor of 2.
Implementing all this took less than one day. I guess this shows yet again: don’t mess with the compiler if you can mess with the library instead. Also, use Haskell.