Skip to content

Sparking imperatives

April 21, 2010

Here is a fun hack I came up with while working on vector and repa. One way to express parallelism in Haskell is to write x `par` y (this comes from package parallel). This is equivalent to y but tells the compiler that it might be a good idea to start evaluating x in parallel with y. For example, this code says that the two expensive computations should be executed concurrently:

let x = expensive computation 1
    y = expensive computation 2
in
(x `par` y) `pseq` (x+y)

The pseq is necessary because we have to tell the compiler to evaluate x `par` y before x+y. This is all explained in Algorithm + Strategy = Parallelism.

Internally, when x `par` y is evaluated the runtime system creates a spark for x. A spark is a bit like a thread but cheaper. The RTS maintains a queue of sparks and evaluates them whenever it has a spare CPU or core. The scheduling algorithm is based on work stealing and is really quite sophisticated. A detailed description of the RTS is given in Runtime Support for Multicore Haskell.

Let’s try to abuse this mechanism a bit by sparking ST computations rather than pure ones. Here is how:

parST :: ST s a -> ST s a
parST m = x `par` return x
  where
    x = runST (unsafeIOToST noDuplicate >> unsafeCoerce m)

First, we create a thunk x which, when evaluated, runs the ST computation m. The parallel RTS will sometimes evaluate thunks twice which is fine for pure computations (it just duplicates work) but could be disastrous for stateful ones since it could duplicate side effects. The call to noDuplicate (from GHC.IO) ensures that this doesn’t happen. Then, we spark x and return it. Note that return is lazy and doesn’t evaluate x. This is absolutely crucial.

It is quite possible to implement parST in terms of forkIO. However, sparks are much cheaper than threads (yes, even than GHC threads) which means that this implementation ought to support much more fine-grained parallelism.

So how do we use parST? The basic idea is to spark computations, do something else for a while and then synchronise by demanding the results:

do
  x <- parST $ foo
  bar
  x `seq` return ()

The last line ensures that foo is executed one way or another: either in parallel to bar or after bar when its result is demanded by seq. We can capture an instance of this pattern in a combinator:

(|||) :: ST s () -> ST s () -> ST s ()
p ||| q = do
            u <- runST p
            q
            u `seq` return ()

This is quite straightforward to use. Here is a rather simple-minded version of in-place Quicksort:

qsort :: (MVector v a, Ord a) => v s a -> ST s ()
qsort v
  | n < 2 = return ()
  | otherwise = do
                  x <- unsafeRead v (n `div` 2)
                  i <- unstablePartition (<x) v
                  qsort (unsafeSlice 0 i v)
                    ||| qsort (unsafeSlice (max i 1) (n-i) v)
  where
    n = length v

This looks just like sequential Quicksort except that the two recursive calls are potentially executed in parallel.

Interestingly, the programming model that parST gives us is very well known (e.g., as lazy threads). In particular, Cilk, which I rather like, is based on a very similar approach. It is quite amazing that we can get this with basically 2 lines of Haskell code.

This leaves two questions. Firstly: is it safe? I think (but I’m not sure) that the answer is yes for IO (we can define parIO similarly to parST) but it is definitely not safe for ST. Here is an example that produces different results depending on scheduling and compiler optimisations:

let x = runST (do { r <- newSTRef 0; writeSTRef r 1 ||| writeSTRef r 2; readSTRef r })
    y = runST (do { r <- newSTRef 0; writeSTRef r 1 ||| writeSTRef r 2; readSTRef r })
in x == y

It ought to be possible to build a safe library on top of it, though.

So what about performance? Alas, I don’t have time to pursue this at the moment so I have absolutely no idea. The only real benchmark I tried is Introsort from Dan Doel’s vector-algorithms which I parallelised like Quicksort above (that is, I changed exactly one line). The chart below shows the running times for sorting 10M random elements (in seconds) vs. the number of cores on an 8-core XServe. Clearly, it does things in parallel but it doesn’t really scale. I have my suspicions as to why this is the case (I blame noDuplicate) but I need to investigate more. It is quite encouraging, however, that the very naive parallel algorithm is barely slower than the sequential one on 1 core (1.98s vs. 2.08s).

About these ads
8 Comments leave one →
  1. Сергей permalink
    April 21, 2010 21:50

    Plotting 1/time or in logarithmic scale would be more interesting.

  2. April 22, 2010 14:53

    Roman,

    Is GHC smart enough not to create a spark for every recursive call to to the sort? Is that handled by the lazyness of the threads you’ve described?

    Otherwise I might suggest that even lightweight thread creation here would significantly reduce the increase in speed gained through parallelism.

    Best

    John

    • April 22, 2010 18:47

      John,

      This does create a spark for each recursive call but a spark is not a thread! It’s just an entry in a work queue which might or might not get picked up and executed by an RTS thread that runs whenever a core/CPU is idle.

  3. April 23, 2010 01:27

    This came up back in December when Ryan Newton and I investigated using this approach as a scheduler for the Haskell port of the Intel Concurrent Collections library. In a perfect world this would provide a very nice low cost task based parallelism framework.

    Unfortunately there is a rather catastrophic problem that comes up with this.

    Sadly, GHC will silently drop sparks once the spark queues fill up. Simon Marlow pointed out the flaw in our design.

    There is even talk of them no longer counting as GC roots, so it becomes even worse. Ultimately, you really can’t rely on the spark pool to kick off work which has side effects. =(

    • April 23, 2010 01:44

      Edward, I don’t see why this is a problem. Rather, this is precisely the semantics I’d expect. As soon as you synchronise (i.e., demand the result of parST) you are guaranteed that the side effects happen. If you don’t synchronise, the side effects might never happen but then, you can’t expect them to. This is no different from doing something like (forkIO (print “Hello”)). If the thread never gets scheduled, your program won’t output “Hello”.

  4. April 23, 2010 01:56

    I just realized there is a significant difference in what you are doing and what we were doing. We didn’t have the demand component you do.

    We were using par to kick off a bunch of unsafe(Dupable)PerformIO tasks, but there was no direct handle to the result other that the fact that those tasks internally synchronized with each other and ultimately made a bunch of entries visible in a result structure, but given the way the problem was structured, we didn’t have the ‘serial spine’ for gluing it all back together. These work items had previously been scheduled as threads.

    Good job. =)

    • April 23, 2010 02:08

      Ah, that make sense. Couldn’t you fork one thread and have it evaluate the sparks?

      • April 23, 2010 03:12

        Quite probably. However, if the goal is to keep, say, 8 processors busy, and i kick off a few hundred tasks, potentially very long running tasks, that model doesn’t seem at first glance to work so well to keep things from starving — that ‘cleanup thread’ may wind up forcing a lot of stuff more or less sequentially.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: