NoSlow: Microbenchmarks for Haskell array libraries

2009 November 27
by Roman Leshchinskiy

Over the last couple of days, I have implemented a small benchmark suite which tries to measure the performance of various Haskell array libraries, with particular emphasis on finding out how well they are able to fuse things. It is now on Hackage under the very creative and imaginative name NoSlow (Haskell seems to have gained a tradition of naming benchmark suites nosomething). What it does is compile and run a set of micro-benchmarks using these libraries:

The actual benchmarking is done by Brian O’Sullivan’s wonderful criterion.

NoSlow is a very young project so the output isn’t particularly pretty but it does the job. It is also not very thoroughly tested and only includes a very restricted set of benchmarks so saying “library x is better than library y because it produces better numbers in NoSlow” would be completely and utterly wrong. However, it is quite useful for identifying cases where taking a closer look at the code generated for a particular loop might be a good idea. It already helped me spot one tricky performance regression in the standard list library.

Here is an example of the data it produces.

NoSlow results for GHC 6.13.20091124 (only loops specialised for Double)
dph-prim list storablevector uvector vector
seq Primitive Storable boxed
$a+1 58.38504 us 192.9597 us 58.25747 us 58.47876 us 58.52366 us 58.77233 us 405.5711 us
$a+^1 59.03597 us 356.6624 us 210.4899 us 80.25804 us 58.53107 us 59.19546 us 417.2587 us
$a+x 55.62417 us 236.2620 us 55.63200 us 55.62767 us 55.33446 us 55.96459 us 477.2455 us
$a+^x 56.91372 us 445.1638 us 237.9741 us 75.81757 us 55.79731 us 56.45521 us 491.3216 us
$a+x+x+x+x 96.75086 us 280.3636 us 236.0562 us 96.76510 us 95.07297 us 97.15892 us 519.6680 us
$a+$b 64.82325 us 255.3842 us 91.09551 us 64.82592 us 65.21805 us 65.10213 us 500.7479 us
$a+$b(zip) 64.48503 us 347.4379 us 64.49701 us 499.5277 us
x*$a+$b 75.63250 us 452.0626 us 154.3398 us 75.68617 us 110.5486 us 155.2830 us 649.9281 us
($a+$b)*($c+$d) 125.0923 us 1.080122 ms 288.3781 us 125.0150 us 919.2376 us 1.132326 ms 2.967417 ms
($a+$b)*($c+$d)(zip) 155.3547 us 742.2439 us 155.3554 us 2.509351 ms
(x+$a)*(y+$b) 104.0011 us 678.5383 us 203.9229 us 103.9131 us 89.32382 us 89.98352 us 401.6301 us
(^x+$a)*(^y+$b) 118.5413 us 1.061493 ms 534.0542 us 150.2891 us 822.7933 us 930.1098 us 2.650773 ms
filter(neq0)(map0) 61.45542 us 78.80583 us 104.6501 us 61.35867 us 68.04449 us 61.67384 us 124.5576 us
filter(neq0)(^0) 17.42000 us 23.41616 us 173.0071 us 17.91704 us 23.33224 us 24.97328 us 59.48926 us
filter(eq0)(map0) 69.61809 us 174.2209 us 242.6149 us 69.63563 us 69.34671 us 69.89652 us 331.1557 us
filter(eq0)(^0) 45.96399 us 94.22360 us 308.5451 us 45.75672 us 52.15233 us 45.52413 us 201.7559 us
zip_filter 76.31172 us 466.6432 us 334.0208 us 76.24536 us 99.80086 us 100.4251 us 341.6012 us
filter_zip 81.01137 us 468.8768 us 239.6135 us 79.83034 us 80.92583 us 87.66268 us 255.4129 us
filter_evens 59.13895 us 182.2491 us 58.59793 us 337.2523 us
sum($a*$b) 115.0304 us 1.130831 ms 145.4783 us 115.0499 us 73.86368 us 74.39775 us 99.39568 us
sum($a*$b)(zip) 114.9057 us 1.420020 ms 114.8766 us 99.43429 us
sum[m..n] 53.50359 us 646.0969 us 235.4800 us 53.76784 us 129.2202 us 133.0979 us 134.6730 us
sumsq(map) 67.20919 us 653.9266 us 292.8844 us 67.21005 us 143.1072 us 142.2476 us 143.5639 us
sumsq(zip) 67.24133 us 1.928390 ms 311.0818 us 160.3321 us 369.7680 us 380.4847 us 914.5373 us
sum_evens 120.4421 us 401.7939 us 120.4255 us 150.9723 us

Benchmarks

At the moment, NoSlow only benchmarks a bunch of very small and fairly random loop kernels as I was more concerned with getting the infrastructure in place. Here is an example (called $a+$b in the table):

dotp as bs = sum (zipWith (*) as bs)

When it is built, NoSlow compiles several different versions of this loop. For instance, for the storablevector backend it will generate these two functions:

dotp :: (Storable a, Num a) => StorableVector a -> StorableVector a -> a
dotp :: StorableVector Double -> StorableVector Double -> Double

The first one is overloaded on the element type whereas the second one is specialised to Double. NoSlow does this for all its backends and then runs all generated versions of the benchmark with the same test data. With more testing and more benchmarks, this will give us a rough estimate of the relative performance of the different array libraries. It also shows how much faster specialised loops are compared to their polymorphic versions

Here is the output of a full run. *Double and Double are overloaded and specialised loops, respectively. At the moment, NoSlow only does Double benchmarks but adding more types is easy.

NoSlow results for GHC 6.13.20091124
dph-prim list storablevector uvector vector
seq Primitive Storable boxed
*Double Double *Double Double *Double Double *Double Double *Double Double *Double Double *Double Double
$a+1 388.9189 us 58.38504 us 334.0156 us 192.9597 us 310.7921 us 58.25747 us 410.9401 us 58.47876 us 831.8845 us 58.52366 us 467.6662 us 58.77233 us 563.7031 us 405.5711 us
$a+^1 412.6220 us 59.03597 us 361.8679 us 356.6624 us 808.4398 us 210.4899 us 554.7731 us 80.25804 us 842.5514 us 58.53107 us 469.5773 us 59.19546 us 570.0256 us 417.2587 us
$a+x 398.3495 us 55.62417 us 266.9966 us 236.2620 us 317.0817 us 55.63200 us 416.8037 us 55.62767 us 848.4946 us 55.33446 us 479.7607 us 55.96459 us 576.0136 us 477.2455 us
$a+^x 401.3565 us 56.91372 us 376.1232 us 445.1638 us 821.9274 us 237.9741 us 560.3928 us 75.81757 us 854.3754 us 55.79731 us 483.5405 us 56.45521 us 579.5658 us 491.3216 us
$a+x+x+x+x 1.099485 ms 96.75086 us 989.8561 us 280.3636 us 1.367973 ms 236.0562 us 1.108161 ms 96.76510 us 1.732897 ms 95.07297 us 1.402448 ms 97.15892 us 1.918533 ms 519.6680 us
$a+$b 675.8037 us 64.82325 us 266.1987 us 255.3842 us 695.4756 us 91.09551 us 721.3087 us 64.82592 us 1.238228 ms 65.21805 us 758.5373 us 65.10213 us 579.0221 us 500.7479 us
$a+$b(zip) 674.6217 us 64.48503 us 486.2640 us 347.4379 us 718.1258 us 64.49701 us 579.6009 us 499.5277 us
x*$a+$b 843.1906 us 75.63250 us 576.2166 us 452.0626 us 1.086456 ms 154.3398 us 889.1795 us 75.68617 us 1.494543 ms 110.5486 us 877.4670 us 155.2830 us 1.226473 ms 649.9281 us
($a+$b)*($c+$d) 1.510817 ms 125.0923 us 1.003353 ms 1.080122 ms 1.958700 ms 288.3781 us 1.582407 ms 125.0150 us 3.109755 ms 919.2376 us 1.915603 ms 1.132326 ms 3.199406 ms 2.967417 ms
($a+$b)*($c+$d)(zip) 1.585162 ms 155.3547 us 1.345901 ms 742.2439 us 1.845875 ms 155.3554 us 2.749534 ms 2.509351 ms
(x+$a)*(y+$b) 1.139386 ms 104.0011 us 969.7649 us 678.5383 us 1.451681 ms 203.9229 us 1.174562 ms 103.9131 us 1.950864 ms 89.32382 us 1.404879 ms 89.98352 us 2.030718 ms 401.6301 us
(^x+$a)*(^y+$b) 1.095473 ms 118.5413 us 1.154678 ms 1.061493 ms 2.512015 ms 534.0542 us 1.530284 ms 150.2891 us 2.452523 ms 822.7933 us 1.825248 ms 930.1098 us 3.399711 ms 2.650773 ms
filter(neq0)(map0) 357.0597 us 61.45542 us 320.2454 us 78.80583 us 581.4489 us 104.6501 us 364.2109 us 61.35867 us 609.2637 us 68.04449 us 531.9321 us 61.67384 us 370.0963 us 124.5576 us
filter(neq0)(^0) 19.80998 us 17.42000 us 6.715188 us 23.41616 us 404.8197 us 173.0071 us 20.19333 us 17.91704 us 24.38774 us 23.33224 us 24.77845 us 24.97328 us 59.51524 us 59.48926 us
filter(eq0)(map0) 527.7883 us 69.61809 us 499.8775 us 174.2209 us 839.2031 us 242.6149 us 542.4786 us 69.63563 us 1.015737 ms 69.34671 us 622.5425 us 69.89652 us 619.5380 us 331.1557 us
filter(eq0)(^0) 153.7917 us 45.96399 us 87.08318 us 94.22360 us 648.8350 us 308.5451 us 162.6549 us 45.75672 us 417.1879 us 52.15233 us 126.1947 us 45.52413 us 214.1162 us 201.7559 us
zip_filter 696.4477 us 76.31172 us 631.0826 us 466.6432 us 1.158844 ms 334.0208 us 735.2156 us 76.24536 us 1.042058 ms 99.80086 us 635.7235 us 100.4251 us 588.1413 us 341.6012 us
filter_zip 741.1476 us 81.01137 us 467.9363 us 468.8768 us 1.073220 ms 239.6135 us 789.2609 us 79.83034 us 1.249209 ms 80.92583 us 835.3062 us 87.66268 us 507.6646 us 255.4129 us
filter_evens 315.2792 us 59.13895 us 185.6894 us 182.2491 us 342.7927 us 58.59793 us 321.4253 us 337.2523 us
sum($a*$b) 889.3029 us 115.0304 us 1.113731 ms 1.130831 ms 837.7596 us 145.4783 us 630.7635 us 115.0499 us 1.046405 ms 73.86368 us 956.1445 us 74.39775 us 310.2107 us 99.39568 us
sum($a*$b)(zip) 893.2471 us 114.9057 us 1.890718 ms 1.420020 ms 624.3066 us 114.8766 us 371.0969 us 99.43429 us
sum[m..n] 480.6165 us 53.50359 us 1.006370 ms 646.0969 us 599.5453 us 235.4800 us 257.9740 us 53.76784 us 520.8189 us 129.2202 us 455.0702 us 133.0979 us 450.9119 us 134.6730 us
sumsq(map) 601.0236 us 67.20919 us 1.389623 ms 653.9266 us 916.6944 us 292.8844 us 361.3013 us 67.21005 us 633.0462 us 143.1072 us 632.7607 us 142.2476 us 626.6808 us 143.5639 us
sumsq(zip) 581.3862 us 67.24133 us 2.487876 ms 1.928390 ms 1.391611 ms 311.0818 us 931.6956 us 160.3321 us 2.725828 ms 369.7680 us 1.892799 ms 380.4847 us 1.770848 ms 914.5373 us
sum_evens 522.6244 us 120.4421 us 377.5469 us 401.7939 us 283.7662 us 120.4255 us 293.9641 us 150.9723 us

NoSlow can also be used to see how well different GHC versions optimise array loops. Here is a comparison of the current HEAD to GHC 6.10.4 (2 means 6.13 is 2x slower, 0.5 means it is 2x faster).

NoSlow results for GHC 6.13.20091124 vs GHC 6.10.4
dph-prim list storablevector uvector vector
seq Primitive Storable boxed
*Double Double *Double Double *Double Double *Double Double *Double Double *Double Double *Double Double
$a+1 0.169 0.998 1.821 1.023 0.879 1.000 0.183 1.000 0.109 0.019 0.069 1.000 0.994 0.985
$a+^1 0.178 1.001 1.027 1.029 0.068 0.114 0.130 1.001 0.058 0.014 0.031 0.999 1.011 1.004
$a+x 0.172 1.004 1.138 0.989 0.970 1.006 0.185 1.005 0.121 0.018 0.072 1.002 1.024 0.965
$a+^x 0.173 0.982 1.034 1.023 0.068 0.127 0.132 0.992 0.060 0.013 0.032 1.007 1.012 0.979
$a+x+x+x+x 0.400 1.019 1.082 1.007 1.033 0.274 0.386 1.019 0.059 0.008 0.049 0.089 1.034 0.981
$a+$b 0.214 0.966 1.014 1.010 0.067 1.226 0.223 0.966 0.121 0.014 0.072 1.000 0.974 0.984
$a+$b(zip) 0.213 0.965 1.740 1.358 0.223 0.963 0.984 1.002
x*$a+$b 0.251 0.997 1.000 0.982 0.101 1.065 0.263 0.997 0.085 0.020 0.051 1.292 1.007 1.034
($a+$b)*($c+$d) 0.287 0.199 1.168 1.483 0.064 0.271 0.293 0.199 0.101 0.061 0.060 1.058 0.992 1.299
($a+$b)*($c+$d)(zip) 0.302 1.262 1.301 0.906 0.350 1.257 0.949 1.249
(x+$a)*(y+$b) 0.298 1.119 1.019 0.929 0.138 0.372 0.270 1.119 0.080 0.008 0.059 0.136 0.991 1.001
(^x+$a)*(^y+$b) 0.323 0.166 0.911 1.057 0.074 0.114 0.207 0.199 0.061 0.039 0.043 0.799 1.123 1.200
filter(neq0)(map0) 0.374 0.999 1.275 0.990 0.828 0.519 0.364 1.000 0.067 0.047 0.064 0.999 0.986 0.982
filter(neq0)(^0) 0.191 0.949 0.062 1.000 0.189 0.090 0.180 1.016 0.004 0.036 0.004 1.014 0.423 1.029
filter(eq0)(map0) 0.220 1.000 1.163 1.049 0.334 0.805 0.231 1.005 0.090 0.019 0.057 1.003 0.981 1.014
filter(eq0)(^0) 0.131 1.010 0.435 1.177 0.164 0.154 0.137 1.008 0.050 0.019 0.014 1.003 0.738 1.060
zip_filter 0.327 0.991 1.476 0.997 0.093 0.447 0.339 0.995 0.068 0.014 0.041 0.201 1.277 0.815
filter_zip 0.283 0.990 1.024 1.475 0.077 0.785 0.297 0.982 0.083 0.019 0.054 1.000 1.071 0.987
filter_evens 0.180 0.984 0.985 0.998 0.186 0.548 0.109 0.115
sum($a*$b) 0.167 1.475 0.982 0.978 0.074 1.115 0.141 1.476 0.090 0.941 0.079 0.955 0.120 0.642
sum($a*$b)(zip) 0.167 1.475 1.430 1.344 0.139 1.474 0.143 0.998
sum[m..n] 0.137 0.516 1.042 0.994 0.273 1.014 0.104 0.371 0.063 0.043 0.062 0.065 0.090 0.048
sumsq(map) 0.167 0.566 1.012 0.985 0.364 1.012 0.140 0.404 0.079 0.047 0.082 0.069 0.120 0.051
sumsq(zip) 0.088 0.557 1.009 0.968 0.115 1.002 0.165 1.017 0.153 0.061 0.107 0.180 0.272 0.309
sum_evens 0.190 2.114 0.938 1.041 0.115 1.379 0.078 0.053

Usage

The current interface is rather minimalistic. After building NoSlow, you get two binaries: noslow and noslow-table. The first one runs the actual benchmarks and produces a log file:

noslow -u log

The log can then be fed to noslow-table to generate HTML tables:

noslow-table log > table.html

We can also restrict the output to a particular data type:

noslow-table log --type=Double > table.html

or compare two logs:

noslow-table --diff log1 log2 > table.html

In this case, the table will contain ratios of corresponding values from the two logs, just like in the table above.

Compiling

NoSlow should work out of the box with GHC 6.10. I haven’t tried it with 6.12 because I didn’t have time to install the release candidate and the required packages. Getting it to build with the HEAD (which is what I’m interested in) is a rather annoying process, though, because uvector is missing a DPH patch which would make it work with the new IO system. Fortunately, I could simply transplant some code from the DPH library (the module is dph-base/Data/Array/Parallel/Data/Array/Parallel/Arr/BUArr.hs) and everything worked.

Also, storablevector depends on QuickCheck 1 which doesn’t seem to compile with the HEAD. I just removed all references to QuickCheck from the code but it might be easier to simply not build the storablevector backend (there is a flag for that).

How it works

Benchmarking itself is rather straighforward – all the exciting bits are in criterion. The heart of NoSlow is the specialiser – a Template Haskell function that takes a bunch of abstract loop kernels and specialises them for a particular backend and data type. Here is how it works. The kernels themselves are implemented in a TH quote:

kernels = [d| ...
    dotp :: (Num a, I.Vector v a) => Ty (v a) -> v a -> v a -> a
    dotp _ as bs = named "sum($a*$b)" $ I.sum (I.zipWith (*) as bs)
    ... |]

They refer to functions from module I (short for NoName.Backend.Interface) which are just stubs and do not provide any real functionality. The specialiser traverses the ASTs of the kernels (which it has access to since they are quoted) and replaces references to those stubs with calls to functions from the backend that it is specialising for. For instance, it replaces I.sum by NoSlow.Backend.List.sum (which is just a reexport of the Prelude function) when specialising for lists. It also adjusts signatures and looks for tags like named in the example above which defines the kernel’s name in the HTML output.

This scheme has one big advantage: it allows backends to mark some kernels as unsupported. An example is storablevector which does not support arrays of pairs. The corresponding backend defines zip like this:

zip :: Undefined
zip = Undefined

When replacing I.zip with NoSlow.Backend.StorableVector.zip, the specialiser reifies the latter, notices that its type is Unsupported and omits the kernel that contains the call to zip altogether.

Tricking GHC into evaluating recursive functions at compile time

2009 November 5
by Roman Leshchinskiy

Here is a trick I came up with for a project of mine. Suppose you have a GADT like this very simple one:

data T a where
  TInt  :: Int -> T Int
  TPair :: T a -> T b -> T (a,b)

and a function which does something with it:

sumT :: T a -> Int
sumT (TInt n) = n
sumT (TPair l r) = sumT l + sumT r

Now, let’s use the two:

term = TPair (TPair (TInt 1) (TInt 2)) (TInt 3)
foo = sumT term

Since foo is constant, we would expect GHC to evaluate it at compile time and just bind it to 6 in the compiled code, right?

Wrong! For this to happen, GHC would have to inline sumT. But sumT is a recursive function and GHC never inlines those because it might get into an infinite loop otherwise. This means that it won’t optimise foo at all which was absolutely unacceptable in my program. I spent about two days fiddling with inline pragmas, rewrite rules and other unpleasant things until I found a satisfactory solution.

My first attempt was to only inline sumT if it is applied to a constructor. We could try adding a couple of rewrite rules.

"sumT/TInt"  forall n. sumT (TInt n) = n
"sumT/TPair" forall l r. sumT (TPair l r) = sumT l + sumT r

Alas, this doesn’t work most of time. Basically, trying to match on non-trivial constructors in rewrite rules is never a good idea. We could introduce “virtual” constructors, use them everywhere instead of the real ones and match on them.

tInt :: Int -> T Int
{-# NOINLINE CONLIKE tInt #-}
tInt = TInt

tPair :: T a -> T b -> T (a,b)
{-# NOINLINE CONLIKE tPair #-}
tPair = TPair

"sumT/tInt" forall n. sumT (tInt n) = n
"sumT/tPair" forall l r. sumT (tPair l r) = sumT l + sumT r

This works much better but, unfortunately, still fails in my program. There, I make extensive use of type families so the Core generated by GHC has casts all over the place. Casts make rule matching highly unreliable because rules don’t ignore them (a particularly ugly wart that I keep running into). So what to do?

The solution I came up with requires adding a unit component to every recursive constructor.

data T a where
  TInt  :: Int -> T Int
  TPair :: () -> T a -> T b -> T (a,b)

Where we previously wrote TPair, we will now write TPair (). In fact, let’s provide a convenience function for that:

tPair :: T a -> T b -> T (a,b)
tPair = TPair ()

Now, we define a non-recursive version of sumT which is parametrised with a function it is supposed to apply to the pair components.

sumT_cont :: (forall a. () -> T a -> Int) -> T a -> Int
{-# INLINE sumT_cont #-}
sumT_cont cont (TInt  n) = n
sumT_cont cont (TPair u l r) = cont u l + cont u r

Note that since sumT_cont isn’t recursive it can be freely inlined. Note also that we pass the unit value from the constructor to cont. This is absolutely essential.

The actual recursive sum is defined via sumT_cont. Of course, it has to be parametrised with () (which it ignores).

sumT' :: () -> T a -> Int
sumT' _ = sumT_cont sumT'

sumT :: T a -> Int
{-# INLINE sumT #-}
sumT = sumT' ()

The final missing piece that makes the whole thing work is this simple rewrite rule:

"sumT'"  sumT' () = sumT_cont sumT'

It “inlines” sumT' but only if it is applied to (). Why is this useful? Let’s see what happens if we apply sumT to a term which GHC knows nothing about:

   sumT x
= {inline sumT}
   sumT' () x
= {apply rule "sumT'"}
   sumT_cont sumT' x
= {inline sumT_cont}
    case x of TInt n -> n
              TPair u l r -> sumT' u l + sumT' u r

Rewriting sumT' to sumT_cont sumT’ again would be a disaster as it would put us into an infinite rewriting loop. This is precisely the reason why GHC won't inline recursive functions. But our rule doesn't match here because u is not guaranteed to be ()!

So what happens if we apply sumT to a term that is at least partially static?

   sumT (tPair (tInt 1) y)
= {inline sumT and tPair}
   sumT' () (TPair () (TInt 1) y)
= {apply rule "sumT'"}
   sumT_cont sumT' (TPair () (TInt 1) y)
= {inline sumT_cont}
    case TPair () (TInt 1) y of TInt n -> n
                                TPair u l r -> sumT' u l + sumT' u r
= {eliminate case}
   sumT' () (TInt 1) + sumT' () y
= {apply rule "sumT'" twice}
    sumT_cont sumT' (TInt 1) + sumT_cont sumT' y
= {inline sumT_cont, eliminate case}
    1 + case y of TInt n -> n
                  TPair u l r -> sumT' u l + sumT' u r

This looks good! In effect, GHC executed sumT for the statically known portion of the term at compile time and deferred the rest to run time. This worked because when it eliminated the case on TPair it bound u in the case alternative to (). This allowed it to apply the "sumT'" rule again and thus to get rid of the TInt constructor in the left component. The right component is unknown, though, so rewriting stops there. In general, after "inlining" (via the rewrite rule) sumT' once, GHC will only apply the rule again if it eliminates the case, thus binding u to (). This, in turn, is only possible if the head of the term is a known constructor so GHC will continue rewriting and inlining until it consumes all known constructors but will not get into an infinite loop. For foo from my first example, which is fully constant, it will perform the entire computation at compile time and reduce it to 6.

A word of warning: it is possible to get GHC into an infinite loop with this approach by constructing infinite but statically known terms. For instance, we could apply the same technique to this type.

data U = UInt Int | UPair U U

But now, this term gets us into trouble:

x = UPair (UInt 1) x

This technique works best with GADTs like T that do not admit infinite terms.

Profiling with stream fusion

2009 October 30
by Roman Leshchinskiy

As we move on to bigger examples in DPH, identifying performance problems just by staring at the Core output becomes somewhat difficult. We’ve finally reached a point where we actually have to profile DPH programs and identify slow loops. But how? Cost centre profiling isn’t supported by the vectoriser and in any case, it works on the Haskell source whereas we are interested in loops generated by the vectoriser. Ticky-ticky profiling happens a bit too late, when all those loops have been transformed into something mostly unrecognisable. It also seems to have problems with -threaded. Looks like we have to implement it ourselves…

With a bit of thought, this turns out to be surprisingly easy. Since DPH uses stream fusion, (almost) all those loops operate on streams. Now, the basic idea is to annotate all streams with a string which tells us how the stream has been produced.

data Stream a = forall s. Stream ... String

replicateS :: Int -> a -> Stream a
replicateS n x = Stream ... "replicateS"

mapS :: (a -> b) -> Stream a -> Stream b
mapS f (Stream ... c) = Stream ... ("mapS <- " ++ c)

and so on. Now, when a real loop consumes a stream it simply logs its running time together with the stream’s producer:

unstreamU :: Stream a -> UArr a
unstreamU (Stream ... c) = runST (traceLoopST ("unstreamU <- " ++ c) $ fill)
  where
    fill =

Since streams are purely compile-time structures and no streams should be left in the program if the simplifier is doing its job properly, all those strings simply go away if we compile our library without tracing (i.e., make traceLoopST a noop). With tracing on, however, our rather fancy implementation actually produces dtrace events at the beginning and at the end of a loop. This is a very flexible mechanism. In particular, it allows us to produce profiles like this (pardon my formatting):

  ...
  unstreamMU <- zipWithS <- (zipWithS <- (zipWithS <- (replicateEachS <- zipWithS <- (streamU, streamU), streamU), zipWithS <- (replicateEachS <- zipWithS <- (streamU, streamU), streamU)), zipWithS <- (zipWithS <- (replicateEachS <- zipWithS <- (streamU, st         12115903
  unstreamMU <- mapS <- mapS <- streamU                      21787693
  foldS <- streamU                                           27096786
  unstreamMU <- zipWithS <- (streamU, mapS <- fold1SS <- (streamU, zipWithS <- (enumFromToEachS <- zipWithS <- (replicateS, mapS <- streamU), streamU)))         60756439
  unstreamMU <- toStream                                    817568512

This shows the raw running time for each kind of stream loop in the program. Of course, it doesn’t distinguish between different instantiations of the same loop structure (for instance, foldS <- streamU occurs rather frequently) but it’s still very helpful and doesn’t require any compiler support whatsoever. Ignoring toStream which is only called during test data generation, the loop with fold1SS takes by far the most time. Indeed, it turns out that fold1SS is the problem; optimising its implementation speeds up the program almost by a factor of 2.

Implementing all this took less than one day. I guess this shows yet again: don’t mess with the compiler if you can mess with the library instead. Also, use Haskell.

Talk on “Loop Fusion in Haskell”

2009 October 22
by Roman Leshchinskiy

I gave a talk about loop fusion in Haskell today at FP-Syd, the Sydney Functional Programming group. It covered stream fusion and fusion for distributed types which are two of the optimisations that make Data Parallel Haskell fast.

Talk on “Generics in Data Parallel Haskell”

2009 October 3
by Roman Leshchinskiy

Here are the slides for my talk on the use of generics in the implementation of Data Parallel Haskell at the SAPLING meeting yesterday. This is where Instant Generics come from.

Squinting at fusion

2009 September 29
by Roman Leshchinskiy

This being my first blog post and all, I’ll try to maximise boredom and minimise readability by writing as few lines of text as possible. Here we go…

As we know, recursive data types are fixpoints of non-recursive ones. So, for instance, the standard list data type:

data [a] = [] | a : [a]

is just the fixpoint of this guy:

data PreList a s = Nil | Cons a s

with these simple injection/projection functions:

inject :: PreList a [a] -> [a]
inject Nil         = []
inject (Cons x xs) = x : xs

project :: [a] -> PreList a [a]
project []     = Nil
project (x:xs) = Cons x xs

Of course, PreList is also a functor:

instance Functor (PreList a) where
  fmap f Nil        = Nil
  fmap f (Cons x s) = Cons x (f s)

Now, our goal is to mutilate the standard short cut fusion rules by using PreList as much as possible. Let’s do destroy/unfoldr first:

destroy :: (forall s. (s -> PreList a s) -> s -> t) -> [a] -> t
destroy g = g project

unfoldr :: (s -> PreList a s) -> s -> [a]
unfoldr f = inject . fmap (unfoldr f) . f

The fusion rule is

destroy g (unfoldr f s) = g f s

Of course, unfoldr is just the list anamorphism but what’s so special about destroy? If we squint hard enough at its type we might realise that

forall t. (forall s. (s -> PreList a s) -> s -> t) -> t

is, in fact, isomorphic to

exists s. (s -> PreList a s, s)

This being Haskell, we have to introduce a separate data type for the existential:

data Unfolding a = forall s. Unfolding (s -> PreList a s) s

This makes the signatures of our two functions a lot simpler:

destroy :: [a] -> Unfolding a
destroy xs = Unfolding project xs

unfoldr :: Unfolding a -> [a]
unfoldr (Unfolding f s) = ana s
  where
     ana = inject . fmap ana . f

And the fusion rule is a bit nicer, too:

destroy (unfoldr s) = s

In fact, what we have here is almost but not quite stream fusion: destroy is equivalent to stream, unfoldr to unstream and Unfolding to Stream. The only difference is that PreList (which corresponds to the Step data type in the paper) is missing the Skip constructor which, unfortunately, is crucial for making the whole thing work.

This part was clear to me back when we were doing stream fusion but the next bit I didn’t understand until recently. The question is: can we do something similar with foldr/build? Again, let’s get rid of as many funny types as possible and use PreList instead:

foldr :: (PreList a s -> s) -> [a] -> s
foldr f = f . fmap (foldr f) . project

build :: (forall s. (PreList a s -> s) -> s) -> [a]
build g = g inject

If you are wondering where these signatures come from, here is a little hint:

(a -> s -> s) -> s -> t   ~   ((a,s) -> s) -> (() -> s) -> t    ~   ((a,s)+() -> s) -> t   ~   (PreList a s -> s) -> t

Now, by squinting at the types we might just see that it makes sense to introduce an abstraction:

data Folding a = Folding (forall s. (PreList a s -> s) -> s)

and use it:

foldr :: [a] -> Folding a
foldr xs = Folding (\f -> let cata = f . fmap cata . project in cata xs)

build :: Folding a -> [a]
build (Folding g) = g inject

VoilĂ ! We now have a shiny new foldr/build fusion rule:

foldr (build s) = s

It’s probably worth pointing out that Folding does, in fact, support many useful list operations. Here is an example:

instance Functor Folding where
  fmap f (Folding g) = Folding (\h -> g (h . emap))
    where
      emap Nil        = Nil
      emap (Cons x s) = Cons (f x) s

So is there a point to all this? Well, I find it quite interesting that foldr/build fusion can be rewritten in this way. I’m even more intrigued by what we get if we squint at the list hylomorphism:

refold :: Unfolding a -> Folding a
refold (Unfolding g s) = Folding (\f -> let hylo = f . fmap hylo . g in  hylo s)

Could we have both stream fusion and foldr/build in one framework? Would that be useful? Is there a way of working on fusion without irreparably damaging my eyes?