Over the last couple of days, I have implemented a small benchmark suite which tries to measure the performance of various Haskell array libraries, with particular emphasis on finding out how well they are able to fuse things. It is now on Hackage under the very creative and imaginative name NoSlow (Haskell seems to have gained a tradition of naming benchmark suites nosomething). What it does is compile and run a set of micro-benchmarks using these libraries:
- standard lists
- primitive arrays from the DPH project (package dph-prim-seq)
- uvector (a fork of the above)
- vector (primitive, storable and boxed arrays)
- storablevector
The actual benchmarking is done by Brian O’Sullivan’s wonderful criterion.
NoSlow is a very young project so the output isn’t particularly pretty but it does the job. It is also not very thoroughly tested and only includes a very restricted set of benchmarks so saying “library x is better than library y because it produces better numbers in NoSlow” would be completely and utterly wrong. However, it is quite useful for identifying cases where taking a closer look at the code generated for a particular loop might be a good idea. It already helped me spot one tricky performance regression in the standard list library.
Here is an example of the data it produces.
| dph-prim | list | storablevector | uvector | vector | |||
|---|---|---|---|---|---|---|---|
| seq | Primitive | Storable | boxed | ||||
| $a+1 | 58.38504 us | 192.9597 us | 58.25747 us | 58.47876 us | 58.52366 us | 58.77233 us | 405.5711 us |
| $a+^1 | 59.03597 us | 356.6624 us | 210.4899 us | 80.25804 us | 58.53107 us | 59.19546 us | 417.2587 us |
| $a+x | 55.62417 us | 236.2620 us | 55.63200 us | 55.62767 us | 55.33446 us | 55.96459 us | 477.2455 us |
| $a+^x | 56.91372 us | 445.1638 us | 237.9741 us | 75.81757 us | 55.79731 us | 56.45521 us | 491.3216 us |
| $a+x+x+x+x | 96.75086 us | 280.3636 us | 236.0562 us | 96.76510 us | 95.07297 us | 97.15892 us | 519.6680 us |
| $a+$b | 64.82325 us | 255.3842 us | 91.09551 us | 64.82592 us | 65.21805 us | 65.10213 us | 500.7479 us |
| $a+$b(zip) | 64.48503 us | 347.4379 us | 64.49701 us | 499.5277 us | |||
| x*$a+$b | 75.63250 us | 452.0626 us | 154.3398 us | 75.68617 us | 110.5486 us | 155.2830 us | 649.9281 us |
| ($a+$b)*($c+$d) | 125.0923 us | 1.080122 ms | 288.3781 us | 125.0150 us | 919.2376 us | 1.132326 ms | 2.967417 ms |
| ($a+$b)*($c+$d)(zip) | 155.3547 us | 742.2439 us | 155.3554 us | 2.509351 ms | |||
| (x+$a)*(y+$b) | 104.0011 us | 678.5383 us | 203.9229 us | 103.9131 us | 89.32382 us | 89.98352 us | 401.6301 us |
| (^x+$a)*(^y+$b) | 118.5413 us | 1.061493 ms | 534.0542 us | 150.2891 us | 822.7933 us | 930.1098 us | 2.650773 ms |
| filter(neq0)(map0) | 61.45542 us | 78.80583 us | 104.6501 us | 61.35867 us | 68.04449 us | 61.67384 us | 124.5576 us |
| filter(neq0)(^0) | 17.42000 us | 23.41616 us | 173.0071 us | 17.91704 us | 23.33224 us | 24.97328 us | 59.48926 us |
| filter(eq0)(map0) | 69.61809 us | 174.2209 us | 242.6149 us | 69.63563 us | 69.34671 us | 69.89652 us | 331.1557 us |
| filter(eq0)(^0) | 45.96399 us | 94.22360 us | 308.5451 us | 45.75672 us | 52.15233 us | 45.52413 us | 201.7559 us |
| zip_filter | 76.31172 us | 466.6432 us | 334.0208 us | 76.24536 us | 99.80086 us | 100.4251 us | 341.6012 us |
| filter_zip | 81.01137 us | 468.8768 us | 239.6135 us | 79.83034 us | 80.92583 us | 87.66268 us | 255.4129 us |
| filter_evens | 59.13895 us | 182.2491 us | 58.59793 us | 337.2523 us | |||
| sum($a*$b) | 115.0304 us | 1.130831 ms | 145.4783 us | 115.0499 us | 73.86368 us | 74.39775 us | 99.39568 us |
| sum($a*$b)(zip) | 114.9057 us | 1.420020 ms | 114.8766 us | 99.43429 us | |||
| sum[m..n] | 53.50359 us | 646.0969 us | 235.4800 us | 53.76784 us | 129.2202 us | 133.0979 us | 134.6730 us |
| sumsq(map) | 67.20919 us | 653.9266 us | 292.8844 us | 67.21005 us | 143.1072 us | 142.2476 us | 143.5639 us |
| sumsq(zip) | 67.24133 us | 1.928390 ms | 311.0818 us | 160.3321 us | 369.7680 us | 380.4847 us | 914.5373 us |
| sum_evens | 120.4421 us | 401.7939 us | 120.4255 us | 150.9723 us | |||
Benchmarks
At the moment, NoSlow only benchmarks a bunch of very small and fairly random loop kernels as I was more concerned with getting the infrastructure in place. Here is an example (called $a+$b in the table):
dotp as bs = sum (zipWith (*) as bs)
When it is built, NoSlow compiles several different versions of this loop. For instance, for the storablevector backend it will generate these two functions:
dotp :: (Storable a, Num a) => StorableVector a -> StorableVector a -> a dotp :: StorableVector Double -> StorableVector Double -> Double
The first one is overloaded on the element type whereas the second one is specialised to Double. NoSlow does this for all its backends and then runs all generated versions of the benchmark with the same test data. With more testing and more benchmarks, this will give us a rough estimate of the relative performance of the different array libraries. It also shows how much faster specialised loops are compared to their polymorphic versions
Here is the output of a full run. *Double and Double are overloaded and specialised loops, respectively. At the moment, NoSlow only does Double benchmarks but adding more types is easy.
| dph-prim | list | storablevector | uvector | vector | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| seq | Primitive | Storable | boxed | |||||||||||
| *Double | Double | *Double | Double | *Double | Double | *Double | Double | *Double | Double | *Double | Double | *Double | Double | |
| $a+1 | 388.9189 us | 58.38504 us | 334.0156 us | 192.9597 us | 310.7921 us | 58.25747 us | 410.9401 us | 58.47876 us | 831.8845 us | 58.52366 us | 467.6662 us | 58.77233 us | 563.7031 us | 405.5711 us |
| $a+^1 | 412.6220 us | 59.03597 us | 361.8679 us | 356.6624 us | 808.4398 us | 210.4899 us | 554.7731 us | 80.25804 us | 842.5514 us | 58.53107 us | 469.5773 us | 59.19546 us | 570.0256 us | 417.2587 us |
| $a+x | 398.3495 us | 55.62417 us | 266.9966 us | 236.2620 us | 317.0817 us | 55.63200 us | 416.8037 us | 55.62767 us | 848.4946 us | 55.33446 us | 479.7607 us | 55.96459 us | 576.0136 us | 477.2455 us |
| $a+^x | 401.3565 us | 56.91372 us | 376.1232 us | 445.1638 us | 821.9274 us | 237.9741 us | 560.3928 us | 75.81757 us | 854.3754 us | 55.79731 us | 483.5405 us | 56.45521 us | 579.5658 us | 491.3216 us |
| $a+x+x+x+x | 1.099485 ms | 96.75086 us | 989.8561 us | 280.3636 us | 1.367973 ms | 236.0562 us | 1.108161 ms | 96.76510 us | 1.732897 ms | 95.07297 us | 1.402448 ms | 97.15892 us | 1.918533 ms | 519.6680 us |
| $a+$b | 675.8037 us | 64.82325 us | 266.1987 us | 255.3842 us | 695.4756 us | 91.09551 us | 721.3087 us | 64.82592 us | 1.238228 ms | 65.21805 us | 758.5373 us | 65.10213 us | 579.0221 us | 500.7479 us |
| $a+$b(zip) | 674.6217 us | 64.48503 us | 486.2640 us | 347.4379 us | 718.1258 us | 64.49701 us | 579.6009 us | 499.5277 us | ||||||
| x*$a+$b | 843.1906 us | 75.63250 us | 576.2166 us | 452.0626 us | 1.086456 ms | 154.3398 us | 889.1795 us | 75.68617 us | 1.494543 ms | 110.5486 us | 877.4670 us | 155.2830 us | 1.226473 ms | 649.9281 us |
| ($a+$b)*($c+$d) | 1.510817 ms | 125.0923 us | 1.003353 ms | 1.080122 ms | 1.958700 ms | 288.3781 us | 1.582407 ms | 125.0150 us | 3.109755 ms | 919.2376 us | 1.915603 ms | 1.132326 ms | 3.199406 ms | 2.967417 ms |
| ($a+$b)*($c+$d)(zip) | 1.585162 ms | 155.3547 us | 1.345901 ms | 742.2439 us | 1.845875 ms | 155.3554 us | 2.749534 ms | 2.509351 ms | ||||||
| (x+$a)*(y+$b) | 1.139386 ms | 104.0011 us | 969.7649 us | 678.5383 us | 1.451681 ms | 203.9229 us | 1.174562 ms | 103.9131 us | 1.950864 ms | 89.32382 us | 1.404879 ms | 89.98352 us | 2.030718 ms | 401.6301 us |
| (^x+$a)*(^y+$b) | 1.095473 ms | 118.5413 us | 1.154678 ms | 1.061493 ms | 2.512015 ms | 534.0542 us | 1.530284 ms | 150.2891 us | 2.452523 ms | 822.7933 us | 1.825248 ms | 930.1098 us | 3.399711 ms | 2.650773 ms |
| filter(neq0)(map0) | 357.0597 us | 61.45542 us | 320.2454 us | 78.80583 us | 581.4489 us | 104.6501 us | 364.2109 us | 61.35867 us | 609.2637 us | 68.04449 us | 531.9321 us | 61.67384 us | 370.0963 us | 124.5576 us |
| filter(neq0)(^0) | 19.80998 us | 17.42000 us | 6.715188 us | 23.41616 us | 404.8197 us | 173.0071 us | 20.19333 us | 17.91704 us | 24.38774 us | 23.33224 us | 24.77845 us | 24.97328 us | 59.51524 us | 59.48926 us |
| filter(eq0)(map0) | 527.7883 us | 69.61809 us | 499.8775 us | 174.2209 us | 839.2031 us | 242.6149 us | 542.4786 us | 69.63563 us | 1.015737 ms | 69.34671 us | 622.5425 us | 69.89652 us | 619.5380 us | 331.1557 us |
| filter(eq0)(^0) | 153.7917 us | 45.96399 us | 87.08318 us | 94.22360 us | 648.8350 us | 308.5451 us | 162.6549 us | 45.75672 us | 417.1879 us | 52.15233 us | 126.1947 us | 45.52413 us | 214.1162 us | 201.7559 us |
| zip_filter | 696.4477 us | 76.31172 us | 631.0826 us | 466.6432 us | 1.158844 ms | 334.0208 us | 735.2156 us | 76.24536 us | 1.042058 ms | 99.80086 us | 635.7235 us | 100.4251 us | 588.1413 us | 341.6012 us |
| filter_zip | 741.1476 us | 81.01137 us | 467.9363 us | 468.8768 us | 1.073220 ms | 239.6135 us | 789.2609 us | 79.83034 us | 1.249209 ms | 80.92583 us | 835.3062 us | 87.66268 us | 507.6646 us | 255.4129 us |
| filter_evens | 315.2792 us | 59.13895 us | 185.6894 us | 182.2491 us | 342.7927 us | 58.59793 us | 321.4253 us | 337.2523 us | ||||||
| sum($a*$b) | 889.3029 us | 115.0304 us | 1.113731 ms | 1.130831 ms | 837.7596 us | 145.4783 us | 630.7635 us | 115.0499 us | 1.046405 ms | 73.86368 us | 956.1445 us | 74.39775 us | 310.2107 us | 99.39568 us |
| sum($a*$b)(zip) | 893.2471 us | 114.9057 us | 1.890718 ms | 1.420020 ms | 624.3066 us | 114.8766 us | 371.0969 us | 99.43429 us | ||||||
| sum[m..n] | 480.6165 us | 53.50359 us | 1.006370 ms | 646.0969 us | 599.5453 us | 235.4800 us | 257.9740 us | 53.76784 us | 520.8189 us | 129.2202 us | 455.0702 us | 133.0979 us | 450.9119 us | 134.6730 us |
| sumsq(map) | 601.0236 us | 67.20919 us | 1.389623 ms | 653.9266 us | 916.6944 us | 292.8844 us | 361.3013 us | 67.21005 us | 633.0462 us | 143.1072 us | 632.7607 us | 142.2476 us | 626.6808 us | 143.5639 us |
| sumsq(zip) | 581.3862 us | 67.24133 us | 2.487876 ms | 1.928390 ms | 1.391611 ms | 311.0818 us | 931.6956 us | 160.3321 us | 2.725828 ms | 369.7680 us | 1.892799 ms | 380.4847 us | 1.770848 ms | 914.5373 us |
| sum_evens | 522.6244 us | 120.4421 us | 377.5469 us | 401.7939 us | 283.7662 us | 120.4255 us | 293.9641 us | 150.9723 us | ||||||
NoSlow can also be used to see how well different GHC versions optimise array loops. Here is a comparison of the current HEAD to GHC 6.10.4 (2 means 6.13 is 2x slower, 0.5 means it is 2x faster).
| dph-prim | list | storablevector | uvector | vector | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| seq | Primitive | Storable | boxed | |||||||||||
| *Double | Double | *Double | Double | *Double | Double | *Double | Double | *Double | Double | *Double | Double | *Double | Double | |
| $a+1 | 0.169 | 0.998 | 1.821 | 1.023 | 0.879 | 1.000 | 0.183 | 1.000 | 0.109 | 0.019 | 0.069 | 1.000 | 0.994 | 0.985 |
| $a+^1 | 0.178 | 1.001 | 1.027 | 1.029 | 0.068 | 0.114 | 0.130 | 1.001 | 0.058 | 0.014 | 0.031 | 0.999 | 1.011 | 1.004 |
| $a+x | 0.172 | 1.004 | 1.138 | 0.989 | 0.970 | 1.006 | 0.185 | 1.005 | 0.121 | 0.018 | 0.072 | 1.002 | 1.024 | 0.965 |
| $a+^x | 0.173 | 0.982 | 1.034 | 1.023 | 0.068 | 0.127 | 0.132 | 0.992 | 0.060 | 0.013 | 0.032 | 1.007 | 1.012 | 0.979 |
| $a+x+x+x+x | 0.400 | 1.019 | 1.082 | 1.007 | 1.033 | 0.274 | 0.386 | 1.019 | 0.059 | 0.008 | 0.049 | 0.089 | 1.034 | 0.981 |
| $a+$b | 0.214 | 0.966 | 1.014 | 1.010 | 0.067 | 1.226 | 0.223 | 0.966 | 0.121 | 0.014 | 0.072 | 1.000 | 0.974 | 0.984 |
| $a+$b(zip) | 0.213 | 0.965 | 1.740 | 1.358 | 0.223 | 0.963 | 0.984 | 1.002 | ||||||
| x*$a+$b | 0.251 | 0.997 | 1.000 | 0.982 | 0.101 | 1.065 | 0.263 | 0.997 | 0.085 | 0.020 | 0.051 | 1.292 | 1.007 | 1.034 |
| ($a+$b)*($c+$d) | 0.287 | 0.199 | 1.168 | 1.483 | 0.064 | 0.271 | 0.293 | 0.199 | 0.101 | 0.061 | 0.060 | 1.058 | 0.992 | 1.299 |
| ($a+$b)*($c+$d)(zip) | 0.302 | 1.262 | 1.301 | 0.906 | 0.350 | 1.257 | 0.949 | 1.249 | ||||||
| (x+$a)*(y+$b) | 0.298 | 1.119 | 1.019 | 0.929 | 0.138 | 0.372 | 0.270 | 1.119 | 0.080 | 0.008 | 0.059 | 0.136 | 0.991 | 1.001 |
| (^x+$a)*(^y+$b) | 0.323 | 0.166 | 0.911 | 1.057 | 0.074 | 0.114 | 0.207 | 0.199 | 0.061 | 0.039 | 0.043 | 0.799 | 1.123 | 1.200 |
| filter(neq0)(map0) | 0.374 | 0.999 | 1.275 | 0.990 | 0.828 | 0.519 | 0.364 | 1.000 | 0.067 | 0.047 | 0.064 | 0.999 | 0.986 | 0.982 |
| filter(neq0)(^0) | 0.191 | 0.949 | 0.062 | 1.000 | 0.189 | 0.090 | 0.180 | 1.016 | 0.004 | 0.036 | 0.004 | 1.014 | 0.423 | 1.029 |
| filter(eq0)(map0) | 0.220 | 1.000 | 1.163 | 1.049 | 0.334 | 0.805 | 0.231 | 1.005 | 0.090 | 0.019 | 0.057 | 1.003 | 0.981 | 1.014 |
| filter(eq0)(^0) | 0.131 | 1.010 | 0.435 | 1.177 | 0.164 | 0.154 | 0.137 | 1.008 | 0.050 | 0.019 | 0.014 | 1.003 | 0.738 | 1.060 |
| zip_filter | 0.327 | 0.991 | 1.476 | 0.997 | 0.093 | 0.447 | 0.339 | 0.995 | 0.068 | 0.014 | 0.041 | 0.201 | 1.277 | 0.815 |
| filter_zip | 0.283 | 0.990 | 1.024 | 1.475 | 0.077 | 0.785 | 0.297 | 0.982 | 0.083 | 0.019 | 0.054 | 1.000 | 1.071 | 0.987 |
| filter_evens | 0.180 | 0.984 | 0.985 | 0.998 | 0.186 | 0.548 | 0.109 | 0.115 | ||||||
| sum($a*$b) | 0.167 | 1.475 | 0.982 | 0.978 | 0.074 | 1.115 | 0.141 | 1.476 | 0.090 | 0.941 | 0.079 | 0.955 | 0.120 | 0.642 |
| sum($a*$b)(zip) | 0.167 | 1.475 | 1.430 | 1.344 | 0.139 | 1.474 | 0.143 | 0.998 | ||||||
| sum[m..n] | 0.137 | 0.516 | 1.042 | 0.994 | 0.273 | 1.014 | 0.104 | 0.371 | 0.063 | 0.043 | 0.062 | 0.065 | 0.090 | 0.048 |
| sumsq(map) | 0.167 | 0.566 | 1.012 | 0.985 | 0.364 | 1.012 | 0.140 | 0.404 | 0.079 | 0.047 | 0.082 | 0.069 | 0.120 | 0.051 |
| sumsq(zip) | 0.088 | 0.557 | 1.009 | 0.968 | 0.115 | 1.002 | 0.165 | 1.017 | 0.153 | 0.061 | 0.107 | 0.180 | 0.272 | 0.309 |
| sum_evens | 0.190 | 2.114 | 0.938 | 1.041 | 0.115 | 1.379 | 0.078 | 0.053 | ||||||
Usage
The current interface is rather minimalistic. After building NoSlow, you get two binaries: noslow and noslow-table. The first one runs the actual benchmarks and produces a log file:
noslow -u log
The log can then be fed to noslow-table to generate HTML tables:
noslow-table log > table.html
We can also restrict the output to a particular data type:
noslow-table log --type=Double > table.html
or compare two logs:
noslow-table --diff log1 log2 > table.html
In this case, the table will contain ratios of corresponding values from the two logs, just like in the table above.
Compiling
NoSlow should work out of the box with GHC 6.10. I haven’t tried it with 6.12 because I didn’t have time to install the release candidate and the required packages. Getting it to build with the HEAD (which is what I’m interested in) is a rather annoying process, though, because uvector is missing a DPH patch which would make it work with the new IO system. Fortunately, I could simply transplant some code from the DPH library (the module is dph-base/Data/Array/Parallel/Data/Array/Parallel/Arr/BUArr.hs) and everything worked.
Also, storablevector depends on QuickCheck 1 which doesn’t seem to compile with the HEAD. I just removed all references to QuickCheck from the code but it might be easier to simply not build the storablevector backend (there is a flag for that).
How it works
Benchmarking itself is rather straighforward – all the exciting bits are in criterion. The heart of NoSlow is the specialiser – a Template Haskell function that takes a bunch of abstract loop kernels and specialises them for a particular backend and data type. Here is how it works. The kernels themselves are implemented in a TH quote:
kernels = [d| ...
dotp :: (Num a, I.Vector v a) => Ty (v a) -> v a -> v a -> a
dotp _ as bs = named "sum($a*$b)" $ I.sum (I.zipWith (*) as bs)
... |]
They refer to functions from module I (short for NoName.Backend.Interface) which are just stubs and do not provide any real functionality. The specialiser traverses the ASTs of the kernels (which it has access to since they are quoted) and replaces references to those stubs with calls to functions from the backend that it is specialising for. For instance, it replaces I.sum by NoSlow.Backend.List.sum (which is just a reexport of the Prelude function) when specialising for lists. It also adjusts signatures and looks for tags like named in the example above which defines the kernel’s name in the HTML output.
This scheme has one big advantage: it allows backends to mark some kernels as unsupported. An example is storablevector which does not support arrays of pairs. The corresponding backend defines zip like this:
zip :: Undefined zip = Undefined
When replacing I.zip with NoSlow.Backend.StorableVector.zip, the specialiser reifies the latter, notices that its type is Unsupported and omits the kernel that contains the call to zip altogether.
Here is a trick I came up with for a project of mine. Suppose you have a GADT like this very simple one:
data T a where TInt :: Int -> T Int TPair :: T a -> T b -> T (a,b)
and a function which does something with it:
sumT :: T a -> Int sumT (TInt n) = n sumT (TPair l r) = sumT l + sumT r
Now, let’s use the two:
term = TPair (TPair (TInt 1) (TInt 2)) (TInt 3) foo = sumT term
Since foo is constant, we would expect GHC to evaluate it at compile time and just bind it to 6 in the compiled code, right?
Wrong! For this to happen, GHC would have to inline sumT. But sumT is a recursive function and GHC never inlines those because it might get into an infinite loop otherwise. This means that it won’t optimise foo at all which was absolutely unacceptable in my program. I spent about two days fiddling with inline pragmas, rewrite rules and other unpleasant things until I found a satisfactory solution.
My first attempt was to only inline sumT if it is applied to a constructor. We could try adding a couple of rewrite rules.
"sumT/TInt" forall n. sumT (TInt n) = n "sumT/TPair" forall l r. sumT (TPair l r) = sumT l + sumT r
Alas, this doesn’t work most of time. Basically, trying to match on non-trivial constructors in rewrite rules is never a good idea. We could introduce “virtual” constructors, use them everywhere instead of the real ones and match on them.
tInt :: Int -> T Int
{-# NOINLINE CONLIKE tInt #-}
tInt = TInt
tPair :: T a -> T b -> T (a,b)
{-# NOINLINE CONLIKE tPair #-}
tPair = TPair
"sumT/tInt" forall n. sumT (tInt n) = n
"sumT/tPair" forall l r. sumT (tPair l r) = sumT l + sumT r
This works much better but, unfortunately, still fails in my program. There, I make extensive use of type families so the Core generated by GHC has casts all over the place. Casts make rule matching highly unreliable because rules don’t ignore them (a particularly ugly wart that I keep running into). So what to do?
The solution I came up with requires adding a unit component to every recursive constructor.
data T a where TInt :: Int -> T Int TPair :: () -> T a -> T b -> T (a,b)
Where we previously wrote TPair, we will now write TPair (). In fact, let’s provide a convenience function for that:
tPair :: T a -> T b -> T (a,b) tPair = TPair ()
Now, we define a non-recursive version of sumT which is parametrised with a function it is supposed to apply to the pair components.
sumT_cont :: (forall a. () -> T a -> Int) -> T a -> Int
{-# INLINE sumT_cont #-}
sumT_cont cont (TInt n) = n
sumT_cont cont (TPair u l r) = cont u l + cont u r
Note that since sumT_cont isn’t recursive it can be freely inlined. Note also that we pass the unit value from the constructor to cont. This is absolutely essential.
The actual recursive sum is defined via sumT_cont. Of course, it has to be parametrised with () (which it ignores).
sumT' :: () -> T a -> Int
sumT' _ = sumT_cont sumT'
sumT :: T a -> Int
{-# INLINE sumT #-}
sumT = sumT' ()
The final missing piece that makes the whole thing work is this simple rewrite rule:
"sumT'" sumT' () = sumT_cont sumT'
It “inlines” sumT' but only if it is applied to (). Why is this useful? Let’s see what happens if we apply sumT to a term which GHC knows nothing about:
sumT x
= {inline sumT}
sumT' () x
= {apply rule "sumT'"}
sumT_cont sumT' x
= {inline sumT_cont}
case x of TInt n -> n
TPair u l r -> sumT' u l + sumT' u r
Rewriting sumT' to sumT_cont sumT’ again would be a disaster as it would put us into an infinite rewriting loop. This is precisely the reason why GHC won't inline recursive functions. But our rule doesn't match here because u is not guaranteed to be ()!
So what happens if we apply sumT to a term that is at least partially static?
sumT (tPair (tInt 1) y)
= {inline sumT and tPair}
sumT' () (TPair () (TInt 1) y)
= {apply rule "sumT'"}
sumT_cont sumT' (TPair () (TInt 1) y)
= {inline sumT_cont}
case TPair () (TInt 1) y of TInt n -> n
TPair u l r -> sumT' u l + sumT' u r
= {eliminate case}
sumT' () (TInt 1) + sumT' () y
= {apply rule "sumT'" twice}
sumT_cont sumT' (TInt 1) + sumT_cont sumT' y
= {inline sumT_cont, eliminate case}
1 + case y of TInt n -> n
TPair u l r -> sumT' u l + sumT' u r
This looks good! In effect, GHC executed sumT for the statically known portion of the term at compile time and deferred the rest to run time. This worked because when it eliminated the case on TPair it bound u in the case alternative to (). This allowed it to apply the "sumT'" rule again and thus to get rid of the TInt constructor in the left component. The right component is unknown, though, so rewriting stops there. In general, after "inlining" (via the rewrite rule) sumT' once, GHC will only apply the rule again if it eliminates the case, thus binding u to (). This, in turn, is only possible if the head of the term is a known constructor so GHC will continue rewriting and inlining until it consumes all known constructors but will not get into an infinite loop. For foo from my first example, which is fully constant, it will perform the entire computation at compile time and reduce it to 6.
A word of warning: it is possible to get GHC into an infinite loop with this approach by constructing infinite but statically known terms. For instance, we could apply the same technique to this type.
data U = UInt Int | UPair U U
But now, this term gets us into trouble:
x = UPair (UInt 1) x
This technique works best with GADTs like T that do not admit infinite terms.
As we move on to bigger examples in DPH, identifying performance problems just by staring at the Core output becomes somewhat difficult. We’ve finally reached a point where we actually have to profile DPH programs and identify slow loops. But how? Cost centre profiling isn’t supported by the vectoriser and in any case, it works on the Haskell source whereas we are interested in loops generated by the vectoriser. Ticky-ticky profiling happens a bit too late, when all those loops have been transformed into something mostly unrecognisable. It also seems to have problems with -threaded. Looks like we have to implement it ourselves…
With a bit of thought, this turns out to be surprisingly easy. Since DPH uses stream fusion, (almost) all those loops operate on streams. Now, the basic idea is to annotate all streams with a string which tells us how the stream has been produced.
data Stream a = forall s. Stream ... String
replicateS :: Int -> a -> Stream a
replicateS n x = Stream ... "replicateS"
mapS :: (a -> b) -> Stream a -> Stream b
mapS f (Stream ... c) = Stream ... ("mapS <- " ++ c)
and so on. Now, when a real loop consumes a stream it simply logs its running time together with the stream’s producer:
unstreamU :: Stream a -> UArr a
unstreamU (Stream ... c) = runST (traceLoopST ("unstreamU <- " ++ c) $ fill)
where
fill =
Since streams are purely compile-time structures and no streams should be left in the program if the simplifier is doing its job properly, all those strings simply go away if we compile our library without tracing (i.e., make traceLoopST a noop). With tracing on, however, our rather fancy implementation actually produces dtrace events at the beginning and at the end of a loop. This is a very flexible mechanism. In particular, it allows us to produce profiles like this (pardon my formatting):
... unstreamMU <- zipWithS <- (zipWithS <- (zipWithS <- (replicateEachS <- zipWithS <- (streamU, streamU), streamU), zipWithS <- (replicateEachS <- zipWithS <- (streamU, streamU), streamU)), zipWithS <- (zipWithS <- (replicateEachS <- zipWithS <- (streamU, st 12115903 unstreamMU <- mapS <- mapS <- streamU 21787693 foldS <- streamU 27096786 unstreamMU <- zipWithS <- (streamU, mapS <- fold1SS <- (streamU, zipWithS <- (enumFromToEachS <- zipWithS <- (replicateS, mapS <- streamU), streamU))) 60756439 unstreamMU <- toStream 817568512
This shows the raw running time for each kind of stream loop in the program. Of course, it doesn’t distinguish between different instantiations of the same loop structure (for instance, foldS <- streamU occurs rather frequently) but it’s still very helpful and doesn’t require any compiler support whatsoever. Ignoring toStream which is only called during test data generation, the loop with fold1SS takes by far the most time. Indeed, it turns out that fold1SS is the problem; optimising its implementation speeds up the program almost by a factor of 2.
Implementing all this took less than one day. I guess this shows yet again: don’t mess with the compiler if you can mess with the library instead. Also, use Haskell.
I gave a talk about loop fusion in Haskell today at FP-Syd, the Sydney Functional Programming group. It covered stream fusion and fusion for distributed types which are two of the optimisations that make Data Parallel Haskell fast.
Here are the slides for my talk on the use of generics in the implementation of Data Parallel Haskell at the SAPLING meeting yesterday. This is where Instant Generics come from.
This being my first blog post and all, I’ll try to maximise boredom and minimise readability by writing as few lines of text as possible. Here we go…
As we know, recursive data types are fixpoints of non-recursive ones. So, for instance, the standard list data type:
data [a] = [] | a : [a]
is just the fixpoint of this guy:
data PreList a s = Nil | Cons a s
with these simple injection/projection functions:
inject :: PreList a [a] -> [a] inject Nil = [] inject (Cons x xs) = x : xs project :: [a] -> PreList a [a] project [] = Nil project (x:xs) = Cons x xs
Of course, PreList is also a functor:
instance Functor (PreList a) where fmap f Nil = Nil fmap f (Cons x s) = Cons x (f s)
Now, our goal is to mutilate the standard short cut fusion rules by using PreList as much as possible. Let’s do destroy/unfoldr first:
destroy :: (forall s. (s -> PreList a s) -> s -> t) -> [a] -> t destroy g = g project unfoldr :: (s -> PreList a s) -> s -> [a] unfoldr f = inject . fmap (unfoldr f) . f
The fusion rule is
destroy g (unfoldr f s) = g f s
Of course, unfoldr is just the list anamorphism but what’s so special about destroy? If we squint hard enough at its type we might realise that
forall t. (forall s. (s -> PreList a s) -> s -> t) -> t
is, in fact, isomorphic to
exists s. (s -> PreList a s, s)
This being Haskell, we have to introduce a separate data type for the existential:
data Unfolding a = forall s. Unfolding (s -> PreList a s) s
This makes the signatures of our two functions a lot simpler:
destroy :: [a] -> Unfolding a
destroy xs = Unfolding project xs
unfoldr :: Unfolding a -> [a]
unfoldr (Unfolding f s) = ana s
where
ana = inject . fmap ana . f
And the fusion rule is a bit nicer, too:
destroy (unfoldr s) = s
In fact, what we have here is almost but not quite stream fusion: destroy is equivalent to stream, unfoldr to unstream and Unfolding to Stream. The only difference is that PreList (which corresponds to the Step data type in the paper) is missing the Skip constructor which, unfortunately, is crucial for making the whole thing work.
This part was clear to me back when we were doing stream fusion but the next bit I didn’t understand until recently. The question is: can we do something similar with foldr/build? Again, let’s get rid of as many funny types as possible and use PreList instead:
foldr :: (PreList a s -> s) -> [a] -> s foldr f = f . fmap (foldr f) . project build :: (forall s. (PreList a s -> s) -> s) -> [a] build g = g inject
If you are wondering where these signatures come from, here is a little hint:
(a -> s -> s) -> s -> t ~ ((a,s) -> s) -> (() -> s) -> t ~ ((a,s)+() -> s) -> t ~ (PreList a s -> s) -> t
Now, by squinting at the types we might just see that it makes sense to introduce an abstraction:
data Folding a = Folding (forall s. (PreList a s -> s) -> s)
and use it:
foldr :: [a] -> Folding a foldr xs = Folding (\f -> let cata = f . fmap cata . project in cata xs) build :: Folding a -> [a] build (Folding g) = g inject
VoilĂ ! We now have a shiny new foldr/build fusion rule:
foldr (build s) = s
It’s probably worth pointing out that Folding does, in fact, support many useful list operations. Here is an example:
instance Functor Folding where
fmap f (Folding g) = Folding (\h -> g (h . emap))
where
emap Nil = Nil
emap (Cons x s) = Cons (f x) s
So is there a point to all this? Well, I find it quite interesting that foldr/build fusion can be rewritten in this way. I’m even more intrigued by what we get if we squint at the list hylomorphism:
refold :: Unfolding a -> Folding a refold (Unfolding g s) = Folding (\f -> let hylo = f . fmap hylo . g in hylo s)
Could we have both stream fusion and foldr/build in one framework? Would that be useful? Is there a way of working on fusion without irreparably damaging my eyes?