NoSlow: Microbenchmarks for Haskell array libraries
Over the last couple of days, I have implemented a small benchmark suite which tries to measure the performance of various Haskell array libraries, with particular emphasis on finding out how well they are able to fuse things. It is now on Hackage under the very creative and imaginative name NoSlow (Haskell seems to have gained a tradition of naming benchmark suites nosomething). What it does is compile and run a set of micro-benchmarks using these libraries:
- standard lists
- primitive arrays from the DPH project (package dph-prim-seq)
- uvector (a fork of the above)
- vector (primitive, storable and boxed arrays)
- storablevector
The actual benchmarking is done by Brian O’Sullivan’s wonderful criterion.
NoSlow is a very young project so the output isn’t particularly pretty but it does the job. It is also not very thoroughly tested and only includes a very restricted set of benchmarks so saying “library x is better than library y because it produces better numbers in NoSlow” would be completely and utterly wrong. However, it is quite useful for identifying cases where taking a closer look at the code generated for a particular loop might be a good idea. It already helped me spot one tricky performance regression in the standard list library.
Here is an example of the data it produces.
| dph-prim | list | storablevector | uvector | vector | |||
|---|---|---|---|---|---|---|---|
| seq | Primitive | Storable | boxed | ||||
| $a+1 | 58.38504 us | 192.9597 us | 58.25747 us | 58.47876 us | 58.52366 us | 58.77233 us | 405.5711 us |
| $a+^1 | 59.03597 us | 356.6624 us | 210.4899 us | 80.25804 us | 58.53107 us | 59.19546 us | 417.2587 us |
| $a+x | 55.62417 us | 236.2620 us | 55.63200 us | 55.62767 us | 55.33446 us | 55.96459 us | 477.2455 us |
| $a+^x | 56.91372 us | 445.1638 us | 237.9741 us | 75.81757 us | 55.79731 us | 56.45521 us | 491.3216 us |
| $a+x+x+x+x | 96.75086 us | 280.3636 us | 236.0562 us | 96.76510 us | 95.07297 us | 97.15892 us | 519.6680 us |
| $a+$b | 64.82325 us | 255.3842 us | 91.09551 us | 64.82592 us | 65.21805 us | 65.10213 us | 500.7479 us |
| $a+$b(zip) | 64.48503 us | 347.4379 us | 64.49701 us | 499.5277 us | |||
| x*$a+$b | 75.63250 us | 452.0626 us | 154.3398 us | 75.68617 us | 110.5486 us | 155.2830 us | 649.9281 us |
| ($a+$b)*($c+$d) | 125.0923 us | 1.080122 ms | 288.3781 us | 125.0150 us | 919.2376 us | 1.132326 ms | 2.967417 ms |
| ($a+$b)*($c+$d)(zip) | 155.3547 us | 742.2439 us | 155.3554 us | 2.509351 ms | |||
| (x+$a)*(y+$b) | 104.0011 us | 678.5383 us | 203.9229 us | 103.9131 us | 89.32382 us | 89.98352 us | 401.6301 us |
| (^x+$a)*(^y+$b) | 118.5413 us | 1.061493 ms | 534.0542 us | 150.2891 us | 822.7933 us | 930.1098 us | 2.650773 ms |
| filter(neq0)(map0) | 61.45542 us | 78.80583 us | 104.6501 us | 61.35867 us | 68.04449 us | 61.67384 us | 124.5576 us |
| filter(neq0)(^0) | 17.42000 us | 23.41616 us | 173.0071 us | 17.91704 us | 23.33224 us | 24.97328 us | 59.48926 us |
| filter(eq0)(map0) | 69.61809 us | 174.2209 us | 242.6149 us | 69.63563 us | 69.34671 us | 69.89652 us | 331.1557 us |
| filter(eq0)(^0) | 45.96399 us | 94.22360 us | 308.5451 us | 45.75672 us | 52.15233 us | 45.52413 us | 201.7559 us |
| zip_filter | 76.31172 us | 466.6432 us | 334.0208 us | 76.24536 us | 99.80086 us | 100.4251 us | 341.6012 us |
| filter_zip | 81.01137 us | 468.8768 us | 239.6135 us | 79.83034 us | 80.92583 us | 87.66268 us | 255.4129 us |
| filter_evens | 59.13895 us | 182.2491 us | 58.59793 us | 337.2523 us | |||
| sum($a*$b) | 115.0304 us | 1.130831 ms | 145.4783 us | 115.0499 us | 73.86368 us | 74.39775 us | 99.39568 us |
| sum($a*$b)(zip) | 114.9057 us | 1.420020 ms | 114.8766 us | 99.43429 us | |||
| sum[m..n] | 53.50359 us | 646.0969 us | 235.4800 us | 53.76784 us | 129.2202 us | 133.0979 us | 134.6730 us |
| sumsq(map) | 67.20919 us | 653.9266 us | 292.8844 us | 67.21005 us | 143.1072 us | 142.2476 us | 143.5639 us |
| sumsq(zip) | 67.24133 us | 1.928390 ms | 311.0818 us | 160.3321 us | 369.7680 us | 380.4847 us | 914.5373 us |
| sum_evens | 120.4421 us | 401.7939 us | 120.4255 us | 150.9723 us | |||
Benchmarks
At the moment, NoSlow only benchmarks a bunch of very small and fairly random loop kernels as I was more concerned with getting the infrastructure in place. Here is an example (called $a+$b in the table):
dotp as bs = sum (zipWith (*) as bs)
When it is built, NoSlow compiles several different versions of this loop. For instance, for the storablevector backend it will generate these two functions:
dotp :: (Storable a, Num a) => StorableVector a -> StorableVector a -> a dotp :: StorableVector Double -> StorableVector Double -> Double
The first one is overloaded on the element type whereas the second one is specialised to Double. NoSlow does this for all its backends and then runs all generated versions of the benchmark with the same test data. With more testing and more benchmarks, this will give us a rough estimate of the relative performance of the different array libraries. It also shows how much faster specialised loops are compared to their polymorphic versions
Here is the output of a full run. *Double and Double are overloaded and specialised loops, respectively. At the moment, NoSlow only does Double benchmarks but adding more types is easy.
| dph-prim | list | storablevector | uvector | vector | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| seq | Primitive | Storable | boxed | |||||||||||
| *Double | Double | *Double | Double | *Double | Double | *Double | Double | *Double | Double | *Double | Double | *Double | Double | |
| $a+1 | 388.9189 us | 58.38504 us | 334.0156 us | 192.9597 us | 310.7921 us | 58.25747 us | 410.9401 us | 58.47876 us | 831.8845 us | 58.52366 us | 467.6662 us | 58.77233 us | 563.7031 us | 405.5711 us |
| $a+^1 | 412.6220 us | 59.03597 us | 361.8679 us | 356.6624 us | 808.4398 us | 210.4899 us | 554.7731 us | 80.25804 us | 842.5514 us | 58.53107 us | 469.5773 us | 59.19546 us | 570.0256 us | 417.2587 us |
| $a+x | 398.3495 us | 55.62417 us | 266.9966 us | 236.2620 us | 317.0817 us | 55.63200 us | 416.8037 us | 55.62767 us | 848.4946 us | 55.33446 us | 479.7607 us | 55.96459 us | 576.0136 us | 477.2455 us |
| $a+^x | 401.3565 us | 56.91372 us | 376.1232 us | 445.1638 us | 821.9274 us | 237.9741 us | 560.3928 us | 75.81757 us | 854.3754 us | 55.79731 us | 483.5405 us | 56.45521 us | 579.5658 us | 491.3216 us |
| $a+x+x+x+x | 1.099485 ms | 96.75086 us | 989.8561 us | 280.3636 us | 1.367973 ms | 236.0562 us | 1.108161 ms | 96.76510 us | 1.732897 ms | 95.07297 us | 1.402448 ms | 97.15892 us | 1.918533 ms | 519.6680 us |
| $a+$b | 675.8037 us | 64.82325 us | 266.1987 us | 255.3842 us | 695.4756 us | 91.09551 us | 721.3087 us | 64.82592 us | 1.238228 ms | 65.21805 us | 758.5373 us | 65.10213 us | 579.0221 us | 500.7479 us |
| $a+$b(zip) | 674.6217 us | 64.48503 us | 486.2640 us | 347.4379 us | 718.1258 us | 64.49701 us | 579.6009 us | 499.5277 us | ||||||
| x*$a+$b | 843.1906 us | 75.63250 us | 576.2166 us | 452.0626 us | 1.086456 ms | 154.3398 us | 889.1795 us | 75.68617 us | 1.494543 ms | 110.5486 us | 877.4670 us | 155.2830 us | 1.226473 ms | 649.9281 us |
| ($a+$b)*($c+$d) | 1.510817 ms | 125.0923 us | 1.003353 ms | 1.080122 ms | 1.958700 ms | 288.3781 us | 1.582407 ms | 125.0150 us | 3.109755 ms | 919.2376 us | 1.915603 ms | 1.132326 ms | 3.199406 ms | 2.967417 ms |
| ($a+$b)*($c+$d)(zip) | 1.585162 ms | 155.3547 us | 1.345901 ms | 742.2439 us | 1.845875 ms | 155.3554 us | 2.749534 ms | 2.509351 ms | ||||||
| (x+$a)*(y+$b) | 1.139386 ms | 104.0011 us | 969.7649 us | 678.5383 us | 1.451681 ms | 203.9229 us | 1.174562 ms | 103.9131 us | 1.950864 ms | 89.32382 us | 1.404879 ms | 89.98352 us | 2.030718 ms | 401.6301 us |
| (^x+$a)*(^y+$b) | 1.095473 ms | 118.5413 us | 1.154678 ms | 1.061493 ms | 2.512015 ms | 534.0542 us | 1.530284 ms | 150.2891 us | 2.452523 ms | 822.7933 us | 1.825248 ms | 930.1098 us | 3.399711 ms | 2.650773 ms |
| filter(neq0)(map0) | 357.0597 us | 61.45542 us | 320.2454 us | 78.80583 us | 581.4489 us | 104.6501 us | 364.2109 us | 61.35867 us | 609.2637 us | 68.04449 us | 531.9321 us | 61.67384 us | 370.0963 us | 124.5576 us |
| filter(neq0)(^0) | 19.80998 us | 17.42000 us | 6.715188 us | 23.41616 us | 404.8197 us | 173.0071 us | 20.19333 us | 17.91704 us | 24.38774 us | 23.33224 us | 24.77845 us | 24.97328 us | 59.51524 us | 59.48926 us |
| filter(eq0)(map0) | 527.7883 us | 69.61809 us | 499.8775 us | 174.2209 us | 839.2031 us | 242.6149 us | 542.4786 us | 69.63563 us | 1.015737 ms | 69.34671 us | 622.5425 us | 69.89652 us | 619.5380 us | 331.1557 us |
| filter(eq0)(^0) | 153.7917 us | 45.96399 us | 87.08318 us | 94.22360 us | 648.8350 us | 308.5451 us | 162.6549 us | 45.75672 us | 417.1879 us | 52.15233 us | 126.1947 us | 45.52413 us | 214.1162 us | 201.7559 us |
| zip_filter | 696.4477 us | 76.31172 us | 631.0826 us | 466.6432 us | 1.158844 ms | 334.0208 us | 735.2156 us | 76.24536 us | 1.042058 ms | 99.80086 us | 635.7235 us | 100.4251 us | 588.1413 us | 341.6012 us |
| filter_zip | 741.1476 us | 81.01137 us | 467.9363 us | 468.8768 us | 1.073220 ms | 239.6135 us | 789.2609 us | 79.83034 us | 1.249209 ms | 80.92583 us | 835.3062 us | 87.66268 us | 507.6646 us | 255.4129 us |
| filter_evens | 315.2792 us | 59.13895 us | 185.6894 us | 182.2491 us | 342.7927 us | 58.59793 us | 321.4253 us | 337.2523 us | ||||||
| sum($a*$b) | 889.3029 us | 115.0304 us | 1.113731 ms | 1.130831 ms | 837.7596 us | 145.4783 us | 630.7635 us | 115.0499 us | 1.046405 ms | 73.86368 us | 956.1445 us | 74.39775 us | 310.2107 us | 99.39568 us |
| sum($a*$b)(zip) | 893.2471 us | 114.9057 us | 1.890718 ms | 1.420020 ms | 624.3066 us | 114.8766 us | 371.0969 us | 99.43429 us | ||||||
| sum[m..n] | 480.6165 us | 53.50359 us | 1.006370 ms | 646.0969 us | 599.5453 us | 235.4800 us | 257.9740 us | 53.76784 us | 520.8189 us | 129.2202 us | 455.0702 us | 133.0979 us | 450.9119 us | 134.6730 us |
| sumsq(map) | 601.0236 us | 67.20919 us | 1.389623 ms | 653.9266 us | 916.6944 us | 292.8844 us | 361.3013 us | 67.21005 us | 633.0462 us | 143.1072 us | 632.7607 us | 142.2476 us | 626.6808 us | 143.5639 us |
| sumsq(zip) | 581.3862 us | 67.24133 us | 2.487876 ms | 1.928390 ms | 1.391611 ms | 311.0818 us | 931.6956 us | 160.3321 us | 2.725828 ms | 369.7680 us | 1.892799 ms | 380.4847 us | 1.770848 ms | 914.5373 us |
| sum_evens | 522.6244 us | 120.4421 us | 377.5469 us | 401.7939 us | 283.7662 us | 120.4255 us | 293.9641 us | 150.9723 us | ||||||
NoSlow can also be used to see how well different GHC versions optimise array loops. Here is a comparison of the current HEAD to GHC 6.10.4 (2 means 6.13 is 2x slower, 0.5 means it is 2x faster).
| dph-prim | list | storablevector | uvector | vector | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| seq | Primitive | Storable | boxed | |||||||||||
| *Double | Double | *Double | Double | *Double | Double | *Double | Double | *Double | Double | *Double | Double | *Double | Double | |
| $a+1 | 0.169 | 0.998 | 1.821 | 1.023 | 0.879 | 1.000 | 0.183 | 1.000 | 0.109 | 0.019 | 0.069 | 1.000 | 0.994 | 0.985 |
| $a+^1 | 0.178 | 1.001 | 1.027 | 1.029 | 0.068 | 0.114 | 0.130 | 1.001 | 0.058 | 0.014 | 0.031 | 0.999 | 1.011 | 1.004 |
| $a+x | 0.172 | 1.004 | 1.138 | 0.989 | 0.970 | 1.006 | 0.185 | 1.005 | 0.121 | 0.018 | 0.072 | 1.002 | 1.024 | 0.965 |
| $a+^x | 0.173 | 0.982 | 1.034 | 1.023 | 0.068 | 0.127 | 0.132 | 0.992 | 0.060 | 0.013 | 0.032 | 1.007 | 1.012 | 0.979 |
| $a+x+x+x+x | 0.400 | 1.019 | 1.082 | 1.007 | 1.033 | 0.274 | 0.386 | 1.019 | 0.059 | 0.008 | 0.049 | 0.089 | 1.034 | 0.981 |
| $a+$b | 0.214 | 0.966 | 1.014 | 1.010 | 0.067 | 1.226 | 0.223 | 0.966 | 0.121 | 0.014 | 0.072 | 1.000 | 0.974 | 0.984 |
| $a+$b(zip) | 0.213 | 0.965 | 1.740 | 1.358 | 0.223 | 0.963 | 0.984 | 1.002 | ||||||
| x*$a+$b | 0.251 | 0.997 | 1.000 | 0.982 | 0.101 | 1.065 | 0.263 | 0.997 | 0.085 | 0.020 | 0.051 | 1.292 | 1.007 | 1.034 |
| ($a+$b)*($c+$d) | 0.287 | 0.199 | 1.168 | 1.483 | 0.064 | 0.271 | 0.293 | 0.199 | 0.101 | 0.061 | 0.060 | 1.058 | 0.992 | 1.299 |
| ($a+$b)*($c+$d)(zip) | 0.302 | 1.262 | 1.301 | 0.906 | 0.350 | 1.257 | 0.949 | 1.249 | ||||||
| (x+$a)*(y+$b) | 0.298 | 1.119 | 1.019 | 0.929 | 0.138 | 0.372 | 0.270 | 1.119 | 0.080 | 0.008 | 0.059 | 0.136 | 0.991 | 1.001 |
| (^x+$a)*(^y+$b) | 0.323 | 0.166 | 0.911 | 1.057 | 0.074 | 0.114 | 0.207 | 0.199 | 0.061 | 0.039 | 0.043 | 0.799 | 1.123 | 1.200 |
| filter(neq0)(map0) | 0.374 | 0.999 | 1.275 | 0.990 | 0.828 | 0.519 | 0.364 | 1.000 | 0.067 | 0.047 | 0.064 | 0.999 | 0.986 | 0.982 |
| filter(neq0)(^0) | 0.191 | 0.949 | 0.062 | 1.000 | 0.189 | 0.090 | 0.180 | 1.016 | 0.004 | 0.036 | 0.004 | 1.014 | 0.423 | 1.029 |
| filter(eq0)(map0) | 0.220 | 1.000 | 1.163 | 1.049 | 0.334 | 0.805 | 0.231 | 1.005 | 0.090 | 0.019 | 0.057 | 1.003 | 0.981 | 1.014 |
| filter(eq0)(^0) | 0.131 | 1.010 | 0.435 | 1.177 | 0.164 | 0.154 | 0.137 | 1.008 | 0.050 | 0.019 | 0.014 | 1.003 | 0.738 | 1.060 |
| zip_filter | 0.327 | 0.991 | 1.476 | 0.997 | 0.093 | 0.447 | 0.339 | 0.995 | 0.068 | 0.014 | 0.041 | 0.201 | 1.277 | 0.815 |
| filter_zip | 0.283 | 0.990 | 1.024 | 1.475 | 0.077 | 0.785 | 0.297 | 0.982 | 0.083 | 0.019 | 0.054 | 1.000 | 1.071 | 0.987 |
| filter_evens | 0.180 | 0.984 | 0.985 | 0.998 | 0.186 | 0.548 | 0.109 | 0.115 | ||||||
| sum($a*$b) | 0.167 | 1.475 | 0.982 | 0.978 | 0.074 | 1.115 | 0.141 | 1.476 | 0.090 | 0.941 | 0.079 | 0.955 | 0.120 | 0.642 |
| sum($a*$b)(zip) | 0.167 | 1.475 | 1.430 | 1.344 | 0.139 | 1.474 | 0.143 | 0.998 | ||||||
| sum[m..n] | 0.137 | 0.516 | 1.042 | 0.994 | 0.273 | 1.014 | 0.104 | 0.371 | 0.063 | 0.043 | 0.062 | 0.065 | 0.090 | 0.048 |
| sumsq(map) | 0.167 | 0.566 | 1.012 | 0.985 | 0.364 | 1.012 | 0.140 | 0.404 | 0.079 | 0.047 | 0.082 | 0.069 | 0.120 | 0.051 |
| sumsq(zip) | 0.088 | 0.557 | 1.009 | 0.968 | 0.115 | 1.002 | 0.165 | 1.017 | 0.153 | 0.061 | 0.107 | 0.180 | 0.272 | 0.309 |
| sum_evens | 0.190 | 2.114 | 0.938 | 1.041 | 0.115 | 1.379 | 0.078 | 0.053 | ||||||
Usage
The current interface is rather minimalistic. After building NoSlow, you get two binaries: noslow and noslow-table. The first one runs the actual benchmarks and produces a log file:
noslow -u log
The log can then be fed to noslow-table to generate HTML tables:
noslow-table log > table.html
We can also restrict the output to a particular data type:
noslow-table log --type=Double > table.html
or compare two logs:
noslow-table --diff log1 log2 > table.html
In this case, the table will contain ratios of corresponding values from the two logs, just like in the table above.
Compiling
NoSlow should work out of the box with GHC 6.10. I haven’t tried it with 6.12 because I didn’t have time to install the release candidate and the required packages. Getting it to build with the HEAD (which is what I’m interested in) is a rather annoying process, though, because uvector is missing a DPH patch which would make it work with the new IO system. Fortunately, I could simply transplant some code from the DPH library (the module is dph-base/Data/Array/Parallel/Data/Array/Parallel/Arr/BUArr.hs) and everything worked.
Also, storablevector depends on QuickCheck 1 which doesn’t seem to compile with the HEAD. I just removed all references to QuickCheck from the code but it might be easier to simply not build the storablevector backend (there is a flag for that).
How it works
Benchmarking itself is rather straighforward – all the exciting bits are in criterion. The heart of NoSlow is the specialiser – a Template Haskell function that takes a bunch of abstract loop kernels and specialises them for a particular backend and data type. Here is how it works. The kernels themselves are implemented in a TH quote:
kernels = [d| ...
dotp :: (Num a, I.Vector v a) => Ty (v a) -> v a -> v a -> a
dotp _ as bs = named "sum($a*$b)" $ I.sum (I.zipWith (*) as bs)
... |]
They refer to functions from module I (short for NoName.Backend.Interface) which are just stubs and do not provide any real functionality. The specialiser traverses the ASTs of the kernels (which it has access to since they are quoted) and replaces references to those stubs with calls to functions from the backend that it is specialising for. For instance, it replaces I.sum by NoSlow.Backend.List.sum (which is just a reexport of the Prelude function) when specialising for lists. It also adjusts signatures and looks for tags like named in the example above which defines the kernel’s name in the HTML output.
This scheme has one big advantage: it allows backends to mark some kernels as unsupported. An example is storablevector which does not support arrays of pairs. The corresponding backend defines zip like this:
zip :: Undefined zip = Undefined
When replacing I.zip with NoSlow.Backend.StorableVector.zip, the specialiser reifies the latter, notices that its type is Unsupported and omits the kernel that contains the call to zip altogether.
Trackbacks