| <html> |
| <head><title>Benchmarking BOF</title></head> |
| <body> |
| |
| <h1>Benchmarking BOF</h1> |
| <h3><i>Kristof Beyls</i></h3> |
| |
| <p>This is a summary of what was discussed at the Performance Tracking and |
| Benchmarking Infrastructure BoF session last week at the LLVM dev meeting.</p> |
| |
| <p>At the same time it contains a proposal on a few next steps to improve the |
| setup and use of buildbots to track performance changes in code generated by |
| LLVM.</p> |
| |
| <p>The buildbots currently are very valuable in detecting correctness regressions, |
| and getting the community to quickly rectify those regressions. However, |
| performance regressions are hardly noted and it seems as a community, we don't |
| really keep track of those well.</p> |
| |
| <p>The goal for the BoF was to try and find a number of actions that could take us |
| closer to the point where as a community, we would at least notice some of the |
| performance regressions and take action to fix the regressions. Given that |
| this has been discussed already quite a few times at previous BoF sessions at |
| multiple developer meetings, we thought we should aim for a small, incremental, |
| but sure improvement over the current status. Ideally, we should initially aim |
| for getting to the point where at least some of the performance regressions are |
| detected and acted upon.</p> |
| |
| <p>We already have a central database that stores benchmarking numbers, produced |
| for 2 boards, see |
| <a href="http://llvm.org/perf/db_default/v4/nts/recent_activity#machines">perf's |
| page</a>. However, it seems no-one monitors the produced results, nor is it easy |
| to derive from those numbers if a particular patch really introduced a significant |
| regression.</p> |
| |
| <p>At the BoF, we identified the following issues blocking us from being able to |
| detect significant regressions more easily:</p> |
| |
| <ul> |
| <li>A lot of the Execution Time and Compile Time results are very noisy, because |
| the individual programs don't run long enough and don't take long enough to |
| compile (e.g. less than 0.1 seconds).</li> |
| <li>The proposed actions to improve the execution time is, for the programs under |
| the Benchmarks sub-directories in the test-suite, to: |
| <ul> |
| <li>Increase the run time of the benchmark so it runs long enough to avoid |
| noisy results. "Long enough" probably means roughly 10 seconds. We'd probably |
| need a number of different settings, or a parameter that can be set per |
| program, so that the running time on individual boards can be tuned. E.g. |
| on a faster board, more iterations would be run than on a slower board.</li> |
| <li>Evaluate if the main running time of the benchmark is caused by running code |
| compiled or by something else, e.g. file IO. Programs dominated by file IO |
| shouldn't be used to track performance changes over time. The proposal to |
| resolve this is to create a way to run the test suite in 'benchmark mode', |
| which includes only a subset of the test suite useful for benchmarking.</li> |
| </ul> |
| </li> |
| <li>The identified action to improve the compile time measurements is to just add |
| up the compilation time for all benchmarks and measure that, instead of the |
| compile times of the individual benchmarks. It seems this could be implemented |
| by simply changing or adding a view in the web interface, showing the trend of |
| the compilation time for all benchmarks over time, rather than trend graphs for |
| individual programs.</li> |
| <li>Furthermore, on each individual board, the noise introduced by the board itself |
| should be minimized. Each board should have a maintainer, who ensures the board |
| doesn't produce a significant level of noise. If the board starts producing a |
| high level of noise, and the maintainer doesn't fix it quickly, the performance |
| numbers coming from the board will be ignored. It's not clear what the best way |
| would be to mark a board as being ignored. The suggestion was made that board |
| maintainers could get a script to run before each benchmarking run, to check |
| whether the board seems in a reasonable state, e.g. by checking the load on the |
| board is near zero; "dd" executes as fast as expected; .... It's expected that |
| the checks in the script might be somewhat dependent on the operating system |
| the board runs.</li> |
| <li>To reduce the noise levels further, it would be nice if the execution time |
| of individual benchmarks could be averaged out over a number (e.g. 5) |
| consecutive runs. That way, each individual benchmark run remains relatively |
| fast, by having to run each program just once; while at the same time the |
| averaging should reduce some of the insignificant noise in the individual runs.</li> |
| </ul> |
| |
| <p>We'd appreciate any feedback on the above proposals. We're also looking for more |
| volunteers to implement the above improvements; so if you're interested in |
| working on some of the above, please let us know.</p> |
| |
| |
| </body> |