Benchmarking `llvm-libc`'s memory functions

Foreword

Microbenchmarks are valuable tools to assess and compare the performance of isolated pieces of code. However they don't capture all interactions of complex systems; and so other metrics can be equally important:

code size (to reduce instruction cache pressure),
Profile Guided Optimization friendliness,
hyperthreading / multithreading friendliness.

Rationale

The goal here is to satisfy the Benchmarking Principles.

Relevance: Benchmarks should measure relatively vital features.
Representativeness: Benchmark performance metrics should be broadly accepted by industry and academia.
Equity: All systems should be fairly compared.
Repeatability: Benchmark results can be verified.
Cost-effectiveness: Benchmark tests are economical.
Scalability: Benchmark tests should measure from single server to multiple servers.
Transparency: Benchmark metrics should be easy to understand.

Benchmarking is a subtle art and benchmarking memory functions is no exception. Here we'll dive into peculiarities of designing good microbenchmarks for llvm-libc memory functions.

Challenges

As seen in the README.md the microbenchmarking facility should focus on measuring low latency code. If copying a few bytes takes in the order of a few cycles, the benchmark should be able to measure accurately down to the cycle.

Measuring instruments

There are different sources of time in a computer (ordered from high to low resolution)

Performance Counters: used to introspect the internals of the CPU,
High Precision Event Timer: used to trigger short lived actions,
Real-Time Clocks (RTC): used to keep track of the computer's time.

In theory Performance Counters provide cycle accurate measurement via the cpu cycles event. But as we'll see, they are not really practical in this context.

Performance counters and modern processor architecture

Modern CPUs are out of order and superscalar as a consequence it is hard to know what is included when the counter is read, some instructions may still be in flight, some others may be executing speculatively. As a matter of fact on the same machine, measuring twice the same piece of code will yield different results.

Performance counters semantics inconsistencies and availability

Although they have the same name, the exact semantics of performance counters are micro-architecture dependent: it is generally not possible to compare two micro-architectures exposing the same performance counters.

Each vendor decides which performance counters to implement and their exact meaning. Although we want to benchmark llvm-libc memory functions for all available target triples, there are no guarantees that the counter we're interested in is available.

Additional imprecisions

Reading performance counters is done through Kernel System calls. The System call itself is costly (hundreds of cycles) and will perturbate the counter's value.
Interruptions can occur during measurement.
If the system is already under monitoring (virtual machines or system wide profiling) the kernel can decide to multiplex the performance counters leading to lower precision or even completely missing the measurement.
The Kernel can decide to migrate the process to a different core.
Dynamic frequency scaling can kick in during the measurement and change the ticking duration. Ultimately we care about the amount of work over a period of time. This removes some legitimacy of measuring cycles rather than raw time.

Cycle accuracy conclusion

We have seen that performance counters are: not widely available, semantically inconsistent across micro-architectures and imprecise on modern CPUs for small snippets of code.

Design decisions

In order to achieve the needed precision we would need to resort on more widely available counters and derive the time from a high number of runs: going from a single deterministic measure to a probabilistic one.

To get a good signal to noise ratio we need the running time of the piece of code to be orders of magnitude greater than the measurement precision.

For instance, if measurement precision is of 10 cycles, we need the function runtime to take more than 1000 cycles to achieve 1% SNR.

Repeating code N-times until precision is sufficient

The algorithm is as follows:

We measure the time it takes to run the code N times (Initially N is 10 for instance)
We deduce an approximation of the runtime of one iteration (= runtime / N).
We increase N by X% and repeat the measurement (geometric progression).
We keep track of the one iteration runtime approximation and build a weighted mean of all the samples so far (weight is proportional to N)
We stop the process when the difference between the weighted mean and the last estimation is smaller than ε or when other stopping conditions are met (total runtime, maximum iterations or maximum sample count).

This method allows us to be as precise as needed provided that the measured runtime is proportional to N. Longer run times also smooth out imprecision related to interrupts and context switches.

Note: When measuring longer runtimes (e.g. copying several megabytes of data) the above assumption doesn't hold anymore and the ε precision cannot be reached by increasing iterations. The whole benchmarking process becomes prohibitively slow. In this case the algorithm is limited to a single sample and repeated several times to get a decent 95% confidence interval.

Effect of branch prediction

When measuring code with branches, repeating the same call again and again will allow the processor to learn the branching patterns and perfectly predict all the branches, leading to unrealistic results.

Decision: When benchmarking small buffer sizes, the function parameters should be randomized between calls to prevent perfect branch predictions.

Effect of the memory subsystem

The CPU is tightly coupled to the memory subsystem. It is common to see L1, L2 and L3 data caches.

We may be tempted to randomize data accesses widely to exercise all the caching layers down to RAM but the cost of accessing lower layers of memory completely dominates the runtime for small sizes.

So to respect Equity and Repeatability principles we should make sure we do not depend on the memory subsystem.

Decision: When benchmarking small buffer sizes, the data accessed by the function should stay in L1.

Effect of prefetching

In case of small buffer sizes, prefetching should not kick in but in case of large buffers it may introduce a bias.

Decision: When benchmarking large buffer sizes, the data should be accessed in a random fashion to lower the impact of prefetching between calls.

Effect of dynamic frequency scaling

Modern processors implement dynamic frequency scaling. In so-called performance mode the CPU will increase its frequency and run faster than usual within some limits : “The increased clock rate is limited by the processor's power, current, and thermal limits, the number of cores currently in use, and the maximum frequency of the active cores.”

Decision: When benchmarking we want to make sure the dynamic frequency scaling is always set to performance. We also want to make sure that the time based events are not impacted by frequency scaling.

See REAME.md on how to set this up.

Reserved and pinned cores

Some operating systems allow core reservation. It removes a set of perturbation sources like: process migration, context switches and interrupts. When a core is hyperthreaded, both cores should be reserved.

Microbenchmarks limitations

As stated in the Foreword section a number of effects do play a role in production but are not directly measurable through microbenchmarks. The code size of the benchmark is (much) smaller than the hot code of real applications and doesn't exhibit instruction cache pressure as much.

iCache pressure

Fundamental functions that are called frequently will occupy the L1 iCache (illustration). If they are too big they will prevent other hot code to stay in the cache and incur stalls. So the memory functions should be as small as possible.

iTLB pressure

The same reasoning goes for instruction Translation Lookaside Buffer (iTLB) incurring TLB misses.

FAQ

Why don't you use Google Benchmark directly?
We reuse some parts of Google Benchmark (detection of frequency scaling, CPU cache hierarchy informations) but when it comes to measuring memory functions Google Benchmark have a few issues:
- Google Benchmark privileges code based configuration via macros and builders. It is typically done in a static manner. In our case the parameters we need to setup are a mix of what‘s usually controlled by the framework (number of trials, maximum number of iterations, size ranges) and parameters that are more tied to the function under test (randomization strategies, custom values). Achieving this with Google Benchmark is cumbersome as it involves templated benchmarks and duplicated code. In the end, the configuration would be spread across command line flags (via framework’s option or custom flags), and code constants.
- Output of the measurements is done through a BenchmarkReporter class, that makes it hard to access the parameters discussed above.

Benchmarking llvm-libc's memory functions