| # Optimizing Clang : A Practical Example of Applying BOLT | 
 |  | 
 | ## Preface | 
 |  | 
 | *BOLT* (Binary Optimization and Layout Tool) is designed to improve the application | 
 | performance by laying out code in a manner that helps CPU better utilize its caching and | 
 | branch predicting resources. | 
 |  | 
 | The most obvious candidates for BOLT optimizations | 
 | are programs that suffer from many instruction cache and iTLB misses, such as | 
 | large applications measuring over hundreds of megabytes in size. However, medium-sized | 
 | programs can benefit too. Clang, one of the most popular open-source C/C++ compilers, | 
 | is a good example of the latter. Its code size could easily be in the order of tens of megabytes. | 
 | As we will see, the Clang binary suffers from many instruction cache | 
 | misses and can be significantly improved with BOLT, even on top of profile-guided and | 
 | link-time optimizations. | 
 |  | 
 | In this tutorial we will first build Clang with PGO and LTO, and then will show steps on how to | 
 | apply BOLT optimizations to make Clang up to 15% faster. We will also analyze where | 
 | the compile-time performance gains are coming from, and verify that the speed-ups are | 
 | sustainable while building other applications. | 
 |  | 
 | ## Building Clang | 
 |  | 
 | The process of getting Clang sources and performing the build is very similar to the | 
 | one described at http://clang.llvm.org/get_started.html. For completeness, we provide the detailed steps | 
 | on how to obtain and build Clang in [Bootstrapping Clang-7 with PGO and LTO](#bootstrapping-clang-7-with-pgo-and-lto) section. | 
 |  | 
 | The only difference from the standard Clang build is that we require the `-Wl,-q` flag to be present during | 
 | the final link. This option saves relocation metadata in the executable file, but does not affect | 
 | the generated code in any way. | 
 |  | 
 | ## Optimizing Clang with BOLT | 
 |  | 
 | We will use the setup described in [Bootstrapping Clang-7 with PGO and LTO](#bootstrapping-clang-7-with-pgo-and-lto). | 
 | Adjust the steps accordingly if you skipped that section. We will also assume that `llvm-bolt` is present in your `$PATH`. | 
 |  | 
 | Before we can run BOLT optimizations, we need to collect the profile for Clang, and we will use | 
 | Clang/LLVM sources for that. | 
 | Collecting accurate profile requires running `perf` on a hardware that | 
 | implements taken branch sampling (`-b/-j` flag). For that reason, it may not be possible to | 
 | collect the accurate profile in a virtualized environment, e.g. in the cloud. | 
 | We do support regular sampling profiles, but the performance | 
 | improvements are expected to be more modest. | 
 |  | 
 | ```bash | 
 | $ mkdir ${TOPLEV}/stage3 | 
 | $ cd ${TOPLEV}/stage3 | 
 | $ CPATH=${TOPLEV}/stage2-prof-use-lto/install/bin/ | 
 | $ cmake -G Ninja ${TOPLEV}/llvm -DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_BUILD_TYPE=Release \ | 
 |     -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \ | 
 |     -DLLVM_ENABLE_PROJECTS="clang" \ | 
 |     -DLLVM_USE_LINKER=lld -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage3/install | 
 | $ perf record -e cycles:u -j any,u -- ninja clang | 
 | ``` | 
 |  | 
 | Once the last command is finished, it will create a `perf.data` file larger than 10GiB. | 
 | We will first convert this profile into a more compact aggregated | 
 | form suitable to be consumed by BOLT: | 
 | ```bash | 
 |   $ perf2bolt $CPATH/clang-7 -p perf.data -o clang-7.fdata -w clang-7.yaml | 
 | ``` | 
 | Notice that we are passing `clang-7` to `perf2bolt` which is the real binary that | 
 | `clang` and `clang++` are symlinking to. The next step will optimize Clang using | 
 | the generated profile: | 
 | ```bash | 
 | $ llvm-bolt $CPATH/clang-7 -o $CPATH/clang-7.bolt -b clang-7.yaml \ | 
 |     -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions \ | 
 |     -split-all-cold -dyno-stats -icf=1 -use-gnu-stack | 
 | ``` | 
 | The output will look similar to the one below: | 
 | ```t | 
 | ... | 
 | BOLT-INFO: enabling relocation mode | 
 | BOLT-INFO: 11415 functions out of 104526 simple functions (10.9%) have non-empty execution profile. | 
 | ... | 
 | BOLT-INFO: ICF folded 29144 out of 105177 functions in 8 passes. 82 functions had jump tables. | 
 | BOLT-INFO: Removing all identical functions will save 5466.69 KB of code space. Folded functions were called 2131985 times based on profile. | 
 | BOLT-INFO: basic block reordering modified layout of 7848 (10.32%) functions | 
 | ... | 
 |            660155947 : executed forward branches (-2.3%) | 
 |             48252553 : taken forward branches (-57.2%) | 
 |            129897961 : executed backward branches (+13.8%) | 
 |             52389551 : taken backward branches (-19.5%) | 
 |             35650038 : executed unconditional branches (-33.2%) | 
 |            128338874 : all function calls (=) | 
 |             19010563 : indirect calls (=) | 
 |              9918250 : PLT calls (=) | 
 |           6113398840 : executed instructions (-0.6%) | 
 |           1519537463 : executed load instructions (=) | 
 |            943321306 : executed store instructions (=) | 
 |             20467109 : taken jump table branches (=) | 
 |            825703946 : total branches (-2.1%) | 
 |            136292142 : taken branches (-41.1%) | 
 |            689411804 : non-taken conditional branches (+12.6%) | 
 |            100642104 : taken conditional branches (-43.4%) | 
 |            790053908 : all conditional branches (=) | 
 | ... | 
 | ``` | 
 | The statistics in the output is based on the LBR profile collected with `perf`, and since we were using | 
 | the `cycles` counter, its accuracy is affected. However, the relative improvement in `taken conditional | 
 |  branches` is a good indication that BOLT was able to straighten out the code even after PGO. | 
 |  | 
 | ## Measuring Compile-time Improvement | 
 |  | 
 | `clang-7.bolt` can be used as a replacement for *PGO+LTO* Clang: | 
 | ```bash | 
 | $ mv $CPATH/clang-7 $CPATH/clang-7.org | 
 | $ ln -fs $CPATH/clang-7.bolt $CPATH/clang-7 | 
 | ``` | 
 | Doing a new build of Clang using the new binary shows a significant overall | 
 | build time reduction on a 48-core Haswell system: | 
 | ```bash | 
 | $ ln -fs $CPATH/clang-7.org $CPATH/clang-7 | 
 | $ ninja clean && /bin/time -f %e ninja clang -j48 | 
 | 202.72 | 
 | $ ln -fs $CPATH/clang-7.bolt $CPATH/clang-7 | 
 | $ ninja clean && /bin/time -f %e ninja clang -j48 | 
 | 180.11 | 
 | ``` | 
 | That's 22.61 seconds (or 12%) faster compared to the *PGO+LTO* build. | 
 | Notice that we are measuring an improvement of the total build time, which includes the time spent in the linker. | 
 | Compilation time improvements for individual files differ, and speedups over 15% are not uncommon. | 
 | If we run BOLT on a Clang binary compiled without *PGO+LTO* (in which case the build is finished in 253.32 seconds), | 
 | the gains we see are over 50 seconds (25%), | 
 | but, as expected, the result is still slower than *PGO+LTO+BOLT* build. | 
 |  | 
 | ## Source of the Wins | 
 |  | 
 | We mentioned that Clang suffers from considerable instruction cache misses. This can be measured with `perf`: | 
 | ```bash | 
 | $ ln -fs $CPATH/clang-7.org $CPATH/clang-7 | 
 | $ ninja clean && perf stat -e instructions,L1-icache-misses -- ninja clang -j48 | 
 |   ... | 
 |    16,366,101,626,647      instructions | 
 |       359,996,216,537      L1-icache-misses | 
 | ``` | 
 | That's about 22 instruction cache misses per thousand instructions. As a rule of thumb, if the application | 
 | has over 10 misses per thousand instructions, it is a good indication that it will be improved by BOLT. | 
 | Now let's see how many misses are in the BOLTed binary: | 
 | ```bash | 
 | $ ln -fs $CPATH/clang-7.bolt $CPATH/clang-7 | 
 | $ ninja clean && perf stat -e instructions,L1-icache-misses -- ninja clang -j48 | 
 |   ... | 
 |   16,319,818,488,769      instructions | 
 |      244,888,677,972      L1-icache-misses | 
 | ``` | 
 | The number of misses per thousand instructions went down from 22 to 15, significantly reducing | 
 | the number of stalls in the CPU front-end. | 
 | Notice how the number of executed instructions stayed roughly the same. That's because we didn't | 
 | run any optimizations beyond the ones affecting the code layout. Other than instruction cache misses, | 
 | BOLT also improves branch mispredictions, iTLB misses, and misses in L2 and L3. | 
 |  | 
 | ## Using Clang for Other Applications | 
 |  | 
 | We have collected profile for Clang using its own source code. Would it be enough to speed up | 
 | the compilation of other projects? We picked `mysqld`, an open-source database, to do the test. | 
 |  | 
 | On our 48-core Haswell system using the *PGO+LTO* Clang, the build finished in 136.06 seconds, while using the *PGO+LTO+BOLT* Clang, 126.10 seconds. | 
 | That's a noticeable improvement, but not as significant as the one we saw on Clang itself. | 
 | This is partially because the number of instruction cache misses is slightly lower on this scenario : 19 vs 22. | 
 | Another reason is that Clang is run with a different set of options while building `mysqld` compared | 
 | to the training run. | 
 |  | 
 | Different options exercise different code paths, and | 
 | if we trained without a specific option, we may have misplaced parts of the code responsible for handling it. | 
 | To test this theory, we have collected another `perf` profile while building `mysqld`, and merged it with an existing profile | 
 | using the `merge-fdata` utility that comes with BOLT. Optimized with that profile, the *PGO+LTO+BOLT* Clang was able | 
 | to perform the `mysqld` build in 124.74 seconds, i.e. 11 seconds or 9% faster compared to *PGO+LGO* Clang. | 
 | The merged profile didn't make the original Clang compilation slower either, while the number of profiled functions in Clang increased from 11,415 to 14,025. | 
 |  | 
 | Ideally, the profile run has to be done with a superset of all commonly used options. However, the main improvement is expected with just the basic set. | 
 |  | 
 | ## Summary | 
 |  | 
 | In this tutorial we demonstrated how to use BOLT to improve the | 
 | performance of the Clang compiler. Similarly, BOLT could be used to improve the performance | 
 | of GCC, or any other application suffering from a high number of instruction | 
 | cache misses. | 
 |  | 
 | ---- | 
 | # Appendix | 
 |  | 
 | ## Bootstrapping Clang-7 with PGO and LTO | 
 |  | 
 | Below we describe detailed steps to build Clang, and make it ready for BOLT | 
 | optimizations. If you already have the build setup, you can skip this section, | 
 | except for the last step that adds `-Wl,-q` linker flag to the final build. | 
 |  | 
 | ### Getting Clang-7 Sources | 
 |  | 
 | Set `$TOPLEV` to the directory of your preference where you would like to do | 
 | builds. E.g. `TOPLEV=~/clang-7/`. Follow with commands to clone the `release_70` | 
 | branch of LLVM monorepo: | 
 | ```bash | 
 | $ mkdir ${TOPLEV} | 
 | $ cd ${TOPLEV} | 
 | $ git clone --branch=release/7.x https://github.com/llvm/llvm-project.git | 
 | ``` | 
 |  | 
 | ### Building Stage 1 Compiler | 
 |  | 
 | Stage 1 will be the first build we are going to do, and we will be using the | 
 | default system compiler to build Clang. If your system lacks a compiler, use | 
 | your distribution package manager to install one that supports C++11. In this | 
 | example we are going to use GCC. In addition to the compiler, you will need the | 
 | `cmake` and `ninja` packages. Note that we disable the build of certain | 
 | compiler-rt components that are known to cause build issues at release/7.x. | 
 | ```bash | 
 | $ mkdir ${TOPLEV}/stage1 | 
 | $ cd ${TOPLEV}/stage1 | 
 | $ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \ | 
 |       -DCMAKE_BUILD_TYPE=Release \ | 
 |       -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_ASM_COMPILER=gcc \ | 
 |       -DLLVM_ENABLE_PROJECTS="clang;lld" \ | 
 |       -DLLVM_ENABLE_RUNTIMES="compiler-rt" \ | 
 |       -DCOMPILER_RT_BUILD_SANITIZERS=OFF -DCOMPILER_RT_BUILD_XRAY=OFF \ | 
 |       -DCOMPILER_RT_BUILD_LIBFUZZER=OFF \ | 
 |       -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage1/install | 
 | $ ninja install | 
 | ``` | 
 |  | 
 | ### Building Stage 2 Compiler With Instrumentation | 
 |  | 
 | Using the freshly-baked stage 1 Clang compiler, we are going to build Clang with | 
 | profile generation capabilities: | 
 | ```bash | 
 | $ mkdir ${TOPLEV}/stage2-prof-gen | 
 | $ cd ${TOPLEV}/stage2-prof-gen | 
 | $ CPATH=${TOPLEV}/stage1/install/bin/ | 
 | $ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \ | 
 |     -DCMAKE_BUILD_TYPE=Release \ | 
 |     -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \ | 
 |     -DLLVM_ENABLE_PROJECTS="clang;lld" \ | 
 |     -DLLVM_USE_LINKER=lld -DLLVM_BUILD_INSTRUMENTED=ON \ | 
 |     -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage2-prof-gen/install | 
 | $ ninja install | 
 | ``` | 
 |  | 
 | ### Generating Profile for PGO | 
 |  | 
 | While there are many ways to obtain the profile data, we are going to use the | 
 | source code already at our disposal, i.e. we are going to collect the profile | 
 | while building Clang itself: | 
 | ```bash | 
 | $ mkdir ${TOPLEV}/stage3-train | 
 | $ cd ${TOPLEV}/stage3-train | 
 | $ CPATH=${TOPLEV}/stage2-prof-gen/install/bin | 
 | $ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \ | 
 |     -DCMAKE_BUILD_TYPE=Release \ | 
 |     -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \ | 
 |     -DLLVM_ENABLE_PROJECTS="clang" \ | 
 |     -DLLVM_USE_LINKER=lld -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage3-train/install | 
 | $ ninja clang | 
 | ``` | 
 | Once the build is completed, the profile files will be saved under | 
 | `${TOPLEV}/stage2-prof-gen/profiles`. We will merge them before they can be | 
 | passed back into Clang: | 
 | ```bash | 
 | $ cd ${TOPLEV}/stage2-prof-gen/profiles | 
 | $ ${TOPLEV}/stage1/install/bin/llvm-profdata merge -output=clang.profdata * | 
 | ``` | 
 |  | 
 | ### Building Clang with PGO and LTO | 
 |  | 
 | Now the profile can be used to guide optimizations to produce better code for | 
 | our scenario, i.e. building Clang. We will also enable link-time optimizations | 
 | to allow cross-module inlining and other optimizations. Finally, we are going to | 
 | add one extra step that is useful for BOLT: a linker flag instructing it to | 
 | preserve relocations in the output binary. Note that this flag does not affect | 
 | the generated code or data used at runtime, it only writes metadata to the file | 
 | on disk: | 
 | ```bash | 
 | $ mkdir ${TOPLEV}/stage2-prof-use-lto | 
 | $ cd ${TOPLEV}/stage2-prof-use-lto | 
 | $ CPATH=${TOPLEV}/stage1/install/bin/ | 
 | $ export LDFLAGS="-Wl,-q" | 
 | $ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \ | 
 |     -DCMAKE_BUILD_TYPE=Release \ | 
 |     -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \ | 
 |     -DLLVM_ENABLE_PROJECTS="clang;lld" \ | 
 |     -DLLVM_ENABLE_LTO=Full \ | 
 |     -DLLVM_PROFDATA_FILE=${TOPLEV}/stage2-prof-gen/profiles/clang.profdata \ | 
 |     -DLLVM_USE_LINKER=lld \ | 
 |     -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage2-prof-use-lto/install | 
 | $ ninja install | 
 | ``` | 
 | Now we have a Clang compiler that can build itself much faster. As we will see, | 
 | it builds other applications faster as well, and, with BOLT, the compile time | 
 | can be improved even further. |