| # Optimizing Clang : A Practical Example of Applying BOLT |
| |
| ## Preface |
| |
| *BOLT* (Binary Optimization and Layout Tool) is designed to improve the application |
| performance by laying out code in a manner that helps CPU better utilize its caching and |
| branch predicting resources. |
| |
| The most obvious candidates for BOLT optimizations |
| are programs that suffer from many instruction cache and iTLB misses, such as |
| large applications measuring over hundreds of megabytes in size. However, medium-sized |
| programs can benefit too. Clang, one of the most popular open-source C/C++ compilers, |
| is a good example of the latter. Its code size could easily be in the order of tens of megabytes. |
| As we will see, the Clang binary suffers from many instruction cache |
| misses and can be significantly improved with BOLT, even on top of profile-guided and |
| link-time optimizations. |
| |
| In this tutorial we will first build Clang with PGO and LTO, and then will show steps on how to |
| apply BOLT optimizations to make Clang up to 15% faster. We will also analyze where |
| the compile-time performance gains are coming from, and verify that the speed-ups are |
| sustainable while building other applications. |
| |
| ## Building Clang |
| |
| The process of getting Clang sources and performing the build is very similar to the |
| one described at http://clang.llvm.org/get_started.html. For completeness, we provide the detailed steps |
| on how to obtain and build Clang in [Bootstrapping Clang-7 with PGO and LTO](#bootstrapping-clang-7-with-pgo-and-lto) section. |
| |
| The only difference from the standard Clang build is that we require the `-Wl,-q` flag to be present during |
| the final link. This option saves relocation metadata in the executable file, but does not affect |
| the generated code in any way. |
| |
| ## Optimizing Clang with BOLT |
| |
| We will use the setup described in [Bootstrapping Clang-7 with PGO and LTO](#bootstrapping-clang-7-with-pgo-and-lto). |
| Adjust the steps accordingly if you skipped that section. We will also assume that `llvm-bolt` is present in your `$PATH`. |
| |
| Before we can run BOLT optimizations, we need to collect the profile for Clang, and we will use |
| Clang/LLVM sources for that. |
| Collecting accurate profile requires running `perf` on a hardware that |
| implements taken branch sampling (`-b/-j` flag). For that reason, it may not be possible to |
| collect the accurate profile in a virtualized environment, e.g. in the cloud. |
| We do support regular sampling profiles, but the performance |
| improvements are expected to be more modest. |
| |
| ```bash |
| $ mkdir ${TOPLEV}/stage3 |
| $ cd ${TOPLEV}/stage3 |
| $ CPATH=${TOPLEV}/stage2-prof-use-lto/install/bin/ |
| $ cmake -G Ninja ${TOPLEV}/llvm -DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_BUILD_TYPE=Release \ |
| -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \ |
| -DLLVM_ENABLE_PROJECTS="clang" \ |
| -DLLVM_USE_LINKER=lld -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage3/install |
| $ perf record -e cycles:u -j any,u -- ninja clang |
| ``` |
| |
| Once the last command is finished, it will create a `perf.data` file larger than 10GiB. |
| We will first convert this profile into a more compact aggregated |
| form suitable to be consumed by BOLT: |
| ```bash |
| $ perf2bolt $CPATH/clang-7 -p perf.data -o clang-7.fdata -w clang-7.yaml |
| ``` |
| Notice that we are passing `clang-7` to `perf2bolt` which is the real binary that |
| `clang` and `clang++` are symlinking to. The next step will optimize Clang using |
| the generated profile: |
| ```bash |
| $ llvm-bolt $CPATH/clang-7 -o $CPATH/clang-7.bolt -b clang-7.yaml \ |
| -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions \ |
| -split-all-cold -dyno-stats -icf=1 -use-gnu-stack |
| ``` |
| The output will look similar to the one below: |
| ```t |
| ... |
| BOLT-INFO: enabling relocation mode |
| BOLT-INFO: 11415 functions out of 104526 simple functions (10.9%) have non-empty execution profile. |
| ... |
| BOLT-INFO: ICF folded 29144 out of 105177 functions in 8 passes. 82 functions had jump tables. |
| BOLT-INFO: Removing all identical functions will save 5466.69 KB of code space. Folded functions were called 2131985 times based on profile. |
| BOLT-INFO: basic block reordering modified layout of 7848 (10.32%) functions |
| ... |
| 660155947 : executed forward branches (-2.3%) |
| 48252553 : taken forward branches (-57.2%) |
| 129897961 : executed backward branches (+13.8%) |
| 52389551 : taken backward branches (-19.5%) |
| 35650038 : executed unconditional branches (-33.2%) |
| 128338874 : all function calls (=) |
| 19010563 : indirect calls (=) |
| 9918250 : PLT calls (=) |
| 6113398840 : executed instructions (-0.6%) |
| 1519537463 : executed load instructions (=) |
| 943321306 : executed store instructions (=) |
| 20467109 : taken jump table branches (=) |
| 825703946 : total branches (-2.1%) |
| 136292142 : taken branches (-41.1%) |
| 689411804 : non-taken conditional branches (+12.6%) |
| 100642104 : taken conditional branches (-43.4%) |
| 790053908 : all conditional branches (=) |
| ... |
| ``` |
| The statistics in the output is based on the LBR profile collected with `perf`, and since we were using |
| the `cycles` counter, its accuracy is affected. However, the relative improvement in `taken conditional |
| branches` is a good indication that BOLT was able to straighten out the code even after PGO. |
| |
| ## Measuring Compile-time Improvement |
| |
| `clang-7.bolt` can be used as a replacement for *PGO+LTO* Clang: |
| ```bash |
| $ mv $CPATH/clang-7 $CPATH/clang-7.org |
| $ ln -fs $CPATH/clang-7.bolt $CPATH/clang-7 |
| ``` |
| Doing a new build of Clang using the new binary shows a significant overall |
| build time reduction on a 48-core Haswell system: |
| ```bash |
| $ ln -fs $CPATH/clang-7.org $CPATH/clang-7 |
| $ ninja clean && /bin/time -f %e ninja clang -j48 |
| 202.72 |
| $ ln -fs $CPATH/clang-7.bolt $CPATH/clang-7 |
| $ ninja clean && /bin/time -f %e ninja clang -j48 |
| 180.11 |
| ``` |
| That's 22.61 seconds (or 12%) faster compared to the *PGO+LTO* build. |
| Notice that we are measuring an improvement of the total build time, which includes the time spent in the linker. |
| Compilation time improvements for individual files differ, and speedups over 15% are not uncommon. |
| If we run BOLT on a Clang binary compiled without *PGO+LTO* (in which case the build is finished in 253.32 seconds), |
| the gains we see are over 50 seconds (25%), |
| but, as expected, the result is still slower than *PGO+LTO+BOLT* build. |
| |
| ## Source of the Wins |
| |
| We mentioned that Clang suffers from considerable instruction cache misses. This can be measured with `perf`: |
| ```bash |
| $ ln -fs $CPATH/clang-7.org $CPATH/clang-7 |
| $ ninja clean && perf stat -e instructions,L1-icache-misses -- ninja clang -j48 |
| ... |
| 16,366,101,626,647 instructions |
| 359,996,216,537 L1-icache-misses |
| ``` |
| That's about 22 instruction cache misses per thousand instructions. As a rule of thumb, if the application |
| has over 10 misses per thousand instructions, it is a good indication that it will be improved by BOLT. |
| Now let's see how many misses are in the BOLTed binary: |
| ```bash |
| $ ln -fs $CPATH/clang-7.bolt $CPATH/clang-7 |
| $ ninja clean && perf stat -e instructions,L1-icache-misses -- ninja clang -j48 |
| ... |
| 16,319,818,488,769 instructions |
| 244,888,677,972 L1-icache-misses |
| ``` |
| The number of misses per thousand instructions went down from 22 to 15, significantly reducing |
| the number of stalls in the CPU front-end. |
| Notice how the number of executed instructions stayed roughly the same. That's because we didn't |
| run any optimizations beyond the ones affecting the code layout. Other than instruction cache misses, |
| BOLT also improves branch mispredictions, iTLB misses, and misses in L2 and L3. |
| |
| ## Using Clang for Other Applications |
| |
| We have collected profile for Clang using its own source code. Would it be enough to speed up |
| the compilation of other projects? We picked `mysqld`, an open-source database, to do the test. |
| |
| On our 48-core Haswell system using the *PGO+LTO* Clang, the build finished in 136.06 seconds, while using the *PGO+LTO+BOLT* Clang, 126.10 seconds. |
| That's a noticeable improvement, but not as significant as the one we saw on Clang itself. |
| This is partially because the number of instruction cache misses is slightly lower on this scenario : 19 vs 22. |
| Another reason is that Clang is run with a different set of options while building `mysqld` compared |
| to the training run. |
| |
| Different options exercise different code paths, and |
| if we trained without a specific option, we may have misplaced parts of the code responsible for handling it. |
| To test this theory, we have collected another `perf` profile while building `mysqld`, and merged it with an existing profile |
| using the `merge-fdata` utility that comes with BOLT. Optimized with that profile, the *PGO+LTO+BOLT* Clang was able |
| to perform the `mysqld` build in 124.74 seconds, i.e. 11 seconds or 9% faster compared to *PGO+LGO* Clang. |
| The merged profile didn't make the original Clang compilation slower either, while the number of profiled functions in Clang increased from 11,415 to 14,025. |
| |
| Ideally, the profile run has to be done with a superset of all commonly used options. However, the main improvement is expected with just the basic set. |
| |
| ## Summary |
| |
| In this tutorial we demonstrated how to use BOLT to improve the |
| performance of the Clang compiler. Similarly, BOLT could be used to improve the performance |
| of GCC, or any other application suffering from a high number of instruction |
| cache misses. |
| |
| ---- |
| # Appendix |
| |
| ## Bootstrapping Clang-7 with PGO and LTO |
| |
| Below we describe detailed steps to build Clang, and make it ready for BOLT |
| optimizations. If you already have the build setup, you can skip this section, |
| except for the last step that adds `-Wl,-q` linker flag to the final build. |
| |
| ### Getting Clang-7 Sources |
| |
| Set `$TOPLEV` to the directory of your preference where you would like to do |
| builds. E.g. `TOPLEV=~/clang-7/`. Follow with commands to clone the `release_70` |
| branch of LLVM monorepo: |
| ```bash |
| $ mkdir ${TOPLEV} |
| $ cd ${TOPLEV} |
| $ git clone --branch=release/7.x https://github.com/llvm/llvm-project.git |
| ``` |
| |
| ### Building Stage 1 Compiler |
| |
| Stage 1 will be the first build we are going to do, and we will be using the |
| default system compiler to build Clang. If your system lacks a compiler, use |
| your distribution package manager to install one that supports C++11. In this |
| example we are going to use GCC. In addition to the compiler, you will need the |
| `cmake` and `ninja` packages. Note that we disable the build of certain |
| compiler-rt components that are known to cause build issues at release/7.x. |
| ```bash |
| $ mkdir ${TOPLEV}/stage1 |
| $ cd ${TOPLEV}/stage1 |
| $ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \ |
| -DCMAKE_BUILD_TYPE=Release \ |
| -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_ASM_COMPILER=gcc \ |
| -DLLVM_ENABLE_PROJECTS="clang;lld" \ |
| -DLLVM_ENABLE_RUNTIMES="compiler-rt" \ |
| -DCOMPILER_RT_BUILD_SANITIZERS=OFF -DCOMPILER_RT_BUILD_XRAY=OFF \ |
| -DCOMPILER_RT_BUILD_LIBFUZZER=OFF \ |
| -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage1/install |
| $ ninja install |
| ``` |
| |
| ### Building Stage 2 Compiler With Instrumentation |
| |
| Using the freshly-baked stage 1 Clang compiler, we are going to build Clang with |
| profile generation capabilities: |
| ```bash |
| $ mkdir ${TOPLEV}/stage2-prof-gen |
| $ cd ${TOPLEV}/stage2-prof-gen |
| $ CPATH=${TOPLEV}/stage1/install/bin/ |
| $ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \ |
| -DCMAKE_BUILD_TYPE=Release \ |
| -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \ |
| -DLLVM_ENABLE_PROJECTS="clang;lld" \ |
| -DLLVM_USE_LINKER=lld -DLLVM_BUILD_INSTRUMENTED=ON \ |
| -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage2-prof-gen/install |
| $ ninja install |
| ``` |
| |
| ### Generating Profile for PGO |
| |
| While there are many ways to obtain the profile data, we are going to use the |
| source code already at our disposal, i.e. we are going to collect the profile |
| while building Clang itself: |
| ```bash |
| $ mkdir ${TOPLEV}/stage3-train |
| $ cd ${TOPLEV}/stage3-train |
| $ CPATH=${TOPLEV}/stage2-prof-gen/install/bin |
| $ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \ |
| -DCMAKE_BUILD_TYPE=Release \ |
| -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \ |
| -DLLVM_ENABLE_PROJECTS="clang" \ |
| -DLLVM_USE_LINKER=lld -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage3-train/install |
| $ ninja clang |
| ``` |
| Once the build is completed, the profile files will be saved under |
| `${TOPLEV}/stage2-prof-gen/profiles`. We will merge them before they can be |
| passed back into Clang: |
| ```bash |
| $ cd ${TOPLEV}/stage2-prof-gen/profiles |
| $ ${TOPLEV}/stage1/install/bin/llvm-profdata merge -output=clang.profdata * |
| ``` |
| |
| ### Building Clang with PGO and LTO |
| |
| Now the profile can be used to guide optimizations to produce better code for |
| our scenario, i.e. building Clang. We will also enable link-time optimizations |
| to allow cross-module inlining and other optimizations. Finally, we are going to |
| add one extra step that is useful for BOLT: a linker flag instructing it to |
| preserve relocations in the output binary. Note that this flag does not affect |
| the generated code or data used at runtime, it only writes metadata to the file |
| on disk: |
| ```bash |
| $ mkdir ${TOPLEV}/stage2-prof-use-lto |
| $ cd ${TOPLEV}/stage2-prof-use-lto |
| $ CPATH=${TOPLEV}/stage1/install/bin/ |
| $ export LDFLAGS="-Wl,-q" |
| $ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \ |
| -DCMAKE_BUILD_TYPE=Release \ |
| -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \ |
| -DLLVM_ENABLE_PROJECTS="clang;lld" \ |
| -DLLVM_ENABLE_LTO=Full \ |
| -DLLVM_PROFDATA_FILE=${TOPLEV}/stage2-prof-gen/profiles/clang.profdata \ |
| -DLLVM_USE_LINKER=lld \ |
| -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage2-prof-use-lto/install |
| $ ninja install |
| ``` |
| Now we have a Clang compiler that can build itself much faster. As we will see, |
| it builds other applications faster as well, and, with BOLT, the compile time |
| can be improved even further. |