| # BOLT |
| |
| BOLT is a post-link optimizer developed to speed up large applications. |
| It achieves the improvements by optimizing application's code layout based on |
| execution profile gathered by sampling profiler, such as Linux `perf` tool. |
| An overview of the ideas implemented in BOLT along with a discussion of its |
| potential and current results is available in |
| [CGO'19 paper](https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/). |
| |
| ## Input Binary Requirements |
| |
| BOLT operates on X86-64 and AArch64 ELF binaries. At the minimum, the binaries |
| should have an unstripped symbol table, and, to get maximum performance gains, |
| they should be linked with relocations (`--emit-relocs` or `-q` linker flag). |
| |
| BOLT disassembles functions and reconstructs the control flow graph (CFG) |
| before it runs optimizations. Since this is a nontrivial task, |
| especially when indirect branches are present, we rely on certain heuristics |
| to accomplish it. These heuristics have been tested on a code generated with |
| Clang and GCC compilers. The main requirement for C/C++ code is not to rely |
| on code layout properties, such as function pointer deltas. |
| Assembly code can be processed too. Requirements for it include a clear |
| separation of code and data, with data objects being placed into data |
| sections/segments. If indirect jumps are used for intra-function control |
| transfer (e.g., jump tables), the code patterns should be matching those |
| generated by Clang/GCC. |
| |
| NOTE: BOLT is currently incompatible with the `-freorder-blocks-and-partition` |
| compiler option. Since GCC8 enables this option by default, you have to |
| explicitly disable it by adding `-fno-reorder-blocks-and-partition` flag if |
| you are compiling with GCC8 or above. |
| |
| NOTE2: DWARF v5 is the new debugging format generated by the latest LLVM and GCC |
| compilers. It offers several benefits over the previous DWARF v4. Currently, the |
| support for v5 is a work in progress for BOLT. While you will be able to |
| optimize binaries produced by the latest compilers, until the support is |
| complete, you will not be able to update the debug info with |
| `-update-debug-sections`. To temporarily work around the issue, we recommend |
| compiling binaries with `-gdwarf-4` option that forces DWARF v4 output. |
| |
| PIE and .so support has been added recently. Please report bugs if you |
| encounter any issues. |
| |
| ## Installation |
| |
| ### Docker Image |
| |
| You can build and use the docker image containing BOLT using our [docker file](utils/docker/Dockerfile). |
| Alternatively, you can build BOLT manually using the steps below. |
| |
| ### Manual Build |
| |
| BOLT heavily uses LLVM libraries, and by design, it is built as one of LLVM |
| tools. The build process is not much different from a regular LLVM build. |
| The following instructions are assuming that you are running under Linux. |
| |
| Start with cloning LLVM repo: |
| |
| ``` |
| > git clone https://github.com/llvm/llvm-project.git |
| > mkdir build |
| > cd build |
| > cmake -G Ninja ../llvm-project/llvm -DLLVM_TARGETS_TO_BUILD="X86;AArch64" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_PROJECTS="bolt" |
| > ninja bolt |
| ``` |
| |
| `llvm-bolt` will be available under `bin/`. Add this directory to your path to |
| ensure the rest of the commands in this tutorial work. |
| |
| ## Optimizing BOLT's Performance |
| |
| BOLT runs many internal passes in parallel. If you foresee heavy usage of |
| BOLT, you can improve the processing time by linking against one of memory |
| allocation libraries with good support for concurrency. E.g. to use jemalloc: |
| |
| ``` |
| > sudo yum install jemalloc-devel |
| > LD_PRELOAD=/usr/lib64/libjemalloc.so llvm-bolt .... |
| ``` |
| Or if you rather use tcmalloc: |
| ``` |
| > sudo yum install gperftools-devel |
| > LD_PRELOAD=/usr/lib64/libtcmalloc_minimal.so llvm-bolt .... |
| ``` |
| |
| ## Usage |
| |
| For a complete practical guide of using BOLT see [Optimizing Clang with BOLT](docs/OptimizingClang.md). |
| |
| ### Step 0 |
| |
| In order to allow BOLT to re-arrange functions (in addition to re-arranging |
| code within functions) in your program, it needs a little help from the linker. |
| Add `--emit-relocs` to the final link step of your application. You can verify |
| the presence of relocations by checking for `.rela.text` section in the binary. |
| BOLT will also report if it detects relocations while processing the binary. |
| |
| ### Step 1: Collect Profile |
| |
| This step is different for different kinds of executables. If you can invoke |
| your program to run on a representative input from a command line, then check |
| **For Applications** section below. If your program typically runs as a |
| server/service, then skip to **For Services** section. |
| |
| The version of `perf` command used for the following steps has to support |
| `-F brstack` option. We recommend using `perf` version 4.5 or later. |
| |
| #### For Applications |
| |
| This assumes you can run your program from a command line with a typical input. |
| In this case, simply prepend the command line invocation with `perf`: |
| ``` |
| $ perf record -e cycles:u -j any,u -o perf.data -- <executable> <args> ... |
| ``` |
| |
| #### For Services |
| |
| Once you get the service deployed and warmed-up, it is time to collect perf |
| data with LBR (branch information). The exact perf command to use will depend |
| on the service. E.g., to collect the data for all processes running on the |
| server for the next 3 minutes use: |
| ``` |
| $ perf record -e cycles:u -j any,u -a -o perf.data -- sleep 180 |
| ``` |
| |
| Depending on the application, you may need more samples to be included with |
| your profile. It's hard to tell upfront what would be a sweet spot for your |
| application. We recommend the profile to cover 1B instructions as reported |
| by BOLT `-dyno-stats` option. If you need to increase the number of samples |
| in the profile, you can either run the `sleep` command for longer and use |
| `-F<N>` option with `perf` to increase sampling frequency. |
| |
| Note that for profile collection we recommend using cycle events and not |
| `BR_INST_RETIRED.*`. Empirically we found it to produce better results. |
| |
| If the collection of a profile with branches is not available, e.g., when you run on |
| a VM or on hardware that does not support it, then you can use only sample |
| events, such as cycles. In this case, the quality of the profile information |
| would not be as good, and performance gains with BOLT are expected to be lower. |
| |
| #### With instrumentation |
| |
| If perf record is not available to you, you may collect profile by first |
| instrumenting the binary with BOLT and then running it. |
| ``` |
| llvm-bolt <executable> -instrument -o <instrumented-executable> |
| ``` |
| |
| After you run instrumented-executable with the desired workload, its BOLT |
| profile should be ready for you in `/tmp/prof.fdata` and you can skip |
| **Step 2**. |
| |
| Run BOLT with the `-help` option and check the category "BOLT instrumentation |
| options" for a quick reference on instrumentation knobs. |
| |
| ### Step 2: Convert Profile to BOLT Format |
| |
| NOTE: you can skip this step and feed `perf.data` directly to BOLT using |
| experimental `-p perf.data` option. |
| |
| For this step, you will need `perf.data` file collected from the previous step and |
| a copy of the binary that was running. The binary has to be either |
| unstripped, or should have a symbol table intact (i.e., running `strip -g` is |
| okay). |
| |
| Make sure `perf` is in your `PATH`, and execute `perf2bolt`: |
| ``` |
| $ perf2bolt -p perf.data -o perf.fdata <executable> |
| ``` |
| |
| This command will aggregate branch data from `perf.data` and store it in a |
| format that is both more compact and more resilient to binary modifications. |
| |
| If the profile was collected without LBRs, you will need to add `-nl` flag to |
| the command line above. |
| |
| ### Step 3: Optimize with BOLT |
| |
| Once you have `perf.fdata` ready, you can use it for optimizations with |
| BOLT. Assuming your environment is setup to include the right path, execute |
| `llvm-bolt`: |
| ``` |
| $ llvm-bolt <executable> -o <executable>.bolt -data=perf.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold -split-eh -dyno-stats |
| ``` |
| |
| If you do need an updated debug info, then add `-update-debug-sections` option |
| to the command above. The processing time will be slightly longer. |
| |
| For a full list of options see `-help`/`-help-hidden` output. |
| |
| The input binary for this step does not have to 100% match the binary used for |
| profile collection in **Step 1**. This could happen when you are doing active |
| development, and the source code constantly changes, yet you want to benefit |
| from profile-guided optimizations. However, since the binary is not precisely the |
| same, the profile information could become invalid or stale, and BOLT will |
| report the number of functions with a stale profile. The higher the |
| number, the less performance improvement should be expected. Thus, it is |
| crucial to update `.fdata` for release branches. |
| |
| ## Multiple Profiles |
| |
| Suppose your application can run in different modes, and you can generate |
| multiple profiles for each one of them. To generate a single binary that can |
| benefit all modes (assuming the profiles don't contradict each other) you can |
| use `merge-fdata` tool: |
| ``` |
| $ merge-fdata *.fdata > combined.fdata |
| ``` |
| Use `combined.fdata` for **Step 3** above to generate a universally optimized |
| binary. |
| |
| ## License |
| |
| BOLT is licensed under the [Apache License v2.0 with LLVM Exceptions](./LICENSE.TXT). |