| ========================= |
| Performance Investigation |
| ========================= |
| |
| Multiple factors contribute to the time it takes to analyze a file with Clang Static Analyzer. |
| A translation unit contains multiple entry points, each of which take multiple steps to analyze. |
| |
| Performance analysis using ``-ftime-trace`` |
| =========================================== |
| |
| You can add the ``-ftime-trace=file.json`` option to break down the analysis time into individual entry points and steps within each entry point. |
| You can explore the generated JSON file in a Chromium browser using the ``chrome://tracing`` URL, |
| or using `perfetto <https://ui.perfetto.dev>`_ or `speedscope <https://speedscope.app>`_. |
| Once you narrow down to specific analysis steps you are interested in, you can more effectively employ heavier profilers, |
| such as `Perf <https://perfwiki.github.io/main/>`_ and `Callgrind <https://valgrind.org/docs/manual/cl-manual.html>`_. |
| |
| Each analysis step has a time scope in the trace, corresponds to processing of an exploded node, and is designated with a ``ProgramPoint``. |
| If the ``ProgramPoint`` is associated with a location, you can see it on the scope metadata label. |
| |
| Here is an example of a time trace produced with |
| |
| .. code-block:: bash |
| :caption: Clang Static Analyzer invocation to generate a time trace of string.c analysis. |
| |
| clang -cc1 -analyze -verify clang/test/Analysis/string.c \ |
| -analyzer-checker=core,unix,alpha.unix.cstring,debug.ExprInspection \ |
| -ftime-trace=trace.json -ftime-trace-granularity=1 |
| |
| .. image:: ../images/speedscope.png |
| |
| On the speedscope screenshot above, under the first time ruler is the bird's-eye view of the entire trace that spans a little over 60 milliseconds. |
| Under the second ruler (focused on the 18.09-18.13ms time point) you can see a narrowed-down portion. |
| The second box ("HandleCode memset...") that spans entire screen (and actually extends beyond it) corresponds to the analysis of ``memset16_region_cast()`` entry point that is defined in the "string.c" test file on line 1627. |
| Below it, you can find multiple sub-scopes each corresponding to processing of a single exploded node. |
| |
| - First: a ``PostStmt`` for some statement on line 1634. This scope has a selected subscope "CheckerManager::runCheckersForCallEvent (Pre)" that takes 5 microseconds. |
| - Four other nodes, too small to be discernible at this zoom level |
| - Last on this screenshot: another ``PostStmt`` for a statement on line 1635. |
| |
| In addition to the ``-ftime-trace`` option, you can use ``-ftime-trace-granularity`` to fine-tune the time trace. |
| |
| - ``-ftime-trace-granularity=NN`` dumps only time scopes that are longer than NN microseconds. |
| - ``-ftime-trace-verbose`` enables some additional dumps in the frontend related to template instantiations. |
| At the moment, it has no effect on the traces from the static analyzer. |
| |
| Note: Both Chrome-tracing and speedscope tools might struggle with time traces above 100 MB in size. |
| Luckily, in most cases the default max-steps boundary of 225 000 produces the traces of approximately that size |
| for a single entry point. |
| You can use ``-analyze-function=get_global_options`` together with ``-ftime-trace`` to narrow down analysis to a specific entry point. |
| |
| |
| Performance analysis using ``perf`` |
| =================================== |
| |
| `Perf <https://perfwiki.github.io/main/>`_ is a tool for conducting sampling-based profiling. |
| It's easy to start profiling, you only have 2 prerequisites. |
| Build with ``-fno-omit-frame-pointer`` and debug info (``-g``). |
| You can use release builds, but probably the easiest is to set the ``CMAKE_BUILD_TYPE=RelWithDebInfo`` |
| along with ``CMAKE_CXX_FLAGS="-fno-omit-frame-pointer"`` when configuring ``llvm``. |
| Here is how to `get started <https://llvm.org/docs/CMake.html#quick-start>`_ if you are in trouble. |
| |
| .. code-block:: bash |
| :caption: Running the Clang Static Analyzer through ``perf`` to gather samples of the execution. |
| |
| # -F: Sampling frequency, use `-F max` for maximal frequency |
| # -g: Enable call-graph recording for both kernel and user space |
| perf record -F 99 -g -- clang -cc1 -analyze -verify clang/test/Analysis/string.c \ |
| -analyzer-checker=core,unix,alpha.unix.cstring,debug.ExprInspection |
| |
| Once you have the profile data, you can use it to produce a Flame graph. |
| A Flame graph is a visual representation of the stack frames of the samples. |
| Common stack frame prefixes are squashed together, making up a wider bar. |
| The wider the bar, the more time was spent under that particular stack frame, |
| giving a sense of how the overall execution time was spent. |
| |
| Clone the `FlameGraph <https://github.com/brendangregg/FlameGraph>`_ git repository, |
| as we will use some scripts from there to convert the ``perf`` samples into a Flame graph. |
| It's also useful to check out Brendan Gregg's (the author of FlameGraph) |
| `homepage <https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html>`_. |
| |
| |
| .. code-block:: bash |
| :caption: Converting the ``perf`` profile into a Flamegraph, then opening it in Firefox. |
| |
| perf script | /path/to/FlameGraph/stackcollapse-perf.pl > perf.folded |
| /path/to/FlameGraph/flamegraph.pl perf.folded > perf.svg |
| firefox perf.svg |
| |
| .. image:: ../images/flamegraph.png |
| |
| |
| Performance analysis using ``uftrace`` |
| ====================================== |
| |
| `uftrace <https://github.com/namhyung/uftrace/wiki/Tutorial#getting-started>`_ is a great tool to generate rich profile data |
| that you can use to focus and drill down into the timeline of your application. |
| We will use it to generate Chromium trace JSON. |
| In contrast to ``perf``, this approach statically instruments every function, so it should be more precise and thorough than the sampling-based approaches like ``perf``. |
| In contrast to using ``-ftime-trace``, functions don't need to opt-in to be profiled using ``llvm::TimeTraceScope``. |
| All functions are profiled due to automatic static instrumentation. |
| |
| There is only one prerequisite to use this tool. |
| You need to build the binary you are about to instrument using ``-pg`` or ``-finstrument-functions``. |
| This will make it run substantially slower but allows rich instrumentation. |
| It will also consume many gigabites of storage for a single trace unless filter flags are used during recording. |
| |
| .. code-block:: bash |
| :caption: Recording with ``uftrace``, then dumping the result as a Chrome trace JSON. |
| |
| uftrace record clang -cc1 -analyze -verify clang/test/Analysis/string.c \ |
| -analyzer-checker=core,unix,alpha.unix.cstring,debug.ExprInspection |
| uftrace dump --filter=".*::AnalysisConsumer::HandleTranslationUnit" --time-filter=300 --chrome > trace.json |
| |
| .. image:: ../images/uftrace_detailed.png |
| |
| In this picture, you can see the functions below the Static Analyzer's entry point, which takes at least 300 nanoseconds to run, visualized by Chrome's ``about:tracing`` page |
| You can also see how deep function calls we may have due to AST visitors. |
| |
| Using different filters can reduce the number of functions to record. |
| For the common options, refer to the ``uftrace`` `documentation <https://github.com/namhyung/uftrace/blob/master/doc/uftrace-record.md#common-options>`_. |
| |
| Similar filters can be applied for dumping too. That way you can reuse the same (detailed) |
| recording to selectively focus on some special part using a refinement of the filter flags. |
| Remember, the trace JSON needs to fit into Chrome's ``about:tracing`` or `speedscope <https://speedscope.app>`_, |
| thus it needs to be of a limited size. |
| If you do not apply filters on recording, you will collect a large trace and every dump operation |
| would need to sieve through the much larger recording which may be annoying if done repeatedly. |
| |
| If the trace JSON is still too large to load, have a look at the dump as plain text and look for frequent entries that refer to non-interesting parts. |
| Once you have some of those, add them as ``--hide`` flags to the ``uftrace dump`` call. |
| To see what functions appear frequently in the trace, use this command: |
| |
| .. code-block:: bash |
| |
| cat trace.json | grep -Po '"name":"(.+)"' | sort | uniq -c | sort -nr | head -n 50 |
| |
| ``uftrace`` can also dump the report as a Flame graph using ``uftrace dump --framegraph``. |