| ========================= |
| Compiling CUDA with clang |
| ========================= |
| |
| .. contents:: |
| :local: |
| |
| Introduction |
| ============ |
| |
| This document describes how to compile CUDA code with clang, and gives some |
| details about LLVM and clang's CUDA implementations. |
| |
| This document assumes a basic familiarity with CUDA. Information about CUDA |
| programming can be found in the |
| `CUDA programming guide |
| <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_. |
| |
| Compiling CUDA Code |
| =================== |
| |
| Prerequisites |
| ------------- |
| |
| CUDA is supported since llvm 3.9. Clang currently supports CUDA 7.0 through |
| 12.1. If clang detects a newer CUDA version, it will issue a warning and will |
| attempt to use detected CUDA SDK it as if it were CUDA 12.1. |
| |
| Before you build CUDA code, you'll need to have installed the CUDA SDK. See |
| `NVIDIA's CUDA installation guide |
| <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ for |
| details. Note that clang `maynot support |
| <https://bugs.llvm.org/show_bug.cgi?id=26966>`_ the CUDA toolkit as installed by |
| some Linux package managers. Clang does attempt to deal with specific details of |
| CUDA installation on a handful of common Linux distributions, but in general the |
| most reliable way to make it work is to install CUDA in a single directory from |
| NVIDIA's `.run` package and specify its location via `--cuda-path=...` argument. |
| |
| CUDA compilation is supported on Linux. Compilation on MacOS and Windows may or |
| may not work and currently have no maintainers. |
| |
| Invoking clang |
| -------------- |
| |
| Invoking clang for CUDA compilation works similarly to compiling regular C++. |
| You just need to be aware of a few additional flags. |
| |
| You can use `this <https://gist.github.com/855e277884eb6b388cd2f00d956c2fd4>`_ |
| program as a toy example. Save it as ``axpy.cu``. (Clang detects that you're |
| compiling CUDA code by noticing that your filename ends with ``.cu``. |
| Alternatively, you can pass ``-x cuda``.) |
| |
| To build and run, run the following commands, filling in the parts in angle |
| brackets as described below: |
| |
| .. code-block:: console |
| |
| $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \ |
| -L<CUDA install path>/<lib64 or lib> \ |
| -lcudart_static -ldl -lrt -pthread |
| $ ./axpy |
| y[0] = 2 |
| y[1] = 4 |
| y[2] = 6 |
| y[3] = 8 |
| |
| On MacOS, replace `-lcudart_static` with `-lcudart`; otherwise, you may get |
| "CUDA driver version is insufficient for CUDA runtime version" errors when you |
| run your program. |
| |
| * ``<CUDA install path>`` -- the directory where you installed CUDA SDK. |
| Typically, ``/usr/local/cuda``. |
| |
| Pass e.g. ``-L/usr/local/cuda/lib64`` if compiling in 64-bit mode; otherwise, |
| pass e.g. ``-L/usr/local/cuda/lib``. (In CUDA, the device code and host code |
| always have the same pointer widths, so if you're compiling 64-bit code for |
| the host, you're also compiling 64-bit code for the device.) Note that as of |
| v10.0 CUDA SDK `no longer supports compilation of 32-bit |
| applications <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#deprecated-features>`_. |
| |
| * ``<GPU arch>`` -- the `compute capability |
| <https://developer.nvidia.com/cuda-gpus>`_ of your GPU. For example, if you |
| want to run your program on a GPU with compute capability of 3.5, specify |
| ``--cuda-gpu-arch=sm_35``. |
| |
| Note: You cannot pass ``compute_XX`` as an argument to ``--cuda-gpu-arch``; |
| only ``sm_XX`` is currently supported. However, clang always includes PTX in |
| its binaries, so e.g. a binary compiled with ``--cuda-gpu-arch=sm_30`` would be |
| forwards-compatible with e.g. ``sm_35`` GPUs. |
| |
| You can pass ``--cuda-gpu-arch`` multiple times to compile for multiple archs. |
| |
| The `-L` and `-l` flags only need to be passed when linking. When compiling, |
| you may also need to pass ``--cuda-path=/path/to/cuda`` if you didn't install |
| the CUDA SDK into ``/usr/local/cuda`` or ``/usr/local/cuda-X.Y``. |
| |
| Flags that control numerical code |
| --------------------------------- |
| |
| If you're using GPUs, you probably care about making numerical code run fast. |
| GPU hardware allows for more control over numerical operations than most CPUs, |
| but this results in more compiler options for you to juggle. |
| |
| Flags you may wish to tweak include: |
| |
| * ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when |
| compiling CUDA) Controls whether the compiler emits fused multiply-add |
| operations. |
| |
| * ``off``: never emit fma operations, and prevent ptxas from fusing multiply |
| and add instructions. |
| * ``on``: fuse multiplies and adds within a single statement, but never |
| across statements (C11 semantics). Prevent ptxas from fusing other |
| multiplies and adds. |
| * ``fast``: fuse multiplies and adds wherever profitable, even across |
| statements. Doesn't prevent ptxas from fusing additional multiplies and |
| adds. |
| |
| Fused multiply-add instructions can be much faster than the unfused |
| equivalents, but because the intermediate result in an fma is not rounded, |
| this flag can affect numerical code. |
| |
| * ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled, |
| floating point operations may flush `denormal |
| <https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0. |
| Operations on denormal numbers are often much slower than the same operations |
| on normal numbers. |
| |
| * ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the |
| compiler may emit calls to faster, approximate versions of transcendental |
| functions, instead of using the slower, fully IEEE-compliant versions. For |
| example, this flag allows clang to emit the ptx ``sin.approx.f32`` |
| instruction. |
| |
| This is implied by ``-ffast-math``. |
| |
| Standard library support |
| ======================== |
| |
| In clang and nvcc, most of the C++ standard library is not supported on the |
| device side. |
| |
| ``<math.h>`` and ``<cmath>`` |
| ---------------------------- |
| |
| In clang, ``math.h`` and ``cmath`` are available and `pass |
| <https://github.com/llvm/llvm-test-suite/blob/main/External/CUDA/math_h.cu>`_ |
| `tests |
| <https://github.com/llvm/llvm-test-suite/blob/main/External/CUDA/cmath.cu>`_ |
| adapted from libc++'s test suite. |
| |
| In nvcc ``math.h`` and ``cmath`` are mostly available. Versions of ``::foof`` |
| in namespace std (e.g. ``std::sinf``) are not available, and where the standard |
| calls for overloads that take integral arguments, these are usually not |
| available. |
| |
| .. code-block:: c++ |
| |
| #include <math.h> |
| #include <cmath.h> |
| |
| // clang is OK with everything in this function. |
| __device__ void test() { |
| std::sin(0.); // nvcc - ok |
| std::sin(0); // nvcc - error, because no std::sin(int) override is available. |
| sin(0); // nvcc - same as above. |
| |
| sinf(0.); // nvcc - ok |
| std::sinf(0.); // nvcc - no such function |
| } |
| |
| ``<std::complex>`` |
| ------------------ |
| |
| nvcc does not officially support ``std::complex``. It's an error to use |
| ``std::complex`` in ``__device__`` code, but it often works in ``__host__ |
| __device__`` code due to nvcc's interpretation of the "wrong-side rule" (see |
| below). However, we have heard from implementers that it's possible to get |
| into situations where nvcc will omit a call to an ``std::complex`` function, |
| especially when compiling without optimizations. |
| |
| As of 2016-11-16, clang supports ``std::complex`` without these caveats. It is |
| tested with libstdc++ 4.8.5 and newer, but is known to work only with libc++ |
| newer than 2016-11-16. |
| |
| ``<algorithm>`` |
| --------------- |
| |
| In C++14, many useful functions from ``<algorithm>`` (notably, ``std::min`` and |
| ``std::max``) become constexpr. You can therefore use these in device code, |
| when compiling with clang. |
| |
| Detecting clang vs NVCC from code |
| ================================= |
| |
| Although clang's CUDA implementation is largely compatible with NVCC's, you may |
| still want to detect when you're compiling CUDA code specifically with clang. |
| |
| This is tricky, because NVCC may invoke clang as part of its own compilation |
| process! For example, NVCC uses the host compiler's preprocessor when |
| compiling for device code, and that host compiler may in fact be clang. |
| |
| When clang is actually compiling CUDA code -- rather than being used as a |
| subtool of NVCC's -- it defines the ``__CUDA__`` macro. ``__CUDA_ARCH__`` is |
| defined only in device mode (but will be defined if NVCC is using clang as a |
| preprocessor). So you can use the following incantations to detect clang CUDA |
| compilation, in host and device modes: |
| |
| .. code-block:: c++ |
| |
| #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__) |
| // clang compiling CUDA code, host mode. |
| #endif |
| |
| #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__) |
| // clang compiling CUDA code, device mode. |
| #endif |
| |
| Both clang and nvcc define ``__CUDACC__`` during CUDA compilation. You can |
| detect NVCC specifically by looking for ``__NVCC__``. |
| |
| Dialect Differences Between clang and nvcc |
| ========================================== |
| |
| There is no formal CUDA spec, and clang and nvcc speak slightly different |
| dialects of the language. Below, we describe some of the differences. |
| |
| This section is painful; hopefully you can skip this section and live your life |
| blissfully unaware. |
| |
| Compilation Models |
| ------------------ |
| |
| Most of the differences between clang and nvcc stem from the different |
| compilation models used by clang and nvcc. nvcc uses *split compilation*, |
| which works roughly as follows: |
| |
| * Run a preprocessor over the input ``.cu`` file to split it into two source |
| files: ``H``, containing source code for the host, and ``D``, containing |
| source code for the device. |
| |
| * For each GPU architecture ``arch`` that we're compiling for, do: |
| |
| * Compile ``D`` using nvcc proper. The result of this is a ``ptx`` file for |
| ``P_arch``. |
| |
| * Optionally, invoke ``ptxas``, the PTX assembler, to generate a file, |
| ``S_arch``, containing GPU machine code (SASS) for ``arch``. |
| |
| * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a |
| single "fat binary" file, ``F``. |
| |
| * Compile ``H`` using an external host compiler (gcc, clang, or whatever you |
| like). ``F`` is packaged up into a header file which is force-included into |
| ``H``; nvcc generates code that calls into this header to e.g. launch |
| kernels. |
| |
| clang uses *merged parsing*. This is similar to split compilation, except all |
| of the host and device code is present and must be semantically-correct in both |
| compilation steps. |
| |
| * For each GPU architecture ``arch`` that we're compiling for, do: |
| |
| * Compile the input ``.cu`` file for device, using clang. ``__host__`` code |
| is parsed and must be semantically correct, even though we're not |
| generating code for the host at this time. |
| |
| The output of this step is a ``ptx`` file ``P_arch``. |
| |
| * Invoke ``ptxas`` to generate a SASS file, ``S_arch``. Note that, unlike |
| nvcc, clang always generates SASS code. |
| |
| * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a |
| single fat binary file, ``F``. |
| |
| * Compile ``H`` using clang. ``__device__`` code is parsed and must be |
| semantically correct, even though we're not generating code for the device |
| at this time. |
| |
| ``F`` is passed to this compilation, and clang includes it in a special ELF |
| section, where it can be found by tools like ``cuobjdump``. |
| |
| (You may ask at this point, why does clang need to parse the input file |
| multiple times? Why not parse it just once, and then use the AST to generate |
| code for the host and each device architecture? |
| |
| Unfortunately this can't work because we have to define different macros during |
| host compilation and during device compilation for each GPU architecture.) |
| |
| clang's approach allows it to be highly robust to C++ edge cases, as it doesn't |
| need to decide at an early stage which declarations to keep and which to throw |
| away. But it has some consequences you should be aware of. |
| |
| Overloading Based on ``__host__`` and ``__device__`` Attributes |
| --------------------------------------------------------------- |
| |
| Let "H", "D", and "HD" stand for "``__host__`` functions", "``__device__`` |
| functions", and "``__host__ __device__`` functions", respectively. Functions |
| with no attributes behave the same as H. |
| |
| nvcc does not allow you to create H and D functions with the same signature: |
| |
| .. code-block:: c++ |
| |
| // nvcc: error - function "foo" has already been defined |
| __host__ void foo() {} |
| __device__ void foo() {} |
| |
| However, nvcc allows you to "overload" H and D functions with different |
| signatures: |
| |
| .. code-block:: c++ |
| |
| // nvcc: no error |
| __host__ void foo(int) {} |
| __device__ void foo() {} |
| |
| In clang, the ``__host__`` and ``__device__`` attributes are part of a |
| function's signature, and so it's legal to have H and D functions with |
| (otherwise) the same signature: |
| |
| .. code-block:: c++ |
| |
| // clang: no error |
| __host__ void foo() {} |
| __device__ void foo() {} |
| |
| HD functions cannot be overloaded by H or D functions with the same signature: |
| |
| .. code-block:: c++ |
| |
| // nvcc: error - function "foo" has already been defined |
| // clang: error - redefinition of 'foo' |
| __host__ __device__ void foo() {} |
| __device__ void foo() {} |
| |
| // nvcc: no error |
| // clang: no error |
| __host__ __device__ void bar(int) {} |
| __device__ void bar() {} |
| |
| When resolving an overloaded function, clang considers the host/device |
| attributes of the caller and callee. These are used as a tiebreaker during |
| overload resolution. See `IdentifyCUDAPreference |
| <https://clang.llvm.org/doxygen/SemaCUDA_8cpp.html>`_ for the full set of rules, |
| but at a high level they are: |
| |
| * D functions prefer to call other Ds. HDs are given lower priority. |
| |
| * Similarly, H functions prefer to call other Hs, or ``__global__`` functions |
| (with equal priority). HDs are given lower priority. |
| |
| * HD functions prefer to call other HDs. |
| |
| When compiling for device, HDs will call Ds with lower priority than HD, and |
| will call Hs with still lower priority. If it's forced to call an H, the |
| program is malformed if we emit code for this HD function. We call this the |
| "wrong-side rule", see example below. |
| |
| The rules are symmetrical when compiling for host. |
| |
| Some examples: |
| |
| .. code-block:: c++ |
| |
| __host__ void foo(); |
| __device__ void foo(); |
| |
| __host__ void bar(); |
| __host__ __device__ void bar(); |
| |
| __host__ void test_host() { |
| foo(); // calls H overload |
| bar(); // calls H overload |
| } |
| |
| __device__ void test_device() { |
| foo(); // calls D overload |
| bar(); // calls HD overload |
| } |
| |
| __host__ __device__ void test_hd() { |
| foo(); // calls H overload when compiling for host, otherwise D overload |
| bar(); // always calls HD overload |
| } |
| |
| Wrong-side rule example: |
| |
| .. code-block:: c++ |
| |
| __host__ void host_only(); |
| |
| // We don't codegen inline functions unless they're referenced by a |
| // non-inline function. inline_hd1() is called only from the host side, so |
| // does not generate an error. inline_hd2() is called from the device side, |
| // so it generates an error. |
| inline __host__ __device__ void inline_hd1() { host_only(); } // no error |
| inline __host__ __device__ void inline_hd2() { host_only(); } // error |
| |
| __host__ void host_fn() { inline_hd1(); } |
| __device__ void device_fn() { inline_hd2(); } |
| |
| // This function is not inline, so it's always codegen'ed on both the host |
| // and the device. Therefore, it generates an error. |
| __host__ __device__ void not_inline_hd() { host_only(); } |
| |
| For the purposes of the wrong-side rule, templated functions also behave like |
| ``inline`` functions: They aren't codegen'ed unless they're instantiated |
| (usually as part of the process of invoking them). |
| |
| clang's behavior with respect to the wrong-side rule matches nvcc's, except |
| nvcc only emits a warning for ``not_inline_hd``; device code is allowed to call |
| ``not_inline_hd``. In its generated code, nvcc may omit ``not_inline_hd``'s |
| call to ``host_only`` entirely, or it may try to generate code for |
| ``host_only`` on the device. What you get seems to depend on whether or not |
| the compiler chooses to inline ``host_only``. |
| |
| Member functions, including constructors, may be overloaded using H and D |
| attributes. However, destructors cannot be overloaded. |
| |
| Clang Warnings for Host and Device Function Declarations |
| -------------------------------------------------------- |
| |
| Clang can emit warnings when it detects that host (H) and device (D) functions are declared or defined with the same signature. These warnings are not enabled by default. |
| |
| To enable these warnings, use the following compiler flag: |
| |
| .. code-block:: console |
| |
| -Wnvcc-compat |
| |
| Using a Different Class on Host/Device |
| -------------------------------------- |
| |
| Occasionally you may want to have a class with different host/device versions. |
| |
| If all of the class's members are the same on the host and device, you can just |
| provide overloads for the class's member functions. |
| |
| However, if you want your class to have different members on host/device, you |
| won't be able to provide working H and D overloads in both classes. In this |
| case, clang is likely to be unhappy with you. |
| |
| .. code-block:: c++ |
| |
| #ifdef __CUDA_ARCH__ |
| struct S { |
| __device__ void foo() { /* use device_only */ } |
| int device_only; |
| }; |
| #else |
| struct S { |
| __host__ void foo() { /* use host_only */ } |
| double host_only; |
| }; |
| |
| __device__ void test() { |
| S s; |
| // clang generates an error here, because during host compilation, we |
| // have ifdef'ed away the __device__ overload of S::foo(). The __device__ |
| // overload must be present *even during host compilation*. |
| S.foo(); |
| } |
| #endif |
| |
| We posit that you don't really want to have classes with different members on H |
| and D. For example, if you were to pass one of these as a parameter to a |
| kernel, it would have a different layout on H and D, so would not work |
| properly. |
| |
| To make code like this compatible with clang, we recommend you separate it out |
| into two classes. If you need to write code that works on both host and |
| device, consider writing an overloaded wrapper function that returns different |
| types on host and device. |
| |
| .. code-block:: c++ |
| |
| struct HostS { ... }; |
| struct DeviceS { ... }; |
| |
| __host__ HostS MakeStruct() { return HostS(); } |
| __device__ DeviceS MakeStruct() { return DeviceS(); } |
| |
| // Now host and device code can call MakeStruct(). |
| |
| Unfortunately, this idiom isn't compatible with nvcc, because it doesn't allow |
| you to overload based on the H/D attributes. Here's an idiom that works with |
| both clang and nvcc: |
| |
| .. code-block:: c++ |
| |
| struct HostS { ... }; |
| struct DeviceS { ... }; |
| |
| #ifdef __NVCC__ |
| #ifndef __CUDA_ARCH__ |
| __host__ HostS MakeStruct() { return HostS(); } |
| #else |
| __device__ DeviceS MakeStruct() { return DeviceS(); } |
| #endif |
| #else |
| __host__ HostS MakeStruct() { return HostS(); } |
| __device__ DeviceS MakeStruct() { return DeviceS(); } |
| #endif |
| |
| // Now host and device code can call MakeStruct(). |
| |
| Hopefully you don't have to do this sort of thing often. |
| |
| Optimizations |
| ============= |
| |
| Modern CPUs and GPUs are architecturally quite different, so code that's fast |
| on a CPU isn't necessarily fast on a GPU. We've made a number of changes to |
| LLVM to make it generate good GPU code. Among these changes are: |
| |
| * `Straight-line scalar optimizations <https://docs.google.com/document/d/1momWzKFf4D6h8H3YlfgKQ3qeZy5ayvMRh6yR-Xn2hUE>`_ -- These |
| reduce redundancy within straight-line code. |
| |
| * `Aggressive speculative execution |
| <https://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ |
| -- This is mainly for promoting straight-line scalar optimizations, which are |
| most effective on code along dominator paths. |
| |
| * `Memory space inference |
| <https://llvm.org/doxygen/NVPTXInferAddressSpaces_8cpp_source.html>`_ -- |
| In PTX, we can operate on pointers that are in a particular "address space" |
| (global, shared, constant, or local), or we can operate on pointers in the |
| "generic" address space, which can point to anything. Operations in a |
| non-generic address space are faster, but pointers in CUDA are not explicitly |
| annotated with their address space, so it's up to LLVM to infer it where |
| possible. |
| |
| * `Bypassing 64-bit divides |
| <https://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ -- |
| This was an existing optimization that we enabled for the PTX backend. |
| |
| 64-bit integer divides are much slower than 32-bit ones on NVIDIA GPUs. |
| Many of the 64-bit divides in our benchmarks have a divisor and dividend |
| which fit in 32-bits at runtime. This optimization provides a fast path for |
| this common case. |
| |
| * Aggressive loop unrolling and function inlining -- Loop unrolling and |
| function inlining need to be more aggressive for GPUs than for CPUs because |
| control flow transfer in GPU is more expensive. More aggressive unrolling and |
| inlining also promote other optimizations, such as constant propagation and |
| SROA, which sometimes speed up code by over 10x. |
| |
| (Programmers can force unrolling and inline using clang's `loop unrolling pragmas |
| <https://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_ |
| and ``__attribute__((always_inline))``.) |
| |
| Publication |
| =========== |
| |
| The team at Google published a paper in CGO 2016 detailing the optimizations |
| they'd made to clang/LLVM. Note that "gpucc" is no longer a meaningful name: |
| The relevant tools are now just vanilla clang/LLVM. |
| |
| | `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_ |
| | Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt |
| | *Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)* |
| | |
| | `Slides from the CGO talk <http://wujingyue.github.io/docs/gpucc-talk.pdf>`_ |
| | |
| | `Tutorial given at CGO <http://wujingyue.github.io/docs/gpucc-tutorial.pdf>`_ |
| |
| Obtaining Help |
| ============== |
| |
| To obtain help on LLVM in general and its CUDA support, see `the LLVM |
| community <https://llvm.org/docs/#mailing-lists>`_. |