| ========================= | 
 | Compiling CUDA with clang | 
 | ========================= | 
 |  | 
 | .. contents:: | 
 |    :local: | 
 |  | 
 | Introduction | 
 | ============ | 
 |  | 
 | This document describes how to compile CUDA code with clang, and gives some | 
 | details about LLVM and clang's CUDA implementations. | 
 |  | 
 | This document assumes a basic familiarity with CUDA. Information about CUDA | 
 | programming can be found in the | 
 | `CUDA programming guide | 
 | <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_. | 
 |  | 
 | Compiling CUDA Code | 
 | =================== | 
 |  | 
 | Prerequisites | 
 | ------------- | 
 |  | 
 | CUDA is supported since llvm 3.9. Clang currently supports CUDA 7.0 through | 
 | 10.1. If clang detects a newer CUDA version, it will issue a warning and will | 
 | attempt to use detected CUDA SDK it as if it were CUDA-10.1. | 
 |  | 
 | Before you build CUDA code, you'll need to have installed the CUDA SDK.  See | 
 | `NVIDIA's CUDA installation guide | 
 | <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ for | 
 | details.  Note that clang `maynot support | 
 | <https://bugs.llvm.org/show_bug.cgi?id=26966>`_ the CUDA toolkit as installed by | 
 | some Linux package managers. Clang does attempt to deal with specific details of | 
 | CUDA installation on a handful of common Linux distributions, but in general the | 
 | most reliable way to make it work is to install CUDA in a single directory from | 
 | NVIDIA's `.run` package and specify its location via `--cuda-path=...` argument. | 
 |  | 
 | CUDA compilation is supported on Linux. Compilation on MacOS and Windows may or | 
 | may not work and currently have no maintainers. | 
 |  | 
 | Invoking clang | 
 | -------------- | 
 |  | 
 | Invoking clang for CUDA compilation works similarly to compiling regular C++. | 
 | You just need to be aware of a few additional flags. | 
 |  | 
 | You can use `this <https://gist.github.com/855e277884eb6b388cd2f00d956c2fd4>`_ | 
 | program as a toy example.  Save it as ``axpy.cu``.  (Clang detects that you're | 
 | compiling CUDA code by noticing that your filename ends with ``.cu``. | 
 | Alternatively, you can pass ``-x cuda``.) | 
 |  | 
 | To build and run, run the following commands, filling in the parts in angle | 
 | brackets as described below: | 
 |  | 
 | .. code-block:: console | 
 |  | 
 |   $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \ | 
 |       -L<CUDA install path>/<lib64 or lib>             \ | 
 |       -lcudart_static -ldl -lrt -pthread | 
 |   $ ./axpy | 
 |   y[0] = 2 | 
 |   y[1] = 4 | 
 |   y[2] = 6 | 
 |   y[3] = 8 | 
 |  | 
 | On MacOS, replace `-lcudart_static` with `-lcudart`; otherwise, you may get | 
 | "CUDA driver version is insufficient for CUDA runtime version" errors when you | 
 | run your program. | 
 |  | 
 | * ``<CUDA install path>`` -- the directory where you installed CUDA SDK. | 
 |   Typically, ``/usr/local/cuda``. | 
 |  | 
 |   Pass e.g. ``-L/usr/local/cuda/lib64`` if compiling in 64-bit mode; otherwise, | 
 |   pass e.g. ``-L/usr/local/cuda/lib``.  (In CUDA, the device code and host code | 
 |   always have the same pointer widths, so if you're compiling 64-bit code for | 
 |   the host, you're also compiling 64-bit code for the device.) Note that as of | 
 |   v10.0 CUDA SDK `no longer supports compilation of 32-bit | 
 |   applications <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#deprecated-features>`_. | 
 |  | 
 | * ``<GPU arch>`` -- the `compute capability | 
 |   <https://developer.nvidia.com/cuda-gpus>`_ of your GPU. For example, if you | 
 |   want to run your program on a GPU with compute capability of 3.5, specify | 
 |   ``--cuda-gpu-arch=sm_35``. | 
 |  | 
 |   Note: You cannot pass ``compute_XX`` as an argument to ``--cuda-gpu-arch``; | 
 |   only ``sm_XX`` is currently supported.  However, clang always includes PTX in | 
 |   its binaries, so e.g. a binary compiled with ``--cuda-gpu-arch=sm_30`` would be | 
 |   forwards-compatible with e.g. ``sm_35`` GPUs. | 
 |  | 
 |   You can pass ``--cuda-gpu-arch`` multiple times to compile for multiple archs. | 
 |  | 
 | The `-L` and `-l` flags only need to be passed when linking.  When compiling, | 
 | you may also need to pass ``--cuda-path=/path/to/cuda`` if you didn't install | 
 | the CUDA SDK into ``/usr/local/cuda`` or ``/usr/local/cuda-X.Y``. | 
 |  | 
 | Flags that control numerical code | 
 | --------------------------------- | 
 |  | 
 | If you're using GPUs, you probably care about making numerical code run fast. | 
 | GPU hardware allows for more control over numerical operations than most CPUs, | 
 | but this results in more compiler options for you to juggle. | 
 |  | 
 | Flags you may wish to tweak include: | 
 |  | 
 | * ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when | 
 |   compiling CUDA) Controls whether the compiler emits fused multiply-add | 
 |   operations. | 
 |  | 
 |   * ``off``: never emit fma operations, and prevent ptxas from fusing multiply | 
 |     and add instructions. | 
 |   * ``on``: fuse multiplies and adds within a single statement, but never | 
 |     across statements (C11 semantics).  Prevent ptxas from fusing other | 
 |     multiplies and adds. | 
 |   * ``fast``: fuse multiplies and adds wherever profitable, even across | 
 |     statements.  Doesn't prevent ptxas from fusing additional multiplies and | 
 |     adds. | 
 |  | 
 |   Fused multiply-add instructions can be much faster than the unfused | 
 |   equivalents, but because the intermediate result in an fma is not rounded, | 
 |   this flag can affect numerical code. | 
 |  | 
 | * ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled, | 
 |   floating point operations may flush `denormal | 
 |   <https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0. | 
 |   Operations on denormal numbers are often much slower than the same operations | 
 |   on normal numbers. | 
 |  | 
 | * ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the | 
 |   compiler may emit calls to faster, approximate versions of transcendental | 
 |   functions, instead of using the slower, fully IEEE-compliant versions.  For | 
 |   example, this flag allows clang to emit the ptx ``sin.approx.f32`` | 
 |   instruction. | 
 |  | 
 |   This is implied by ``-ffast-math``. | 
 |  | 
 | Standard library support | 
 | ======================== | 
 |  | 
 | In clang and nvcc, most of the C++ standard library is not supported on the | 
 | device side. | 
 |  | 
 | ``<math.h>`` and ``<cmath>`` | 
 | ---------------------------- | 
 |  | 
 | In clang, ``math.h`` and ``cmath`` are available and `pass | 
 | <https://github.com/llvm/llvm-test-suite/blob/master/External/CUDA/math_h.cu>`_ | 
 | `tests | 
 | <https://github.com/llvm/llvm-test-suite/blob/master/External/CUDA/cmath.cu>`_ | 
 | adapted from libc++'s test suite. | 
 |  | 
 | In nvcc ``math.h`` and ``cmath`` are mostly available.  Versions of ``::foof`` | 
 | in namespace std (e.g. ``std::sinf``) are not available, and where the standard | 
 | calls for overloads that take integral arguments, these are usually not | 
 | available. | 
 |  | 
 | .. code-block:: c++ | 
 |  | 
 |   #include <math.h> | 
 |   #include <cmath.h> | 
 |  | 
 |   // clang is OK with everything in this function. | 
 |   __device__ void test() { | 
 |     std::sin(0.); // nvcc - ok | 
 |     std::sin(0);  // nvcc - error, because no std::sin(int) override is available. | 
 |     sin(0);       // nvcc - same as above. | 
 |  | 
 |     sinf(0.);       // nvcc - ok | 
 |     std::sinf(0.);  // nvcc - no such function | 
 |   } | 
 |  | 
 | ``<std::complex>`` | 
 | ------------------ | 
 |  | 
 | nvcc does not officially support ``std::complex``.  It's an error to use | 
 | ``std::complex`` in ``__device__`` code, but it often works in ``__host__ | 
 | __device__`` code due to nvcc's interpretation of the "wrong-side rule" (see | 
 | below).  However, we have heard from implementers that it's possible to get | 
 | into situations where nvcc will omit a call to an ``std::complex`` function, | 
 | especially when compiling without optimizations. | 
 |  | 
 | As of 2016-11-16, clang supports ``std::complex`` without these caveats.  It is | 
 | tested with libstdc++ 4.8.5 and newer, but is known to work only with libc++ | 
 | newer than 2016-11-16. | 
 |  | 
 | ``<algorithm>`` | 
 | --------------- | 
 |  | 
 | In C++14, many useful functions from ``<algorithm>`` (notably, ``std::min`` and | 
 | ``std::max``) become constexpr.  You can therefore use these in device code, | 
 | when compiling with clang. | 
 |  | 
 | Detecting clang vs NVCC from code | 
 | ================================= | 
 |  | 
 | Although clang's CUDA implementation is largely compatible with NVCC's, you may | 
 | still want to detect when you're compiling CUDA code specifically with clang. | 
 |  | 
 | This is tricky, because NVCC may invoke clang as part of its own compilation | 
 | process!  For example, NVCC uses the host compiler's preprocessor when | 
 | compiling for device code, and that host compiler may in fact be clang. | 
 |  | 
 | When clang is actually compiling CUDA code -- rather than being used as a | 
 | subtool of NVCC's -- it defines the ``__CUDA__`` macro.  ``__CUDA_ARCH__`` is | 
 | defined only in device mode (but will be defined if NVCC is using clang as a | 
 | preprocessor).  So you can use the following incantations to detect clang CUDA | 
 | compilation, in host and device modes: | 
 |  | 
 | .. code-block:: c++ | 
 |  | 
 |   #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__) | 
 |   // clang compiling CUDA code, host mode. | 
 |   #endif | 
 |  | 
 |   #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__) | 
 |   // clang compiling CUDA code, device mode. | 
 |   #endif | 
 |  | 
 | Both clang and nvcc define ``__CUDACC__`` during CUDA compilation.  You can | 
 | detect NVCC specifically by looking for ``__NVCC__``. | 
 |  | 
 | Dialect Differences Between clang and nvcc | 
 | ========================================== | 
 |  | 
 | There is no formal CUDA spec, and clang and nvcc speak slightly different | 
 | dialects of the language.  Below, we describe some of the differences. | 
 |  | 
 | This section is painful; hopefully you can skip this section and live your life | 
 | blissfully unaware. | 
 |  | 
 | Compilation Models | 
 | ------------------ | 
 |  | 
 | Most of the differences between clang and nvcc stem from the different | 
 | compilation models used by clang and nvcc.  nvcc uses *split compilation*, | 
 | which works roughly as follows: | 
 |  | 
 |  * Run a preprocessor over the input ``.cu`` file to split it into two source | 
 |    files: ``H``, containing source code for the host, and ``D``, containing | 
 |    source code for the device. | 
 |  | 
 |  * For each GPU architecture ``arch`` that we're compiling for, do: | 
 |  | 
 |    * Compile ``D`` using nvcc proper.  The result of this is a ``ptx`` file for | 
 |      ``P_arch``. | 
 |  | 
 |    * Optionally, invoke ``ptxas``, the PTX assembler, to generate a file, | 
 |      ``S_arch``, containing GPU machine code (SASS) for ``arch``. | 
 |  | 
 |  * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a | 
 |    single "fat binary" file, ``F``. | 
 |  | 
 |  * Compile ``H`` using an external host compiler (gcc, clang, or whatever you | 
 |    like).  ``F`` is packaged up into a header file which is force-included into | 
 |    ``H``; nvcc generates code that calls into this header to e.g. launch | 
 |    kernels. | 
 |  | 
 | clang uses *merged parsing*.  This is similar to split compilation, except all | 
 | of the host and device code is present and must be semantically-correct in both | 
 | compilation steps. | 
 |  | 
 |   * For each GPU architecture ``arch`` that we're compiling for, do: | 
 |  | 
 |     * Compile the input ``.cu`` file for device, using clang.  ``__host__`` code | 
 |       is parsed and must be semantically correct, even though we're not | 
 |       generating code for the host at this time. | 
 |  | 
 |       The output of this step is a ``ptx`` file ``P_arch``. | 
 |  | 
 |     * Invoke ``ptxas`` to generate a SASS file, ``S_arch``.  Note that, unlike | 
 |       nvcc, clang always generates SASS code. | 
 |  | 
 |   * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a | 
 |     single fat binary file, ``F``. | 
 |  | 
 |   * Compile ``H`` using clang.  ``__device__`` code is parsed and must be | 
 |     semantically correct, even though we're not generating code for the device | 
 |     at this time. | 
 |  | 
 |     ``F`` is passed to this compilation, and clang includes it in a special ELF | 
 |     section, where it can be found by tools like ``cuobjdump``. | 
 |  | 
 | (You may ask at this point, why does clang need to parse the input file | 
 | multiple times?  Why not parse it just once, and then use the AST to generate | 
 | code for the host and each device architecture? | 
 |  | 
 | Unfortunately this can't work because we have to define different macros during | 
 | host compilation and during device compilation for each GPU architecture.) | 
 |  | 
 | clang's approach allows it to be highly robust to C++ edge cases, as it doesn't | 
 | need to decide at an early stage which declarations to keep and which to throw | 
 | away.  But it has some consequences you should be aware of. | 
 |  | 
 | Overloading Based on ``__host__`` and ``__device__`` Attributes | 
 | --------------------------------------------------------------- | 
 |  | 
 | Let "H", "D", and "HD" stand for "``__host__`` functions", "``__device__`` | 
 | functions", and "``__host__ __device__`` functions", respectively.  Functions | 
 | with no attributes behave the same as H. | 
 |  | 
 | nvcc does not allow you to create H and D functions with the same signature: | 
 |  | 
 | .. code-block:: c++ | 
 |  | 
 |   // nvcc: error - function "foo" has already been defined | 
 |   __host__ void foo() {} | 
 |   __device__ void foo() {} | 
 |  | 
 | However, nvcc allows you to "overload" H and D functions with different | 
 | signatures: | 
 |  | 
 | .. code-block:: c++ | 
 |  | 
 |   // nvcc: no error | 
 |   __host__ void foo(int) {} | 
 |   __device__ void foo() {} | 
 |  | 
 | In clang, the ``__host__`` and ``__device__`` attributes are part of a | 
 | function's signature, and so it's legal to have H and D functions with | 
 | (otherwise) the same signature: | 
 |  | 
 | .. code-block:: c++ | 
 |  | 
 |   // clang: no error | 
 |   __host__ void foo() {} | 
 |   __device__ void foo() {} | 
 |  | 
 | HD functions cannot be overloaded by H or D functions with the same signature: | 
 |  | 
 | .. code-block:: c++ | 
 |  | 
 |   // nvcc: error - function "foo" has already been defined | 
 |   // clang: error - redefinition of 'foo' | 
 |   __host__ __device__ void foo() {} | 
 |   __device__ void foo() {} | 
 |  | 
 |   // nvcc: no error | 
 |   // clang: no error | 
 |   __host__ __device__ void bar(int) {} | 
 |   __device__ void bar() {} | 
 |  | 
 | When resolving an overloaded function, clang considers the host/device | 
 | attributes of the caller and callee.  These are used as a tiebreaker during | 
 | overload resolution.  See `IdentifyCUDAPreference | 
 | <https://clang.llvm.org/doxygen/SemaCUDA_8cpp.html>`_ for the full set of rules, | 
 | but at a high level they are: | 
 |  | 
 |  * D functions prefer to call other Ds.  HDs are given lower priority. | 
 |  | 
 |  * Similarly, H functions prefer to call other Hs, or ``__global__`` functions | 
 |    (with equal priority).  HDs are given lower priority. | 
 |  | 
 |  * HD functions prefer to call other HDs. | 
 |  | 
 |    When compiling for device, HDs will call Ds with lower priority than HD, and | 
 |    will call Hs with still lower priority.  If it's forced to call an H, the | 
 |    program is malformed if we emit code for this HD function.  We call this the | 
 |    "wrong-side rule", see example below. | 
 |  | 
 |    The rules are symmetrical when compiling for host. | 
 |  | 
 | Some examples: | 
 |  | 
 | .. code-block:: c++ | 
 |  | 
 |    __host__ void foo(); | 
 |    __device__ void foo(); | 
 |  | 
 |    __host__ void bar(); | 
 |    __host__ __device__ void bar(); | 
 |  | 
 |    __host__ void test_host() { | 
 |      foo();  // calls H overload | 
 |      bar();  // calls H overload | 
 |    } | 
 |  | 
 |    __device__ void test_device() { | 
 |      foo();  // calls D overload | 
 |      bar();  // calls HD overload | 
 |    } | 
 |  | 
 |    __host__ __device__ void test_hd() { | 
 |      foo();  // calls H overload when compiling for host, otherwise D overload | 
 |      bar();  // always calls HD overload | 
 |    } | 
 |  | 
 | Wrong-side rule example: | 
 |  | 
 | .. code-block:: c++ | 
 |  | 
 |   __host__ void host_only(); | 
 |  | 
 |   // We don't codegen inline functions unless they're referenced by a | 
 |   // non-inline function.  inline_hd1() is called only from the host side, so | 
 |   // does not generate an error.  inline_hd2() is called from the device side, | 
 |   // so it generates an error. | 
 |   inline __host__ __device__ void inline_hd1() { host_only(); }  // no error | 
 |   inline __host__ __device__ void inline_hd2() { host_only(); }  // error | 
 |  | 
 |   __host__ void host_fn() { inline_hd1(); } | 
 |   __device__ void device_fn() { inline_hd2(); } | 
 |  | 
 |   // This function is not inline, so it's always codegen'ed on both the host | 
 |   // and the device.  Therefore, it generates an error. | 
 |   __host__ __device__ void not_inline_hd() { host_only(); } | 
 |  | 
 | For the purposes of the wrong-side rule, templated functions also behave like | 
 | ``inline`` functions: They aren't codegen'ed unless they're instantiated | 
 | (usually as part of the process of invoking them). | 
 |  | 
 | clang's behavior with respect to the wrong-side rule matches nvcc's, except | 
 | nvcc only emits a warning for ``not_inline_hd``; device code is allowed to call | 
 | ``not_inline_hd``.  In its generated code, nvcc may omit ``not_inline_hd``'s | 
 | call to ``host_only`` entirely, or it may try to generate code for | 
 | ``host_only`` on the device.  What you get seems to depend on whether or not | 
 | the compiler chooses to inline ``host_only``. | 
 |  | 
 | Member functions, including constructors, may be overloaded using H and D | 
 | attributes.  However, destructors cannot be overloaded. | 
 |  | 
 | Using a Different Class on Host/Device | 
 | -------------------------------------- | 
 |  | 
 | Occasionally you may want to have a class with different host/device versions. | 
 |  | 
 | If all of the class's members are the same on the host and device, you can just | 
 | provide overloads for the class's member functions. | 
 |  | 
 | However, if you want your class to have different members on host/device, you | 
 | won't be able to provide working H and D overloads in both classes. In this | 
 | case, clang is likely to be unhappy with you. | 
 |  | 
 | .. code-block:: c++ | 
 |  | 
 |   #ifdef __CUDA_ARCH__ | 
 |   struct S { | 
 |     __device__ void foo() { /* use device_only */ } | 
 |     int device_only; | 
 |   }; | 
 |   #else | 
 |   struct S { | 
 |     __host__ void foo() { /* use host_only */ } | 
 |     double host_only; | 
 |   }; | 
 |  | 
 |   __device__ void test() { | 
 |     S s; | 
 |     // clang generates an error here, because during host compilation, we | 
 |     // have ifdef'ed away the __device__ overload of S::foo().  The __device__ | 
 |     // overload must be present *even during host compilation*. | 
 |     S.foo(); | 
 |   } | 
 |   #endif | 
 |  | 
 | We posit that you don't really want to have classes with different members on H | 
 | and D.  For example, if you were to pass one of these as a parameter to a | 
 | kernel, it would have a different layout on H and D, so would not work | 
 | properly. | 
 |  | 
 | To make code like this compatible with clang, we recommend you separate it out | 
 | into two classes.  If you need to write code that works on both host and | 
 | device, consider writing an overloaded wrapper function that returns different | 
 | types on host and device. | 
 |  | 
 | .. code-block:: c++ | 
 |  | 
 |   struct HostS { ... }; | 
 |   struct DeviceS { ... }; | 
 |  | 
 |   __host__ HostS MakeStruct() { return HostS(); } | 
 |   __device__ DeviceS MakeStruct() { return DeviceS(); } | 
 |  | 
 |   // Now host and device code can call MakeStruct(). | 
 |  | 
 | Unfortunately, this idiom isn't compatible with nvcc, because it doesn't allow | 
 | you to overload based on the H/D attributes.  Here's an idiom that works with | 
 | both clang and nvcc: | 
 |  | 
 | .. code-block:: c++ | 
 |  | 
 |   struct HostS { ... }; | 
 |   struct DeviceS { ... }; | 
 |  | 
 |   #ifdef __NVCC__ | 
 |     #ifndef __CUDA_ARCH__ | 
 |       __host__ HostS MakeStruct() { return HostS(); } | 
 |     #else | 
 |       __device__ DeviceS MakeStruct() { return DeviceS(); } | 
 |     #endif | 
 |   #else | 
 |     __host__ HostS MakeStruct() { return HostS(); } | 
 |     __device__ DeviceS MakeStruct() { return DeviceS(); } | 
 |   #endif | 
 |  | 
 |   // Now host and device code can call MakeStruct(). | 
 |  | 
 | Hopefully you don't have to do this sort of thing often. | 
 |  | 
 | Optimizations | 
 | ============= | 
 |  | 
 | Modern CPUs and GPUs are architecturally quite different, so code that's fast | 
 | on a CPU isn't necessarily fast on a GPU.  We've made a number of changes to | 
 | LLVM to make it generate good GPU code.  Among these changes are: | 
 |  | 
 | * `Straight-line scalar optimizations <https://goo.gl/4Rb9As>`_ -- These | 
 |   reduce redundancy within straight-line code. | 
 |  | 
 | * `Aggressive speculative execution | 
 |   <https://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ | 
 |   -- This is mainly for promoting straight-line scalar optimizations, which are | 
 |   most effective on code along dominator paths. | 
 |  | 
 | * `Memory space inference | 
 |   <https://llvm.org/doxygen/NVPTXInferAddressSpaces_8cpp_source.html>`_ -- | 
 |   In PTX, we can operate on pointers that are in a particular "address space" | 
 |   (global, shared, constant, or local), or we can operate on pointers in the | 
 |   "generic" address space, which can point to anything.  Operations in a | 
 |   non-generic address space are faster, but pointers in CUDA are not explicitly | 
 |   annotated with their address space, so it's up to LLVM to infer it where | 
 |   possible. | 
 |  | 
 | * `Bypassing 64-bit divides | 
 |   <https://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ -- | 
 |   This was an existing optimization that we enabled for the PTX backend. | 
 |  | 
 |   64-bit integer divides are much slower than 32-bit ones on NVIDIA GPUs. | 
 |   Many of the 64-bit divides in our benchmarks have a divisor and dividend | 
 |   which fit in 32-bits at runtime. This optimization provides a fast path for | 
 |   this common case. | 
 |  | 
 | * Aggressive loop unrolling and function inlining -- Loop unrolling and | 
 |   function inlining need to be more aggressive for GPUs than for CPUs because | 
 |   control flow transfer in GPU is more expensive. More aggressive unrolling and | 
 |   inlining also promote other optimizations, such as constant propagation and | 
 |   SROA, which sometimes speed up code by over 10x. | 
 |  | 
 |   (Programmers can force unrolling and inline using clang's `loop unrolling pragmas | 
 |   <https://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_ | 
 |   and ``__attribute__((always_inline))``.) | 
 |  | 
 | Publication | 
 | =========== | 
 |  | 
 | The team at Google published a paper in CGO 2016 detailing the optimizations | 
 | they'd made to clang/LLVM.  Note that "gpucc" is no longer a meaningful name: | 
 | The relevant tools are now just vanilla clang/LLVM. | 
 |  | 
 | | `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_ | 
 | | Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt | 
 | | *Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)* | 
 | | | 
 | | `Slides from the CGO talk <http://wujingyue.github.io/docs/gpucc-talk.pdf>`_ | 
 | | | 
 | | `Tutorial given at CGO <http://wujingyue.github.io/docs/gpucc-tutorial.pdf>`_ | 
 |  | 
 | Obtaining Help | 
 | ============== | 
 |  | 
 | To obtain help on LLVM in general and its CUDA support, see `the LLVM | 
 | community <https://llvm.org/docs/#mailing-lists>`_. |