Blame - docs/CompileCudaWithLLVM.rst - llvm-project/llvm

blob: 5aa1798ff0e7bc43a7a3e6558ea2e29e577120a5 [file] [log] [blame]

Justin Lebar	1b629fa	2016-09-07 20:37:41 +0000	[diff] [blame]	1	=========================
Justin Lebar	0a6dc1e	2016-09-07 20:09:53 +0000	[diff] [blame]	2	Compiling CUDA with clang
Justin Lebar	1b629fa	2016-09-07 20:37:41 +0000	[diff] [blame]	3	=========================
Jingyue Wu	d89b9ac	2015-11-10 22:35:47 +0000	[diff] [blame]	4
				5	.. contents::
				6	:local:
				7
				8	Introduction
				9	============
				10
Justin Lebar	0a6dc1e	2016-09-07 20:09:53 +0000	[diff] [blame]	11	This document describes how to compile CUDA code with clang, and gives some
				12	details about LLVM and clang's CUDA implementations.
				13
				14	This document assumes a basic familiarity with CUDA. Information about CUDA
				15	programming can be found in the
Jingyue Wu	d89b9ac	2015-11-10 22:35:47 +0000	[diff] [blame]	16	`CUDA programming guide
				17	<http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_.
				18
Justin Lebar	0a6dc1e	2016-09-07 20:09:53 +0000	[diff] [blame]	19	Compiling CUDA Code
				20	===================
Jingyue Wu	d89b9ac	2015-11-10 22:35:47 +0000	[diff] [blame]	21
Justin Lebar	0a6dc1e	2016-09-07 20:09:53 +0000	[diff] [blame]	22	Prerequisites
				23	-------------
Jingyue Wu	d89b9ac	2015-11-10 22:35:47 +0000	[diff] [blame]	24
Artem Belevich	a58e270	2020-03-16 15:47:50 -0700	[diff] [blame]	25	CUDA is supported since llvm 3.9. Clang currently supports CUDA 7.0 through
Carlos Galvez	54ab638	2021-11-05 09:13:33 +0000	[diff] [blame]	26	11.5. If clang detects a newer CUDA version, it will issue a warning and will
				27	attempt to use detected CUDA SDK it as if it were CUDA 11.5.
Jingyue Wu	d89b9ac	2015-11-10 22:35:47 +0000	[diff] [blame]	28
Artem Belevich	a58e270	2020-03-16 15:47:50 -0700	[diff] [blame]	29	Before you build CUDA code, you'll need to have installed the CUDA SDK. See
				30	`NVIDIA's CUDA installation guide
Artem Belevich	ac9695f	2018-11-16 01:02:43 +0000	[diff] [blame]	31	<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ for
Artem Belevich	a58e270	2020-03-16 15:47:50 -0700	[diff] [blame]	32	details. Note that clang `maynot support
Sylvestre Ledru	6879431	2020-03-22 22:45:15 +0100	[diff] [blame]	33	<https://bugs.llvm.org/show_bug.cgi?id=26966>`_ the CUDA toolkit as installed by
Artem Belevich	a58e270	2020-03-16 15:47:50 -0700	[diff] [blame]	34	some Linux package managers. Clang does attempt to deal with specific details of
				35	CUDA installation on a handful of common Linux distributions, but in general the
				36	most reliable way to make it work is to install CUDA in a single directory from
				37	NVIDIA's `.run` package and specify its location via `--cuda-path=...` argument.
Jingyue Wu	d89b9ac	2015-11-10 22:35:47 +0000	[diff] [blame]	38
Artem Belevich	ac9695f	2018-11-16 01:02:43 +0000	[diff] [blame]	39	CUDA compilation is supported on Linux. Compilation on MacOS and Windows may or
Artem Belevich	a58e270	2020-03-16 15:47:50 -0700	[diff] [blame]	40	may not work and currently have no maintainers.
Justin Lebar	db37632	2016-11-18 00:42:00 +0000	[diff] [blame]	41
Justin Lebar	1b629fa	2016-09-07 20:37:41 +0000	[diff] [blame]	42	Invoking clang
				43	--------------
Jingyue Wu	d89b9ac	2015-11-10 22:35:47 +0000	[diff] [blame]	44
Justin Lebar	1b629fa	2016-09-07 20:37:41 +0000	[diff] [blame]	45	Invoking clang for CUDA compilation works similarly to compiling regular C++.
				46	You just need to be aware of a few additional flags.
Jingyue Wu	d89b9ac	2015-11-10 22:35:47 +0000	[diff] [blame]	47
Justin Lebar	4d77f5d	2016-09-07 20:42:24 +0000	[diff] [blame]	48	You can use `this <https://gist.github.com/855e277884eb6b388cd2f00d956c2fd4>`_
Justin Lebar	cd625e8	2016-09-07 21:46:21 +0000	[diff] [blame]	49	program as a toy example. Save it as ``axpy.cu``. (Clang detects that you're
				50	compiling CUDA code by noticing that your filename ends with ``.cu``.
				51	Alternatively, you can pass ``-x cuda``.)
				52
				53	To build and run, run the following commands, filling in the parts in angle
				54	brackets as described below:
Jingyue Wu	d89b9ac	2015-11-10 22:35:47 +0000	[diff] [blame]	55
				56	.. code-block:: console
				57
Justin Lebar	1b629fa	2016-09-07 20:37:41 +0000	[diff] [blame]	58	$ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \
				59	-L<CUDA install path>/<lib64 or lib> \
Jingyue Wu	a664fcf	2016-01-30 23:48:47 +0000	[diff] [blame]	60	-lcudart_static -ldl -lrt -pthread
Jingyue Wu	d89b9ac	2015-11-10 22:35:47 +0000	[diff] [blame]	61	$ ./axpy
				62	y[0] = 2
				63	y[1] = 4
				64	y[2] = 6
				65	y[3] = 8
				66
Justin Lebar	07a162a	2016-11-22 23:13:29 +0000	[diff] [blame]	67	On MacOS, replace `-lcudart_static` with `-lcudart`; otherwise, you may get
				68	"CUDA driver version is insufficient for CUDA runtime version" errors when you
				69	run your program.
				70
Justin Lebar	cd625e8	2016-09-07 21:46:21 +0000	[diff] [blame]	71	* ``<CUDA install path>`` -- the directory where you installed CUDA SDK.
				72	Typically, ``/usr/local/cuda``.
Jingyue Wu	d89b9ac	2015-11-10 22:35:47 +0000	[diff] [blame]	73
Justin Lebar	cd625e8	2016-09-07 21:46:21 +0000	[diff] [blame]	74	Pass e.g. ``-L/usr/local/cuda/lib64`` if compiling in 64-bit mode; otherwise,
				75	pass e.g. ``-L/usr/local/cuda/lib``. (In CUDA, the device code and host code
				76	always have the same pointer widths, so if you're compiling 64-bit code for
Artem Belevich	ac9695f	2018-11-16 01:02:43 +0000	[diff] [blame]	77	the host, you're also compiling 64-bit code for the device.) Note that as of
				78	v10.0 CUDA SDK `no longer supports compilation of 32-bit
Artem Belevich	7db9e8e	2018-11-16 01:23:12 +0000	[diff] [blame]	79	applications <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#deprecated-features>`_.
Justin Lebar	bdb549c	2016-09-07 20:09:46 +0000	[diff] [blame]	80
Justin Lebar	cd625e8	2016-09-07 21:46:21 +0000	[diff] [blame]	81	* ``<GPU arch>`` -- the `compute capability
				82	<https://developer.nvidia.com/cuda-gpus>`_ of your GPU. For example, if you
				83	want to run your program on a GPU with compute capability of 3.5, specify
Justin Lebar	1b629fa	2016-09-07 20:37:41 +0000	[diff] [blame]	84	``--cuda-gpu-arch=sm_35``.
Justin Lebar	481baf5	2016-03-21 23:05:15 +0000	[diff] [blame]	85
Justin Lebar	1b629fa	2016-09-07 20:37:41 +0000	[diff] [blame]	86	Note: You cannot pass ``compute_XX`` as an argument to ``--cuda-gpu-arch``;
				87	only ``sm_XX`` is currently supported. However, clang always includes PTX in
				88	its binaries, so e.g. a binary compiled with ``--cuda-gpu-arch=sm_30`` would be
				89	forwards-compatible with e.g. ``sm_35`` GPUs.
Justin Lebar	481baf5	2016-03-21 23:05:15 +0000	[diff] [blame]	90
Justin Lebar	cd625e8	2016-09-07 21:46:21 +0000	[diff] [blame]	91	You can pass ``--cuda-gpu-arch`` multiple times to compile for multiple archs.
Justin Lebar	481baf5	2016-03-21 23:05:15 +0000	[diff] [blame]	92
Justin Lebar	b2457d6	2016-09-07 21:46:49 +0000	[diff] [blame]	93	The `-L` and `-l` flags only need to be passed when linking. When compiling,
				94	you may also need to pass ``--cuda-path=/path/to/cuda`` if you didn't install
Artem Belevich	ac9695f	2018-11-16 01:02:43 +0000	[diff] [blame]	95	the CUDA SDK into ``/usr/local/cuda`` or ``/usr/local/cuda-X.Y``.
Justin Lebar	b2457d6	2016-09-07 21:46:49 +0000	[diff] [blame]	96
Justin Lebar	20dabab	2016-05-25 23:11:31 +0000	[diff] [blame]	97	Flags that control numerical code
Justin Lebar	1b629fa	2016-09-07 20:37:41 +0000	[diff] [blame]	98	---------------------------------
Justin Lebar	20dabab	2016-05-25 23:11:31 +0000	[diff] [blame]	99
				100	If you're using GPUs, you probably care about making numerical code run fast.
				101	GPU hardware allows for more control over numerical operations than most CPUs,
				102	but this results in more compiler options for you to juggle.
				103
				104	Flags you may wish to tweak include:
				105
				106	* ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when
				107	compiling CUDA) Controls whether the compiler emits fused multiply-add
				108	operations.
				109
				110	* ``off``: never emit fma operations, and prevent ptxas from fusing multiply
				111	and add instructions.
				112	* ``on``: fuse multiplies and adds within a single statement, but never
				113	across statements (C11 semantics). Prevent ptxas from fusing other
				114	multiplies and adds.
				115	* ``fast``: fuse multiplies and adds wherever profitable, even across
				116	statements. Doesn't prevent ptxas from fusing additional multiplies and
				117	adds.
				118
				119	Fused multiply-add instructions can be much faster than the unfused
				120	equivalents, but because the intermediate result in an fma is not rounded,
				121	this flag can affect numerical code.
				122
				123	* ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled,
				124	floating point operations may flush `denormal
				125	<https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0.
				126	Operations on denormal numbers are often much slower than the same operations
				127	on normal numbers.
				128
				129	* ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the
				130	compiler may emit calls to faster, approximate versions of transcendental
				131	functions, instead of using the slower, fully IEEE-compliant versions. For
				132	example, this flag allows clang to emit the ptx ``sin.approx.f32``
				133	instruction.
				134
				135	This is implied by ``-ffast-math``.
				136
Justin Lebar	6690488	2016-09-15 02:04:32 +0000	[diff] [blame]	137	Standard library support
				138	========================
				139
				140	In clang and nvcc, most of the C++ standard library is not supported on the
				141	device side.
				142
Justin Lebar	fcf0926	2016-09-16 04:14:02 +0000	[diff] [blame]	143	``<math.h>`` and ``<cmath>``
				144	----------------------------
Justin Lebar	6690488	2016-09-15 02:04:32 +0000	[diff] [blame]	145
				146	In clang, ``math.h`` and ``cmath`` are available and `pass
Quinn Pham	139e899	2021-11-03 14:41:24 -0500	[diff] [blame]	147	<https://github.com/llvm/llvm-test-suite/blob/main/External/CUDA/math_h.cu>`_
Justin Lebar	6690488	2016-09-15 02:04:32 +0000	[diff] [blame]	148	`tests
Quinn Pham	139e899	2021-11-03 14:41:24 -0500	[diff] [blame]	149	<https://github.com/llvm/llvm-test-suite/blob/main/External/CUDA/cmath.cu>`_
Justin Lebar	6690488	2016-09-15 02:04:32 +0000	[diff] [blame]	150	adapted from libc++'s test suite.
				151
				152	In nvcc ``math.h`` and ``cmath`` are mostly available. Versions of ``::foof``
				153	in namespace std (e.g. ``std::sinf``) are not available, and where the standard
				154	calls for overloads that take integral arguments, these are usually not
				155	available.
				156
				157	.. code-block:: c++
				158
				159	#include <math.h>
				160	#include <cmath.h>
				161
				162	// clang is OK with everything in this function.
				163	__device__ void test() {
				164	std::sin(0.); // nvcc - ok
				165	std::sin(0); // nvcc - error, because no std::sin(int) override is available.
				166	sin(0); // nvcc - same as above.
				167
				168	sinf(0.); // nvcc - ok
				169	std::sinf(0.); // nvcc - no such function
				170	}
				171
Justin Lebar	fcf0926	2016-09-16 04:14:02 +0000	[diff] [blame]	172	``<std::complex>``
				173	------------------
Justin Lebar	6690488	2016-09-15 02:04:32 +0000	[diff] [blame]	174
				175	nvcc does not officially support ``std::complex``. It's an error to use
				176	``std::complex`` in ``__device__`` code, but it often works in ``__host__
				177	__device__`` code due to nvcc's interpretation of the "wrong-side rule" (see
				178	below). However, we have heard from implementers that it's possible to get
				179	into situations where nvcc will omit a call to an ``std::complex`` function,
				180	especially when compiling without optimizations.
				181
Justin Lebar	5dae950	2016-11-17 01:03:42 +0000	[diff] [blame]	182	As of 2016-11-16, clang supports ``std::complex`` without these caveats. It is
				183	tested with libstdc++ 4.8.5 and newer, but is known to work only with libc++
				184	newer than 2016-11-16.
Justin Lebar	6690488	2016-09-15 02:04:32 +0000	[diff] [blame]	185
Justin Lebar	fcf0926	2016-09-16 04:14:02 +0000	[diff] [blame]	186	``<algorithm>``
				187	---------------
				188
				189	In C++14, many useful functions from ``<algorithm>`` (notably, ``std::min`` and
				190	``std::max``) become constexpr. You can therefore use these in device code,
				191	when compiling with clang.
Justin Lebar	6690488	2016-09-15 02:04:32 +0000	[diff] [blame]	192
Justin Lebar	1b629fa	2016-09-07 20:37:41 +0000	[diff] [blame]	193	Detecting clang vs NVCC from code
				194	=================================
				195
				196	Although clang's CUDA implementation is largely compatible with NVCC's, you may
				197	still want to detect when you're compiling CUDA code specifically with clang.
				198
				199	This is tricky, because NVCC may invoke clang as part of its own compilation
				200	process! For example, NVCC uses the host compiler's preprocessor when
				201	compiling for device code, and that host compiler may in fact be clang.
				202
				203	When clang is actually compiling CUDA code -- rather than being used as a
				204	subtool of NVCC's -- it defines the ``__CUDA__`` macro. ``__CUDA_ARCH__`` is
				205	defined only in device mode (but will be defined if NVCC is using clang as a
				206	preprocessor). So you can use the following incantations to detect clang CUDA
				207	compilation, in host and device modes:
				208
				209	.. code-block:: c++
				210
				211	#if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__)
Justin Lebar	6690488	2016-09-15 02:04:32 +0000	[diff] [blame]	212	// clang compiling CUDA code, host mode.
Justin Lebar	1b629fa	2016-09-07 20:37:41 +0000	[diff] [blame]	213	#endif
				214
				215	#if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__)
Justin Lebar	6690488	2016-09-15 02:04:32 +0000	[diff] [blame]	216	// clang compiling CUDA code, device mode.
Justin Lebar	1b629fa	2016-09-07 20:37:41 +0000	[diff] [blame]	217	#endif
				218
				219	Both clang and nvcc define ``__CUDACC__`` during CUDA compilation. You can
				220	detect NVCC specifically by looking for ``__NVCC__``.
				221
Justin Lebar	6690488	2016-09-15 02:04:32 +0000	[diff] [blame]	222	Dialect Differences Between clang and nvcc
				223	==========================================
				224
				225	There is no formal CUDA spec, and clang and nvcc speak slightly different
				226	dialects of the language. Below, we describe some of the differences.
				227
				228	This section is painful; hopefully you can skip this section and live your life
				229	blissfully unaware.
				230
				231	Compilation Models
				232	------------------
				233
				234	Most of the differences between clang and nvcc stem from the different
				235	compilation models used by clang and nvcc. nvcc uses split compilation,
				236	which works roughly as follows:
				237
				238	* Run a preprocessor over the input ``.cu`` file to split it into two source
				239	files: ``H``, containing source code for the host, and ``D``, containing
				240	source code for the device.
				241
				242	* For each GPU architecture ``arch`` that we're compiling for, do:
				243
				244	* Compile ``D`` using nvcc proper. The result of this is a ``ptx`` file for
				245	``P_arch``.
				246
				247	* Optionally, invoke ``ptxas``, the PTX assembler, to generate a file,
				248	``S_arch``, containing GPU machine code (SASS) for ``arch``.
				249
				250	* Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a
				251	single "fat binary" file, ``F``.
				252
				253	* Compile ``H`` using an external host compiler (gcc, clang, or whatever you
				254	like). ``F`` is packaged up into a header file which is force-included into
				255	``H``; nvcc generates code that calls into this header to e.g. launch
				256	kernels.
				257
				258	clang uses merged parsing. This is similar to split compilation, except all
				259	of the host and device code is present and must be semantically-correct in both
				260	compilation steps.
				261
				262	* For each GPU architecture ``arch`` that we're compiling for, do:
				263
				264	* Compile the input ``.cu`` file for device, using clang. ``__host__`` code
				265	is parsed and must be semantically correct, even though we're not
				266	generating code for the host at this time.
				267
				268	The output of this step is a ``ptx`` file ``P_arch``.
				269
				270	* Invoke ``ptxas`` to generate a SASS file, ``S_arch``. Note that, unlike
				271	nvcc, clang always generates SASS code.
				272
				273	* Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a
				274	single fat binary file, ``F``.
				275
				276	* Compile ``H`` using clang. ``__device__`` code is parsed and must be
				277	semantically correct, even though we're not generating code for the device
				278	at this time.
				279
				280	``F`` is passed to this compilation, and clang includes it in a special ELF
				281	section, where it can be found by tools like ``cuobjdump``.
				282
				283	(You may ask at this point, why does clang need to parse the input file
				284	multiple times? Why not parse it just once, and then use the AST to generate
				285	code for the host and each device architecture?
				286
				287	Unfortunately this can't work because we have to define different macros during
				288	host compilation and during device compilation for each GPU architecture.)
				289
				290	clang's approach allows it to be highly robust to C++ edge cases, as it doesn't
				291	need to decide at an early stage which declarations to keep and which to throw
				292	away. But it has some consequences you should be aware of.
				293
				294	Overloading Based on ``__host__`` and ``__device__`` Attributes
				295	---------------------------------------------------------------
				296
				297	Let "H", "D", and "HD" stand for "``__host__`` functions", "``__device__``
				298	functions", and "``__host__ __device__`` functions", respectively. Functions
				299	with no attributes behave the same as H.
				300
				301	nvcc does not allow you to create H and D functions with the same signature:
				302
				303	.. code-block:: c++
				304
				305	// nvcc: error - function "foo" has already been defined
				306	__host__ void foo() {}
				307	__device__ void foo() {}
				308
				309	However, nvcc allows you to "overload" H and D functions with different
				310	signatures:
				311
				312	.. code-block:: c++
				313
				314	// nvcc: no error
				315	__host__ void foo(int) {}
				316	__device__ void foo() {}
				317
				318	In clang, the ``__host__`` and ``__device__`` attributes are part of a
				319	function's signature, and so it's legal to have H and D functions with
				320	(otherwise) the same signature:
				321
				322	.. code-block:: c++
				323
				324	// clang: no error
				325	__host__ void foo() {}
				326	__device__ void foo() {}
				327
				328	HD functions cannot be overloaded by H or D functions with the same signature:
				329
				330	.. code-block:: c++
				331
				332	// nvcc: error - function "foo" has already been defined
				333	// clang: error - redefinition of 'foo'
				334	__host__ __device__ void foo() {}
				335	__device__ void foo() {}
				336
				337	// nvcc: no error
				338	// clang: no error
				339	__host__ __device__ void bar(int) {}
				340	__device__ void bar() {}
				341
				342	When resolving an overloaded function, clang considers the host/device
				343	attributes of the caller and callee. These are used as a tiebreaker during
				344	overload resolution. See `IdentifyCUDAPreference
Sylvestre Ledru	4e27776	2020-03-22 22:42:03 +0100	[diff] [blame]	345	<https://clang.llvm.org/doxygen/SemaCUDA_8cpp.html>`_ for the full set of rules,
Justin Lebar	6690488	2016-09-15 02:04:32 +0000	[diff] [blame]	346	but at a high level they are:
				347
				348	* D functions prefer to call other Ds. HDs are given lower priority.
				349
				350	* Similarly, H functions prefer to call other Hs, or ``__global__`` functions
				351	(with equal priority). HDs are given lower priority.
				352
				353	* HD functions prefer to call other HDs.
				354
				355	When compiling for device, HDs will call Ds with lower priority than HD, and
				356	will call Hs with still lower priority. If it's forced to call an H, the
				357	program is malformed if we emit code for this HD function. We call this the
				358	"wrong-side rule", see example below.
				359
				360	The rules are symmetrical when compiling for host.
				361
				362	Some examples:
				363
				364	.. code-block:: c++
				365
				366	__host__ void foo();
				367	__device__ void foo();
				368
				369	__host__ void bar();
				370	__host__ __device__ void bar();
				371
				372	__host__ void test_host() {
				373	foo(); // calls H overload
				374	bar(); // calls H overload
				375	}
				376
				377	__device__ void test_device() {
				378	foo(); // calls D overload
				379	bar(); // calls HD overload
				380	}
				381
				382	__host__ __device__ void test_hd() {
				383	foo(); // calls H overload when compiling for host, otherwise D overload
				384	bar(); // always calls HD overload
				385	}
				386
				387	Wrong-side rule example:
				388
				389	.. code-block:: c++
				390
				391	__host__ void host_only();
				392
				393	// We don't codegen inline functions unless they're referenced by a
				394	// non-inline function. inline_hd1() is called only from the host side, so
				395	// does not generate an error. inline_hd2() is called from the device side,
				396	// so it generates an error.
				397	inline __host__ __device__ void inline_hd1() { host_only(); } // no error
				398	inline __host__ __device__ void inline_hd2() { host_only(); } // error
				399
				400	__host__ void host_fn() { inline_hd1(); }
				401	__device__ void device_fn() { inline_hd2(); }
				402
				403	// This function is not inline, so it's always codegen'ed on both the host
				404	// and the device. Therefore, it generates an error.
				405	__host__ __device__ void not_inline_hd() { host_only(); }
				406
				407	For the purposes of the wrong-side rule, templated functions also behave like
				408	``inline`` functions: They aren't codegen'ed unless they're instantiated
				409	(usually as part of the process of invoking them).
				410
				411	clang's behavior with respect to the wrong-side rule matches nvcc's, except
				412	nvcc only emits a warning for ``not_inline_hd``; device code is allowed to call
				413	``not_inline_hd``. In its generated code, nvcc may omit ``not_inline_hd``'s
				414	call to ``host_only`` entirely, or it may try to generate code for
				415	``host_only`` on the device. What you get seems to depend on whether or not
				416	the compiler chooses to inline ``host_only``.
				417
				418	Member functions, including constructors, may be overloaded using H and D
				419	attributes. However, destructors cannot be overloaded.
				420
				421	Using a Different Class on Host/Device
				422	--------------------------------------
				423
				424	Occasionally you may want to have a class with different host/device versions.
				425
				426	If all of the class's members are the same on the host and device, you can just
				427	provide overloads for the class's member functions.
				428
				429	However, if you want your class to have different members on host/device, you
				430	won't be able to provide working H and D overloads in both classes. In this
				431	case, clang is likely to be unhappy with you.
				432
				433	.. code-block:: c++
				434
				435	#ifdef __CUDA_ARCH__
				436	struct S {
				437	__device__ void foo() { /* use device_only */ }
				438	int device_only;
				439	};
				440	#else
				441	struct S {
				442	__host__ void foo() { /* use host_only */ }
				443	double host_only;
				444	};
				445
				446	__device__ void test() {
				447	S s;
				448	// clang generates an error here, because during host compilation, we
				449	// have ifdef'ed away the __device__ overload of S::foo(). The __device__
				450	// overload must be present even during host compilation.
				451	S.foo();
				452	}
				453	#endif
				454
				455	We posit that you don't really want to have classes with different members on H
				456	and D. For example, if you were to pass one of these as a parameter to a
				457	kernel, it would have a different layout on H and D, so would not work
				458	properly.
				459
				460	To make code like this compatible with clang, we recommend you separate it out
				461	into two classes. If you need to write code that works on both host and
				462	device, consider writing an overloaded wrapper function that returns different
				463	types on host and device.
				464
				465	.. code-block:: c++
				466
				467	struct HostS { ... };
				468	struct DeviceS { ... };
				469
				470	__host__ HostS MakeStruct() { return HostS(); }
				471	__device__ DeviceS MakeStruct() { return DeviceS(); }
				472
				473	// Now host and device code can call MakeStruct().
				474
				475	Unfortunately, this idiom isn't compatible with nvcc, because it doesn't allow
				476	you to overload based on the H/D attributes. Here's an idiom that works with
				477	both clang and nvcc:
				478
				479	.. code-block:: c++
				480
				481	struct HostS { ... };
				482	struct DeviceS { ... };
				483
				484	#ifdef __NVCC__
				485	#ifndef __CUDA_ARCH__
				486	__host__ HostS MakeStruct() { return HostS(); }
				487	#else
				488	__device__ DeviceS MakeStruct() { return DeviceS(); }
				489	#endif
				490	#else
				491	__host__ HostS MakeStruct() { return HostS(); }
				492	__device__ DeviceS MakeStruct() { return DeviceS(); }
				493	#endif
				494
				495	// Now host and device code can call MakeStruct().
				496
				497	Hopefully you don't have to do this sort of thing often.
				498
Jingyue Wu	d89b9ac	2015-11-10 22:35:47 +0000	[diff] [blame]	499	Optimizations
				500	=============
				501
Justin Lebar	99fa300	2016-09-07 21:46:53 +0000	[diff] [blame]	502	Modern CPUs and GPUs are architecturally quite different, so code that's fast
				503	on a CPU isn't necessarily fast on a GPU. We've made a number of changes to
				504	LLVM to make it generate good GPU code. Among these changes are:
Jingyue Wu	d89b9ac	2015-11-10 22:35:47 +0000	[diff] [blame]	505
Justin Lebar	99fa300	2016-09-07 21:46:53 +0000	[diff] [blame]	506	* `Straight-line scalar optimizations <https://goo.gl/4Rb9As>`_ -- These
				507	reduce redundancy within straight-line code.
Jingyue Wu	d89b9ac	2015-11-10 22:35:47 +0000	[diff] [blame]	508
Justin Lebar	99fa300	2016-09-07 21:46:53 +0000	[diff] [blame]	509	* `Aggressive speculative execution
Sylvestre Ledru	4e27776	2020-03-22 22:42:03 +0100	[diff] [blame]	510	<https://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_
Justin Lebar	99fa300	2016-09-07 21:46:53 +0000	[diff] [blame]	511	-- This is mainly for promoting straight-line scalar optimizations, which are
				512	most effective on code along dominator paths.
Jingyue Wu	d89b9ac	2015-11-10 22:35:47 +0000	[diff] [blame]	513
Justin Lebar	99fa300	2016-09-07 21:46:53 +0000	[diff] [blame]	514	* `Memory space inference
Sylvestre Ledru	4e27776	2020-03-22 22:42:03 +0100	[diff] [blame]	515	<https://llvm.org/doxygen/NVPTXInferAddressSpaces_8cpp_source.html>`_ --
Kazuaki Ishizaki	87c5529	2020-01-22 11:30:57 +0800	[diff] [blame]	516	In PTX, we can operate on pointers that are in a particular "address space"
Justin Lebar	99fa300	2016-09-07 21:46:53 +0000	[diff] [blame]	517	(global, shared, constant, or local), or we can operate on pointers in the
				518	"generic" address space, which can point to anything. Operations in a
				519	non-generic address space are faster, but pointers in CUDA are not explicitly
				520	annotated with their address space, so it's up to LLVM to infer it where
				521	possible.
Jingyue Wu	d89b9ac	2015-11-10 22:35:47 +0000	[diff] [blame]	522
Justin Lebar	99fa300	2016-09-07 21:46:53 +0000	[diff] [blame]	523	* `Bypassing 64-bit divides
Sylvestre Ledru	4e27776	2020-03-22 22:42:03 +0100	[diff] [blame]	524	<https://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ --
Justin Lebar	99fa300	2016-09-07 21:46:53 +0000	[diff] [blame]	525	This was an existing optimization that we enabled for the PTX backend.
				526
				527	64-bit integer divides are much slower than 32-bit ones on NVIDIA GPUs.
				528	Many of the 64-bit divides in our benchmarks have a divisor and dividend
				529	which fit in 32-bits at runtime. This optimization provides a fast path for
				530	this common case.
				531
Kazuaki Ishizaki	87c5529	2020-01-22 11:30:57 +0800	[diff] [blame]	532	* Aggressive loop unrolling and function inlining -- Loop unrolling and
Jingyue Wu	d89b9ac	2015-11-10 22:35:47 +0000	[diff] [blame]	533	function inlining need to be more aggressive for GPUs than for CPUs because
Justin Lebar	99fa300	2016-09-07 21:46:53 +0000	[diff] [blame]	534	control flow transfer in GPU is more expensive. More aggressive unrolling and
				535	inlining also promote other optimizations, such as constant propagation and
				536	SROA, which sometimes speed up code by over 10x.
				537
				538	(Programmers can force unrolling and inline using clang's `loop unrolling pragmas
Sylvestre Ledru	4e27776	2020-03-22 22:42:03 +0100	[diff] [blame]	539	<https://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_
Justin Lebar	99fa300	2016-09-07 21:46:53 +0000	[diff] [blame]	540	and ``__attribute__((always_inline))``.)
Jingyue Wu	98d1be8	2016-02-23 23:34:49 +0000	[diff] [blame]	541
Jingyue Wu	9260a1c	2016-03-30 05:05:40 +0000	[diff] [blame]	542	Publication
				543	===========
				544
Justin Lebar	99fa300	2016-09-07 21:46:53 +0000	[diff] [blame]	545	The team at Google published a paper in CGO 2016 detailing the optimizations
				546	they'd made to clang/LLVM. Note that "gpucc" is no longer a meaningful name:
				547	The relevant tools are now just vanilla clang/LLVM.
				548
Jingyue Wu	9260a1c	2016-03-30 05:05:40 +0000	[diff] [blame]	549	\| `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_
				550	\| Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt
				551	\| Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)
Justin Lebar	99fa300	2016-09-07 21:46:53 +0000	[diff] [blame]	552	\|
Artem Belevich	ac9695f	2018-11-16 01:02:43 +0000	[diff] [blame]	553	\| `Slides from the CGO talk <http://wujingyue.github.io/docs/gpucc-talk.pdf>`_
Justin Lebar	99fa300	2016-09-07 21:46:53 +0000	[diff] [blame]	554	\|
Artem Belevich	ac9695f	2018-11-16 01:02:43 +0000	[diff] [blame]	555	\| `Tutorial given at CGO <http://wujingyue.github.io/docs/gpucc-tutorial.pdf>`_
Jingyue Wu	9260a1c	2016-03-30 05:05:40 +0000	[diff] [blame]	556
Jingyue Wu	98d1be8	2016-02-23 23:34:49 +0000	[diff] [blame]	557	Obtaining Help
				558	==============
				559
				560	To obtain help on LLVM in general and its CUDA support, see `the LLVM
Sylvestre Ledru	4e27776	2020-03-22 22:42:03 +0100	[diff] [blame]	561	community <https://llvm.org/docs/#mailing-lists>`_.