annotate llvm/docs/CompileCudaWithLLVM.rst @ 173:0572611fdcc8 llvm10 llvm12

reorgnization done
author Shinji KONO <kono@ie.u-ryukyu.ac.jp>
date Mon, 25 May 2020 11:55:54 +0900
parents 1d019706d866
children c4bab56944e8
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
150
anatofuz
parents:
diff changeset
1 =========================
anatofuz
parents:
diff changeset
2 Compiling CUDA with clang
anatofuz
parents:
diff changeset
3 =========================
anatofuz
parents:
diff changeset
4
anatofuz
parents:
diff changeset
5 .. contents::
anatofuz
parents:
diff changeset
6 :local:
anatofuz
parents:
diff changeset
7
anatofuz
parents:
diff changeset
8 Introduction
anatofuz
parents:
diff changeset
9 ============
anatofuz
parents:
diff changeset
10
anatofuz
parents:
diff changeset
11 This document describes how to compile CUDA code with clang, and gives some
anatofuz
parents:
diff changeset
12 details about LLVM and clang's CUDA implementations.
anatofuz
parents:
diff changeset
13
anatofuz
parents:
diff changeset
14 This document assumes a basic familiarity with CUDA. Information about CUDA
anatofuz
parents:
diff changeset
15 programming can be found in the
anatofuz
parents:
diff changeset
16 `CUDA programming guide
anatofuz
parents:
diff changeset
17 <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_.
anatofuz
parents:
diff changeset
18
anatofuz
parents:
diff changeset
19 Compiling CUDA Code
anatofuz
parents:
diff changeset
20 ===================
anatofuz
parents:
diff changeset
21
anatofuz
parents:
diff changeset
22 Prerequisites
anatofuz
parents:
diff changeset
23 -------------
anatofuz
parents:
diff changeset
24
173
0572611fdcc8 reorgnization done
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents: 150
diff changeset
25 CUDA is supported since llvm 3.9. Clang currently supports CUDA 7.0 through
0572611fdcc8 reorgnization done
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents: 150
diff changeset
26 10.1. If clang detects a newer CUDA version, it will issue a warning and will
0572611fdcc8 reorgnization done
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents: 150
diff changeset
27 attempt to use detected CUDA SDK it as if it were CUDA-10.1.
150
anatofuz
parents:
diff changeset
28
173
0572611fdcc8 reorgnization done
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents: 150
diff changeset
29 Before you build CUDA code, you'll need to have installed the CUDA SDK. See
0572611fdcc8 reorgnization done
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents: 150
diff changeset
30 `NVIDIA's CUDA installation guide
150
anatofuz
parents:
diff changeset
31 <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ for
173
0572611fdcc8 reorgnization done
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents: 150
diff changeset
32 details. Note that clang `maynot support
0572611fdcc8 reorgnization done
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents: 150
diff changeset
33 <https://bugs.llvm.org/show_bug.cgi?id=26966>`_ the CUDA toolkit as installed by
0572611fdcc8 reorgnization done
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents: 150
diff changeset
34 some Linux package managers. Clang does attempt to deal with specific details of
0572611fdcc8 reorgnization done
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents: 150
diff changeset
35 CUDA installation on a handful of common Linux distributions, but in general the
0572611fdcc8 reorgnization done
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents: 150
diff changeset
36 most reliable way to make it work is to install CUDA in a single directory from
0572611fdcc8 reorgnization done
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents: 150
diff changeset
37 NVIDIA's `.run` package and specify its location via `--cuda-path=...` argument.
150
anatofuz
parents:
diff changeset
38
anatofuz
parents:
diff changeset
39 CUDA compilation is supported on Linux. Compilation on MacOS and Windows may or
173
0572611fdcc8 reorgnization done
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents: 150
diff changeset
40 may not work and currently have no maintainers.
150
anatofuz
parents:
diff changeset
41
anatofuz
parents:
diff changeset
42 Invoking clang
anatofuz
parents:
diff changeset
43 --------------
anatofuz
parents:
diff changeset
44
anatofuz
parents:
diff changeset
45 Invoking clang for CUDA compilation works similarly to compiling regular C++.
anatofuz
parents:
diff changeset
46 You just need to be aware of a few additional flags.
anatofuz
parents:
diff changeset
47
anatofuz
parents:
diff changeset
48 You can use `this <https://gist.github.com/855e277884eb6b388cd2f00d956c2fd4>`_
anatofuz
parents:
diff changeset
49 program as a toy example. Save it as ``axpy.cu``. (Clang detects that you're
anatofuz
parents:
diff changeset
50 compiling CUDA code by noticing that your filename ends with ``.cu``.
anatofuz
parents:
diff changeset
51 Alternatively, you can pass ``-x cuda``.)
anatofuz
parents:
diff changeset
52
anatofuz
parents:
diff changeset
53 To build and run, run the following commands, filling in the parts in angle
anatofuz
parents:
diff changeset
54 brackets as described below:
anatofuz
parents:
diff changeset
55
anatofuz
parents:
diff changeset
56 .. code-block:: console
anatofuz
parents:
diff changeset
57
anatofuz
parents:
diff changeset
58 $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \
anatofuz
parents:
diff changeset
59 -L<CUDA install path>/<lib64 or lib> \
anatofuz
parents:
diff changeset
60 -lcudart_static -ldl -lrt -pthread
anatofuz
parents:
diff changeset
61 $ ./axpy
anatofuz
parents:
diff changeset
62 y[0] = 2
anatofuz
parents:
diff changeset
63 y[1] = 4
anatofuz
parents:
diff changeset
64 y[2] = 6
anatofuz
parents:
diff changeset
65 y[3] = 8
anatofuz
parents:
diff changeset
66
anatofuz
parents:
diff changeset
67 On MacOS, replace `-lcudart_static` with `-lcudart`; otherwise, you may get
anatofuz
parents:
diff changeset
68 "CUDA driver version is insufficient for CUDA runtime version" errors when you
anatofuz
parents:
diff changeset
69 run your program.
anatofuz
parents:
diff changeset
70
anatofuz
parents:
diff changeset
71 * ``<CUDA install path>`` -- the directory where you installed CUDA SDK.
anatofuz
parents:
diff changeset
72 Typically, ``/usr/local/cuda``.
anatofuz
parents:
diff changeset
73
anatofuz
parents:
diff changeset
74 Pass e.g. ``-L/usr/local/cuda/lib64`` if compiling in 64-bit mode; otherwise,
anatofuz
parents:
diff changeset
75 pass e.g. ``-L/usr/local/cuda/lib``. (In CUDA, the device code and host code
anatofuz
parents:
diff changeset
76 always have the same pointer widths, so if you're compiling 64-bit code for
anatofuz
parents:
diff changeset
77 the host, you're also compiling 64-bit code for the device.) Note that as of
anatofuz
parents:
diff changeset
78 v10.0 CUDA SDK `no longer supports compilation of 32-bit
anatofuz
parents:
diff changeset
79 applications <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#deprecated-features>`_.
anatofuz
parents:
diff changeset
80
anatofuz
parents:
diff changeset
81 * ``<GPU arch>`` -- the `compute capability
anatofuz
parents:
diff changeset
82 <https://developer.nvidia.com/cuda-gpus>`_ of your GPU. For example, if you
anatofuz
parents:
diff changeset
83 want to run your program on a GPU with compute capability of 3.5, specify
anatofuz
parents:
diff changeset
84 ``--cuda-gpu-arch=sm_35``.
anatofuz
parents:
diff changeset
85
anatofuz
parents:
diff changeset
86 Note: You cannot pass ``compute_XX`` as an argument to ``--cuda-gpu-arch``;
anatofuz
parents:
diff changeset
87 only ``sm_XX`` is currently supported. However, clang always includes PTX in
anatofuz
parents:
diff changeset
88 its binaries, so e.g. a binary compiled with ``--cuda-gpu-arch=sm_30`` would be
anatofuz
parents:
diff changeset
89 forwards-compatible with e.g. ``sm_35`` GPUs.
anatofuz
parents:
diff changeset
90
anatofuz
parents:
diff changeset
91 You can pass ``--cuda-gpu-arch`` multiple times to compile for multiple archs.
anatofuz
parents:
diff changeset
92
anatofuz
parents:
diff changeset
93 The `-L` and `-l` flags only need to be passed when linking. When compiling,
anatofuz
parents:
diff changeset
94 you may also need to pass ``--cuda-path=/path/to/cuda`` if you didn't install
anatofuz
parents:
diff changeset
95 the CUDA SDK into ``/usr/local/cuda`` or ``/usr/local/cuda-X.Y``.
anatofuz
parents:
diff changeset
96
anatofuz
parents:
diff changeset
97 Flags that control numerical code
anatofuz
parents:
diff changeset
98 ---------------------------------
anatofuz
parents:
diff changeset
99
anatofuz
parents:
diff changeset
100 If you're using GPUs, you probably care about making numerical code run fast.
anatofuz
parents:
diff changeset
101 GPU hardware allows for more control over numerical operations than most CPUs,
anatofuz
parents:
diff changeset
102 but this results in more compiler options for you to juggle.
anatofuz
parents:
diff changeset
103
anatofuz
parents:
diff changeset
104 Flags you may wish to tweak include:
anatofuz
parents:
diff changeset
105
anatofuz
parents:
diff changeset
106 * ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when
anatofuz
parents:
diff changeset
107 compiling CUDA) Controls whether the compiler emits fused multiply-add
anatofuz
parents:
diff changeset
108 operations.
anatofuz
parents:
diff changeset
109
anatofuz
parents:
diff changeset
110 * ``off``: never emit fma operations, and prevent ptxas from fusing multiply
anatofuz
parents:
diff changeset
111 and add instructions.
anatofuz
parents:
diff changeset
112 * ``on``: fuse multiplies and adds within a single statement, but never
anatofuz
parents:
diff changeset
113 across statements (C11 semantics). Prevent ptxas from fusing other
anatofuz
parents:
diff changeset
114 multiplies and adds.
anatofuz
parents:
diff changeset
115 * ``fast``: fuse multiplies and adds wherever profitable, even across
anatofuz
parents:
diff changeset
116 statements. Doesn't prevent ptxas from fusing additional multiplies and
anatofuz
parents:
diff changeset
117 adds.
anatofuz
parents:
diff changeset
118
anatofuz
parents:
diff changeset
119 Fused multiply-add instructions can be much faster than the unfused
anatofuz
parents:
diff changeset
120 equivalents, but because the intermediate result in an fma is not rounded,
anatofuz
parents:
diff changeset
121 this flag can affect numerical code.
anatofuz
parents:
diff changeset
122
anatofuz
parents:
diff changeset
123 * ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled,
anatofuz
parents:
diff changeset
124 floating point operations may flush `denormal
anatofuz
parents:
diff changeset
125 <https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0.
anatofuz
parents:
diff changeset
126 Operations on denormal numbers are often much slower than the same operations
anatofuz
parents:
diff changeset
127 on normal numbers.
anatofuz
parents:
diff changeset
128
anatofuz
parents:
diff changeset
129 * ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the
anatofuz
parents:
diff changeset
130 compiler may emit calls to faster, approximate versions of transcendental
anatofuz
parents:
diff changeset
131 functions, instead of using the slower, fully IEEE-compliant versions. For
anatofuz
parents:
diff changeset
132 example, this flag allows clang to emit the ptx ``sin.approx.f32``
anatofuz
parents:
diff changeset
133 instruction.
anatofuz
parents:
diff changeset
134
anatofuz
parents:
diff changeset
135 This is implied by ``-ffast-math``.
anatofuz
parents:
diff changeset
136
anatofuz
parents:
diff changeset
137 Standard library support
anatofuz
parents:
diff changeset
138 ========================
anatofuz
parents:
diff changeset
139
anatofuz
parents:
diff changeset
140 In clang and nvcc, most of the C++ standard library is not supported on the
anatofuz
parents:
diff changeset
141 device side.
anatofuz
parents:
diff changeset
142
anatofuz
parents:
diff changeset
143 ``<math.h>`` and ``<cmath>``
anatofuz
parents:
diff changeset
144 ----------------------------
anatofuz
parents:
diff changeset
145
anatofuz
parents:
diff changeset
146 In clang, ``math.h`` and ``cmath`` are available and `pass
anatofuz
parents:
diff changeset
147 <https://github.com/llvm/llvm-test-suite/blob/master/External/CUDA/math_h.cu>`_
anatofuz
parents:
diff changeset
148 `tests
anatofuz
parents:
diff changeset
149 <https://github.com/llvm/llvm-test-suite/blob/master/External/CUDA/cmath.cu>`_
anatofuz
parents:
diff changeset
150 adapted from libc++'s test suite.
anatofuz
parents:
diff changeset
151
anatofuz
parents:
diff changeset
152 In nvcc ``math.h`` and ``cmath`` are mostly available. Versions of ``::foof``
anatofuz
parents:
diff changeset
153 in namespace std (e.g. ``std::sinf``) are not available, and where the standard
anatofuz
parents:
diff changeset
154 calls for overloads that take integral arguments, these are usually not
anatofuz
parents:
diff changeset
155 available.
anatofuz
parents:
diff changeset
156
anatofuz
parents:
diff changeset
157 .. code-block:: c++
anatofuz
parents:
diff changeset
158
anatofuz
parents:
diff changeset
159 #include <math.h>
anatofuz
parents:
diff changeset
160 #include <cmath.h>
anatofuz
parents:
diff changeset
161
anatofuz
parents:
diff changeset
162 // clang is OK with everything in this function.
anatofuz
parents:
diff changeset
163 __device__ void test() {
anatofuz
parents:
diff changeset
164 std::sin(0.); // nvcc - ok
anatofuz
parents:
diff changeset
165 std::sin(0); // nvcc - error, because no std::sin(int) override is available.
anatofuz
parents:
diff changeset
166 sin(0); // nvcc - same as above.
anatofuz
parents:
diff changeset
167
anatofuz
parents:
diff changeset
168 sinf(0.); // nvcc - ok
anatofuz
parents:
diff changeset
169 std::sinf(0.); // nvcc - no such function
anatofuz
parents:
diff changeset
170 }
anatofuz
parents:
diff changeset
171
anatofuz
parents:
diff changeset
172 ``<std::complex>``
anatofuz
parents:
diff changeset
173 ------------------
anatofuz
parents:
diff changeset
174
anatofuz
parents:
diff changeset
175 nvcc does not officially support ``std::complex``. It's an error to use
anatofuz
parents:
diff changeset
176 ``std::complex`` in ``__device__`` code, but it often works in ``__host__
anatofuz
parents:
diff changeset
177 __device__`` code due to nvcc's interpretation of the "wrong-side rule" (see
anatofuz
parents:
diff changeset
178 below). However, we have heard from implementers that it's possible to get
anatofuz
parents:
diff changeset
179 into situations where nvcc will omit a call to an ``std::complex`` function,
anatofuz
parents:
diff changeset
180 especially when compiling without optimizations.
anatofuz
parents:
diff changeset
181
anatofuz
parents:
diff changeset
182 As of 2016-11-16, clang supports ``std::complex`` without these caveats. It is
anatofuz
parents:
diff changeset
183 tested with libstdc++ 4.8.5 and newer, but is known to work only with libc++
anatofuz
parents:
diff changeset
184 newer than 2016-11-16.
anatofuz
parents:
diff changeset
185
anatofuz
parents:
diff changeset
186 ``<algorithm>``
anatofuz
parents:
diff changeset
187 ---------------
anatofuz
parents:
diff changeset
188
anatofuz
parents:
diff changeset
189 In C++14, many useful functions from ``<algorithm>`` (notably, ``std::min`` and
anatofuz
parents:
diff changeset
190 ``std::max``) become constexpr. You can therefore use these in device code,
anatofuz
parents:
diff changeset
191 when compiling with clang.
anatofuz
parents:
diff changeset
192
anatofuz
parents:
diff changeset
193 Detecting clang vs NVCC from code
anatofuz
parents:
diff changeset
194 =================================
anatofuz
parents:
diff changeset
195
anatofuz
parents:
diff changeset
196 Although clang's CUDA implementation is largely compatible with NVCC's, you may
anatofuz
parents:
diff changeset
197 still want to detect when you're compiling CUDA code specifically with clang.
anatofuz
parents:
diff changeset
198
anatofuz
parents:
diff changeset
199 This is tricky, because NVCC may invoke clang as part of its own compilation
anatofuz
parents:
diff changeset
200 process! For example, NVCC uses the host compiler's preprocessor when
anatofuz
parents:
diff changeset
201 compiling for device code, and that host compiler may in fact be clang.
anatofuz
parents:
diff changeset
202
anatofuz
parents:
diff changeset
203 When clang is actually compiling CUDA code -- rather than being used as a
anatofuz
parents:
diff changeset
204 subtool of NVCC's -- it defines the ``__CUDA__`` macro. ``__CUDA_ARCH__`` is
anatofuz
parents:
diff changeset
205 defined only in device mode (but will be defined if NVCC is using clang as a
anatofuz
parents:
diff changeset
206 preprocessor). So you can use the following incantations to detect clang CUDA
anatofuz
parents:
diff changeset
207 compilation, in host and device modes:
anatofuz
parents:
diff changeset
208
anatofuz
parents:
diff changeset
209 .. code-block:: c++
anatofuz
parents:
diff changeset
210
anatofuz
parents:
diff changeset
211 #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__)
anatofuz
parents:
diff changeset
212 // clang compiling CUDA code, host mode.
anatofuz
parents:
diff changeset
213 #endif
anatofuz
parents:
diff changeset
214
anatofuz
parents:
diff changeset
215 #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__)
anatofuz
parents:
diff changeset
216 // clang compiling CUDA code, device mode.
anatofuz
parents:
diff changeset
217 #endif
anatofuz
parents:
diff changeset
218
anatofuz
parents:
diff changeset
219 Both clang and nvcc define ``__CUDACC__`` during CUDA compilation. You can
anatofuz
parents:
diff changeset
220 detect NVCC specifically by looking for ``__NVCC__``.
anatofuz
parents:
diff changeset
221
anatofuz
parents:
diff changeset
222 Dialect Differences Between clang and nvcc
anatofuz
parents:
diff changeset
223 ==========================================
anatofuz
parents:
diff changeset
224
anatofuz
parents:
diff changeset
225 There is no formal CUDA spec, and clang and nvcc speak slightly different
anatofuz
parents:
diff changeset
226 dialects of the language. Below, we describe some of the differences.
anatofuz
parents:
diff changeset
227
anatofuz
parents:
diff changeset
228 This section is painful; hopefully you can skip this section and live your life
anatofuz
parents:
diff changeset
229 blissfully unaware.
anatofuz
parents:
diff changeset
230
anatofuz
parents:
diff changeset
231 Compilation Models
anatofuz
parents:
diff changeset
232 ------------------
anatofuz
parents:
diff changeset
233
anatofuz
parents:
diff changeset
234 Most of the differences between clang and nvcc stem from the different
anatofuz
parents:
diff changeset
235 compilation models used by clang and nvcc. nvcc uses *split compilation*,
anatofuz
parents:
diff changeset
236 which works roughly as follows:
anatofuz
parents:
diff changeset
237
anatofuz
parents:
diff changeset
238 * Run a preprocessor over the input ``.cu`` file to split it into two source
anatofuz
parents:
diff changeset
239 files: ``H``, containing source code for the host, and ``D``, containing
anatofuz
parents:
diff changeset
240 source code for the device.
anatofuz
parents:
diff changeset
241
anatofuz
parents:
diff changeset
242 * For each GPU architecture ``arch`` that we're compiling for, do:
anatofuz
parents:
diff changeset
243
anatofuz
parents:
diff changeset
244 * Compile ``D`` using nvcc proper. The result of this is a ``ptx`` file for
anatofuz
parents:
diff changeset
245 ``P_arch``.
anatofuz
parents:
diff changeset
246
anatofuz
parents:
diff changeset
247 * Optionally, invoke ``ptxas``, the PTX assembler, to generate a file,
anatofuz
parents:
diff changeset
248 ``S_arch``, containing GPU machine code (SASS) for ``arch``.
anatofuz
parents:
diff changeset
249
anatofuz
parents:
diff changeset
250 * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a
anatofuz
parents:
diff changeset
251 single "fat binary" file, ``F``.
anatofuz
parents:
diff changeset
252
anatofuz
parents:
diff changeset
253 * Compile ``H`` using an external host compiler (gcc, clang, or whatever you
anatofuz
parents:
diff changeset
254 like). ``F`` is packaged up into a header file which is force-included into
anatofuz
parents:
diff changeset
255 ``H``; nvcc generates code that calls into this header to e.g. launch
anatofuz
parents:
diff changeset
256 kernels.
anatofuz
parents:
diff changeset
257
anatofuz
parents:
diff changeset
258 clang uses *merged parsing*. This is similar to split compilation, except all
anatofuz
parents:
diff changeset
259 of the host and device code is present and must be semantically-correct in both
anatofuz
parents:
diff changeset
260 compilation steps.
anatofuz
parents:
diff changeset
261
anatofuz
parents:
diff changeset
262 * For each GPU architecture ``arch`` that we're compiling for, do:
anatofuz
parents:
diff changeset
263
anatofuz
parents:
diff changeset
264 * Compile the input ``.cu`` file for device, using clang. ``__host__`` code
anatofuz
parents:
diff changeset
265 is parsed and must be semantically correct, even though we're not
anatofuz
parents:
diff changeset
266 generating code for the host at this time.
anatofuz
parents:
diff changeset
267
anatofuz
parents:
diff changeset
268 The output of this step is a ``ptx`` file ``P_arch``.
anatofuz
parents:
diff changeset
269
anatofuz
parents:
diff changeset
270 * Invoke ``ptxas`` to generate a SASS file, ``S_arch``. Note that, unlike
anatofuz
parents:
diff changeset
271 nvcc, clang always generates SASS code.
anatofuz
parents:
diff changeset
272
anatofuz
parents:
diff changeset
273 * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a
anatofuz
parents:
diff changeset
274 single fat binary file, ``F``.
anatofuz
parents:
diff changeset
275
anatofuz
parents:
diff changeset
276 * Compile ``H`` using clang. ``__device__`` code is parsed and must be
anatofuz
parents:
diff changeset
277 semantically correct, even though we're not generating code for the device
anatofuz
parents:
diff changeset
278 at this time.
anatofuz
parents:
diff changeset
279
anatofuz
parents:
diff changeset
280 ``F`` is passed to this compilation, and clang includes it in a special ELF
anatofuz
parents:
diff changeset
281 section, where it can be found by tools like ``cuobjdump``.
anatofuz
parents:
diff changeset
282
anatofuz
parents:
diff changeset
283 (You may ask at this point, why does clang need to parse the input file
anatofuz
parents:
diff changeset
284 multiple times? Why not parse it just once, and then use the AST to generate
anatofuz
parents:
diff changeset
285 code for the host and each device architecture?
anatofuz
parents:
diff changeset
286
anatofuz
parents:
diff changeset
287 Unfortunately this can't work because we have to define different macros during
anatofuz
parents:
diff changeset
288 host compilation and during device compilation for each GPU architecture.)
anatofuz
parents:
diff changeset
289
anatofuz
parents:
diff changeset
290 clang's approach allows it to be highly robust to C++ edge cases, as it doesn't
anatofuz
parents:
diff changeset
291 need to decide at an early stage which declarations to keep and which to throw
anatofuz
parents:
diff changeset
292 away. But it has some consequences you should be aware of.
anatofuz
parents:
diff changeset
293
anatofuz
parents:
diff changeset
294 Overloading Based on ``__host__`` and ``__device__`` Attributes
anatofuz
parents:
diff changeset
295 ---------------------------------------------------------------
anatofuz
parents:
diff changeset
296
anatofuz
parents:
diff changeset
297 Let "H", "D", and "HD" stand for "``__host__`` functions", "``__device__``
anatofuz
parents:
diff changeset
298 functions", and "``__host__ __device__`` functions", respectively. Functions
anatofuz
parents:
diff changeset
299 with no attributes behave the same as H.
anatofuz
parents:
diff changeset
300
anatofuz
parents:
diff changeset
301 nvcc does not allow you to create H and D functions with the same signature:
anatofuz
parents:
diff changeset
302
anatofuz
parents:
diff changeset
303 .. code-block:: c++
anatofuz
parents:
diff changeset
304
anatofuz
parents:
diff changeset
305 // nvcc: error - function "foo" has already been defined
anatofuz
parents:
diff changeset
306 __host__ void foo() {}
anatofuz
parents:
diff changeset
307 __device__ void foo() {}
anatofuz
parents:
diff changeset
308
anatofuz
parents:
diff changeset
309 However, nvcc allows you to "overload" H and D functions with different
anatofuz
parents:
diff changeset
310 signatures:
anatofuz
parents:
diff changeset
311
anatofuz
parents:
diff changeset
312 .. code-block:: c++
anatofuz
parents:
diff changeset
313
anatofuz
parents:
diff changeset
314 // nvcc: no error
anatofuz
parents:
diff changeset
315 __host__ void foo(int) {}
anatofuz
parents:
diff changeset
316 __device__ void foo() {}
anatofuz
parents:
diff changeset
317
anatofuz
parents:
diff changeset
318 In clang, the ``__host__`` and ``__device__`` attributes are part of a
anatofuz
parents:
diff changeset
319 function's signature, and so it's legal to have H and D functions with
anatofuz
parents:
diff changeset
320 (otherwise) the same signature:
anatofuz
parents:
diff changeset
321
anatofuz
parents:
diff changeset
322 .. code-block:: c++
anatofuz
parents:
diff changeset
323
anatofuz
parents:
diff changeset
324 // clang: no error
anatofuz
parents:
diff changeset
325 __host__ void foo() {}
anatofuz
parents:
diff changeset
326 __device__ void foo() {}
anatofuz
parents:
diff changeset
327
anatofuz
parents:
diff changeset
328 HD functions cannot be overloaded by H or D functions with the same signature:
anatofuz
parents:
diff changeset
329
anatofuz
parents:
diff changeset
330 .. code-block:: c++
anatofuz
parents:
diff changeset
331
anatofuz
parents:
diff changeset
332 // nvcc: error - function "foo" has already been defined
anatofuz
parents:
diff changeset
333 // clang: error - redefinition of 'foo'
anatofuz
parents:
diff changeset
334 __host__ __device__ void foo() {}
anatofuz
parents:
diff changeset
335 __device__ void foo() {}
anatofuz
parents:
diff changeset
336
anatofuz
parents:
diff changeset
337 // nvcc: no error
anatofuz
parents:
diff changeset
338 // clang: no error
anatofuz
parents:
diff changeset
339 __host__ __device__ void bar(int) {}
anatofuz
parents:
diff changeset
340 __device__ void bar() {}
anatofuz
parents:
diff changeset
341
anatofuz
parents:
diff changeset
342 When resolving an overloaded function, clang considers the host/device
anatofuz
parents:
diff changeset
343 attributes of the caller and callee. These are used as a tiebreaker during
anatofuz
parents:
diff changeset
344 overload resolution. See `IdentifyCUDAPreference
173
0572611fdcc8 reorgnization done
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents: 150
diff changeset
345 <https://clang.llvm.org/doxygen/SemaCUDA_8cpp.html>`_ for the full set of rules,
150
anatofuz
parents:
diff changeset
346 but at a high level they are:
anatofuz
parents:
diff changeset
347
anatofuz
parents:
diff changeset
348 * D functions prefer to call other Ds. HDs are given lower priority.
anatofuz
parents:
diff changeset
349
anatofuz
parents:
diff changeset
350 * Similarly, H functions prefer to call other Hs, or ``__global__`` functions
anatofuz
parents:
diff changeset
351 (with equal priority). HDs are given lower priority.
anatofuz
parents:
diff changeset
352
anatofuz
parents:
diff changeset
353 * HD functions prefer to call other HDs.
anatofuz
parents:
diff changeset
354
anatofuz
parents:
diff changeset
355 When compiling for device, HDs will call Ds with lower priority than HD, and
anatofuz
parents:
diff changeset
356 will call Hs with still lower priority. If it's forced to call an H, the
anatofuz
parents:
diff changeset
357 program is malformed if we emit code for this HD function. We call this the
anatofuz
parents:
diff changeset
358 "wrong-side rule", see example below.
anatofuz
parents:
diff changeset
359
anatofuz
parents:
diff changeset
360 The rules are symmetrical when compiling for host.
anatofuz
parents:
diff changeset
361
anatofuz
parents:
diff changeset
362 Some examples:
anatofuz
parents:
diff changeset
363
anatofuz
parents:
diff changeset
364 .. code-block:: c++
anatofuz
parents:
diff changeset
365
anatofuz
parents:
diff changeset
366 __host__ void foo();
anatofuz
parents:
diff changeset
367 __device__ void foo();
anatofuz
parents:
diff changeset
368
anatofuz
parents:
diff changeset
369 __host__ void bar();
anatofuz
parents:
diff changeset
370 __host__ __device__ void bar();
anatofuz
parents:
diff changeset
371
anatofuz
parents:
diff changeset
372 __host__ void test_host() {
anatofuz
parents:
diff changeset
373 foo(); // calls H overload
anatofuz
parents:
diff changeset
374 bar(); // calls H overload
anatofuz
parents:
diff changeset
375 }
anatofuz
parents:
diff changeset
376
anatofuz
parents:
diff changeset
377 __device__ void test_device() {
anatofuz
parents:
diff changeset
378 foo(); // calls D overload
anatofuz
parents:
diff changeset
379 bar(); // calls HD overload
anatofuz
parents:
diff changeset
380 }
anatofuz
parents:
diff changeset
381
anatofuz
parents:
diff changeset
382 __host__ __device__ void test_hd() {
anatofuz
parents:
diff changeset
383 foo(); // calls H overload when compiling for host, otherwise D overload
anatofuz
parents:
diff changeset
384 bar(); // always calls HD overload
anatofuz
parents:
diff changeset
385 }
anatofuz
parents:
diff changeset
386
anatofuz
parents:
diff changeset
387 Wrong-side rule example:
anatofuz
parents:
diff changeset
388
anatofuz
parents:
diff changeset
389 .. code-block:: c++
anatofuz
parents:
diff changeset
390
anatofuz
parents:
diff changeset
391 __host__ void host_only();
anatofuz
parents:
diff changeset
392
anatofuz
parents:
diff changeset
393 // We don't codegen inline functions unless they're referenced by a
anatofuz
parents:
diff changeset
394 // non-inline function. inline_hd1() is called only from the host side, so
anatofuz
parents:
diff changeset
395 // does not generate an error. inline_hd2() is called from the device side,
anatofuz
parents:
diff changeset
396 // so it generates an error.
anatofuz
parents:
diff changeset
397 inline __host__ __device__ void inline_hd1() { host_only(); } // no error
anatofuz
parents:
diff changeset
398 inline __host__ __device__ void inline_hd2() { host_only(); } // error
anatofuz
parents:
diff changeset
399
anatofuz
parents:
diff changeset
400 __host__ void host_fn() { inline_hd1(); }
anatofuz
parents:
diff changeset
401 __device__ void device_fn() { inline_hd2(); }
anatofuz
parents:
diff changeset
402
anatofuz
parents:
diff changeset
403 // This function is not inline, so it's always codegen'ed on both the host
anatofuz
parents:
diff changeset
404 // and the device. Therefore, it generates an error.
anatofuz
parents:
diff changeset
405 __host__ __device__ void not_inline_hd() { host_only(); }
anatofuz
parents:
diff changeset
406
anatofuz
parents:
diff changeset
407 For the purposes of the wrong-side rule, templated functions also behave like
anatofuz
parents:
diff changeset
408 ``inline`` functions: They aren't codegen'ed unless they're instantiated
anatofuz
parents:
diff changeset
409 (usually as part of the process of invoking them).
anatofuz
parents:
diff changeset
410
anatofuz
parents:
diff changeset
411 clang's behavior with respect to the wrong-side rule matches nvcc's, except
anatofuz
parents:
diff changeset
412 nvcc only emits a warning for ``not_inline_hd``; device code is allowed to call
anatofuz
parents:
diff changeset
413 ``not_inline_hd``. In its generated code, nvcc may omit ``not_inline_hd``'s
anatofuz
parents:
diff changeset
414 call to ``host_only`` entirely, or it may try to generate code for
anatofuz
parents:
diff changeset
415 ``host_only`` on the device. What you get seems to depend on whether or not
anatofuz
parents:
diff changeset
416 the compiler chooses to inline ``host_only``.
anatofuz
parents:
diff changeset
417
anatofuz
parents:
diff changeset
418 Member functions, including constructors, may be overloaded using H and D
anatofuz
parents:
diff changeset
419 attributes. However, destructors cannot be overloaded.
anatofuz
parents:
diff changeset
420
anatofuz
parents:
diff changeset
421 Using a Different Class on Host/Device
anatofuz
parents:
diff changeset
422 --------------------------------------
anatofuz
parents:
diff changeset
423
anatofuz
parents:
diff changeset
424 Occasionally you may want to have a class with different host/device versions.
anatofuz
parents:
diff changeset
425
anatofuz
parents:
diff changeset
426 If all of the class's members are the same on the host and device, you can just
anatofuz
parents:
diff changeset
427 provide overloads for the class's member functions.
anatofuz
parents:
diff changeset
428
anatofuz
parents:
diff changeset
429 However, if you want your class to have different members on host/device, you
anatofuz
parents:
diff changeset
430 won't be able to provide working H and D overloads in both classes. In this
anatofuz
parents:
diff changeset
431 case, clang is likely to be unhappy with you.
anatofuz
parents:
diff changeset
432
anatofuz
parents:
diff changeset
433 .. code-block:: c++
anatofuz
parents:
diff changeset
434
anatofuz
parents:
diff changeset
435 #ifdef __CUDA_ARCH__
anatofuz
parents:
diff changeset
436 struct S {
anatofuz
parents:
diff changeset
437 __device__ void foo() { /* use device_only */ }
anatofuz
parents:
diff changeset
438 int device_only;
anatofuz
parents:
diff changeset
439 };
anatofuz
parents:
diff changeset
440 #else
anatofuz
parents:
diff changeset
441 struct S {
anatofuz
parents:
diff changeset
442 __host__ void foo() { /* use host_only */ }
anatofuz
parents:
diff changeset
443 double host_only;
anatofuz
parents:
diff changeset
444 };
anatofuz
parents:
diff changeset
445
anatofuz
parents:
diff changeset
446 __device__ void test() {
anatofuz
parents:
diff changeset
447 S s;
anatofuz
parents:
diff changeset
448 // clang generates an error here, because during host compilation, we
anatofuz
parents:
diff changeset
449 // have ifdef'ed away the __device__ overload of S::foo(). The __device__
anatofuz
parents:
diff changeset
450 // overload must be present *even during host compilation*.
anatofuz
parents:
diff changeset
451 S.foo();
anatofuz
parents:
diff changeset
452 }
anatofuz
parents:
diff changeset
453 #endif
anatofuz
parents:
diff changeset
454
anatofuz
parents:
diff changeset
455 We posit that you don't really want to have classes with different members on H
anatofuz
parents:
diff changeset
456 and D. For example, if you were to pass one of these as a parameter to a
anatofuz
parents:
diff changeset
457 kernel, it would have a different layout on H and D, so would not work
anatofuz
parents:
diff changeset
458 properly.
anatofuz
parents:
diff changeset
459
anatofuz
parents:
diff changeset
460 To make code like this compatible with clang, we recommend you separate it out
anatofuz
parents:
diff changeset
461 into two classes. If you need to write code that works on both host and
anatofuz
parents:
diff changeset
462 device, consider writing an overloaded wrapper function that returns different
anatofuz
parents:
diff changeset
463 types on host and device.
anatofuz
parents:
diff changeset
464
anatofuz
parents:
diff changeset
465 .. code-block:: c++
anatofuz
parents:
diff changeset
466
anatofuz
parents:
diff changeset
467 struct HostS { ... };
anatofuz
parents:
diff changeset
468 struct DeviceS { ... };
anatofuz
parents:
diff changeset
469
anatofuz
parents:
diff changeset
470 __host__ HostS MakeStruct() { return HostS(); }
anatofuz
parents:
diff changeset
471 __device__ DeviceS MakeStruct() { return DeviceS(); }
anatofuz
parents:
diff changeset
472
anatofuz
parents:
diff changeset
473 // Now host and device code can call MakeStruct().
anatofuz
parents:
diff changeset
474
anatofuz
parents:
diff changeset
475 Unfortunately, this idiom isn't compatible with nvcc, because it doesn't allow
anatofuz
parents:
diff changeset
476 you to overload based on the H/D attributes. Here's an idiom that works with
anatofuz
parents:
diff changeset
477 both clang and nvcc:
anatofuz
parents:
diff changeset
478
anatofuz
parents:
diff changeset
479 .. code-block:: c++
anatofuz
parents:
diff changeset
480
anatofuz
parents:
diff changeset
481 struct HostS { ... };
anatofuz
parents:
diff changeset
482 struct DeviceS { ... };
anatofuz
parents:
diff changeset
483
anatofuz
parents:
diff changeset
484 #ifdef __NVCC__
anatofuz
parents:
diff changeset
485 #ifndef __CUDA_ARCH__
anatofuz
parents:
diff changeset
486 __host__ HostS MakeStruct() { return HostS(); }
anatofuz
parents:
diff changeset
487 #else
anatofuz
parents:
diff changeset
488 __device__ DeviceS MakeStruct() { return DeviceS(); }
anatofuz
parents:
diff changeset
489 #endif
anatofuz
parents:
diff changeset
490 #else
anatofuz
parents:
diff changeset
491 __host__ HostS MakeStruct() { return HostS(); }
anatofuz
parents:
diff changeset
492 __device__ DeviceS MakeStruct() { return DeviceS(); }
anatofuz
parents:
diff changeset
493 #endif
anatofuz
parents:
diff changeset
494
anatofuz
parents:
diff changeset
495 // Now host and device code can call MakeStruct().
anatofuz
parents:
diff changeset
496
anatofuz
parents:
diff changeset
497 Hopefully you don't have to do this sort of thing often.
anatofuz
parents:
diff changeset
498
anatofuz
parents:
diff changeset
499 Optimizations
anatofuz
parents:
diff changeset
500 =============
anatofuz
parents:
diff changeset
501
anatofuz
parents:
diff changeset
502 Modern CPUs and GPUs are architecturally quite different, so code that's fast
anatofuz
parents:
diff changeset
503 on a CPU isn't necessarily fast on a GPU. We've made a number of changes to
anatofuz
parents:
diff changeset
504 LLVM to make it generate good GPU code. Among these changes are:
anatofuz
parents:
diff changeset
505
anatofuz
parents:
diff changeset
506 * `Straight-line scalar optimizations <https://goo.gl/4Rb9As>`_ -- These
anatofuz
parents:
diff changeset
507 reduce redundancy within straight-line code.
anatofuz
parents:
diff changeset
508
anatofuz
parents:
diff changeset
509 * `Aggressive speculative execution
173
0572611fdcc8 reorgnization done
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents: 150
diff changeset
510 <https://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_
150
anatofuz
parents:
diff changeset
511 -- This is mainly for promoting straight-line scalar optimizations, which are
anatofuz
parents:
diff changeset
512 most effective on code along dominator paths.
anatofuz
parents:
diff changeset
513
anatofuz
parents:
diff changeset
514 * `Memory space inference
173
0572611fdcc8 reorgnization done
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents: 150
diff changeset
515 <https://llvm.org/doxygen/NVPTXInferAddressSpaces_8cpp_source.html>`_ --
150
anatofuz
parents:
diff changeset
516 In PTX, we can operate on pointers that are in a particular "address space"
anatofuz
parents:
diff changeset
517 (global, shared, constant, or local), or we can operate on pointers in the
anatofuz
parents:
diff changeset
518 "generic" address space, which can point to anything. Operations in a
anatofuz
parents:
diff changeset
519 non-generic address space are faster, but pointers in CUDA are not explicitly
anatofuz
parents:
diff changeset
520 annotated with their address space, so it's up to LLVM to infer it where
anatofuz
parents:
diff changeset
521 possible.
anatofuz
parents:
diff changeset
522
anatofuz
parents:
diff changeset
523 * `Bypassing 64-bit divides
173
0572611fdcc8 reorgnization done
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents: 150
diff changeset
524 <https://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ --
150
anatofuz
parents:
diff changeset
525 This was an existing optimization that we enabled for the PTX backend.
anatofuz
parents:
diff changeset
526
anatofuz
parents:
diff changeset
527 64-bit integer divides are much slower than 32-bit ones on NVIDIA GPUs.
anatofuz
parents:
diff changeset
528 Many of the 64-bit divides in our benchmarks have a divisor and dividend
anatofuz
parents:
diff changeset
529 which fit in 32-bits at runtime. This optimization provides a fast path for
anatofuz
parents:
diff changeset
530 this common case.
anatofuz
parents:
diff changeset
531
anatofuz
parents:
diff changeset
532 * Aggressive loop unrolling and function inlining -- Loop unrolling and
anatofuz
parents:
diff changeset
533 function inlining need to be more aggressive for GPUs than for CPUs because
anatofuz
parents:
diff changeset
534 control flow transfer in GPU is more expensive. More aggressive unrolling and
anatofuz
parents:
diff changeset
535 inlining also promote other optimizations, such as constant propagation and
anatofuz
parents:
diff changeset
536 SROA, which sometimes speed up code by over 10x.
anatofuz
parents:
diff changeset
537
anatofuz
parents:
diff changeset
538 (Programmers can force unrolling and inline using clang's `loop unrolling pragmas
173
0572611fdcc8 reorgnization done
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents: 150
diff changeset
539 <https://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_
150
anatofuz
parents:
diff changeset
540 and ``__attribute__((always_inline))``.)
anatofuz
parents:
diff changeset
541
anatofuz
parents:
diff changeset
542 Publication
anatofuz
parents:
diff changeset
543 ===========
anatofuz
parents:
diff changeset
544
anatofuz
parents:
diff changeset
545 The team at Google published a paper in CGO 2016 detailing the optimizations
anatofuz
parents:
diff changeset
546 they'd made to clang/LLVM. Note that "gpucc" is no longer a meaningful name:
anatofuz
parents:
diff changeset
547 The relevant tools are now just vanilla clang/LLVM.
anatofuz
parents:
diff changeset
548
anatofuz
parents:
diff changeset
549 | `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_
anatofuz
parents:
diff changeset
550 | Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt
anatofuz
parents:
diff changeset
551 | *Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)*
anatofuz
parents:
diff changeset
552 |
anatofuz
parents:
diff changeset
553 | `Slides from the CGO talk <http://wujingyue.github.io/docs/gpucc-talk.pdf>`_
anatofuz
parents:
diff changeset
554 |
anatofuz
parents:
diff changeset
555 | `Tutorial given at CGO <http://wujingyue.github.io/docs/gpucc-tutorial.pdf>`_
anatofuz
parents:
diff changeset
556
anatofuz
parents:
diff changeset
557 Obtaining Help
anatofuz
parents:
diff changeset
558 ==============
anatofuz
parents:
diff changeset
559
anatofuz
parents:
diff changeset
560 To obtain help on LLVM in general and its CUDA support, see `the LLVM
173
0572611fdcc8 reorgnization done
Shinji KONO <kono@ie.u-ryukyu.ac.jp>
parents: 150
diff changeset
561 community <https://llvm.org/docs/#mailing-lists>`_.