120
|
1 =========================
|
|
2 Compiling CUDA with clang
|
|
3 =========================
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
4
|
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
5 .. contents::
|
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
6 :local:
|
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
7
|
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
8 Introduction
|
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
9 ============
|
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
10
|
120
|
11 This document describes how to compile CUDA code with clang, and gives some
|
|
12 details about LLVM and clang's CUDA implementations.
|
|
13
|
|
14 This document assumes a basic familiarity with CUDA. Information about CUDA
|
|
15 programming can be found in the
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
16 `CUDA programming guide
|
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
17 <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_.
|
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
18
|
120
|
19 Compiling CUDA Code
|
|
20 ===================
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
21
|
120
|
22 Prerequisites
|
|
23 -------------
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
24
|
120
|
25 CUDA is supported in llvm 3.9, but it's still in active development, so we
|
|
26 recommend you `compile clang/LLVM from HEAD
|
|
27 <http://llvm.org/docs/GettingStarted.html>`_.
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
28
|
120
|
29 Before you build CUDA code, you'll need to have installed the appropriate
|
|
30 driver for your nvidia GPU and the CUDA SDK. See `NVIDIA's CUDA installation
|
|
31 guide <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_
|
|
32 for details. Note that clang `does not support
|
|
33 <https://llvm.org/bugs/show_bug.cgi?id=26966>`_ the CUDA toolkit as installed
|
|
34 by many Linux package managers; you probably need to install nvidia's package.
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
35
|
120
|
36 You will need CUDA 7.0, 7.5, or 8.0 to compile with clang.
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
37
|
121
|
38 CUDA compilation is supported on Linux, on MacOS as of 2016-11-18, and on
|
|
39 Windows as of 2017-01-05.
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
40
|
120
|
41 Invoking clang
|
|
42 --------------
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
43
|
120
|
44 Invoking clang for CUDA compilation works similarly to compiling regular C++.
|
|
45 You just need to be aware of a few additional flags.
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
46
|
120
|
47 You can use `this <https://gist.github.com/855e277884eb6b388cd2f00d956c2fd4>`_
|
|
48 program as a toy example. Save it as ``axpy.cu``. (Clang detects that you're
|
|
49 compiling CUDA code by noticing that your filename ends with ``.cu``.
|
|
50 Alternatively, you can pass ``-x cuda``.)
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
51
|
120
|
52 To build and run, run the following commands, filling in the parts in angle
|
|
53 brackets as described below:
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
54
|
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
55 .. code-block:: console
|
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
56
|
120
|
57 $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \
|
|
58 -L<CUDA install path>/<lib64 or lib> \
|
|
59 -lcudart_static -ldl -lrt -pthread
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
60 $ ./axpy
|
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
61 y[0] = 2
|
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
62 y[1] = 4
|
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
63 y[2] = 6
|
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
64 y[3] = 8
|
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
65
|
120
|
66 On MacOS, replace `-lcudart_static` with `-lcudart`; otherwise, you may get
|
|
67 "CUDA driver version is insufficient for CUDA runtime version" errors when you
|
|
68 run your program.
|
|
69
|
|
70 * ``<CUDA install path>`` -- the directory where you installed CUDA SDK.
|
|
71 Typically, ``/usr/local/cuda``.
|
|
72
|
|
73 Pass e.g. ``-L/usr/local/cuda/lib64`` if compiling in 64-bit mode; otherwise,
|
|
74 pass e.g. ``-L/usr/local/cuda/lib``. (In CUDA, the device code and host code
|
|
75 always have the same pointer widths, so if you're compiling 64-bit code for
|
|
76 the host, you're also compiling 64-bit code for the device.)
|
|
77
|
|
78 * ``<GPU arch>`` -- the `compute capability
|
|
79 <https://developer.nvidia.com/cuda-gpus>`_ of your GPU. For example, if you
|
|
80 want to run your program on a GPU with compute capability of 3.5, specify
|
|
81 ``--cuda-gpu-arch=sm_35``.
|
|
82
|
|
83 Note: You cannot pass ``compute_XX`` as an argument to ``--cuda-gpu-arch``;
|
|
84 only ``sm_XX`` is currently supported. However, clang always includes PTX in
|
|
85 its binaries, so e.g. a binary compiled with ``--cuda-gpu-arch=sm_30`` would be
|
|
86 forwards-compatible with e.g. ``sm_35`` GPUs.
|
|
87
|
|
88 You can pass ``--cuda-gpu-arch`` multiple times to compile for multiple archs.
|
|
89
|
|
90 The `-L` and `-l` flags only need to be passed when linking. When compiling,
|
|
91 you may also need to pass ``--cuda-path=/path/to/cuda`` if you didn't install
|
|
92 the CUDA SDK into ``/usr/local/cuda``, ``/usr/local/cuda-7.0``, or
|
|
93 ``/usr/local/cuda-7.5``.
|
|
94
|
|
95 Flags that control numerical code
|
|
96 ---------------------------------
|
|
97
|
|
98 If you're using GPUs, you probably care about making numerical code run fast.
|
|
99 GPU hardware allows for more control over numerical operations than most CPUs,
|
|
100 but this results in more compiler options for you to juggle.
|
|
101
|
|
102 Flags you may wish to tweak include:
|
|
103
|
|
104 * ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when
|
|
105 compiling CUDA) Controls whether the compiler emits fused multiply-add
|
|
106 operations.
|
|
107
|
|
108 * ``off``: never emit fma operations, and prevent ptxas from fusing multiply
|
|
109 and add instructions.
|
|
110 * ``on``: fuse multiplies and adds within a single statement, but never
|
|
111 across statements (C11 semantics). Prevent ptxas from fusing other
|
|
112 multiplies and adds.
|
|
113 * ``fast``: fuse multiplies and adds wherever profitable, even across
|
|
114 statements. Doesn't prevent ptxas from fusing additional multiplies and
|
|
115 adds.
|
|
116
|
|
117 Fused multiply-add instructions can be much faster than the unfused
|
|
118 equivalents, but because the intermediate result in an fma is not rounded,
|
|
119 this flag can affect numerical code.
|
|
120
|
|
121 * ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled,
|
|
122 floating point operations may flush `denormal
|
|
123 <https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0.
|
|
124 Operations on denormal numbers are often much slower than the same operations
|
|
125 on normal numbers.
|
|
126
|
|
127 * ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the
|
|
128 compiler may emit calls to faster, approximate versions of transcendental
|
|
129 functions, instead of using the slower, fully IEEE-compliant versions. For
|
|
130 example, this flag allows clang to emit the ptx ``sin.approx.f32``
|
|
131 instruction.
|
|
132
|
|
133 This is implied by ``-ffast-math``.
|
|
134
|
|
135 Standard library support
|
|
136 ========================
|
|
137
|
|
138 In clang and nvcc, most of the C++ standard library is not supported on the
|
|
139 device side.
|
|
140
|
|
141 ``<math.h>`` and ``<cmath>``
|
|
142 ----------------------------
|
|
143
|
|
144 In clang, ``math.h`` and ``cmath`` are available and `pass
|
|
145 <https://github.com/llvm-mirror/test-suite/blob/master/External/CUDA/math_h.cu>`_
|
|
146 `tests
|
|
147 <https://github.com/llvm-mirror/test-suite/blob/master/External/CUDA/cmath.cu>`_
|
|
148 adapted from libc++'s test suite.
|
|
149
|
|
150 In nvcc ``math.h`` and ``cmath`` are mostly available. Versions of ``::foof``
|
|
151 in namespace std (e.g. ``std::sinf``) are not available, and where the standard
|
|
152 calls for overloads that take integral arguments, these are usually not
|
|
153 available.
|
|
154
|
|
155 .. code-block:: c++
|
|
156
|
|
157 #include <math.h>
|
|
158 #include <cmath.h>
|
|
159
|
|
160 // clang is OK with everything in this function.
|
|
161 __device__ void test() {
|
|
162 std::sin(0.); // nvcc - ok
|
|
163 std::sin(0); // nvcc - error, because no std::sin(int) override is available.
|
|
164 sin(0); // nvcc - same as above.
|
|
165
|
|
166 sinf(0.); // nvcc - ok
|
|
167 std::sinf(0.); // nvcc - no such function
|
|
168 }
|
|
169
|
|
170 ``<std::complex>``
|
|
171 ------------------
|
|
172
|
|
173 nvcc does not officially support ``std::complex``. It's an error to use
|
|
174 ``std::complex`` in ``__device__`` code, but it often works in ``__host__
|
|
175 __device__`` code due to nvcc's interpretation of the "wrong-side rule" (see
|
|
176 below). However, we have heard from implementers that it's possible to get
|
|
177 into situations where nvcc will omit a call to an ``std::complex`` function,
|
|
178 especially when compiling without optimizations.
|
|
179
|
|
180 As of 2016-11-16, clang supports ``std::complex`` without these caveats. It is
|
|
181 tested with libstdc++ 4.8.5 and newer, but is known to work only with libc++
|
|
182 newer than 2016-11-16.
|
|
183
|
|
184 ``<algorithm>``
|
|
185 ---------------
|
|
186
|
|
187 In C++14, many useful functions from ``<algorithm>`` (notably, ``std::min`` and
|
|
188 ``std::max``) become constexpr. You can therefore use these in device code,
|
|
189 when compiling with clang.
|
|
190
|
|
191 Detecting clang vs NVCC from code
|
|
192 =================================
|
|
193
|
|
194 Although clang's CUDA implementation is largely compatible with NVCC's, you may
|
|
195 still want to detect when you're compiling CUDA code specifically with clang.
|
|
196
|
|
197 This is tricky, because NVCC may invoke clang as part of its own compilation
|
|
198 process! For example, NVCC uses the host compiler's preprocessor when
|
|
199 compiling for device code, and that host compiler may in fact be clang.
|
|
200
|
|
201 When clang is actually compiling CUDA code -- rather than being used as a
|
|
202 subtool of NVCC's -- it defines the ``__CUDA__`` macro. ``__CUDA_ARCH__`` is
|
|
203 defined only in device mode (but will be defined if NVCC is using clang as a
|
|
204 preprocessor). So you can use the following incantations to detect clang CUDA
|
|
205 compilation, in host and device modes:
|
|
206
|
|
207 .. code-block:: c++
|
|
208
|
|
209 #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__)
|
|
210 // clang compiling CUDA code, host mode.
|
|
211 #endif
|
|
212
|
|
213 #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__)
|
|
214 // clang compiling CUDA code, device mode.
|
|
215 #endif
|
|
216
|
|
217 Both clang and nvcc define ``__CUDACC__`` during CUDA compilation. You can
|
|
218 detect NVCC specifically by looking for ``__NVCC__``.
|
|
219
|
|
220 Dialect Differences Between clang and nvcc
|
|
221 ==========================================
|
|
222
|
|
223 There is no formal CUDA spec, and clang and nvcc speak slightly different
|
|
224 dialects of the language. Below, we describe some of the differences.
|
|
225
|
|
226 This section is painful; hopefully you can skip this section and live your life
|
|
227 blissfully unaware.
|
|
228
|
|
229 Compilation Models
|
|
230 ------------------
|
|
231
|
|
232 Most of the differences between clang and nvcc stem from the different
|
|
233 compilation models used by clang and nvcc. nvcc uses *split compilation*,
|
|
234 which works roughly as follows:
|
|
235
|
|
236 * Run a preprocessor over the input ``.cu`` file to split it into two source
|
|
237 files: ``H``, containing source code for the host, and ``D``, containing
|
|
238 source code for the device.
|
|
239
|
|
240 * For each GPU architecture ``arch`` that we're compiling for, do:
|
|
241
|
|
242 * Compile ``D`` using nvcc proper. The result of this is a ``ptx`` file for
|
|
243 ``P_arch``.
|
|
244
|
|
245 * Optionally, invoke ``ptxas``, the PTX assembler, to generate a file,
|
|
246 ``S_arch``, containing GPU machine code (SASS) for ``arch``.
|
|
247
|
|
248 * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a
|
|
249 single "fat binary" file, ``F``.
|
|
250
|
|
251 * Compile ``H`` using an external host compiler (gcc, clang, or whatever you
|
|
252 like). ``F`` is packaged up into a header file which is force-included into
|
|
253 ``H``; nvcc generates code that calls into this header to e.g. launch
|
|
254 kernels.
|
|
255
|
|
256 clang uses *merged parsing*. This is similar to split compilation, except all
|
|
257 of the host and device code is present and must be semantically-correct in both
|
|
258 compilation steps.
|
|
259
|
|
260 * For each GPU architecture ``arch`` that we're compiling for, do:
|
|
261
|
|
262 * Compile the input ``.cu`` file for device, using clang. ``__host__`` code
|
|
263 is parsed and must be semantically correct, even though we're not
|
|
264 generating code for the host at this time.
|
|
265
|
|
266 The output of this step is a ``ptx`` file ``P_arch``.
|
|
267
|
|
268 * Invoke ``ptxas`` to generate a SASS file, ``S_arch``. Note that, unlike
|
|
269 nvcc, clang always generates SASS code.
|
|
270
|
|
271 * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a
|
|
272 single fat binary file, ``F``.
|
|
273
|
|
274 * Compile ``H`` using clang. ``__device__`` code is parsed and must be
|
|
275 semantically correct, even though we're not generating code for the device
|
|
276 at this time.
|
|
277
|
|
278 ``F`` is passed to this compilation, and clang includes it in a special ELF
|
|
279 section, where it can be found by tools like ``cuobjdump``.
|
|
280
|
|
281 (You may ask at this point, why does clang need to parse the input file
|
|
282 multiple times? Why not parse it just once, and then use the AST to generate
|
|
283 code for the host and each device architecture?
|
|
284
|
|
285 Unfortunately this can't work because we have to define different macros during
|
|
286 host compilation and during device compilation for each GPU architecture.)
|
|
287
|
|
288 clang's approach allows it to be highly robust to C++ edge cases, as it doesn't
|
|
289 need to decide at an early stage which declarations to keep and which to throw
|
|
290 away. But it has some consequences you should be aware of.
|
|
291
|
|
292 Overloading Based on ``__host__`` and ``__device__`` Attributes
|
|
293 ---------------------------------------------------------------
|
|
294
|
|
295 Let "H", "D", and "HD" stand for "``__host__`` functions", "``__device__``
|
|
296 functions", and "``__host__ __device__`` functions", respectively. Functions
|
|
297 with no attributes behave the same as H.
|
|
298
|
|
299 nvcc does not allow you to create H and D functions with the same signature:
|
|
300
|
|
301 .. code-block:: c++
|
|
302
|
|
303 // nvcc: error - function "foo" has already been defined
|
|
304 __host__ void foo() {}
|
|
305 __device__ void foo() {}
|
|
306
|
|
307 However, nvcc allows you to "overload" H and D functions with different
|
|
308 signatures:
|
|
309
|
|
310 .. code-block:: c++
|
|
311
|
|
312 // nvcc: no error
|
|
313 __host__ void foo(int) {}
|
|
314 __device__ void foo() {}
|
|
315
|
|
316 In clang, the ``__host__`` and ``__device__`` attributes are part of a
|
|
317 function's signature, and so it's legal to have H and D functions with
|
|
318 (otherwise) the same signature:
|
|
319
|
|
320 .. code-block:: c++
|
|
321
|
|
322 // clang: no error
|
|
323 __host__ void foo() {}
|
|
324 __device__ void foo() {}
|
|
325
|
|
326 HD functions cannot be overloaded by H or D functions with the same signature:
|
|
327
|
|
328 .. code-block:: c++
|
|
329
|
|
330 // nvcc: error - function "foo" has already been defined
|
|
331 // clang: error - redefinition of 'foo'
|
|
332 __host__ __device__ void foo() {}
|
|
333 __device__ void foo() {}
|
|
334
|
|
335 // nvcc: no error
|
|
336 // clang: no error
|
|
337 __host__ __device__ void bar(int) {}
|
|
338 __device__ void bar() {}
|
|
339
|
|
340 When resolving an overloaded function, clang considers the host/device
|
|
341 attributes of the caller and callee. These are used as a tiebreaker during
|
|
342 overload resolution. See `IdentifyCUDAPreference
|
|
343 <http://clang.llvm.org/doxygen/SemaCUDA_8cpp.html>`_ for the full set of rules,
|
|
344 but at a high level they are:
|
|
345
|
|
346 * D functions prefer to call other Ds. HDs are given lower priority.
|
|
347
|
|
348 * Similarly, H functions prefer to call other Hs, or ``__global__`` functions
|
|
349 (with equal priority). HDs are given lower priority.
|
|
350
|
|
351 * HD functions prefer to call other HDs.
|
|
352
|
|
353 When compiling for device, HDs will call Ds with lower priority than HD, and
|
|
354 will call Hs with still lower priority. If it's forced to call an H, the
|
|
355 program is malformed if we emit code for this HD function. We call this the
|
|
356 "wrong-side rule", see example below.
|
|
357
|
|
358 The rules are symmetrical when compiling for host.
|
|
359
|
|
360 Some examples:
|
|
361
|
|
362 .. code-block:: c++
|
|
363
|
|
364 __host__ void foo();
|
|
365 __device__ void foo();
|
|
366
|
|
367 __host__ void bar();
|
|
368 __host__ __device__ void bar();
|
|
369
|
|
370 __host__ void test_host() {
|
|
371 foo(); // calls H overload
|
|
372 bar(); // calls H overload
|
|
373 }
|
|
374
|
|
375 __device__ void test_device() {
|
|
376 foo(); // calls D overload
|
|
377 bar(); // calls HD overload
|
|
378 }
|
|
379
|
|
380 __host__ __device__ void test_hd() {
|
|
381 foo(); // calls H overload when compiling for host, otherwise D overload
|
|
382 bar(); // always calls HD overload
|
|
383 }
|
|
384
|
|
385 Wrong-side rule example:
|
|
386
|
|
387 .. code-block:: c++
|
|
388
|
|
389 __host__ void host_only();
|
|
390
|
|
391 // We don't codegen inline functions unless they're referenced by a
|
|
392 // non-inline function. inline_hd1() is called only from the host side, so
|
|
393 // does not generate an error. inline_hd2() is called from the device side,
|
|
394 // so it generates an error.
|
|
395 inline __host__ __device__ void inline_hd1() { host_only(); } // no error
|
|
396 inline __host__ __device__ void inline_hd2() { host_only(); } // error
|
|
397
|
|
398 __host__ void host_fn() { inline_hd1(); }
|
|
399 __device__ void device_fn() { inline_hd2(); }
|
|
400
|
|
401 // This function is not inline, so it's always codegen'ed on both the host
|
|
402 // and the device. Therefore, it generates an error.
|
|
403 __host__ __device__ void not_inline_hd() { host_only(); }
|
|
404
|
|
405 For the purposes of the wrong-side rule, templated functions also behave like
|
|
406 ``inline`` functions: They aren't codegen'ed unless they're instantiated
|
|
407 (usually as part of the process of invoking them).
|
|
408
|
|
409 clang's behavior with respect to the wrong-side rule matches nvcc's, except
|
|
410 nvcc only emits a warning for ``not_inline_hd``; device code is allowed to call
|
|
411 ``not_inline_hd``. In its generated code, nvcc may omit ``not_inline_hd``'s
|
|
412 call to ``host_only`` entirely, or it may try to generate code for
|
|
413 ``host_only`` on the device. What you get seems to depend on whether or not
|
|
414 the compiler chooses to inline ``host_only``.
|
|
415
|
|
416 Member functions, including constructors, may be overloaded using H and D
|
|
417 attributes. However, destructors cannot be overloaded.
|
|
418
|
|
419 Using a Different Class on Host/Device
|
|
420 --------------------------------------
|
|
421
|
|
422 Occasionally you may want to have a class with different host/device versions.
|
|
423
|
|
424 If all of the class's members are the same on the host and device, you can just
|
|
425 provide overloads for the class's member functions.
|
|
426
|
|
427 However, if you want your class to have different members on host/device, you
|
|
428 won't be able to provide working H and D overloads in both classes. In this
|
|
429 case, clang is likely to be unhappy with you.
|
|
430
|
|
431 .. code-block:: c++
|
|
432
|
|
433 #ifdef __CUDA_ARCH__
|
|
434 struct S {
|
|
435 __device__ void foo() { /* use device_only */ }
|
|
436 int device_only;
|
|
437 };
|
|
438 #else
|
|
439 struct S {
|
|
440 __host__ void foo() { /* use host_only */ }
|
|
441 double host_only;
|
|
442 };
|
|
443
|
|
444 __device__ void test() {
|
|
445 S s;
|
|
446 // clang generates an error here, because during host compilation, we
|
|
447 // have ifdef'ed away the __device__ overload of S::foo(). The __device__
|
|
448 // overload must be present *even during host compilation*.
|
|
449 S.foo();
|
|
450 }
|
|
451 #endif
|
|
452
|
|
453 We posit that you don't really want to have classes with different members on H
|
|
454 and D. For example, if you were to pass one of these as a parameter to a
|
|
455 kernel, it would have a different layout on H and D, so would not work
|
|
456 properly.
|
|
457
|
|
458 To make code like this compatible with clang, we recommend you separate it out
|
|
459 into two classes. If you need to write code that works on both host and
|
|
460 device, consider writing an overloaded wrapper function that returns different
|
|
461 types on host and device.
|
|
462
|
|
463 .. code-block:: c++
|
|
464
|
|
465 struct HostS { ... };
|
|
466 struct DeviceS { ... };
|
|
467
|
|
468 __host__ HostS MakeStruct() { return HostS(); }
|
|
469 __device__ DeviceS MakeStruct() { return DeviceS(); }
|
|
470
|
|
471 // Now host and device code can call MakeStruct().
|
|
472
|
|
473 Unfortunately, this idiom isn't compatible with nvcc, because it doesn't allow
|
|
474 you to overload based on the H/D attributes. Here's an idiom that works with
|
|
475 both clang and nvcc:
|
|
476
|
|
477 .. code-block:: c++
|
|
478
|
|
479 struct HostS { ... };
|
|
480 struct DeviceS { ... };
|
|
481
|
|
482 #ifdef __NVCC__
|
|
483 #ifndef __CUDA_ARCH__
|
|
484 __host__ HostS MakeStruct() { return HostS(); }
|
|
485 #else
|
|
486 __device__ DeviceS MakeStruct() { return DeviceS(); }
|
|
487 #endif
|
|
488 #else
|
|
489 __host__ HostS MakeStruct() { return HostS(); }
|
|
490 __device__ DeviceS MakeStruct() { return DeviceS(); }
|
|
491 #endif
|
|
492
|
|
493 // Now host and device code can call MakeStruct().
|
|
494
|
|
495 Hopefully you don't have to do this sort of thing often.
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
496
|
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
497 Optimizations
|
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
498 =============
|
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
499
|
120
|
500 Modern CPUs and GPUs are architecturally quite different, so code that's fast
|
|
501 on a CPU isn't necessarily fast on a GPU. We've made a number of changes to
|
|
502 LLVM to make it generate good GPU code. Among these changes are:
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
503
|
120
|
504 * `Straight-line scalar optimizations <https://goo.gl/4Rb9As>`_ -- These
|
|
505 reduce redundancy within straight-line code.
|
|
506
|
|
507 * `Aggressive speculative execution
|
|
508 <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_
|
|
509 -- This is mainly for promoting straight-line scalar optimizations, which are
|
|
510 most effective on code along dominator paths.
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
511
|
120
|
512 * `Memory space inference
|
|
513 <http://llvm.org/doxygen/NVPTXInferAddressSpaces_8cpp_source.html>`_ --
|
|
514 In PTX, we can operate on pointers that are in a paricular "address space"
|
|
515 (global, shared, constant, or local), or we can operate on pointers in the
|
|
516 "generic" address space, which can point to anything. Operations in a
|
|
517 non-generic address space are faster, but pointers in CUDA are not explicitly
|
|
518 annotated with their address space, so it's up to LLVM to infer it where
|
|
519 possible.
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
520
|
120
|
521 * `Bypassing 64-bit divides
|
|
522 <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ --
|
|
523 This was an existing optimization that we enabled for the PTX backend.
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
524
|
120
|
525 64-bit integer divides are much slower than 32-bit ones on NVIDIA GPUs.
|
|
526 Many of the 64-bit divides in our benchmarks have a divisor and dividend
|
|
527 which fit in 32-bits at runtime. This optimization provides a fast path for
|
|
528 this common case.
|
|
529
|
|
530 * Aggressive loop unrooling and function inlining -- Loop unrolling and
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
531 function inlining need to be more aggressive for GPUs than for CPUs because
|
120
|
532 control flow transfer in GPU is more expensive. More aggressive unrolling and
|
|
533 inlining also promote other optimizations, such as constant propagation and
|
|
534 SROA, which sometimes speed up code by over 10x.
|
|
535
|
|
536 (Programmers can force unrolling and inline using clang's `loop unrolling pragmas
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
537 <http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_
|
120
|
538 and ``__attribute__((always_inline))``.)
|
|
539
|
|
540 Publication
|
|
541 ===========
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
542
|
120
|
543 The team at Google published a paper in CGO 2016 detailing the optimizations
|
|
544 they'd made to clang/LLVM. Note that "gpucc" is no longer a meaningful name:
|
|
545 The relevant tools are now just vanilla clang/LLVM.
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
546
|
120
|
547 | `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_
|
|
548 | Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt
|
|
549 | *Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)*
|
|
550 |
|
|
551 | `Slides from the CGO talk <http://wujingyue.com/docs/gpucc-talk.pdf>`_
|
|
552 |
|
|
553 | `Tutorial given at CGO <http://wujingyue.com/docs/gpucc-tutorial.pdf>`_
|
100
Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp>
parents:
diff
changeset
|
554
|
120
|
555 Obtaining Help
|
|
556 ==============
|
|
557
|
|
558 To obtain help on LLVM in general and its CUDA support, see `the LLVM
|
|
559 community <http://llvm.org/docs/#mailing-lists>`_.
|