CbC/CbC_llvm: llvm/docs/CompileCudaWithLLVM.rst annotate

annotate llvm/docs/CompileCudaWithLLVM.rst @ 173:0572611fdcc8 llvm10 llvm12

reorgnization done

author	Shinji KONO <kono@ie.u-ryukyu.ac.jp>
date	Mon, 25 May 2020 11:55:54 +0900
parents	1d019706d866
children	c4bab56944e8

rev	line source
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	1 =========================
1d019706d866 LLVM10 anatofuz parents: diff changeset	2 Compiling CUDA with clang
1d019706d866 LLVM10 anatofuz parents: diff changeset	3 =========================
1d019706d866 LLVM10 anatofuz parents: diff changeset	4
1d019706d866 LLVM10 anatofuz parents: diff changeset	5 .. contents::
1d019706d866 LLVM10 anatofuz parents: diff changeset	6 :local:
1d019706d866 LLVM10 anatofuz parents: diff changeset	7
1d019706d866 LLVM10 anatofuz parents: diff changeset	8 Introduction
1d019706d866 LLVM10 anatofuz parents: diff changeset	9 ============
1d019706d866 LLVM10 anatofuz parents: diff changeset	10
1d019706d866 LLVM10 anatofuz parents: diff changeset	11 This document describes how to compile CUDA code with clang, and gives some
1d019706d866 LLVM10 anatofuz parents: diff changeset	12 details about LLVM and clang's CUDA implementations.
1d019706d866 LLVM10 anatofuz parents: diff changeset	13
1d019706d866 LLVM10 anatofuz parents: diff changeset	14 This document assumes a basic familiarity with CUDA. Information about CUDA
1d019706d866 LLVM10 anatofuz parents: diff changeset	15 programming can be found in the
1d019706d866 LLVM10 anatofuz parents: diff changeset	16 `CUDA programming guide
1d019706d866 LLVM10 anatofuz parents: diff changeset	17 <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_.
1d019706d866 LLVM10 anatofuz parents: diff changeset	18
1d019706d866 LLVM10 anatofuz parents: diff changeset	19 Compiling CUDA Code
1d019706d866 LLVM10 anatofuz parents: diff changeset	20 ===================
1d019706d866 LLVM10 anatofuz parents: diff changeset	21
1d019706d866 LLVM10 anatofuz parents: diff changeset	22 Prerequisites
1d019706d866 LLVM10 anatofuz parents: diff changeset	23 -------------
1d019706d866 LLVM10 anatofuz parents: diff changeset	24
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	25 CUDA is supported since llvm 3.9. Clang currently supports CUDA 7.0 through
0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	26 10.1. If clang detects a newer CUDA version, it will issue a warning and will
0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	27 attempt to use detected CUDA SDK it as if it were CUDA-10.1.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	28
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	29 Before you build CUDA code, you'll need to have installed the CUDA SDK. See
0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	30 `NVIDIA's CUDA installation guide
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	31 <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ for
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	32 details. Note that clang `maynot support
0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	33 <https://bugs.llvm.org/show_bug.cgi?id=26966>`_ the CUDA toolkit as installed by
0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	34 some Linux package managers. Clang does attempt to deal with specific details of
0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	35 CUDA installation on a handful of common Linux distributions, but in general the
0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	36 most reliable way to make it work is to install CUDA in a single directory from
0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	37 NVIDIA's `.run` package and specify its location via `--cuda-path=...` argument.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	38
1d019706d866 LLVM10 anatofuz parents: diff changeset	39 CUDA compilation is supported on Linux. Compilation on MacOS and Windows may or
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	40 may not work and currently have no maintainers.
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	41
1d019706d866 LLVM10 anatofuz parents: diff changeset	42 Invoking clang
1d019706d866 LLVM10 anatofuz parents: diff changeset	43 --------------
1d019706d866 LLVM10 anatofuz parents: diff changeset	44
1d019706d866 LLVM10 anatofuz parents: diff changeset	45 Invoking clang for CUDA compilation works similarly to compiling regular C++.
1d019706d866 LLVM10 anatofuz parents: diff changeset	46 You just need to be aware of a few additional flags.
1d019706d866 LLVM10 anatofuz parents: diff changeset	47
1d019706d866 LLVM10 anatofuz parents: diff changeset	48 You can use `this <https://gist.github.com/855e277884eb6b388cd2f00d956c2fd4>`_
1d019706d866 LLVM10 anatofuz parents: diff changeset	49 program as a toy example. Save it as ``axpy.cu``. (Clang detects that you're
1d019706d866 LLVM10 anatofuz parents: diff changeset	50 compiling CUDA code by noticing that your filename ends with ``.cu``.
1d019706d866 LLVM10 anatofuz parents: diff changeset	51 Alternatively, you can pass ``-x cuda``.)
1d019706d866 LLVM10 anatofuz parents: diff changeset	52
1d019706d866 LLVM10 anatofuz parents: diff changeset	53 To build and run, run the following commands, filling in the parts in angle
1d019706d866 LLVM10 anatofuz parents: diff changeset	54 brackets as described below:
1d019706d866 LLVM10 anatofuz parents: diff changeset	55
1d019706d866 LLVM10 anatofuz parents: diff changeset	56 .. code-block:: console
1d019706d866 LLVM10 anatofuz parents: diff changeset	57
1d019706d866 LLVM10 anatofuz parents: diff changeset	58 $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \
1d019706d866 LLVM10 anatofuz parents: diff changeset	59 -L<CUDA install path>/<lib64 or lib> \
1d019706d866 LLVM10 anatofuz parents: diff changeset	60 -lcudart_static -ldl -lrt -pthread
1d019706d866 LLVM10 anatofuz parents: diff changeset	61 $ ./axpy
1d019706d866 LLVM10 anatofuz parents: diff changeset	62 y[0] = 2
1d019706d866 LLVM10 anatofuz parents: diff changeset	63 y[1] = 4
1d019706d866 LLVM10 anatofuz parents: diff changeset	64 y[2] = 6
1d019706d866 LLVM10 anatofuz parents: diff changeset	65 y[3] = 8
1d019706d866 LLVM10 anatofuz parents: diff changeset	66
1d019706d866 LLVM10 anatofuz parents: diff changeset	67 On MacOS, replace `-lcudart_static` with `-lcudart`; otherwise, you may get
1d019706d866 LLVM10 anatofuz parents: diff changeset	68 "CUDA driver version is insufficient for CUDA runtime version" errors when you
1d019706d866 LLVM10 anatofuz parents: diff changeset	69 run your program.
1d019706d866 LLVM10 anatofuz parents: diff changeset	70
1d019706d866 LLVM10 anatofuz parents: diff changeset	71 * ``<CUDA install path>`` -- the directory where you installed CUDA SDK.
1d019706d866 LLVM10 anatofuz parents: diff changeset	72 Typically, ``/usr/local/cuda``.
1d019706d866 LLVM10 anatofuz parents: diff changeset	73
1d019706d866 LLVM10 anatofuz parents: diff changeset	74 Pass e.g. ``-L/usr/local/cuda/lib64`` if compiling in 64-bit mode; otherwise,
1d019706d866 LLVM10 anatofuz parents: diff changeset	75 pass e.g. ``-L/usr/local/cuda/lib``. (In CUDA, the device code and host code
1d019706d866 LLVM10 anatofuz parents: diff changeset	76 always have the same pointer widths, so if you're compiling 64-bit code for
1d019706d866 LLVM10 anatofuz parents: diff changeset	77 the host, you're also compiling 64-bit code for the device.) Note that as of
1d019706d866 LLVM10 anatofuz parents: diff changeset	78 v10.0 CUDA SDK `no longer supports compilation of 32-bit
1d019706d866 LLVM10 anatofuz parents: diff changeset	79 applications <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#deprecated-features>`_.
1d019706d866 LLVM10 anatofuz parents: diff changeset	80
1d019706d866 LLVM10 anatofuz parents: diff changeset	81 * ``<GPU arch>`` -- the `compute capability
1d019706d866 LLVM10 anatofuz parents: diff changeset	82 <https://developer.nvidia.com/cuda-gpus>`_ of your GPU. For example, if you
1d019706d866 LLVM10 anatofuz parents: diff changeset	83 want to run your program on a GPU with compute capability of 3.5, specify
1d019706d866 LLVM10 anatofuz parents: diff changeset	84 ``--cuda-gpu-arch=sm_35``.
1d019706d866 LLVM10 anatofuz parents: diff changeset	85
1d019706d866 LLVM10 anatofuz parents: diff changeset	86 Note: You cannot pass ``compute_XX`` as an argument to ``--cuda-gpu-arch``;
1d019706d866 LLVM10 anatofuz parents: diff changeset	87 only ``sm_XX`` is currently supported. However, clang always includes PTX in
1d019706d866 LLVM10 anatofuz parents: diff changeset	88 its binaries, so e.g. a binary compiled with ``--cuda-gpu-arch=sm_30`` would be
1d019706d866 LLVM10 anatofuz parents: diff changeset	89 forwards-compatible with e.g. ``sm_35`` GPUs.
1d019706d866 LLVM10 anatofuz parents: diff changeset	90
1d019706d866 LLVM10 anatofuz parents: diff changeset	91 You can pass ``--cuda-gpu-arch`` multiple times to compile for multiple archs.
1d019706d866 LLVM10 anatofuz parents: diff changeset	92
1d019706d866 LLVM10 anatofuz parents: diff changeset	93 The `-L` and `-l` flags only need to be passed when linking. When compiling,
1d019706d866 LLVM10 anatofuz parents: diff changeset	94 you may also need to pass ``--cuda-path=/path/to/cuda`` if you didn't install
1d019706d866 LLVM10 anatofuz parents: diff changeset	95 the CUDA SDK into ``/usr/local/cuda`` or ``/usr/local/cuda-X.Y``.
1d019706d866 LLVM10 anatofuz parents: diff changeset	96
1d019706d866 LLVM10 anatofuz parents: diff changeset	97 Flags that control numerical code
1d019706d866 LLVM10 anatofuz parents: diff changeset	98 ---------------------------------
1d019706d866 LLVM10 anatofuz parents: diff changeset	99
1d019706d866 LLVM10 anatofuz parents: diff changeset	100 If you're using GPUs, you probably care about making numerical code run fast.
1d019706d866 LLVM10 anatofuz parents: diff changeset	101 GPU hardware allows for more control over numerical operations than most CPUs,
1d019706d866 LLVM10 anatofuz parents: diff changeset	102 but this results in more compiler options for you to juggle.
1d019706d866 LLVM10 anatofuz parents: diff changeset	103
1d019706d866 LLVM10 anatofuz parents: diff changeset	104 Flags you may wish to tweak include:
1d019706d866 LLVM10 anatofuz parents: diff changeset	105
1d019706d866 LLVM10 anatofuz parents: diff changeset	106 * ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when
1d019706d866 LLVM10 anatofuz parents: diff changeset	107 compiling CUDA) Controls whether the compiler emits fused multiply-add
1d019706d866 LLVM10 anatofuz parents: diff changeset	108 operations.
1d019706d866 LLVM10 anatofuz parents: diff changeset	109
1d019706d866 LLVM10 anatofuz parents: diff changeset	110 * ``off``: never emit fma operations, and prevent ptxas from fusing multiply
1d019706d866 LLVM10 anatofuz parents: diff changeset	111 and add instructions.
1d019706d866 LLVM10 anatofuz parents: diff changeset	112 * ``on``: fuse multiplies and adds within a single statement, but never
1d019706d866 LLVM10 anatofuz parents: diff changeset	113 across statements (C11 semantics). Prevent ptxas from fusing other
1d019706d866 LLVM10 anatofuz parents: diff changeset	114 multiplies and adds.
1d019706d866 LLVM10 anatofuz parents: diff changeset	115 * ``fast``: fuse multiplies and adds wherever profitable, even across
1d019706d866 LLVM10 anatofuz parents: diff changeset	116 statements. Doesn't prevent ptxas from fusing additional multiplies and
1d019706d866 LLVM10 anatofuz parents: diff changeset	117 adds.
1d019706d866 LLVM10 anatofuz parents: diff changeset	118
1d019706d866 LLVM10 anatofuz parents: diff changeset	119 Fused multiply-add instructions can be much faster than the unfused
1d019706d866 LLVM10 anatofuz parents: diff changeset	120 equivalents, but because the intermediate result in an fma is not rounded,
1d019706d866 LLVM10 anatofuz parents: diff changeset	121 this flag can affect numerical code.
1d019706d866 LLVM10 anatofuz parents: diff changeset	122
1d019706d866 LLVM10 anatofuz parents: diff changeset	123 * ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled,
1d019706d866 LLVM10 anatofuz parents: diff changeset	124 floating point operations may flush `denormal
1d019706d866 LLVM10 anatofuz parents: diff changeset	125 <https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0.
1d019706d866 LLVM10 anatofuz parents: diff changeset	126 Operations on denormal numbers are often much slower than the same operations
1d019706d866 LLVM10 anatofuz parents: diff changeset	127 on normal numbers.
1d019706d866 LLVM10 anatofuz parents: diff changeset	128
1d019706d866 LLVM10 anatofuz parents: diff changeset	129 * ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the
1d019706d866 LLVM10 anatofuz parents: diff changeset	130 compiler may emit calls to faster, approximate versions of transcendental
1d019706d866 LLVM10 anatofuz parents: diff changeset	131 functions, instead of using the slower, fully IEEE-compliant versions. For
1d019706d866 LLVM10 anatofuz parents: diff changeset	132 example, this flag allows clang to emit the ptx ``sin.approx.f32``
1d019706d866 LLVM10 anatofuz parents: diff changeset	133 instruction.
1d019706d866 LLVM10 anatofuz parents: diff changeset	134
1d019706d866 LLVM10 anatofuz parents: diff changeset	135 This is implied by ``-ffast-math``.
1d019706d866 LLVM10 anatofuz parents: diff changeset	136
1d019706d866 LLVM10 anatofuz parents: diff changeset	137 Standard library support
1d019706d866 LLVM10 anatofuz parents: diff changeset	138 ========================
1d019706d866 LLVM10 anatofuz parents: diff changeset	139
1d019706d866 LLVM10 anatofuz parents: diff changeset	140 In clang and nvcc, most of the C++ standard library is not supported on the
1d019706d866 LLVM10 anatofuz parents: diff changeset	141 device side.
1d019706d866 LLVM10 anatofuz parents: diff changeset	142
1d019706d866 LLVM10 anatofuz parents: diff changeset	143 ``<math.h>`` and ``<cmath>``
1d019706d866 LLVM10 anatofuz parents: diff changeset	144 ----------------------------
1d019706d866 LLVM10 anatofuz parents: diff changeset	145
1d019706d866 LLVM10 anatofuz parents: diff changeset	146 In clang, ``math.h`` and ``cmath`` are available and `pass
1d019706d866 LLVM10 anatofuz parents: diff changeset	147 <https://github.com/llvm/llvm-test-suite/blob/master/External/CUDA/math_h.cu>`_
1d019706d866 LLVM10 anatofuz parents: diff changeset	148 `tests
1d019706d866 LLVM10 anatofuz parents: diff changeset	149 <https://github.com/llvm/llvm-test-suite/blob/master/External/CUDA/cmath.cu>`_
1d019706d866 LLVM10 anatofuz parents: diff changeset	150 adapted from libc++'s test suite.
1d019706d866 LLVM10 anatofuz parents: diff changeset	151
1d019706d866 LLVM10 anatofuz parents: diff changeset	152 In nvcc ``math.h`` and ``cmath`` are mostly available. Versions of ``::foof``
1d019706d866 LLVM10 anatofuz parents: diff changeset	153 in namespace std (e.g. ``std::sinf``) are not available, and where the standard
1d019706d866 LLVM10 anatofuz parents: diff changeset	154 calls for overloads that take integral arguments, these are usually not
1d019706d866 LLVM10 anatofuz parents: diff changeset	155 available.
1d019706d866 LLVM10 anatofuz parents: diff changeset	156
1d019706d866 LLVM10 anatofuz parents: diff changeset	157 .. code-block:: c++
1d019706d866 LLVM10 anatofuz parents: diff changeset	158
1d019706d866 LLVM10 anatofuz parents: diff changeset	159 #include <math.h>
1d019706d866 LLVM10 anatofuz parents: diff changeset	160 #include <cmath.h>
1d019706d866 LLVM10 anatofuz parents: diff changeset	161
1d019706d866 LLVM10 anatofuz parents: diff changeset	162 // clang is OK with everything in this function.
1d019706d866 LLVM10 anatofuz parents: diff changeset	163 __device__ void test() {
1d019706d866 LLVM10 anatofuz parents: diff changeset	164 std::sin(0.); // nvcc - ok
1d019706d866 LLVM10 anatofuz parents: diff changeset	165 std::sin(0); // nvcc - error, because no std::sin(int) override is available.
1d019706d866 LLVM10 anatofuz parents: diff changeset	166 sin(0); // nvcc - same as above.
1d019706d866 LLVM10 anatofuz parents: diff changeset	167
1d019706d866 LLVM10 anatofuz parents: diff changeset	168 sinf(0.); // nvcc - ok
1d019706d866 LLVM10 anatofuz parents: diff changeset	169 std::sinf(0.); // nvcc - no such function
1d019706d866 LLVM10 anatofuz parents: diff changeset	170 }
1d019706d866 LLVM10 anatofuz parents: diff changeset	171
1d019706d866 LLVM10 anatofuz parents: diff changeset	172 ``<std::complex>``
1d019706d866 LLVM10 anatofuz parents: diff changeset	173 ------------------
1d019706d866 LLVM10 anatofuz parents: diff changeset	174
1d019706d866 LLVM10 anatofuz parents: diff changeset	175 nvcc does not officially support ``std::complex``. It's an error to use
1d019706d866 LLVM10 anatofuz parents: diff changeset	176 ``std::complex`` in ``__device__`` code, but it often works in ``__host__
1d019706d866 LLVM10 anatofuz parents: diff changeset	177 __device__`` code due to nvcc's interpretation of the "wrong-side rule" (see
1d019706d866 LLVM10 anatofuz parents: diff changeset	178 below). However, we have heard from implementers that it's possible to get
1d019706d866 LLVM10 anatofuz parents: diff changeset	179 into situations where nvcc will omit a call to an ``std::complex`` function,
1d019706d866 LLVM10 anatofuz parents: diff changeset	180 especially when compiling without optimizations.
1d019706d866 LLVM10 anatofuz parents: diff changeset	181
1d019706d866 LLVM10 anatofuz parents: diff changeset	182 As of 2016-11-16, clang supports ``std::complex`` without these caveats. It is
1d019706d866 LLVM10 anatofuz parents: diff changeset	183 tested with libstdc++ 4.8.5 and newer, but is known to work only with libc++
1d019706d866 LLVM10 anatofuz parents: diff changeset	184 newer than 2016-11-16.
1d019706d866 LLVM10 anatofuz parents: diff changeset	185
1d019706d866 LLVM10 anatofuz parents: diff changeset	186 ``<algorithm>``
1d019706d866 LLVM10 anatofuz parents: diff changeset	187 ---------------
1d019706d866 LLVM10 anatofuz parents: diff changeset	188
1d019706d866 LLVM10 anatofuz parents: diff changeset	189 In C++14, many useful functions from ``<algorithm>`` (notably, ``std::min`` and
1d019706d866 LLVM10 anatofuz parents: diff changeset	190 ``std::max``) become constexpr. You can therefore use these in device code,
1d019706d866 LLVM10 anatofuz parents: diff changeset	191 when compiling with clang.
1d019706d866 LLVM10 anatofuz parents: diff changeset	192
1d019706d866 LLVM10 anatofuz parents: diff changeset	193 Detecting clang vs NVCC from code
1d019706d866 LLVM10 anatofuz parents: diff changeset	194 =================================
1d019706d866 LLVM10 anatofuz parents: diff changeset	195
1d019706d866 LLVM10 anatofuz parents: diff changeset	196 Although clang's CUDA implementation is largely compatible with NVCC's, you may
1d019706d866 LLVM10 anatofuz parents: diff changeset	197 still want to detect when you're compiling CUDA code specifically with clang.
1d019706d866 LLVM10 anatofuz parents: diff changeset	198
1d019706d866 LLVM10 anatofuz parents: diff changeset	199 This is tricky, because NVCC may invoke clang as part of its own compilation
1d019706d866 LLVM10 anatofuz parents: diff changeset	200 process! For example, NVCC uses the host compiler's preprocessor when
1d019706d866 LLVM10 anatofuz parents: diff changeset	201 compiling for device code, and that host compiler may in fact be clang.
1d019706d866 LLVM10 anatofuz parents: diff changeset	202
1d019706d866 LLVM10 anatofuz parents: diff changeset	203 When clang is actually compiling CUDA code -- rather than being used as a
1d019706d866 LLVM10 anatofuz parents: diff changeset	204 subtool of NVCC's -- it defines the ``__CUDA__`` macro. ``__CUDA_ARCH__`` is
1d019706d866 LLVM10 anatofuz parents: diff changeset	205 defined only in device mode (but will be defined if NVCC is using clang as a
1d019706d866 LLVM10 anatofuz parents: diff changeset	206 preprocessor). So you can use the following incantations to detect clang CUDA
1d019706d866 LLVM10 anatofuz parents: diff changeset	207 compilation, in host and device modes:
1d019706d866 LLVM10 anatofuz parents: diff changeset	208
1d019706d866 LLVM10 anatofuz parents: diff changeset	209 .. code-block:: c++
1d019706d866 LLVM10 anatofuz parents: diff changeset	210
1d019706d866 LLVM10 anatofuz parents: diff changeset	211 #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__)
1d019706d866 LLVM10 anatofuz parents: diff changeset	212 // clang compiling CUDA code, host mode.
1d019706d866 LLVM10 anatofuz parents: diff changeset	213 #endif
1d019706d866 LLVM10 anatofuz parents: diff changeset	214
1d019706d866 LLVM10 anatofuz parents: diff changeset	215 #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__)
1d019706d866 LLVM10 anatofuz parents: diff changeset	216 // clang compiling CUDA code, device mode.
1d019706d866 LLVM10 anatofuz parents: diff changeset	217 #endif
1d019706d866 LLVM10 anatofuz parents: diff changeset	218
1d019706d866 LLVM10 anatofuz parents: diff changeset	219 Both clang and nvcc define ``__CUDACC__`` during CUDA compilation. You can
1d019706d866 LLVM10 anatofuz parents: diff changeset	220 detect NVCC specifically by looking for ``__NVCC__``.
1d019706d866 LLVM10 anatofuz parents: diff changeset	221
1d019706d866 LLVM10 anatofuz parents: diff changeset	222 Dialect Differences Between clang and nvcc
1d019706d866 LLVM10 anatofuz parents: diff changeset	223 ==========================================
1d019706d866 LLVM10 anatofuz parents: diff changeset	224
1d019706d866 LLVM10 anatofuz parents: diff changeset	225 There is no formal CUDA spec, and clang and nvcc speak slightly different
1d019706d866 LLVM10 anatofuz parents: diff changeset	226 dialects of the language. Below, we describe some of the differences.
1d019706d866 LLVM10 anatofuz parents: diff changeset	227
1d019706d866 LLVM10 anatofuz parents: diff changeset	228 This section is painful; hopefully you can skip this section and live your life
1d019706d866 LLVM10 anatofuz parents: diff changeset	229 blissfully unaware.
1d019706d866 LLVM10 anatofuz parents: diff changeset	230
1d019706d866 LLVM10 anatofuz parents: diff changeset	231 Compilation Models
1d019706d866 LLVM10 anatofuz parents: diff changeset	232 ------------------
1d019706d866 LLVM10 anatofuz parents: diff changeset	233
1d019706d866 LLVM10 anatofuz parents: diff changeset	234 Most of the differences between clang and nvcc stem from the different
1d019706d866 LLVM10 anatofuz parents: diff changeset	235 compilation models used by clang and nvcc. nvcc uses split compilation,
1d019706d866 LLVM10 anatofuz parents: diff changeset	236 which works roughly as follows:
1d019706d866 LLVM10 anatofuz parents: diff changeset	237
1d019706d866 LLVM10 anatofuz parents: diff changeset	238 * Run a preprocessor over the input ``.cu`` file to split it into two source
1d019706d866 LLVM10 anatofuz parents: diff changeset	239 files: ``H``, containing source code for the host, and ``D``, containing
1d019706d866 LLVM10 anatofuz parents: diff changeset	240 source code for the device.
1d019706d866 LLVM10 anatofuz parents: diff changeset	241
1d019706d866 LLVM10 anatofuz parents: diff changeset	242 * For each GPU architecture ``arch`` that we're compiling for, do:
1d019706d866 LLVM10 anatofuz parents: diff changeset	243
1d019706d866 LLVM10 anatofuz parents: diff changeset	244 * Compile ``D`` using nvcc proper. The result of this is a ``ptx`` file for
1d019706d866 LLVM10 anatofuz parents: diff changeset	245 ``P_arch``.
1d019706d866 LLVM10 anatofuz parents: diff changeset	246
1d019706d866 LLVM10 anatofuz parents: diff changeset	247 * Optionally, invoke ``ptxas``, the PTX assembler, to generate a file,
1d019706d866 LLVM10 anatofuz parents: diff changeset	248 ``S_arch``, containing GPU machine code (SASS) for ``arch``.
1d019706d866 LLVM10 anatofuz parents: diff changeset	249
1d019706d866 LLVM10 anatofuz parents: diff changeset	250 * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a
1d019706d866 LLVM10 anatofuz parents: diff changeset	251 single "fat binary" file, ``F``.
1d019706d866 LLVM10 anatofuz parents: diff changeset	252
1d019706d866 LLVM10 anatofuz parents: diff changeset	253 * Compile ``H`` using an external host compiler (gcc, clang, or whatever you
1d019706d866 LLVM10 anatofuz parents: diff changeset	254 like). ``F`` is packaged up into a header file which is force-included into
1d019706d866 LLVM10 anatofuz parents: diff changeset	255 ``H``; nvcc generates code that calls into this header to e.g. launch
1d019706d866 LLVM10 anatofuz parents: diff changeset	256 kernels.
1d019706d866 LLVM10 anatofuz parents: diff changeset	257
1d019706d866 LLVM10 anatofuz parents: diff changeset	258 clang uses merged parsing. This is similar to split compilation, except all
1d019706d866 LLVM10 anatofuz parents: diff changeset	259 of the host and device code is present and must be semantically-correct in both
1d019706d866 LLVM10 anatofuz parents: diff changeset	260 compilation steps.
1d019706d866 LLVM10 anatofuz parents: diff changeset	261
1d019706d866 LLVM10 anatofuz parents: diff changeset	262 * For each GPU architecture ``arch`` that we're compiling for, do:
1d019706d866 LLVM10 anatofuz parents: diff changeset	263
1d019706d866 LLVM10 anatofuz parents: diff changeset	264 * Compile the input ``.cu`` file for device, using clang. ``__host__`` code
1d019706d866 LLVM10 anatofuz parents: diff changeset	265 is parsed and must be semantically correct, even though we're not
1d019706d866 LLVM10 anatofuz parents: diff changeset	266 generating code for the host at this time.
1d019706d866 LLVM10 anatofuz parents: diff changeset	267
1d019706d866 LLVM10 anatofuz parents: diff changeset	268 The output of this step is a ``ptx`` file ``P_arch``.
1d019706d866 LLVM10 anatofuz parents: diff changeset	269
1d019706d866 LLVM10 anatofuz parents: diff changeset	270 * Invoke ``ptxas`` to generate a SASS file, ``S_arch``. Note that, unlike
1d019706d866 LLVM10 anatofuz parents: diff changeset	271 nvcc, clang always generates SASS code.
1d019706d866 LLVM10 anatofuz parents: diff changeset	272
1d019706d866 LLVM10 anatofuz parents: diff changeset	273 * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a
1d019706d866 LLVM10 anatofuz parents: diff changeset	274 single fat binary file, ``F``.
1d019706d866 LLVM10 anatofuz parents: diff changeset	275
1d019706d866 LLVM10 anatofuz parents: diff changeset	276 * Compile ``H`` using clang. ``__device__`` code is parsed and must be
1d019706d866 LLVM10 anatofuz parents: diff changeset	277 semantically correct, even though we're not generating code for the device
1d019706d866 LLVM10 anatofuz parents: diff changeset	278 at this time.
1d019706d866 LLVM10 anatofuz parents: diff changeset	279
1d019706d866 LLVM10 anatofuz parents: diff changeset	280 ``F`` is passed to this compilation, and clang includes it in a special ELF
1d019706d866 LLVM10 anatofuz parents: diff changeset	281 section, where it can be found by tools like ``cuobjdump``.
1d019706d866 LLVM10 anatofuz parents: diff changeset	282
1d019706d866 LLVM10 anatofuz parents: diff changeset	283 (You may ask at this point, why does clang need to parse the input file
1d019706d866 LLVM10 anatofuz parents: diff changeset	284 multiple times? Why not parse it just once, and then use the AST to generate
1d019706d866 LLVM10 anatofuz parents: diff changeset	285 code for the host and each device architecture?
1d019706d866 LLVM10 anatofuz parents: diff changeset	286
1d019706d866 LLVM10 anatofuz parents: diff changeset	287 Unfortunately this can't work because we have to define different macros during
1d019706d866 LLVM10 anatofuz parents: diff changeset	288 host compilation and during device compilation for each GPU architecture.)
1d019706d866 LLVM10 anatofuz parents: diff changeset	289
1d019706d866 LLVM10 anatofuz parents: diff changeset	290 clang's approach allows it to be highly robust to C++ edge cases, as it doesn't
1d019706d866 LLVM10 anatofuz parents: diff changeset	291 need to decide at an early stage which declarations to keep and which to throw
1d019706d866 LLVM10 anatofuz parents: diff changeset	292 away. But it has some consequences you should be aware of.
1d019706d866 LLVM10 anatofuz parents: diff changeset	293
1d019706d866 LLVM10 anatofuz parents: diff changeset	294 Overloading Based on ``__host__`` and ``__device__`` Attributes
1d019706d866 LLVM10 anatofuz parents: diff changeset	295 ---------------------------------------------------------------
1d019706d866 LLVM10 anatofuz parents: diff changeset	296
1d019706d866 LLVM10 anatofuz parents: diff changeset	297 Let "H", "D", and "HD" stand for "``__host__`` functions", "``__device__``
1d019706d866 LLVM10 anatofuz parents: diff changeset	298 functions", and "``__host__ __device__`` functions", respectively. Functions
1d019706d866 LLVM10 anatofuz parents: diff changeset	299 with no attributes behave the same as H.
1d019706d866 LLVM10 anatofuz parents: diff changeset	300
1d019706d866 LLVM10 anatofuz parents: diff changeset	301 nvcc does not allow you to create H and D functions with the same signature:
1d019706d866 LLVM10 anatofuz parents: diff changeset	302
1d019706d866 LLVM10 anatofuz parents: diff changeset	303 .. code-block:: c++
1d019706d866 LLVM10 anatofuz parents: diff changeset	304
1d019706d866 LLVM10 anatofuz parents: diff changeset	305 // nvcc: error - function "foo" has already been defined
1d019706d866 LLVM10 anatofuz parents: diff changeset	306 __host__ void foo() {}
1d019706d866 LLVM10 anatofuz parents: diff changeset	307 __device__ void foo() {}
1d019706d866 LLVM10 anatofuz parents: diff changeset	308
1d019706d866 LLVM10 anatofuz parents: diff changeset	309 However, nvcc allows you to "overload" H and D functions with different
1d019706d866 LLVM10 anatofuz parents: diff changeset	310 signatures:
1d019706d866 LLVM10 anatofuz parents: diff changeset	311
1d019706d866 LLVM10 anatofuz parents: diff changeset	312 .. code-block:: c++
1d019706d866 LLVM10 anatofuz parents: diff changeset	313
1d019706d866 LLVM10 anatofuz parents: diff changeset	314 // nvcc: no error
1d019706d866 LLVM10 anatofuz parents: diff changeset	315 __host__ void foo(int) {}
1d019706d866 LLVM10 anatofuz parents: diff changeset	316 __device__ void foo() {}
1d019706d866 LLVM10 anatofuz parents: diff changeset	317
1d019706d866 LLVM10 anatofuz parents: diff changeset	318 In clang, the ``__host__`` and ``__device__`` attributes are part of a
1d019706d866 LLVM10 anatofuz parents: diff changeset	319 function's signature, and so it's legal to have H and D functions with
1d019706d866 LLVM10 anatofuz parents: diff changeset	320 (otherwise) the same signature:
1d019706d866 LLVM10 anatofuz parents: diff changeset	321
1d019706d866 LLVM10 anatofuz parents: diff changeset	322 .. code-block:: c++
1d019706d866 LLVM10 anatofuz parents: diff changeset	323
1d019706d866 LLVM10 anatofuz parents: diff changeset	324 // clang: no error
1d019706d866 LLVM10 anatofuz parents: diff changeset	325 __host__ void foo() {}
1d019706d866 LLVM10 anatofuz parents: diff changeset	326 __device__ void foo() {}
1d019706d866 LLVM10 anatofuz parents: diff changeset	327
1d019706d866 LLVM10 anatofuz parents: diff changeset	328 HD functions cannot be overloaded by H or D functions with the same signature:
1d019706d866 LLVM10 anatofuz parents: diff changeset	329
1d019706d866 LLVM10 anatofuz parents: diff changeset	330 .. code-block:: c++
1d019706d866 LLVM10 anatofuz parents: diff changeset	331
1d019706d866 LLVM10 anatofuz parents: diff changeset	332 // nvcc: error - function "foo" has already been defined
1d019706d866 LLVM10 anatofuz parents: diff changeset	333 // clang: error - redefinition of 'foo'
1d019706d866 LLVM10 anatofuz parents: diff changeset	334 __host__ __device__ void foo() {}
1d019706d866 LLVM10 anatofuz parents: diff changeset	335 __device__ void foo() {}
1d019706d866 LLVM10 anatofuz parents: diff changeset	336
1d019706d866 LLVM10 anatofuz parents: diff changeset	337 // nvcc: no error
1d019706d866 LLVM10 anatofuz parents: diff changeset	338 // clang: no error
1d019706d866 LLVM10 anatofuz parents: diff changeset	339 __host__ __device__ void bar(int) {}
1d019706d866 LLVM10 anatofuz parents: diff changeset	340 __device__ void bar() {}
1d019706d866 LLVM10 anatofuz parents: diff changeset	341
1d019706d866 LLVM10 anatofuz parents: diff changeset	342 When resolving an overloaded function, clang considers the host/device
1d019706d866 LLVM10 anatofuz parents: diff changeset	343 attributes of the caller and callee. These are used as a tiebreaker during
1d019706d866 LLVM10 anatofuz parents: diff changeset	344 overload resolution. See `IdentifyCUDAPreference
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	345 <https://clang.llvm.org/doxygen/SemaCUDA_8cpp.html>`_ for the full set of rules,
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	346 but at a high level they are:
1d019706d866 LLVM10 anatofuz parents: diff changeset	347
1d019706d866 LLVM10 anatofuz parents: diff changeset	348 * D functions prefer to call other Ds. HDs are given lower priority.
1d019706d866 LLVM10 anatofuz parents: diff changeset	349
1d019706d866 LLVM10 anatofuz parents: diff changeset	350 * Similarly, H functions prefer to call other Hs, or ``__global__`` functions
1d019706d866 LLVM10 anatofuz parents: diff changeset	351 (with equal priority). HDs are given lower priority.
1d019706d866 LLVM10 anatofuz parents: diff changeset	352
1d019706d866 LLVM10 anatofuz parents: diff changeset	353 * HD functions prefer to call other HDs.
1d019706d866 LLVM10 anatofuz parents: diff changeset	354
1d019706d866 LLVM10 anatofuz parents: diff changeset	355 When compiling for device, HDs will call Ds with lower priority than HD, and
1d019706d866 LLVM10 anatofuz parents: diff changeset	356 will call Hs with still lower priority. If it's forced to call an H, the
1d019706d866 LLVM10 anatofuz parents: diff changeset	357 program is malformed if we emit code for this HD function. We call this the
1d019706d866 LLVM10 anatofuz parents: diff changeset	358 "wrong-side rule", see example below.
1d019706d866 LLVM10 anatofuz parents: diff changeset	359
1d019706d866 LLVM10 anatofuz parents: diff changeset	360 The rules are symmetrical when compiling for host.
1d019706d866 LLVM10 anatofuz parents: diff changeset	361
1d019706d866 LLVM10 anatofuz parents: diff changeset	362 Some examples:
1d019706d866 LLVM10 anatofuz parents: diff changeset	363
1d019706d866 LLVM10 anatofuz parents: diff changeset	364 .. code-block:: c++
1d019706d866 LLVM10 anatofuz parents: diff changeset	365
1d019706d866 LLVM10 anatofuz parents: diff changeset	366 __host__ void foo();
1d019706d866 LLVM10 anatofuz parents: diff changeset	367 __device__ void foo();
1d019706d866 LLVM10 anatofuz parents: diff changeset	368
1d019706d866 LLVM10 anatofuz parents: diff changeset	369 __host__ void bar();
1d019706d866 LLVM10 anatofuz parents: diff changeset	370 __host__ __device__ void bar();
1d019706d866 LLVM10 anatofuz parents: diff changeset	371
1d019706d866 LLVM10 anatofuz parents: diff changeset	372 __host__ void test_host() {
1d019706d866 LLVM10 anatofuz parents: diff changeset	373 foo(); // calls H overload
1d019706d866 LLVM10 anatofuz parents: diff changeset	374 bar(); // calls H overload
1d019706d866 LLVM10 anatofuz parents: diff changeset	375 }
1d019706d866 LLVM10 anatofuz parents: diff changeset	376
1d019706d866 LLVM10 anatofuz parents: diff changeset	377 __device__ void test_device() {
1d019706d866 LLVM10 anatofuz parents: diff changeset	378 foo(); // calls D overload
1d019706d866 LLVM10 anatofuz parents: diff changeset	379 bar(); // calls HD overload
1d019706d866 LLVM10 anatofuz parents: diff changeset	380 }
1d019706d866 LLVM10 anatofuz parents: diff changeset	381
1d019706d866 LLVM10 anatofuz parents: diff changeset	382 __host__ __device__ void test_hd() {
1d019706d866 LLVM10 anatofuz parents: diff changeset	383 foo(); // calls H overload when compiling for host, otherwise D overload
1d019706d866 LLVM10 anatofuz parents: diff changeset	384 bar(); // always calls HD overload
1d019706d866 LLVM10 anatofuz parents: diff changeset	385 }
1d019706d866 LLVM10 anatofuz parents: diff changeset	386
1d019706d866 LLVM10 anatofuz parents: diff changeset	387 Wrong-side rule example:
1d019706d866 LLVM10 anatofuz parents: diff changeset	388
1d019706d866 LLVM10 anatofuz parents: diff changeset	389 .. code-block:: c++
1d019706d866 LLVM10 anatofuz parents: diff changeset	390
1d019706d866 LLVM10 anatofuz parents: diff changeset	391 __host__ void host_only();
1d019706d866 LLVM10 anatofuz parents: diff changeset	392
1d019706d866 LLVM10 anatofuz parents: diff changeset	393 // We don't codegen inline functions unless they're referenced by a
1d019706d866 LLVM10 anatofuz parents: diff changeset	394 // non-inline function. inline_hd1() is called only from the host side, so
1d019706d866 LLVM10 anatofuz parents: diff changeset	395 // does not generate an error. inline_hd2() is called from the device side,
1d019706d866 LLVM10 anatofuz parents: diff changeset	396 // so it generates an error.
1d019706d866 LLVM10 anatofuz parents: diff changeset	397 inline __host__ __device__ void inline_hd1() { host_only(); } // no error
1d019706d866 LLVM10 anatofuz parents: diff changeset	398 inline __host__ __device__ void inline_hd2() { host_only(); } // error
1d019706d866 LLVM10 anatofuz parents: diff changeset	399
1d019706d866 LLVM10 anatofuz parents: diff changeset	400 __host__ void host_fn() { inline_hd1(); }
1d019706d866 LLVM10 anatofuz parents: diff changeset	401 __device__ void device_fn() { inline_hd2(); }
1d019706d866 LLVM10 anatofuz parents: diff changeset	402
1d019706d866 LLVM10 anatofuz parents: diff changeset	403 // This function is not inline, so it's always codegen'ed on both the host
1d019706d866 LLVM10 anatofuz parents: diff changeset	404 // and the device. Therefore, it generates an error.
1d019706d866 LLVM10 anatofuz parents: diff changeset	405 __host__ __device__ void not_inline_hd() { host_only(); }
1d019706d866 LLVM10 anatofuz parents: diff changeset	406
1d019706d866 LLVM10 anatofuz parents: diff changeset	407 For the purposes of the wrong-side rule, templated functions also behave like
1d019706d866 LLVM10 anatofuz parents: diff changeset	408 ``inline`` functions: They aren't codegen'ed unless they're instantiated
1d019706d866 LLVM10 anatofuz parents: diff changeset	409 (usually as part of the process of invoking them).
1d019706d866 LLVM10 anatofuz parents: diff changeset	410
1d019706d866 LLVM10 anatofuz parents: diff changeset	411 clang's behavior with respect to the wrong-side rule matches nvcc's, except
1d019706d866 LLVM10 anatofuz parents: diff changeset	412 nvcc only emits a warning for ``not_inline_hd``; device code is allowed to call
1d019706d866 LLVM10 anatofuz parents: diff changeset	413 ``not_inline_hd``. In its generated code, nvcc may omit ``not_inline_hd``'s
1d019706d866 LLVM10 anatofuz parents: diff changeset	414 call to ``host_only`` entirely, or it may try to generate code for
1d019706d866 LLVM10 anatofuz parents: diff changeset	415 ``host_only`` on the device. What you get seems to depend on whether or not
1d019706d866 LLVM10 anatofuz parents: diff changeset	416 the compiler chooses to inline ``host_only``.
1d019706d866 LLVM10 anatofuz parents: diff changeset	417
1d019706d866 LLVM10 anatofuz parents: diff changeset	418 Member functions, including constructors, may be overloaded using H and D
1d019706d866 LLVM10 anatofuz parents: diff changeset	419 attributes. However, destructors cannot be overloaded.
1d019706d866 LLVM10 anatofuz parents: diff changeset	420
1d019706d866 LLVM10 anatofuz parents: diff changeset	421 Using a Different Class on Host/Device
1d019706d866 LLVM10 anatofuz parents: diff changeset	422 --------------------------------------
1d019706d866 LLVM10 anatofuz parents: diff changeset	423
1d019706d866 LLVM10 anatofuz parents: diff changeset	424 Occasionally you may want to have a class with different host/device versions.
1d019706d866 LLVM10 anatofuz parents: diff changeset	425
1d019706d866 LLVM10 anatofuz parents: diff changeset	426 If all of the class's members are the same on the host and device, you can just
1d019706d866 LLVM10 anatofuz parents: diff changeset	427 provide overloads for the class's member functions.
1d019706d866 LLVM10 anatofuz parents: diff changeset	428
1d019706d866 LLVM10 anatofuz parents: diff changeset	429 However, if you want your class to have different members on host/device, you
1d019706d866 LLVM10 anatofuz parents: diff changeset	430 won't be able to provide working H and D overloads in both classes. In this
1d019706d866 LLVM10 anatofuz parents: diff changeset	431 case, clang is likely to be unhappy with you.
1d019706d866 LLVM10 anatofuz parents: diff changeset	432
1d019706d866 LLVM10 anatofuz parents: diff changeset	433 .. code-block:: c++
1d019706d866 LLVM10 anatofuz parents: diff changeset	434
1d019706d866 LLVM10 anatofuz parents: diff changeset	435 #ifdef __CUDA_ARCH__
1d019706d866 LLVM10 anatofuz parents: diff changeset	436 struct S {
1d019706d866 LLVM10 anatofuz parents: diff changeset	437 __device__ void foo() { /* use device_only */ }
1d019706d866 LLVM10 anatofuz parents: diff changeset	438 int device_only;
1d019706d866 LLVM10 anatofuz parents: diff changeset	439 };
1d019706d866 LLVM10 anatofuz parents: diff changeset	440 #else
1d019706d866 LLVM10 anatofuz parents: diff changeset	441 struct S {
1d019706d866 LLVM10 anatofuz parents: diff changeset	442 __host__ void foo() { /* use host_only */ }
1d019706d866 LLVM10 anatofuz parents: diff changeset	443 double host_only;
1d019706d866 LLVM10 anatofuz parents: diff changeset	444 };
1d019706d866 LLVM10 anatofuz parents: diff changeset	445
1d019706d866 LLVM10 anatofuz parents: diff changeset	446 __device__ void test() {
1d019706d866 LLVM10 anatofuz parents: diff changeset	447 S s;
1d019706d866 LLVM10 anatofuz parents: diff changeset	448 // clang generates an error here, because during host compilation, we
1d019706d866 LLVM10 anatofuz parents: diff changeset	449 // have ifdef'ed away the __device__ overload of S::foo(). The __device__
1d019706d866 LLVM10 anatofuz parents: diff changeset	450 // overload must be present even during host compilation.
1d019706d866 LLVM10 anatofuz parents: diff changeset	451 S.foo();
1d019706d866 LLVM10 anatofuz parents: diff changeset	452 }
1d019706d866 LLVM10 anatofuz parents: diff changeset	453 #endif
1d019706d866 LLVM10 anatofuz parents: diff changeset	454
1d019706d866 LLVM10 anatofuz parents: diff changeset	455 We posit that you don't really want to have classes with different members on H
1d019706d866 LLVM10 anatofuz parents: diff changeset	456 and D. For example, if you were to pass one of these as a parameter to a
1d019706d866 LLVM10 anatofuz parents: diff changeset	457 kernel, it would have a different layout on H and D, so would not work
1d019706d866 LLVM10 anatofuz parents: diff changeset	458 properly.
1d019706d866 LLVM10 anatofuz parents: diff changeset	459
1d019706d866 LLVM10 anatofuz parents: diff changeset	460 To make code like this compatible with clang, we recommend you separate it out
1d019706d866 LLVM10 anatofuz parents: diff changeset	461 into two classes. If you need to write code that works on both host and
1d019706d866 LLVM10 anatofuz parents: diff changeset	462 device, consider writing an overloaded wrapper function that returns different
1d019706d866 LLVM10 anatofuz parents: diff changeset	463 types on host and device.
1d019706d866 LLVM10 anatofuz parents: diff changeset	464
1d019706d866 LLVM10 anatofuz parents: diff changeset	465 .. code-block:: c++
1d019706d866 LLVM10 anatofuz parents: diff changeset	466
1d019706d866 LLVM10 anatofuz parents: diff changeset	467 struct HostS { ... };
1d019706d866 LLVM10 anatofuz parents: diff changeset	468 struct DeviceS { ... };
1d019706d866 LLVM10 anatofuz parents: diff changeset	469
1d019706d866 LLVM10 anatofuz parents: diff changeset	470 __host__ HostS MakeStruct() { return HostS(); }
1d019706d866 LLVM10 anatofuz parents: diff changeset	471 __device__ DeviceS MakeStruct() { return DeviceS(); }
1d019706d866 LLVM10 anatofuz parents: diff changeset	472
1d019706d866 LLVM10 anatofuz parents: diff changeset	473 // Now host and device code can call MakeStruct().
1d019706d866 LLVM10 anatofuz parents: diff changeset	474
1d019706d866 LLVM10 anatofuz parents: diff changeset	475 Unfortunately, this idiom isn't compatible with nvcc, because it doesn't allow
1d019706d866 LLVM10 anatofuz parents: diff changeset	476 you to overload based on the H/D attributes. Here's an idiom that works with
1d019706d866 LLVM10 anatofuz parents: diff changeset	477 both clang and nvcc:
1d019706d866 LLVM10 anatofuz parents: diff changeset	478
1d019706d866 LLVM10 anatofuz parents: diff changeset	479 .. code-block:: c++
1d019706d866 LLVM10 anatofuz parents: diff changeset	480
1d019706d866 LLVM10 anatofuz parents: diff changeset	481 struct HostS { ... };
1d019706d866 LLVM10 anatofuz parents: diff changeset	482 struct DeviceS { ... };
1d019706d866 LLVM10 anatofuz parents: diff changeset	483
1d019706d866 LLVM10 anatofuz parents: diff changeset	484 #ifdef __NVCC__
1d019706d866 LLVM10 anatofuz parents: diff changeset	485 #ifndef __CUDA_ARCH__
1d019706d866 LLVM10 anatofuz parents: diff changeset	486 __host__ HostS MakeStruct() { return HostS(); }
1d019706d866 LLVM10 anatofuz parents: diff changeset	487 #else
1d019706d866 LLVM10 anatofuz parents: diff changeset	488 __device__ DeviceS MakeStruct() { return DeviceS(); }
1d019706d866 LLVM10 anatofuz parents: diff changeset	489 #endif
1d019706d866 LLVM10 anatofuz parents: diff changeset	490 #else
1d019706d866 LLVM10 anatofuz parents: diff changeset	491 __host__ HostS MakeStruct() { return HostS(); }
1d019706d866 LLVM10 anatofuz parents: diff changeset	492 __device__ DeviceS MakeStruct() { return DeviceS(); }
1d019706d866 LLVM10 anatofuz parents: diff changeset	493 #endif
1d019706d866 LLVM10 anatofuz parents: diff changeset	494
1d019706d866 LLVM10 anatofuz parents: diff changeset	495 // Now host and device code can call MakeStruct().
1d019706d866 LLVM10 anatofuz parents: diff changeset	496
1d019706d866 LLVM10 anatofuz parents: diff changeset	497 Hopefully you don't have to do this sort of thing often.
1d019706d866 LLVM10 anatofuz parents: diff changeset	498
1d019706d866 LLVM10 anatofuz parents: diff changeset	499 Optimizations
1d019706d866 LLVM10 anatofuz parents: diff changeset	500 =============
1d019706d866 LLVM10 anatofuz parents: diff changeset	501
1d019706d866 LLVM10 anatofuz parents: diff changeset	502 Modern CPUs and GPUs are architecturally quite different, so code that's fast
1d019706d866 LLVM10 anatofuz parents: diff changeset	503 on a CPU isn't necessarily fast on a GPU. We've made a number of changes to
1d019706d866 LLVM10 anatofuz parents: diff changeset	504 LLVM to make it generate good GPU code. Among these changes are:
1d019706d866 LLVM10 anatofuz parents: diff changeset	505
1d019706d866 LLVM10 anatofuz parents: diff changeset	506 * `Straight-line scalar optimizations <https://goo.gl/4Rb9As>`_ -- These
1d019706d866 LLVM10 anatofuz parents: diff changeset	507 reduce redundancy within straight-line code.
1d019706d866 LLVM10 anatofuz parents: diff changeset	508
1d019706d866 LLVM10 anatofuz parents: diff changeset	509 * `Aggressive speculative execution
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	510 <https://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	511 -- This is mainly for promoting straight-line scalar optimizations, which are
1d019706d866 LLVM10 anatofuz parents: diff changeset	512 most effective on code along dominator paths.
1d019706d866 LLVM10 anatofuz parents: diff changeset	513
1d019706d866 LLVM10 anatofuz parents: diff changeset	514 * `Memory space inference
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	515 <https://llvm.org/doxygen/NVPTXInferAddressSpaces_8cpp_source.html>`_ --
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	516 In PTX, we can operate on pointers that are in a particular "address space"
1d019706d866 LLVM10 anatofuz parents: diff changeset	517 (global, shared, constant, or local), or we can operate on pointers in the
1d019706d866 LLVM10 anatofuz parents: diff changeset	518 "generic" address space, which can point to anything. Operations in a
1d019706d866 LLVM10 anatofuz parents: diff changeset	519 non-generic address space are faster, but pointers in CUDA are not explicitly
1d019706d866 LLVM10 anatofuz parents: diff changeset	520 annotated with their address space, so it's up to LLVM to infer it where
1d019706d866 LLVM10 anatofuz parents: diff changeset	521 possible.
1d019706d866 LLVM10 anatofuz parents: diff changeset	522
1d019706d866 LLVM10 anatofuz parents: diff changeset	523 * `Bypassing 64-bit divides
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	524 <https://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ --
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	525 This was an existing optimization that we enabled for the PTX backend.
1d019706d866 LLVM10 anatofuz parents: diff changeset	526
1d019706d866 LLVM10 anatofuz parents: diff changeset	527 64-bit integer divides are much slower than 32-bit ones on NVIDIA GPUs.
1d019706d866 LLVM10 anatofuz parents: diff changeset	528 Many of the 64-bit divides in our benchmarks have a divisor and dividend
1d019706d866 LLVM10 anatofuz parents: diff changeset	529 which fit in 32-bits at runtime. This optimization provides a fast path for
1d019706d866 LLVM10 anatofuz parents: diff changeset	530 this common case.
1d019706d866 LLVM10 anatofuz parents: diff changeset	531
1d019706d866 LLVM10 anatofuz parents: diff changeset	532 * Aggressive loop unrolling and function inlining -- Loop unrolling and
1d019706d866 LLVM10 anatofuz parents: diff changeset	533 function inlining need to be more aggressive for GPUs than for CPUs because
1d019706d866 LLVM10 anatofuz parents: diff changeset	534 control flow transfer in GPU is more expensive. More aggressive unrolling and
1d019706d866 LLVM10 anatofuz parents: diff changeset	535 inlining also promote other optimizations, such as constant propagation and
1d019706d866 LLVM10 anatofuz parents: diff changeset	536 SROA, which sometimes speed up code by over 10x.
1d019706d866 LLVM10 anatofuz parents: diff changeset	537
1d019706d866 LLVM10 anatofuz parents: diff changeset	538 (Programmers can force unrolling and inline using clang's `loop unrolling pragmas
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	539 <https://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_
150 1d019706d866 LLVM10 anatofuz parents: diff changeset	540 and ``__attribute__((always_inline))``.)
1d019706d866 LLVM10 anatofuz parents: diff changeset	541
1d019706d866 LLVM10 anatofuz parents: diff changeset	542 Publication
1d019706d866 LLVM10 anatofuz parents: diff changeset	543 ===========
1d019706d866 LLVM10 anatofuz parents: diff changeset	544
1d019706d866 LLVM10 anatofuz parents: diff changeset	545 The team at Google published a paper in CGO 2016 detailing the optimizations
1d019706d866 LLVM10 anatofuz parents: diff changeset	546 they'd made to clang/LLVM. Note that "gpucc" is no longer a meaningful name:
1d019706d866 LLVM10 anatofuz parents: diff changeset	547 The relevant tools are now just vanilla clang/LLVM.
1d019706d866 LLVM10 anatofuz parents: diff changeset	548
1d019706d866 LLVM10 anatofuz parents: diff changeset	549 \| `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_
1d019706d866 LLVM10 anatofuz parents: diff changeset	550 \| Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt
1d019706d866 LLVM10 anatofuz parents: diff changeset	551 \| Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)
1d019706d866 LLVM10 anatofuz parents: diff changeset	552 \|
1d019706d866 LLVM10 anatofuz parents: diff changeset	553 \| `Slides from the CGO talk <http://wujingyue.github.io/docs/gpucc-talk.pdf>`_
1d019706d866 LLVM10 anatofuz parents: diff changeset	554 \|
1d019706d866 LLVM10 anatofuz parents: diff changeset	555 \| `Tutorial given at CGO <http://wujingyue.github.io/docs/gpucc-tutorial.pdf>`_
1d019706d866 LLVM10 anatofuz parents: diff changeset	556
1d019706d866 LLVM10 anatofuz parents: diff changeset	557 Obtaining Help
1d019706d866 LLVM10 anatofuz parents: diff changeset	558 ==============
1d019706d866 LLVM10 anatofuz parents: diff changeset	559
1d019706d866 LLVM10 anatofuz parents: diff changeset	560 To obtain help on LLVM in general and its CUDA support, see `the LLVM
173 0572611fdcc8 reorgnization done Shinji KONO <kono@ie.u-ryukyu.ac.jp> parents: 150 diff changeset	561 community <https://llvm.org/docs/#mailing-lists>`_.

Mercurial > hg > CbC > CbC_llvm

annotate llvm/docs/CompileCudaWithLLVM.rst @ 173:0572611fdcc8 llvm10 llvm12