Mercurial > hg > CbC > CbC_llvm
comparison docs/CompileCudaWithLLVM.rst @ 100:7d135dc70f03 LLVM 3.9
LLVM 3.9
author | Miyagi Mitsuki <e135756@ie.u-ryukyu.ac.jp> |
---|---|
date | Tue, 26 Jan 2016 22:53:40 +0900 |
parents | |
children | 1172e4bd9c6f |
comparison
equal
deleted
inserted
replaced
96:6418606d0ead | 100:7d135dc70f03 |
---|---|
1 =================================== | |
2 Compiling CUDA C/C++ with LLVM | |
3 =================================== | |
4 | |
5 .. contents:: | |
6 :local: | |
7 | |
8 Introduction | |
9 ============ | |
10 | |
11 This document contains the user guides and the internals of compiling CUDA | |
12 C/C++ with LLVM. It is aimed at both users who want to compile CUDA with LLVM | |
13 and developers who want to improve LLVM for GPUs. This document assumes a basic | |
14 familiarity with CUDA. Information about CUDA programming can be found in the | |
15 `CUDA programming guide | |
16 <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_. | |
17 | |
18 How to Build LLVM with CUDA Support | |
19 =================================== | |
20 | |
21 Below is a quick summary of downloading and building LLVM. Consult the `Getting | |
22 Started <http://llvm.org/docs/GettingStarted.html>`_ page for more details on | |
23 setting up LLVM. | |
24 | |
25 #. Checkout LLVM | |
26 | |
27 .. code-block:: console | |
28 | |
29 $ cd where-you-want-llvm-to-live | |
30 $ svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm | |
31 | |
32 #. Checkout Clang | |
33 | |
34 .. code-block:: console | |
35 | |
36 $ cd where-you-want-llvm-to-live | |
37 $ cd llvm/tools | |
38 $ svn co http://llvm.org/svn/llvm-project/cfe/trunk clang | |
39 | |
40 #. Configure and build LLVM and Clang | |
41 | |
42 .. code-block:: console | |
43 | |
44 $ cd where-you-want-llvm-to-live | |
45 $ mkdir build | |
46 $ cd build | |
47 $ cmake [options] .. | |
48 $ make | |
49 | |
50 How to Compile CUDA C/C++ with LLVM | |
51 =================================== | |
52 | |
53 We assume you have installed the CUDA driver and runtime. Consult the `NVIDIA | |
54 CUDA installation Guide | |
55 <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ if | |
56 you have not. | |
57 | |
58 Suppose you want to compile and run the following CUDA program (``axpy.cu``) | |
59 which multiplies a ``float`` array by a ``float`` scalar (AXPY). | |
60 | |
61 .. code-block:: c++ | |
62 | |
63 #include <helper_cuda.h> // for checkCudaErrors | |
64 | |
65 #include <iostream> | |
66 | |
67 __global__ void axpy(float a, float* x, float* y) { | |
68 y[threadIdx.x] = a * x[threadIdx.x]; | |
69 } | |
70 | |
71 int main(int argc, char* argv[]) { | |
72 const int kDataLen = 4; | |
73 | |
74 float a = 2.0f; | |
75 float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f}; | |
76 float host_y[kDataLen]; | |
77 | |
78 // Copy input data to device. | |
79 float* device_x; | |
80 float* device_y; | |
81 checkCudaErrors(cudaMalloc(&device_x, kDataLen * sizeof(float))); | |
82 checkCudaErrors(cudaMalloc(&device_y, kDataLen * sizeof(float))); | |
83 checkCudaErrors(cudaMemcpy(device_x, host_x, kDataLen * sizeof(float), | |
84 cudaMemcpyHostToDevice)); | |
85 | |
86 // Launch the kernel. | |
87 axpy<<<1, kDataLen>>>(a, device_x, device_y); | |
88 | |
89 // Copy output data to host. | |
90 checkCudaErrors(cudaDeviceSynchronize()); | |
91 checkCudaErrors(cudaMemcpy(host_y, device_y, kDataLen * sizeof(float), | |
92 cudaMemcpyDeviceToHost)); | |
93 | |
94 // Print the results. | |
95 for (int i = 0; i < kDataLen; ++i) { | |
96 std::cout << "y[" << i << "] = " << host_y[i] << "\n"; | |
97 } | |
98 | |
99 checkCudaErrors(cudaDeviceReset()); | |
100 return 0; | |
101 } | |
102 | |
103 The command line for compilation is similar to what you would use for C++. | |
104 | |
105 .. code-block:: console | |
106 | |
107 $ clang++ -o axpy -I<CUDA install path>/samples/common/inc -L<CUDA install path>/<lib64 or lib> axpy.cu -lcudart_static -lcuda -ldl -lrt -pthread | |
108 $ ./axpy | |
109 y[0] = 2 | |
110 y[1] = 4 | |
111 y[2] = 6 | |
112 y[3] = 8 | |
113 | |
114 Note that ``helper_cuda.h`` comes from the CUDA samples, so you need the | |
115 samples installed for this example. ``<CUDA install path>`` is the root | |
116 directory where you installed CUDA SDK, typically ``/usr/local/cuda``. | |
117 | |
118 Optimizations | |
119 ============= | |
120 | |
121 CPU and GPU have different design philosophies and architectures. For example, a | |
122 typical CPU has branch prediction, out-of-order execution, and is superscalar, | |
123 whereas a typical GPU has none of these. Due to such differences, an | |
124 optimization pipeline well-tuned for CPUs may be not suitable for GPUs. | |
125 | |
126 LLVM performs several general and CUDA-specific optimizations for GPUs. The | |
127 list below shows some of the more important optimizations for GPUs. Most of | |
128 them have been upstreamed to ``lib/Transforms/Scalar`` and | |
129 ``lib/Target/NVPTX``. A few of them have not been upstreamed due to lack of a | |
130 customizable target-independent optimization pipeline. | |
131 | |
132 * **Straight-line scalar optimizations**. These optimizations reduce redundancy | |
133 in straight-line code. Details can be found in the `design document for | |
134 straight-line scalar optimizations <https://goo.gl/4Rb9As>`_. | |
135 | |
136 * **Inferring memory spaces**. `This optimization | |
137 <http://www.llvm.org/docs/doxygen/html/NVPTXFavorNonGenericAddrSpaces_8cpp_source.html>`_ | |
138 infers the memory space of an address so that the backend can emit faster | |
139 special loads and stores from it. Details can be found in the `design | |
140 document for memory space inference <https://goo.gl/5wH2Ct>`_. | |
141 | |
142 * **Aggressive loop unrooling and function inlining**. Loop unrolling and | |
143 function inlining need to be more aggressive for GPUs than for CPUs because | |
144 control flow transfer in GPU is more expensive. They also promote other | |
145 optimizations such as constant propagation and SROA which sometimes speed up | |
146 code by over 10x. An empirical inline threshold for GPUs is 1100. This | |
147 configuration has yet to be upstreamed with a target-specific optimization | |
148 pipeline. LLVM also provides `loop unrolling pragmas | |
149 <http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_ | |
150 and ``__attribute__((always_inline))`` for programmers to force unrolling and | |
151 inling. | |
152 | |
153 * **Aggressive speculative execution**. `This transformation | |
154 <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ is | |
155 mainly for promoting straight-line scalar optimizations which are most | |
156 effective on code along dominator paths. | |
157 | |
158 * **Memory-space alias analysis**. `This alias analysis | |
159 <http://reviews.llvm.org/D12414>`_ infers that two pointers in different | |
160 special memory spaces do not alias. It has yet to be integrated to the new | |
161 alias analysis infrastructure; the new infrastructure does not run | |
162 target-specific alias analysis. | |
163 | |
164 * **Bypassing 64-bit divides**. `An existing optimization | |
165 <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ | |
166 enabled in the NVPTX backend. 64-bit integer divides are much slower than | |
167 32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit | |
168 divides in our benchmarks have a divisor and dividend which fit in 32-bits at | |
169 runtime. This optimization provides a fast path for this common case. |