Mercurial > hg > CbC > CbC_llvm
diff docs/CompileCudaWithLLVM.rst @ 120:1172e4bd9c6f
update 4.0.0
author | mir3636 |
---|---|
date | Fri, 25 Nov 2016 19:14:25 +0900 |
parents | 7d135dc70f03 |
children | 803732b1fca8 |
line wrap: on
line diff
--- a/docs/CompileCudaWithLLVM.rst Tue Jan 26 22:56:36 2016 +0900 +++ b/docs/CompileCudaWithLLVM.rst Fri Nov 25 19:14:25 2016 +0900 @@ -1,6 +1,6 @@ -=================================== -Compiling CUDA C/C++ with LLVM -=================================== +========================= +Compiling CUDA with clang +========================= .. contents:: :local: @@ -8,162 +8,552 @@ Introduction ============ -This document contains the user guides and the internals of compiling CUDA -C/C++ with LLVM. It is aimed at both users who want to compile CUDA with LLVM -and developers who want to improve LLVM for GPUs. This document assumes a basic -familiarity with CUDA. Information about CUDA programming can be found in the +This document describes how to compile CUDA code with clang, and gives some +details about LLVM and clang's CUDA implementations. + +This document assumes a basic familiarity with CUDA. Information about CUDA +programming can be found in the `CUDA programming guide <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_. -How to Build LLVM with CUDA Support -=================================== - -Below is a quick summary of downloading and building LLVM. Consult the `Getting -Started <http://llvm.org/docs/GettingStarted.html>`_ page for more details on -setting up LLVM. - -#. Checkout LLVM +Compiling CUDA Code +=================== - .. code-block:: console - - $ cd where-you-want-llvm-to-live - $ svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm - -#. Checkout Clang - - .. code-block:: console +Prerequisites +------------- - $ cd where-you-want-llvm-to-live - $ cd llvm/tools - $ svn co http://llvm.org/svn/llvm-project/cfe/trunk clang - -#. Configure and build LLVM and Clang - - .. code-block:: console +CUDA is supported in llvm 3.9, but it's still in active development, so we +recommend you `compile clang/LLVM from HEAD +<http://llvm.org/docs/GettingStarted.html>`_. - $ cd where-you-want-llvm-to-live - $ mkdir build - $ cd build - $ cmake [options] .. - $ make - -How to Compile CUDA C/C++ with LLVM -=================================== - -We assume you have installed the CUDA driver and runtime. Consult the `NVIDIA -CUDA installation Guide -<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ if -you have not. - -Suppose you want to compile and run the following CUDA program (``axpy.cu``) -which multiplies a ``float`` array by a ``float`` scalar (AXPY). +Before you build CUDA code, you'll need to have installed the appropriate +driver for your nvidia GPU and the CUDA SDK. See `NVIDIA's CUDA installation +guide <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ +for details. Note that clang `does not support +<https://llvm.org/bugs/show_bug.cgi?id=26966>`_ the CUDA toolkit as installed +by many Linux package managers; you probably need to install nvidia's package. -.. code-block:: c++ - - #include <helper_cuda.h> // for checkCudaErrors - - #include <iostream> +You will need CUDA 7.0, 7.5, or 8.0 to compile with clang. - __global__ void axpy(float a, float* x, float* y) { - y[threadIdx.x] = a * x[threadIdx.x]; - } +CUDA compilation is supported on Linux, and on MacOS as of XXXX-XX-XX. Windows +support is planned but not yet in place. - int main(int argc, char* argv[]) { - const int kDataLen = 4; - - float a = 2.0f; - float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f}; - float host_y[kDataLen]; +Invoking clang +-------------- - // Copy input data to device. - float* device_x; - float* device_y; - checkCudaErrors(cudaMalloc(&device_x, kDataLen * sizeof(float))); - checkCudaErrors(cudaMalloc(&device_y, kDataLen * sizeof(float))); - checkCudaErrors(cudaMemcpy(device_x, host_x, kDataLen * sizeof(float), - cudaMemcpyHostToDevice)); - - // Launch the kernel. - axpy<<<1, kDataLen>>>(a, device_x, device_y); +Invoking clang for CUDA compilation works similarly to compiling regular C++. +You just need to be aware of a few additional flags. - // Copy output data to host. - checkCudaErrors(cudaDeviceSynchronize()); - checkCudaErrors(cudaMemcpy(host_y, device_y, kDataLen * sizeof(float), - cudaMemcpyDeviceToHost)); +You can use `this <https://gist.github.com/855e277884eb6b388cd2f00d956c2fd4>`_ +program as a toy example. Save it as ``axpy.cu``. (Clang detects that you're +compiling CUDA code by noticing that your filename ends with ``.cu``. +Alternatively, you can pass ``-x cuda``.) - // Print the results. - for (int i = 0; i < kDataLen; ++i) { - std::cout << "y[" << i << "] = " << host_y[i] << "\n"; - } - - checkCudaErrors(cudaDeviceReset()); - return 0; - } - -The command line for compilation is similar to what you would use for C++. +To build and run, run the following commands, filling in the parts in angle +brackets as described below: .. code-block:: console - $ clang++ -o axpy -I<CUDA install path>/samples/common/inc -L<CUDA install path>/<lib64 or lib> axpy.cu -lcudart_static -lcuda -ldl -lrt -pthread + $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \ + -L<CUDA install path>/<lib64 or lib> \ + -lcudart_static -ldl -lrt -pthread $ ./axpy y[0] = 2 y[1] = 4 y[2] = 6 y[3] = 8 -Note that ``helper_cuda.h`` comes from the CUDA samples, so you need the -samples installed for this example. ``<CUDA install path>`` is the root -directory where you installed CUDA SDK, typically ``/usr/local/cuda``. +On MacOS, replace `-lcudart_static` with `-lcudart`; otherwise, you may get +"CUDA driver version is insufficient for CUDA runtime version" errors when you +run your program. + +* ``<CUDA install path>`` -- the directory where you installed CUDA SDK. + Typically, ``/usr/local/cuda``. + + Pass e.g. ``-L/usr/local/cuda/lib64`` if compiling in 64-bit mode; otherwise, + pass e.g. ``-L/usr/local/cuda/lib``. (In CUDA, the device code and host code + always have the same pointer widths, so if you're compiling 64-bit code for + the host, you're also compiling 64-bit code for the device.) + +* ``<GPU arch>`` -- the `compute capability + <https://developer.nvidia.com/cuda-gpus>`_ of your GPU. For example, if you + want to run your program on a GPU with compute capability of 3.5, specify + ``--cuda-gpu-arch=sm_35``. + + Note: You cannot pass ``compute_XX`` as an argument to ``--cuda-gpu-arch``; + only ``sm_XX`` is currently supported. However, clang always includes PTX in + its binaries, so e.g. a binary compiled with ``--cuda-gpu-arch=sm_30`` would be + forwards-compatible with e.g. ``sm_35`` GPUs. + + You can pass ``--cuda-gpu-arch`` multiple times to compile for multiple archs. + +The `-L` and `-l` flags only need to be passed when linking. When compiling, +you may also need to pass ``--cuda-path=/path/to/cuda`` if you didn't install +the CUDA SDK into ``/usr/local/cuda``, ``/usr/local/cuda-7.0``, or +``/usr/local/cuda-7.5``. + +Flags that control numerical code +--------------------------------- + +If you're using GPUs, you probably care about making numerical code run fast. +GPU hardware allows for more control over numerical operations than most CPUs, +but this results in more compiler options for you to juggle. + +Flags you may wish to tweak include: + +* ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when + compiling CUDA) Controls whether the compiler emits fused multiply-add + operations. + + * ``off``: never emit fma operations, and prevent ptxas from fusing multiply + and add instructions. + * ``on``: fuse multiplies and adds within a single statement, but never + across statements (C11 semantics). Prevent ptxas from fusing other + multiplies and adds. + * ``fast``: fuse multiplies and adds wherever profitable, even across + statements. Doesn't prevent ptxas from fusing additional multiplies and + adds. + + Fused multiply-add instructions can be much faster than the unfused + equivalents, but because the intermediate result in an fma is not rounded, + this flag can affect numerical code. + +* ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled, + floating point operations may flush `denormal + <https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0. + Operations on denormal numbers are often much slower than the same operations + on normal numbers. + +* ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the + compiler may emit calls to faster, approximate versions of transcendental + functions, instead of using the slower, fully IEEE-compliant versions. For + example, this flag allows clang to emit the ptx ``sin.approx.f32`` + instruction. + + This is implied by ``-ffast-math``. + +Standard library support +======================== + +In clang and nvcc, most of the C++ standard library is not supported on the +device side. + +``<math.h>`` and ``<cmath>`` +---------------------------- + +In clang, ``math.h`` and ``cmath`` are available and `pass +<https://github.com/llvm-mirror/test-suite/blob/master/External/CUDA/math_h.cu>`_ +`tests +<https://github.com/llvm-mirror/test-suite/blob/master/External/CUDA/cmath.cu>`_ +adapted from libc++'s test suite. + +In nvcc ``math.h`` and ``cmath`` are mostly available. Versions of ``::foof`` +in namespace std (e.g. ``std::sinf``) are not available, and where the standard +calls for overloads that take integral arguments, these are usually not +available. + +.. code-block:: c++ + + #include <math.h> + #include <cmath.h> + + // clang is OK with everything in this function. + __device__ void test() { + std::sin(0.); // nvcc - ok + std::sin(0); // nvcc - error, because no std::sin(int) override is available. + sin(0); // nvcc - same as above. + + sinf(0.); // nvcc - ok + std::sinf(0.); // nvcc - no such function + } + +``<std::complex>`` +------------------ + +nvcc does not officially support ``std::complex``. It's an error to use +``std::complex`` in ``__device__`` code, but it often works in ``__host__ +__device__`` code due to nvcc's interpretation of the "wrong-side rule" (see +below). However, we have heard from implementers that it's possible to get +into situations where nvcc will omit a call to an ``std::complex`` function, +especially when compiling without optimizations. + +As of 2016-11-16, clang supports ``std::complex`` without these caveats. It is +tested with libstdc++ 4.8.5 and newer, but is known to work only with libc++ +newer than 2016-11-16. + +``<algorithm>`` +--------------- + +In C++14, many useful functions from ``<algorithm>`` (notably, ``std::min`` and +``std::max``) become constexpr. You can therefore use these in device code, +when compiling with clang. + +Detecting clang vs NVCC from code +================================= + +Although clang's CUDA implementation is largely compatible with NVCC's, you may +still want to detect when you're compiling CUDA code specifically with clang. + +This is tricky, because NVCC may invoke clang as part of its own compilation +process! For example, NVCC uses the host compiler's preprocessor when +compiling for device code, and that host compiler may in fact be clang. + +When clang is actually compiling CUDA code -- rather than being used as a +subtool of NVCC's -- it defines the ``__CUDA__`` macro. ``__CUDA_ARCH__`` is +defined only in device mode (but will be defined if NVCC is using clang as a +preprocessor). So you can use the following incantations to detect clang CUDA +compilation, in host and device modes: + +.. code-block:: c++ + + #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__) + // clang compiling CUDA code, host mode. + #endif + + #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__) + // clang compiling CUDA code, device mode. + #endif + +Both clang and nvcc define ``__CUDACC__`` during CUDA compilation. You can +detect NVCC specifically by looking for ``__NVCC__``. + +Dialect Differences Between clang and nvcc +========================================== + +There is no formal CUDA spec, and clang and nvcc speak slightly different +dialects of the language. Below, we describe some of the differences. + +This section is painful; hopefully you can skip this section and live your life +blissfully unaware. + +Compilation Models +------------------ + +Most of the differences between clang and nvcc stem from the different +compilation models used by clang and nvcc. nvcc uses *split compilation*, +which works roughly as follows: + + * Run a preprocessor over the input ``.cu`` file to split it into two source + files: ``H``, containing source code for the host, and ``D``, containing + source code for the device. + + * For each GPU architecture ``arch`` that we're compiling for, do: + + * Compile ``D`` using nvcc proper. The result of this is a ``ptx`` file for + ``P_arch``. + + * Optionally, invoke ``ptxas``, the PTX assembler, to generate a file, + ``S_arch``, containing GPU machine code (SASS) for ``arch``. + + * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a + single "fat binary" file, ``F``. + + * Compile ``H`` using an external host compiler (gcc, clang, or whatever you + like). ``F`` is packaged up into a header file which is force-included into + ``H``; nvcc generates code that calls into this header to e.g. launch + kernels. + +clang uses *merged parsing*. This is similar to split compilation, except all +of the host and device code is present and must be semantically-correct in both +compilation steps. + + * For each GPU architecture ``arch`` that we're compiling for, do: + + * Compile the input ``.cu`` file for device, using clang. ``__host__`` code + is parsed and must be semantically correct, even though we're not + generating code for the host at this time. + + The output of this step is a ``ptx`` file ``P_arch``. + + * Invoke ``ptxas`` to generate a SASS file, ``S_arch``. Note that, unlike + nvcc, clang always generates SASS code. + + * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a + single fat binary file, ``F``. + + * Compile ``H`` using clang. ``__device__`` code is parsed and must be + semantically correct, even though we're not generating code for the device + at this time. + + ``F`` is passed to this compilation, and clang includes it in a special ELF + section, where it can be found by tools like ``cuobjdump``. + +(You may ask at this point, why does clang need to parse the input file +multiple times? Why not parse it just once, and then use the AST to generate +code for the host and each device architecture? + +Unfortunately this can't work because we have to define different macros during +host compilation and during device compilation for each GPU architecture.) + +clang's approach allows it to be highly robust to C++ edge cases, as it doesn't +need to decide at an early stage which declarations to keep and which to throw +away. But it has some consequences you should be aware of. + +Overloading Based on ``__host__`` and ``__device__`` Attributes +--------------------------------------------------------------- + +Let "H", "D", and "HD" stand for "``__host__`` functions", "``__device__`` +functions", and "``__host__ __device__`` functions", respectively. Functions +with no attributes behave the same as H. + +nvcc does not allow you to create H and D functions with the same signature: + +.. code-block:: c++ + + // nvcc: error - function "foo" has already been defined + __host__ void foo() {} + __device__ void foo() {} + +However, nvcc allows you to "overload" H and D functions with different +signatures: + +.. code-block:: c++ + + // nvcc: no error + __host__ void foo(int) {} + __device__ void foo() {} + +In clang, the ``__host__`` and ``__device__`` attributes are part of a +function's signature, and so it's legal to have H and D functions with +(otherwise) the same signature: + +.. code-block:: c++ + + // clang: no error + __host__ void foo() {} + __device__ void foo() {} + +HD functions cannot be overloaded by H or D functions with the same signature: + +.. code-block:: c++ + + // nvcc: error - function "foo" has already been defined + // clang: error - redefinition of 'foo' + __host__ __device__ void foo() {} + __device__ void foo() {} + + // nvcc: no error + // clang: no error + __host__ __device__ void bar(int) {} + __device__ void bar() {} + +When resolving an overloaded function, clang considers the host/device +attributes of the caller and callee. These are used as a tiebreaker during +overload resolution. See `IdentifyCUDAPreference +<http://clang.llvm.org/doxygen/SemaCUDA_8cpp.html>`_ for the full set of rules, +but at a high level they are: + + * D functions prefer to call other Ds. HDs are given lower priority. + + * Similarly, H functions prefer to call other Hs, or ``__global__`` functions + (with equal priority). HDs are given lower priority. + + * HD functions prefer to call other HDs. + + When compiling for device, HDs will call Ds with lower priority than HD, and + will call Hs with still lower priority. If it's forced to call an H, the + program is malformed if we emit code for this HD function. We call this the + "wrong-side rule", see example below. + + The rules are symmetrical when compiling for host. + +Some examples: + +.. code-block:: c++ + + __host__ void foo(); + __device__ void foo(); + + __host__ void bar(); + __host__ __device__ void bar(); + + __host__ void test_host() { + foo(); // calls H overload + bar(); // calls H overload + } + + __device__ void test_device() { + foo(); // calls D overload + bar(); // calls HD overload + } + + __host__ __device__ void test_hd() { + foo(); // calls H overload when compiling for host, otherwise D overload + bar(); // always calls HD overload + } + +Wrong-side rule example: + +.. code-block:: c++ + + __host__ void host_only(); + + // We don't codegen inline functions unless they're referenced by a + // non-inline function. inline_hd1() is called only from the host side, so + // does not generate an error. inline_hd2() is called from the device side, + // so it generates an error. + inline __host__ __device__ void inline_hd1() { host_only(); } // no error + inline __host__ __device__ void inline_hd2() { host_only(); } // error + + __host__ void host_fn() { inline_hd1(); } + __device__ void device_fn() { inline_hd2(); } + + // This function is not inline, so it's always codegen'ed on both the host + // and the device. Therefore, it generates an error. + __host__ __device__ void not_inline_hd() { host_only(); } + +For the purposes of the wrong-side rule, templated functions also behave like +``inline`` functions: They aren't codegen'ed unless they're instantiated +(usually as part of the process of invoking them). + +clang's behavior with respect to the wrong-side rule matches nvcc's, except +nvcc only emits a warning for ``not_inline_hd``; device code is allowed to call +``not_inline_hd``. In its generated code, nvcc may omit ``not_inline_hd``'s +call to ``host_only`` entirely, or it may try to generate code for +``host_only`` on the device. What you get seems to depend on whether or not +the compiler chooses to inline ``host_only``. + +Member functions, including constructors, may be overloaded using H and D +attributes. However, destructors cannot be overloaded. + +Using a Different Class on Host/Device +-------------------------------------- + +Occasionally you may want to have a class with different host/device versions. + +If all of the class's members are the same on the host and device, you can just +provide overloads for the class's member functions. + +However, if you want your class to have different members on host/device, you +won't be able to provide working H and D overloads in both classes. In this +case, clang is likely to be unhappy with you. + +.. code-block:: c++ + + #ifdef __CUDA_ARCH__ + struct S { + __device__ void foo() { /* use device_only */ } + int device_only; + }; + #else + struct S { + __host__ void foo() { /* use host_only */ } + double host_only; + }; + + __device__ void test() { + S s; + // clang generates an error here, because during host compilation, we + // have ifdef'ed away the __device__ overload of S::foo(). The __device__ + // overload must be present *even during host compilation*. + S.foo(); + } + #endif + +We posit that you don't really want to have classes with different members on H +and D. For example, if you were to pass one of these as a parameter to a +kernel, it would have a different layout on H and D, so would not work +properly. + +To make code like this compatible with clang, we recommend you separate it out +into two classes. If you need to write code that works on both host and +device, consider writing an overloaded wrapper function that returns different +types on host and device. + +.. code-block:: c++ + + struct HostS { ... }; + struct DeviceS { ... }; + + __host__ HostS MakeStruct() { return HostS(); } + __device__ DeviceS MakeStruct() { return DeviceS(); } + + // Now host and device code can call MakeStruct(). + +Unfortunately, this idiom isn't compatible with nvcc, because it doesn't allow +you to overload based on the H/D attributes. Here's an idiom that works with +both clang and nvcc: + +.. code-block:: c++ + + struct HostS { ... }; + struct DeviceS { ... }; + + #ifdef __NVCC__ + #ifndef __CUDA_ARCH__ + __host__ HostS MakeStruct() { return HostS(); } + #else + __device__ DeviceS MakeStruct() { return DeviceS(); } + #endif + #else + __host__ HostS MakeStruct() { return HostS(); } + __device__ DeviceS MakeStruct() { return DeviceS(); } + #endif + + // Now host and device code can call MakeStruct(). + +Hopefully you don't have to do this sort of thing often. Optimizations ============= -CPU and GPU have different design philosophies and architectures. For example, a -typical CPU has branch prediction, out-of-order execution, and is superscalar, -whereas a typical GPU has none of these. Due to such differences, an -optimization pipeline well-tuned for CPUs may be not suitable for GPUs. +Modern CPUs and GPUs are architecturally quite different, so code that's fast +on a CPU isn't necessarily fast on a GPU. We've made a number of changes to +LLVM to make it generate good GPU code. Among these changes are: -LLVM performs several general and CUDA-specific optimizations for GPUs. The -list below shows some of the more important optimizations for GPUs. Most of -them have been upstreamed to ``lib/Transforms/Scalar`` and -``lib/Target/NVPTX``. A few of them have not been upstreamed due to lack of a -customizable target-independent optimization pipeline. +* `Straight-line scalar optimizations <https://goo.gl/4Rb9As>`_ -- These + reduce redundancy within straight-line code. + +* `Aggressive speculative execution + <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ + -- This is mainly for promoting straight-line scalar optimizations, which are + most effective on code along dominator paths. -* **Straight-line scalar optimizations**. These optimizations reduce redundancy - in straight-line code. Details can be found in the `design document for - straight-line scalar optimizations <https://goo.gl/4Rb9As>`_. +* `Memory space inference + <http://llvm.org/doxygen/NVPTXInferAddressSpaces_8cpp_source.html>`_ -- + In PTX, we can operate on pointers that are in a paricular "address space" + (global, shared, constant, or local), or we can operate on pointers in the + "generic" address space, which can point to anything. Operations in a + non-generic address space are faster, but pointers in CUDA are not explicitly + annotated with their address space, so it's up to LLVM to infer it where + possible. -* **Inferring memory spaces**. `This optimization - <http://www.llvm.org/docs/doxygen/html/NVPTXFavorNonGenericAddrSpaces_8cpp_source.html>`_ - infers the memory space of an address so that the backend can emit faster - special loads and stores from it. Details can be found in the `design - document for memory space inference <https://goo.gl/5wH2Ct>`_. +* `Bypassing 64-bit divides + <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ -- + This was an existing optimization that we enabled for the PTX backend. -* **Aggressive loop unrooling and function inlining**. Loop unrolling and + 64-bit integer divides are much slower than 32-bit ones on NVIDIA GPUs. + Many of the 64-bit divides in our benchmarks have a divisor and dividend + which fit in 32-bits at runtime. This optimization provides a fast path for + this common case. + +* Aggressive loop unrooling and function inlining -- Loop unrolling and function inlining need to be more aggressive for GPUs than for CPUs because - control flow transfer in GPU is more expensive. They also promote other - optimizations such as constant propagation and SROA which sometimes speed up - code by over 10x. An empirical inline threshold for GPUs is 1100. This - configuration has yet to be upstreamed with a target-specific optimization - pipeline. LLVM also provides `loop unrolling pragmas + control flow transfer in GPU is more expensive. More aggressive unrolling and + inlining also promote other optimizations, such as constant propagation and + SROA, which sometimes speed up code by over 10x. + + (Programmers can force unrolling and inline using clang's `loop unrolling pragmas <http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_ - and ``__attribute__((always_inline))`` for programmers to force unrolling and - inling. + and ``__attribute__((always_inline))``.) + +Publication +=========== -* **Aggressive speculative execution**. `This transformation - <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ is - mainly for promoting straight-line scalar optimizations which are most - effective on code along dominator paths. +The team at Google published a paper in CGO 2016 detailing the optimizations +they'd made to clang/LLVM. Note that "gpucc" is no longer a meaningful name: +The relevant tools are now just vanilla clang/LLVM. -* **Memory-space alias analysis**. `This alias analysis - <http://reviews.llvm.org/D12414>`_ infers that two pointers in different - special memory spaces do not alias. It has yet to be integrated to the new - alias analysis infrastructure; the new infrastructure does not run - target-specific alias analysis. +| `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_ +| Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt +| *Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)* +| +| `Slides from the CGO talk <http://wujingyue.com/docs/gpucc-talk.pdf>`_ +| +| `Tutorial given at CGO <http://wujingyue.com/docs/gpucc-tutorial.pdf>`_ -* **Bypassing 64-bit divides**. `An existing optimization - <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ - enabled in the NVPTX backend. 64-bit integer divides are much slower than - 32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit - divides in our benchmarks have a divisor and dividend which fit in 32-bits at - runtime. This optimization provides a fast path for this common case. +Obtaining Help +============== + +To obtain help on LLVM in general and its CUDA support, see `the LLVM +community <http://llvm.org/docs/#mailing-lists>`_.