Mercurial > hg > CbC > CbC_llvm
comparison polly/www/documentation/gpgpucodegen.html @ 150:1d019706d866
LLVM10
author | anatofuz |
---|---|
date | Thu, 13 Feb 2020 15:10:13 +0900 |
parents | |
children | 0572611fdcc8 |
comparison
equal
deleted
inserted
replaced
147:c2174574ed3a | 150:1d019706d866 |
---|---|
1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" | |
2 "http://www.w3.org/TR/html4/strict.dtd"> | |
3 <!-- Material used from: HTML 4.01 specs: http://www.w3.org/TR/html401/ --> | |
4 <html> | |
5 <head> | |
6 <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> | |
7 <title>Polly - GPGPU Code Generation</title> | |
8 <link type="text/css" rel="stylesheet" href="../menu.css"> | |
9 <link type="text/css" rel="stylesheet" href="../content.css"> | |
10 </head> | |
11 <body> | |
12 <div id="box"> | |
13 <!--#include virtual="../menu.html.incl"--> | |
14 <div id="content"> | |
15 <!--*********************************************************************--> | |
16 <h1>Polly - GPGPU Code Generation</h1> | |
17 <!--*********************************************************************--> | |
18 <p><em>WARNING: This project was part of the Google Summer of Code 2012. | |
19 It is currently not finished, but it is in the design and implementation stage. | |
20 The ideas/plans described here may not yet be implemented in Polly and may | |
21 change later on.</em></p> | |
22 | |
23 This project adds GPGPU code generation feature to Polly. | |
24 | |
25 <h2>Objective</h2> | |
26 <p>The overall objective of this GSoC project is to create a preliminary | |
27 implementation of GPGPU code generation for Polly. With this addition, users | |
28 can parallelize some perfectly nested loops with Polly to execute on a | |
29 heterogeneous platform, composed of CPU and GPU.</p> | |
30 <p>There are several successful projects about automatic source-to-source gpu | |
31 code transformation. C-to-CUDA[1] uses the standard Pluto algorithms for | |
32 computing an affine schedule and then applies a wavefront transformation to | |
33 obtain one sequential and n-1 parallel loops. The parallel loops are then | |
34 mapped onto the blocks and threads of GPU. PPCG[2] introduces some advanced | |
35 algorithms which can expose much more parallelism than other methods . And It | |
36 also introduces affine partition heuristics and code generation algorithms | |
37 for locality enhancement in the registers and shared memory.</p> | |
38 <p>Since automatic GPGPU code generation is quite a complex problem and what we | |
39 target is a low-level intermediate representation, LLVM IR, rather than a | |
40 high-level language source, it is important for us to set a proper objective | |
41 as a start step to give a complete solution to GPGPU code generation for LLVM | |
42 IR.</p> | |
43 <p>Firstly, we plan to target two kinds of relatively simple test cases. One is | |
44 comprised of pure parallel and perfectly nested loops, like the following | |
45 code.</p> | |
46 <pre> | |
47 parfor(int i=0 to M) | |
48 parfor(int j=0 to N) | |
49 LoopBody(i, j); | |
50 </pre> | |
51 <p>Another one is that all the loops in it are parallel except the inner-most | |
52 one, just like this:</p> | |
53 <pre> | |
54 parfor(int i=0 to M) | |
55 parfor(int j=0 to N) | |
56 non-parfor(int k=0 to K) | |
57 LoopBody(i, j, k); | |
58 </pre> | |
59 <p>The LoopBody part should be limited to instructions or functions calls | |
60 (intrinsics) which can be handled by LLVM's NVPTX backend.</p> | |
61 <p>On the other hand, we focus on building a preliminary and scalable framework | |
62 of GPGPU code generation for polly. Thus we plan to employ relatively simple | |
63 tiling and mapping algorithms and optimize them later.</p> | |
64 <h2>Work Flow</h2> | |
65 <h3>GPGPU Code Generation In General</h3> | |
66 <p>C-to-CUDA[1] and PPCG[2] propose similar steps to solve the automatic GPGPU | |
67 code generation problem.</p> | |
68 <li>Look for parallel loops.</li> | |
69 <li>Create a polyhedral model from the loops.</li> | |
70 <li>Tile and map the loops to GPU blocks and threads.</li> | |
71 <li>Determine where to place the data.</li> | |
72 <h3>What has been done in Polly</h3> | |
73 <p>Polly has implemented the 1st, 2nd and part of the 3rd of the above steps and | |
74 many other analysis and transformation passes.</p> | |
75 <h3>What to do in Polly</h3> | |
76 <p>Unlike many source-to-source optimizers such as C-to-CUDA and PPCG, Polly is | |
77 a low-level optimizer, which means we can't use a source-level compiler | |
78 (e.g. NVCC) to generate the final assembly for the device. We need manually | |
79 insert device driver API calls to execute the generated kernel assembly | |
80 text.</p> | |
81 <p>In this project, we assume that the device driver library has provided an | |
82 interface to launch kernels in the form of assembly text. Fortunately, most | |
83 of the mainstream GPU vendors provide such a feature in thier products (see | |
84 ptxjit of NVIDIA GPUs and CAL of AMD GPUs). Generally speaking, what we | |
85 are going to do in Polly is:</p> | |
86 <li>Find a way to tile the parallel loops.</li> | |
87 <li>Find a way to extract the loop body and transform it into thread-centric | |
88 parallel code.</li> | |
89 <li>Find a way to store/load the thread-centric code into/from a device module. | |
90 <li>Find a way to pass the target machine information and generate code of the | |
91 device module for the target. | |
92 <li>Find a way to map the tiled loop to GPU blocks and threads.</li> | |
93 <li>Find a way to insert CUDA synchronization operations on-demand. | |
94 <li>Find a way to generate the memory copy operations between a host and a | |
95 device.</li> | |
96 <li>Implement/Wrap a runtime library to serve as the execution engine for the | |
97 generated device code.</li> | |
98 | |
99 <h3>The Work Flow</h3> | |
100 <p>In this section, we assume that the host cpu is X86 and the device is NVIDIA | |
101 CUDA-compatible. we will use the following test case to describe our work | |
102 flow.</p> | |
103 <pre> | |
104 for(i = 0; i < 128; i++) | |
105 for(j = 0; j < 128; j++) | |
106 A[i][j] = i*128 + j; | |
107 </pre> | |
108 <p>The work flow of our code generator is as follows.</p> | |
109 <p>1.We first use Polly's jscop file importer to get a wanted 4-level parallel | |
110 tiled code.</p> | |
111 The "schedule" part of the pre-optimization jscop file is as the following: | |
112 <pre> | |
113 "schedule" : "{ Stmt_for_body3[i0, i1] -> schedule[0, i0, 0, i1, 0] }" | |
114 </pre> | |
115 The jscop file describing the tiling transformation is: | |
116 <pre> | |
117 "schedule" : "{ Stmt_for_body3[i0, i1] -> schedule[0, o0, o1, o2, o3]: | |
118 o0 >= 0 and o0 <= 7 and o1 >= 0 and o1 <= 15 and | |
119 o2 >= 0 and o2 <= 7 and o3 >= 0 and o3 <= 15 and | |
120 i0 = 16o0 + o1 and i1 = 16o2 + o3 }" | |
121 </pre> | |
122 We can test the schedule with the following command line. | |
123 <pre> | |
124 opt -load /path/to/polly/build/LLVMPolly.so -basicaa -polly-import-jscop | |
125 -polly-ast -analyze -q ./test.ll | |
126 -polly-import-jscop-postfix=transformed+gpu | |
127 </pre> | |
128 The output of this schedule is: | |
129 <pre> | |
130 for (c2=0;c2<=7;c2++) { | |
131 for (c3=0;c3<=15;c3++) { | |
132 for (c4=0;c4<=7;c4++) { | |
133 for (c5=0;c5<=15;c5++) { | |
134 Stmt_for_body3(16*c2+c3,16*c4+c5); | |
135 } | |
136 } | |
137 } | |
138 } | |
139 </pre> | |
140 Now we get a 4-dimensional parallel loops with a single SCoP statement in it. | |
141 <p>2.We then extract the loop body (or the inner-most non-parallel loop) into a | |
142 LLVM function, tagging it with PTX_Kernel call convention.</p> | |
143 <p>3.We extract the PTX_kernel function into a temporary module, set the target | |
144 triple (e.g. nvptx64-unknown-linux) for the module, transform the temporary | |
145 module into a string, store it in the original module and erase the | |
146 PTX_kernel function.</p> | |
147 <p>4.We replace the loops with their GPGPU counterpart. The GPGPU part of code | |
148 is composed of a call to the llvm.codegen intrinsic and function calls to our | |
149 GPU runtime library.</p> | |
150 <p>5.Finally, we generate the executable program with <em>llc</em> or run the | |
151 optimized LLVM IRs with a JIT compiler like <em>lli</em>.</p> | |
152 <h2>Usage</h2> | |
153 <p>1. Apply the llvm.codegen intrinsic patch to LLVM code base.</p> | |
154 <pre>cd /path/to/llvm/source | |
155 git am /path/to/polly/source/utils/0001-Add-llvm.codegen-intrinsic.patch</pre> | |
156 <p>2. Build the test case.</p> | |
157 <pre>/path/to/polly/source/test/create_ll.sh test.c</pre> | |
158 <p>3. Get and edit the jscop file (take function "gpu_codegen" as an example). | |
159 </p> | |
160 <pre>opt -load /path/to/polly/build/lib/LLVMPolly.so -basicaa | |
161 -polly-export-jscop ./test.ll | |
162 cp gpu_codegen___%for.cond---%for.end8.jscop | |
163 gpu_codegen___%for.cond---%for.end8.jscop.transformed+gpu | |
164 vi gpu_codegen___%for.cond---%for.end8.jscop.transformed+gpu</pre> | |
165 <p><em>(Please refer to section "The Work Flow" on how to edit the "schedule" | |
166 part of a statement)</em></p> | |
167 <p>4. Optimize the code with GPGPU code generation.</p> | |
168 <pre>opt -load /path/to/polly/build/lib/LLVMPolly.so -basicaa | |
169 -polly-import-jscop-postfix=transformed+gpu -enable-polly-gpgpu | |
170 -polly-gpgpu-triple=nvptx64-unknown-unknown -polly-codegen ./test.ll -S | |
171 -o test.gpued.ll</pre> | |
172 <p>5. Build the final assembly and executable.</p> | |
173 <pre>llc test.gpued.ll -o test.s | |
174 gcc test.s -lGPURuntime -o test</pre> | |
175 <p><em>(Please make sure that LD_LIBRARY_PATH is set properly so that | |
176 /path/to/polly/build/lib/libGPURuntime.so is visible to gcc.)</em></p> | |
177 <h2>TODO List</h2> | |
178 | |
179 <table class="wikitable" cellpadding="2"> | |
180 <tbody> | |
181 <tr style="background: rgb(239, 239, 239)"> | |
182 <th width="400px"> Tasks</th> | |
183 <th width="150px"> Status </th> | |
184 <th> Owner </th> | |
185 </tr> | |
186 <tr> | |
187 <th align="left">Tiling the Parallel Loops with An External Jscop File</th> | |
188 <td align="center" class='open'>Open, In Design</td> | |
189 <td>Yabin Hu</td> | |
190 </tr> | |
191 <tr> | |
192 <th align="left">GPU Runtime Library Implementation</th> | |
193 <td align="center" class='inprogress'>Coding Finished, In Reviewing</td> | |
194 <td></td> | |
195 </tr> | |
196 <tr> | |
197 <th align="left">llvm.codegen Intrinsic Implementation</th> | |
198 <td align="center" class='inprogress'>Codeing Finished, To Be Reviewed</td> | |
199 <td></td> | |
200 </tr> | |
201 <tr> | |
202 <th align="left">Code Generation For Host</th> | |
203 <td align="center" class='inprogress'>50% Done</td> | |
204 <td></td> | |
205 </tr> | |
206 | |
207 </tbody></table> | |
208 | |
209 <h2>References</h2> | |
210 <li type="1" value="1"> | |
211 <em>Automatic C-to-CUDA Code Generation for Affine Programs. </em><br /> | |
212 Muthu Manikandan Baskaran, J. Ramanujam and P. Sadayappan.<br /> | |
213 International Conference on Compiler Construction (CC) 2010.<br /> | |
214 </li> | |
215 <li type="1"><em>PPCG Project</em><br /> | |
216 <a href="http://freecode.com/projects/ppcg">http://freecode.com/projects/ppcg | |
217 </a></li> | |
218 <li type="1"> | |
219 <em>Where is the Data? Why You Cannot Debate GPU vs. CPU Performance Without the | |
220 Answer. </em><br /> | |
221 Chris Gregg and Kim Hazelwood<br /> | |
222 International Symposium on Performance Analysis of Systems and Software | |
223 (ISPASS) 2011. | |
224 </li> | |
225 <p></p> | |
226 </div> | |
227 </div> | |
228 </body> | |
229 </html> |