| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" |
| "http://www.w3.org/TR/html4/strict.dtd"> |
| <!-- Material used from: HTML 4.01 specs: http://www.w3.org/TR/html401/ --> |
| <html> |
| <head> |
| <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> |
| <title>Polly - GPGPU Code Generation</title> |
| <link type="text/css" rel="stylesheet" href="../menu.css"> |
| <link type="text/css" rel="stylesheet" href="../content.css"> |
| </head> |
| <body> |
| <div id="box"> |
| <!--#include virtual="../menu.html.incl"--> |
| <div id="content"> |
| <!--*********************************************************************--> |
| <h1>Polly - GPGPU Code Generation</h1> |
| <!--*********************************************************************--> |
| <p><em>WARNING: This project was part of the Google Summer of Code 2012. |
| It is currently not finished, but it is in the design and implementation stage. |
| The ideas/plans described here may not yet be implemented in Polly and may |
| change later on.</em></p> |
| |
| This project adds GPGPU code generation feature to Polly. |
| |
| <h2>Objective</h2> |
| <p>The overall objective of this GSoC project is to create a preliminary |
| implementation of GPGPU code generation for Polly. With this addition, users |
| can parallelize some perfectly nested loops with Polly to execute on a |
| heterogeneous platform, composed of CPU and GPU.</p> |
| <p>There are several successful projects about automatic source-to-source gpu |
| code transformation. C-to-CUDA[1] uses the standard Pluto algorithms for |
| computing an affine schedule and then applies a wavefront transformation to |
| obtain one sequential and n-1 parallel loops. The parallel loops are then |
| mapped onto the blocks and threads of GPU. PPCG[2] introduces some advanced |
| algorithms which can expose much more parallelism than other methods . And It |
| also introduces affine partition heuristics and code generation algorithms |
| for locality enhancement in the registers and shared memory.</p> |
| <p>Since automatic GPGPU code generation is quite a complex problem and what we |
| target is a low-level intermediate representation, LLVM IR, rather than a |
| high-level language source, it is important for us to set a proper objective |
| as a start step to give a complete solution to GPGPU code generation for LLVM |
| IR.</p> |
| <p>Firstly, we plan to target two kinds of relatively simple test cases. One is |
| comprised of pure parallel and perfectly nested loops, like the following |
| code.</p> |
| <pre> |
| parfor(int i=0 to M) |
| parfor(int j=0 to N) |
| LoopBody(i, j); |
| </pre> |
| <p>Another one is that all the loops in it are parallel except the inner-most |
| one, just like this:</p> |
| <pre> |
| parfor(int i=0 to M) |
| parfor(int j=0 to N) |
| non-parfor(int k=0 to K) |
| LoopBody(i, j, k); |
| </pre> |
| <p>The LoopBody part should be limited to instructions or functions calls |
| (intrinsics) which can be handled by LLVM's NVPTX backend.</p> |
| <p>On the other hand, we focus on building a preliminary and scalable framework |
| of GPGPU code generation for polly. Thus we plan to employ relatively simple |
| tiling and mapping algorithms and optimize them later.</p> |
| <h2>Work Flow</h2> |
| <h3>GPGPU Code Generation In General</h3> |
| <p>C-to-CUDA[1] and PPCG[2] propose similar steps to solve the automatic GPGPU |
| code generation problem.</p> |
| <li>Look for parallel loops.</li> |
| <li>Create a polyhedral model from the loops.</li> |
| <li>Tile and map the loops to GPU blocks and threads.</li> |
| <li>Determine where to place the data.</li> |
| <h3>What has been done in Polly</h3> |
| <p>Polly has implemented the 1st, 2nd and part of the 3rd of the above steps and |
| many other analysis and transformation passes.</p> |
| <h3>What to do in Polly</h3> |
| <p>Unlike many source-to-source optimizers such as C-to-CUDA and PPCG, Polly is |
| a low-level optimizer, which means we can't use a source-level compiler |
| (e.g. NVCC) to generate the final assembly for the device. We need manually |
| insert device driver API calls to execute the generated kernel assembly |
| text.</p> |
| <p>In this project, we assume that the device driver library has provided an |
| interface to launch kernels in the form of assembly text. Fortunately, most |
| of the mainstream GPU vendors provide such a feature in their products (see |
| ptxjit of NVIDIA GPUs and CAL of AMD GPUs). Generally speaking, what we |
| are going to do in Polly is:</p> |
| <li>Find a way to tile the parallel loops.</li> |
| <li>Find a way to extract the loop body and transform it into thread-centric |
| parallel code.</li> |
| <li>Find a way to store/load the thread-centric code into/from a device module. |
| <li>Find a way to pass the target machine information and generate code of the |
| device module for the target. |
| <li>Find a way to map the tiled loop to GPU blocks and threads.</li> |
| <li>Find a way to insert CUDA synchronization operations on-demand. |
| <li>Find a way to generate the memory copy operations between a host and a |
| device.</li> |
| <li>Implement/Wrap a runtime library to serve as the execution engine for the |
| generated device code.</li> |
| |
| <h3>The Work Flow</h3> |
| <p>In this section, we assume that the host cpu is X86 and the device is NVIDIA |
| CUDA-compatible. we will use the following test case to describe our work |
| flow.</p> |
| <pre> |
| for(i = 0; i < 128; i++) |
| for(j = 0; j < 128; j++) |
| A[i][j] = i*128 + j; |
| </pre> |
| <p>The work flow of our code generator is as follows.</p> |
| <p>1.We first use Polly's jscop file importer to get a wanted 4-level parallel |
| tiled code.</p> |
| The "schedule" part of the pre-optimization jscop file is as the following: |
| <pre> |
| "schedule" : "{ Stmt_for_body3[i0, i1] -> schedule[0, i0, 0, i1, 0] }" |
| </pre> |
| The jscop file describing the tiling transformation is: |
| <pre> |
| "schedule" : "{ Stmt_for_body3[i0, i1] -> schedule[0, o0, o1, o2, o3]: |
| o0 >= 0 and o0 <= 7 and o1 >= 0 and o1 <= 15 and |
| o2 >= 0 and o2 <= 7 and o3 >= 0 and o3 <= 15 and |
| i0 = 16o0 + o1 and i1 = 16o2 + o3 }" |
| </pre> |
| We can test the schedule with the following command line. |
| <pre> |
| opt -load /path/to/polly/build/LLVMPolly.so -basic-aa -polly-import-jscop |
| -polly-ast -analyze -q ./test.ll |
| -polly-import-jscop-postfix=transformed+gpu |
| </pre> |
| The output of this schedule is: |
| <pre> |
| for (c2=0;c2<=7;c2++) { |
| for (c3=0;c3<=15;c3++) { |
| for (c4=0;c4<=7;c4++) { |
| for (c5=0;c5<=15;c5++) { |
| Stmt_for_body3(16*c2+c3,16*c4+c5); |
| } |
| } |
| } |
| } |
| </pre> |
| Now we get a 4-dimensional parallel loops with a single SCoP statement in it. |
| <p>2.We then extract the loop body (or the inner-most non-parallel loop) into a |
| LLVM function, tagging it with PTX_Kernel call convention.</p> |
| <p>3.We extract the PTX_kernel function into a temporary module, set the target |
| triple (e.g. nvptx64-unknown-linux) for the module, transform the temporary |
| module into a string, store it in the original module and erase the |
| PTX_kernel function.</p> |
| <p>4.We replace the loops with their GPGPU counterpart. The GPGPU part of code |
| is composed of a call to the llvm.codegen intrinsic and function calls to our |
| GPU runtime library.</p> |
| <p>5.Finally, we generate the executable program with <em>llc</em> or run the |
| optimized LLVM IRs with a JIT compiler like <em>lli</em>.</p> |
| <h2>Usage</h2> |
| <p>1. Apply the llvm.codegen intrinsic patch to LLVM code base.</p> |
| <pre>cd /path/to/llvm/source |
| git am /path/to/polly/source/utils/0001-Add-llvm.codegen-intrinsic.patch</pre> |
| <p>2. Build the test case.</p> |
| <pre>/path/to/polly/source/test/create_ll.sh test.c</pre> |
| <p>3. Get and edit the jscop file (take function "gpu_codegen" as an example). |
| </p> |
| <pre>opt -load /path/to/polly/build/lib/LLVMPolly.so -basic-aa |
| -polly-export-jscop ./test.ll |
| cp gpu_codegen___%for.cond---%for.end8.jscop |
| gpu_codegen___%for.cond---%for.end8.jscop.transformed+gpu |
| vi gpu_codegen___%for.cond---%for.end8.jscop.transformed+gpu</pre> |
| <p><em>(Please refer to section "The Work Flow" on how to edit the "schedule" |
| part of a statement)</em></p> |
| <p>4. Optimize the code with GPGPU code generation.</p> |
| <pre>opt -load /path/to/polly/build/lib/LLVMPolly.so -basic-aa |
| -polly-import-jscop-postfix=transformed+gpu -enable-polly-gpgpu |
| -polly-gpgpu-triple=nvptx64-unknown-unknown -polly-codegen ./test.ll -S |
| -o test.gpued.ll</pre> |
| <p>5. Build the final assembly and executable.</p> |
| <pre>llc test.gpued.ll -o test.s |
| gcc test.s -lGPURuntime -o test</pre> |
| <p><em>(Please make sure that LD_LIBRARY_PATH is set properly so that |
| /path/to/polly/build/lib/libGPURuntime.so is visible to gcc.)</em></p> |
| <h2>TODO List</h2> |
| |
| <table class="wikitable" cellpadding="2"> |
| <tbody> |
| <tr style="background: rgb(239, 239, 239)"> |
| <th width="400px"> Tasks</th> |
| <th width="150px"> Status </th> |
| <th> Owner </th> |
| </tr> |
| <tr> |
| <th align="left">Tiling the Parallel Loops with An External Jscop File</th> |
| <td align="center" class='open'>Open, In Design</td> |
| <td>Yabin Hu</td> |
| </tr> |
| <tr> |
| <th align="left">GPU Runtime Library Implementation</th> |
| <td align="center" class='inprogress'>Coding Finished, In Reviewing</td> |
| <td></td> |
| </tr> |
| <tr> |
| <th align="left">llvm.codegen Intrinsic Implementation</th> |
| <td align="center" class='inprogress'>Codeing Finished, To Be Reviewed</td> |
| <td></td> |
| </tr> |
| <tr> |
| <th align="left">Code Generation For Host</th> |
| <td align="center" class='inprogress'>50% Done</td> |
| <td></td> |
| </tr> |
| |
| </tbody></table> |
| |
| <h2>References</h2> |
| <li type="1" value="1"> |
| <em>Automatic C-to-CUDA Code Generation for Affine Programs. </em><br /> |
| Muthu Manikandan Baskaran, J. Ramanujam and P. Sadayappan.<br /> |
| International Conference on Compiler Construction (CC) 2010.<br /> |
| </li> |
| <li type="1"><em>PPCG Project</em><br /> |
| <a href="http://freecode.com/projects/ppcg">http://freecode.com/projects/ppcg |
| </a></li> |
| <li type="1"> |
| <em>Where is the Data? Why You Cannot Debate GPU vs. CPU Performance Without the |
| Answer. </em><br /> |
| Chris Gregg and Kim Hazelwood<br /> |
| International Symposium on Performance Analysis of Systems and Software |
| (ISPASS) 2011. |
| </li> |
| <p></p> |
| </div> |
| </div> |
| </body> |
| </html> |