| <!--#include virtual="../../header.incl" --> |
| |
| <div class="www_sectiontitle">Third LLVM Performance Workshop at CGO</div> |
| |
| <ul> |
| <li><b>What</b>: Third LLVM Performance Workshop at CGO</li> |
| <li><b>When</b>: <b>Sunday February 17th</b>, 2019</li> |
| <li><b>Where</b>: <b>Georgetown University Room</b>, Washington DC, USA</li> |
| </ul> |
| |
| <p> |
| An LLVM Performance Workshop will be held at CGO 2019. The workshop |
| is co-located with CC, HPCA, and PPoPP. It takes place at <a |
| href="http://cgo.org/cgo2019/venue/">Marriott Marquis</a> |
| in Washington DC. |
| |
| If you are interested in attending the workshop, please register at the |
| <a href="http://cgo.org/cgo2019/workshops.html">CGO website.</a> |
| </p> |
| |
| <div class="www_sectiontitle">Preliminary Schedule</div> |
| <p> |
| <table width="100%"> |
| <tr><td><b>Time</b></td> <td><b>Room</b></td> <td><b>Speaker</b></td> <td><b>Title</b></td> <td> </td></tr> |
| <tr> |
| <td>9:00</td> |
| <td>tba</td> |
| <td>Joel E. Denny</td> |
| <td>Clacc: Translating OpenACC to OpenMP in Clang</td> |
| <td><a href="#jed">[Abstract]</a> </td> |
| </tr> |
| <tr> |
| <td>9:40</td> |
| <td>tba</td> |
| <td>Ayal Zaks</td> |
| <td>Tiling Loops for Scratch-Pad Memories</td> |
| <td><a href="#az">[Abstract]</a> </td> |
| </tr> |
| <tr> |
| <td>10:20-10:40</td> |
| <td> </td> |
| <td colspan=3>Break</td> |
| </tr> |
| <tr> |
| <td>10:40</td> |
| <td>tba</td> |
| <td>Brian Homerding</td> |
| <td>Enabling math function call optimization for DOE proxy applications</td> |
| <td><a href="#bh">[Abstract]</a> </td> |
| </tr> |
| <tr> |
| <td>11:20</td> |
| <td>tba</td> |
| <td>Alexandru Susu</td> |
| <td>Emulating Arithmetic Operations with LLVM's Instruction Selection Pass</td> |
| <td><a href="#as">[Abstract]</a> </td> |
| </tr> |
| <tr> |
| <td>12:00-13:30</td> |
| <td> </td> |
| <td colspan=3>Lunch</td> |
| </tr> |
| <tr> |
| <td>13:40</td> |
| <td>tba</td> |
| <td>Simon Moll</td> |
| <td>Multi-dimensional Vectorization in LLVM</td> |
| <td><a href="#sm">[Abstract]</a> |
| </td> |
| </tr> |
| <tr> |
| <td>14:20</td> |
| <td>tba</td> |
| <td>Johannes Doerfert</td> |
| <td>Performance Gap Exploration with LLVM</td> |
| <td><a href="#jd">[Abstract]</a> |
| </td> |
| </tr> |
| <tr> |
| <td>15:00-15:20</td> |
| <td> </td> |
| <td colspan=3>Break</td> |
| </tr> |
| <tr> |
| <td>15:20</td> |
| <td>tba</td> |
| <td> </td> |
| <td>LLVM Q&A Panel: <b>Questions Welcome</b></td> |
| <td> </td> |
| </tr> |
| <tr> |
| <td>16:00</td> |
| <td> </td> |
| <td colspan=3>Workshop ends.</td> |
| </tr> |
| </table> |
| </p> |
| |
| |
| <div class="www_sectiontitle">Abstracts</div> |
| <p> |
| <ul> |
| <li> <a id="jed"><b>Joel E. Denny, Seyong Lee, and Jeffrey S. Vetter</b>: Clacc: Translating OpenACC to OpenMP in Clang</a> |
| <p> |
| |
| OpenACC was launched in 2010 as a portable programming model for heterogeneous |
| accelerators. Although various implementations already exist, no extensible, |
| open-source, production-quality compiler support is available to the community. |
| This deficiency poses a serious risk for HPC application developers targeting |
| GPUs and other accelerators, and it limits experimentation and progress for the |
| OpenACC specification. To address this deficiency, Clacc is a recent effort |
| funded by the US Exascale Computing Project to develop production OpenACC |
| compiler support for Clang and LLVM. A key feature of the Clacc design is to |
| translate OpenACC to OpenMP to build on Clang's existing OpenMP compiler and |
| runtime support. In this talk, we describe the Clacc goals and design. We |
| also describe the challenges that we have encountered so far in our prototyping |
| efforts, and we present some early performance results. |
| |
| </p> |
| </li> |
| |
| <li> <a id="az"><b>Ayal Zaks, Michael Zuckerman, and Dorit Nuzman</b>: Tiling Loops for Scratch-Pad Memories</a> |
| <p> |
| |
| Tiling a loop is a well-known code transformation that helps optimize temporal |
| locality. Tiling is important for systems that have caches in order to achieve |
| high performance. For systems that are based on scratch-pad memories or |
| software-managed caches, tiling is vital in order for code to be functional. |
| Furthermore, due to the high overhead of transferring data between main memory |
| and scratch-pad memory, it is desirable to tile several loops together. Lastly, |
| if such data transfers can be executed asynchronously and in parallel to |
| processing the data in the scratch-pad memories, careful scheduling of the |
| transfers and double-buffering of the data are desired in order to hide data |
| transfer overheads. In this work we show how multiple loops can be tiled |
| together in order to execute them efficiently on systems with scratch-pad |
| memories. |
| |
| </p> |
| </li> |
| |
| <li> <a id="bh"><b>Brian Homerding</b>: Enabling math function call optimization for DOE proxy applications</a> |
| <p> |
| |
| The US Department of Energy proxy applications are simplified applications that |
| are representative of the important code for various scientific computing |
| workloads. Our performance analysis work on these proxy applications have |
| revealed some areas where Clang can improve when compared to GCC and vendor |
| compilers. Among these is the limited ability to apply optimizations to math |
| function calls when we care about errno. This talk will discuss modeling the |
| memory behavior of math functions using function attributes in order to enable |
| these optimizations. Along with a discussion of our subsequent work to extend |
| the attributes’ coverage and use. |
| |
| </p> |
| </li> |
| |
| <li> <a id="as"><b>Alexandru Susu</b>: Emulating Arithmetic Operations with LLVM's Instruction Selection Pass</a> |
| <p> |
| |
| The Connex-S wide research vector processor has a simple design with 16-bit |
| integer lanes since many embedded applications can make good use of narrow |
| integer types. |
| |
| For completeness, however, our back end for Connex-S needs to lower code to |
| emulate efficiently arithmetic operations for non-native types such as 32-bit |
| integer and 16-bit floating point. To simplify the work of the compiler writer |
| we conceive a method to code generate how we lower these operations inside |
| LLVM's instruction selection pass. |
| |
| We also implement in the Connex-S processor simple lane gating techniques to |
| minimize energy consumption for vector code with a high degree of control |
| divergence, as it is the case for routines emulating floating point operations. |
| |
| </p> |
| </li> |
| |
| <li> <a id="sm"><b>Simon Moll, Shrey Sharma, Matthias Kurtenacker, and Sebastian Hack</b>: Multi-dimensional Vectorization in LLVM</a> |
| <p> |
| |
| Loop vectorization is a classic technique to exploit SIMD instructions in a |
| productive way. In multi-dimensional vectorization, multiple loops of a loop |
| nest are vectorized at once. This exposes opportunities for data reuse, |
| register tiling and more efficient memory accesses. In this work, we present |
| TensorRV, a multi-dimensional vectorization framework for LLVM IR. TensorRV is |
| a generalization of the Region Vectorizer, a general purpose outer-loop and |
| whole-function vectorizer, to the multi-dimensional setting. We evaluate |
| TensorRV on a set of stencil codes and matrix transpose. We find that stencil |
| codes benefit from the reduction of load instructions with a speedup of x1.45 |
| on NEC SX-Aurora TSUBASA. Multi-loop vectorized matrix transpose leverages |
| efficient SIMD shuffle instructions on AVX512, for which we report a speedup of |
| x3.27. |
| |
| </p> |
| </li> |
| |
| <li> <a id="jd"><b>Johannes Doerfert, Brian Homerding and Hal Finkel</b>: Performance Gap Exploration with LLVM</a> |
| <p> |
| |
| Compilers are limited by the static information directly or indirectly |
| encoded in the program. Especially low-level languages, such as C and C++, are |
| therefore considered problematic as their weak type system and relaxed memory |
| semantic allows for various, sometimes non-obvious, behaviors. Since compilers |
| have to preserve the program semantic for all program executions, the existence |
| of exceptional behavior can prevent optimizations that the developer would |
| consider valid and might even expect. Analyses to guarantee the absence of such |
| disruptive and unlikely situations are consequently an indispensable part of an |
| optimizing compiler. However, these analyses have to be approximative and |
| limited in scope. Global and exact static analysis, under consideration of all |
| potential inputs to the program, is simply an infeasible task for any |
| non-trivial program. |
| |
| Even if a user knows the structure of all inputs ever passed to the program, it |
| is not easy to encode such information. The conservatively correct compiler can |
| consequently not match the expectations a developer with superior knowledge |
| has. |
| |
| In this talk, we present a method to automatically measure the effect missing |
| static information has on the optimizations applied to a given program. As a |
| result, we generate an optimistically optimized program version which, compared |
| to the original, defines a performance gap that can be closed by better |
| analyses and programmer annotations. |
| |
| Our evaluation of six, already optimized, proxy kernels for high-performance |
| applications exposed a compiler flaw that caused a ≈6x fold slowdown, as well |
| as opportunities to achieve speedups of up to 20.6%. This clearly indicates |
| that static uncertainty can result in poor performance, but also that compilers |
| need to more effectively utilize available information. |
| |
| </p> |
| </li> |
| </ul> |
| </p> |
| |
| <p> |
| Workshop organization: Johannes Doerfert, Sebastian Pop, Aditya Kumar. |
| </p> |
| |
| <!-- *********************************************************************** --> |
| <hr> |
| |
| <!--#include virtual="../../footer.incl" --> |