blob: 9fdfb92c7362cc429ec0c8ff56991c5e63964cd5 [file] [log] [blame]
<!--#include virtual="../../header.incl" -->
<div class="www_sectiontitle">Third LLVM Performance Workshop at CGO</div>
<ul>
<li><b>What</b>: Third LLVM Performance Workshop at CGO</li>
<li><b>When</b>: <b>Sunday February 17th</b>, 2019</li>
<li><b>Where</b>: <b>Georgetown University Room</b>, Washington DC, USA</li>
</ul>
<p>
An LLVM Performance Workshop will be held at CGO 2019. The workshop
is co-located with CC, HPCA, and PPoPP. It takes place at <a
href="http://cgo.org/cgo2019/venue/">Marriott Marquis</a>
in Washington DC.
If you are interested in attending the workshop, please register at the
<a href="http://cgo.org/cgo2019/workshops.html">CGO website.</a>
</p>
<div class="www_sectiontitle">Preliminary Schedule</div>
<p>
<table width="100%">
<tr><td><b>Time</b></td> <td><b>Room</b></td> <td><b>Speaker</b></td> <td><b>Title</b></td> <td>&nbsp;</td></tr>
<tr>
<td>9:00</td>
<td>tba</td>
<td>Joel E. Denny</td>
<td>Clacc: Translating OpenACC to OpenMP in Clang</td>
<td><a href="#jed">[Abstract]</a> </td>
</tr>
<tr>
<td>9:40</td>
<td>tba</td>
<td>Ayal Zaks</td>
<td>Tiling Loops for Scratch-Pad Memories</td>
<td><a href="#az">[Abstract]</a> </td>
</tr>
<tr>
<td>10:20-10:40</td>
<td>&nbsp;</td>
<td colspan=3>Break</td>
</tr>
<tr>
<td>10:40</td>
<td>tba</td>
<td>Brian Homerding</td>
<td>Enabling math function call optimization for DOE proxy applications</td>
<td><a href="#bh">[Abstract]</a> </td>
</tr>
<tr>
<td>11:20</td>
<td>tba</td>
<td>Alexandru Susu</td>
<td>Emulating Arithmetic Operations with LLVM's Instruction Selection Pass</td>
<td><a href="#as">[Abstract]</a> </td>
</tr>
<tr>
<td>12:00-13:30</td>
<td>&nbsp;</td>
<td colspan=3>Lunch</td>
</tr>
<tr>
<td>13:40</td>
<td>tba</td>
<td>Simon Moll</td>
<td>Multi-dimensional Vectorization in LLVM</td>
<td><a href="#sm">[Abstract]</a>
</td>
</tr>
<tr>
<td>14:20</td>
<td>tba</td>
<td>Johannes Doerfert</td>
<td>Performance Gap Exploration with LLVM</td>
<td><a href="#jd">[Abstract]</a>
</td>
</tr>
<tr>
<td>15:00-15:20</td>
<td>&nbsp;</td>
<td colspan=3>Break</td>
</tr>
<tr>
<td>15:20</td>
<td>tba</td>
<td>&nbsp;</td>
<td>LLVM Q&A Panel: <b>Questions Welcome</b></td>
<td>&nbsp;</td>
</tr>
<tr>
<td>16:00</td>
<td>&nbsp;</td>
<td colspan=3>Workshop ends.</td>
</tr>
</table>
</p>
<div class="www_sectiontitle">Abstracts</div>
<p>
<ul>
<li> <a id="jed"><b>Joel E. Denny, Seyong Lee, and Jeffrey S. Vetter</b>: Clacc: Translating OpenACC to OpenMP in Clang</a>
<p>
OpenACC was launched in 2010 as a portable programming model for heterogeneous
accelerators. Although various implementations already exist, no extensible,
open-source, production-quality compiler support is available to the community.
This deficiency poses a serious risk for HPC application developers targeting
GPUs and other accelerators, and it limits experimentation and progress for the
OpenACC specification. To address this deficiency, Clacc is a recent effort
funded by the US Exascale Computing Project to develop production OpenACC
compiler support for Clang and LLVM. A key feature of the Clacc design is to
translate OpenACC to OpenMP to build on Clang's existing OpenMP compiler and
runtime support. In this talk, we describe the Clacc goals and design. We
also describe the challenges that we have encountered so far in our prototyping
efforts, and we present some early performance results.
</p>
</li>
<li> <a id="az"><b>Ayal Zaks, Michael Zuckerman, and Dorit Nuzman</b>: Tiling Loops for Scratch-Pad Memories</a>
<p>
Tiling a loop is a well-known code transformation that helps optimize temporal
locality. Tiling is important for systems that have caches in order to achieve
high performance. For systems that are based on scratch-pad memories or
software-managed caches, tiling is vital in order for code to be functional.
Furthermore, due to the high overhead of transferring data between main memory
and scratch-pad memory, it is desirable to tile several loops together. Lastly,
if such data transfers can be executed asynchronously and in parallel to
processing the data in the scratch-pad memories, careful scheduling of the
transfers and double-buffering of the data are desired in order to hide data
transfer overheads. In this work we show how multiple loops can be tiled
together in order to execute them efficiently on systems with scratch-pad
memories.
</p>
</li>
<li> <a id="bh"><b>Brian Homerding</b>: Enabling math function call optimization for DOE proxy applications</a>
<p>
The US Department of Energy proxy applications are simplified applications that
are representative of the important code for various scientific computing
workloads. Our performance analysis work on these proxy applications have
revealed some areas where Clang can improve when compared to GCC and vendor
compilers. Among these is the limited ability to apply optimizations to math
function calls when we care about errno. This talk will discuss modeling the
memory behavior of math functions using function attributes in order to enable
these optimizations. Along with a discussion of our subsequent work to extend
the attributes’ coverage and use.
</p>
</li>
<li> <a id="as"><b>Alexandru Susu</b>: Emulating Arithmetic Operations with LLVM's Instruction Selection Pass</a>
<p>
The Connex-S wide research vector processor has a simple design with 16-bit
integer lanes since many embedded applications can make good use of narrow
integer types.
For completeness, however, our back end for Connex-S needs to lower code to
emulate efficiently arithmetic operations for non-native types such as 32-bit
integer and 16-bit floating point. To simplify the work of the compiler writer
we conceive a method to code generate how we lower these operations inside
LLVM's instruction selection pass.
We also implement in the Connex-S processor simple lane gating techniques to
minimize energy consumption for vector code with a high degree of control
divergence, as it is the case for routines emulating floating point operations.
</p>
</li>
<li> <a id="sm"><b>Simon Moll, Shrey Sharma, Matthias Kurtenacker, and Sebastian Hack</b>: Multi-dimensional Vectorization in LLVM</a>
<p>
Loop vectorization is a classic technique to exploit SIMD instructions in a
productive way. In multi-dimensional vectorization, multiple loops of a loop
nest are vectorized at once. This exposes opportunities for data reuse,
register tiling and more efficient memory accesses. In this work, we present
TensorRV, a multi-dimensional vectorization framework for LLVM IR. TensorRV is
a generalization of the Region Vectorizer, a general purpose outer-loop and
whole-function vectorizer, to the multi-dimensional setting. We evaluate
TensorRV on a set of stencil codes and matrix transpose. We find that stencil
codes benefit from the reduction of load instructions with a speedup of x1.45
on NEC SX-Aurora TSUBASA. Multi-loop vectorized matrix transpose leverages
efficient SIMD shuffle instructions on AVX512, for which we report a speedup of
x3.27.
</p>
</li>
<li> <a id="jd"><b>Johannes Doerfert, Brian Homerding and Hal Finkel</b>: Performance Gap Exploration with LLVM</a>
<p>
Compilers are limited by the static information directly or indirectly
encoded in the program. Especially low-level languages, such as C and C++, are
therefore considered problematic as their weak type system and relaxed memory
semantic allows for various, sometimes non-obvious, behaviors. Since compilers
have to preserve the program semantic for all program executions, the existence
of exceptional behavior can prevent optimizations that the developer would
consider valid and might even expect. Analyses to guarantee the absence of such
disruptive and unlikely situations are consequently an indispensable part of an
optimizing compiler. However, these analyses have to be approximative and
limited in scope. Global and exact static analysis, under consideration of all
potential inputs to the program, is simply an infeasible task for any
non-trivial program.
Even if a user knows the structure of all inputs ever passed to the program, it
is not easy to encode such information. The conservatively correct compiler can
consequently not match the expectations a developer with superior knowledge
has.
In this talk, we present a method to automatically measure the effect missing
static information has on the optimizations applied to a given program. As a
result, we generate an optimistically optimized program version which, compared
to the original, defines a performance gap that can be closed by better
analyses and programmer annotations.
Our evaluation of six, already optimized, proxy kernels for high-performance
applications exposed a compiler flaw that caused a ≈6x fold slowdown, as well
as opportunities to achieve speedups of up to 20.6%. This clearly indicates
that static uncertainty can result in poor performance, but also that compilers
need to more effectively utilize available information.
</p>
</li>
</ul>
</p>
<p>
Workshop organization: Johannes Doerfert, Sebastian Pop, Aditya Kumar.
</p>
<!-- *********************************************************************** -->
<hr>
<!--#include virtual="../../footer.incl" -->