devmtg/2018-02-24/index.html - llvm-www - Git at Google

 <!--#include virtual="../../header.incl" -->

 <div class="www_sectiontitle">Second LLVM Performance Workshop at CGO</div>

 <ul>
   <li><b>What</b>: Second LLVM Performance Workshop at CGO</li>
   <li><b>When</b>: Saturday February 24th, 2018</li>
   <li><b>Where</b>: Vienna, Austria</li>
 </ul>

 <p>
   An LLVM Performance Workshop will be held at CGO 2018. The workshop
   is co-located with CC, HPCA, and PPoPP. It takes place at the <a
     href="http://cgo.org/cgo2018/venue/"> Austria Trend Eventhotel Pyramide</a>
   in Vienna.

   If you are interested in attending the workshop, please register at the
   <a href="http://cgo.org/cgo2018/workshops.html">CGO website.</a>
 </p>

 <div class="www_sectiontitle">Schedule</div>
 <p>
  <table width="100%">
    <tr><td><b>Time</b></td> <td><b>Room</b></td> <td><b>Speaker</b></td> <td><b>Title</b></td> <td>&nbsp;</td></tr>
   <tr>
     <td>9:15</td>
     <td>Europa 2</td>
     <td>Maha Kooli</td>
     <td>How to Evaluate "In-Memory Computing" Performances without Hardware Measurements?</td>
     <td><a href="#mk">[Abstract]</a> </td>
   </tr>
   <tr>
     <td>10:00-10:30</td>
     <td>&nbsp;</td>
     <td colspan=3>Coffee Break</td>
   </tr>
   <tr>
     <td>10:30</td>
     <td>Europa 2</td>
     <td>Arsène Pérard-Gayot</td>
     <td>Optimizing LLVM IR for Guided Vectorization</td>
     <td><a href="#apg">[Abstract]</a> </td>
   </tr>
   <tr>
     <td>11:15</td>
     <td>Europa 2</td>
     <td>Siddharth Shankar Swain</td>
     <td>Efficient use of memory by reducing size of AST dumps in cross file analysis by clang static analyzer</td>
     <td><a href="#sss">[Abstract]</a> </td>
   </tr>
   <tr>
     <td>12:00-13:30</td>
     <td>&nbsp;</td>
     <td colspan=3>Lunch</td>
   </tr>
   <tr>
     <td>13:30</td>
     <td>Europa 2</td>
     <td>Julian Hammer</td>
     <td>Cache-aware Scheduling and Performance Modeling with LLVM-Polly and Kerncraft</td>
     <td><a href="#jh">[Abstract]</a>
         <a href="./slides/hammer_2018_CGO_LLVM_perf.pdf">[Slides]</a>
     </td>
   </tr>
   <tr>
     <td>14:15</td>
     <td>Europa 2</td>
     <td>Alexander Matz</td>
     <td>Enabling Automatic Partitioning of Data-Parallel Kernels with Polyhedral Compilation</td>
     <td><a href="#am">[Abstract]</a>
         <a href="./slides/matz_2018_CGO_LLVM_perf.pdf">[Slides]</a>
     </td>
   </tr>
   <tr>
     <td>15:00-15:30</td>
     <td>&nbsp;</td>
     <td colspan=3>Coffee Break</td>
   </tr>
   <tr>
     <td>15:30</td>
     <td>Europa 2</td>
     <td>William Moses</td>
     <td>Tensor Comprehensions</td>
     <td><a href="#wm">[Abstract]</a> </td>
   </tr>
   <tr>
     <td>16:15</td>
     <td>Europa 2</td>
     <td>&nbsp;</td>
     <td>LLVM Q&A Panel: <b>Questions Welcome</b></td>
     <td>&nbsp;</td>
   </tr>
   <tr>
     <td>17:00</td>
     <td>&nbsp;</td>
     <td colspan=3>Workshop ends.</td>
   </tr>
  </table>
 </p>


 <div class="www_sectiontitle">Abstracts</div>
 <p>
   <ul>
     <li> <a id="jh"><b>Julian Hammer, Johannes Doerfert, Georg Hager, Gerhard
           Wellein and Sebastian Hack</b>: Cache-aware Scheduling and
         Performance Modeling with LLVM-Polly and Kerncraft
         Compilation </a> &nbsp;<a href="./slides/hammer_2018_CGO_LLVM_perf.pdf">[Slides]</a>
       <p>

       LLVM/Polly is the polyhedral optimizer of the LLVM project. While there
       currently is a serious integration effort going on, Polly still lacks
       basic support for essential optimizations. In this work we replace the
       fixed tile-sizes policy employed by Polly with an access- and hardware-
       dependent one. In contrast to Polly's scheduling, our tile-size selection
       targets spatial instead of temporal locality. The proposed tile-size
       selection is based on analytic performance modeling using the Layer
       Conditions model, and extended to cope with non-affine accesses and
       non-perfectly nested loops, which are found in many real-world codes.
       Nevertheless, it is best suited for linear-sequential accesses as found
       in stencil computations.

       </p>
     </li>

     <li> <a id="mk"><b>Maha Kooli, Henri-Pierre Charles, Jean-Philippe Noel and
           Bastien Giraud</b>: How to Evaluate "In-Memory Computing"
         Performances without Hardware Measurements? </a>
       <p>

       This paper presents a software platform to evaluate the performance of
       In-Memory Computing architecture based on emerging memory that embeds
       computing abilities. The platform includes emulation tools that are based
       on the Low Level Virtual Machine (LLVM). It permits to early experiment
       applications when the hardware system is not fully designed, and generate
       execution traces. These execution traces are then analyzed to evaluate
       the system performances.

       </p>
     </li>

     <li> <a id="apg"><b>	Arsène Pérard-Gayot, Richard Membarth, Philipp
           Slusallek, Simon Moll, Roland Leißa and Sebastian Hack</b>:
         Optimizing LLVM IR for Guided Vectorization</a>

       <p>

         Guided vectorization takes a scalar program (operating on a single
         element of data) and transforms it into a vectorized program (operating
         on multiple elements at once).  The performance of the vectorized
         program strongly depends on the precision of the analyses performed by
         the vectorizing compiler, and the quality of the target code generator.
         In particular, these analyses must determine whether an expression is
         the same for all lanes (uniform) or not.  Since divergent control flow
         is expensive, the compiler should ensure that it remains uniform
         whenever possible.  In this presentation, we present data layout
         transformations and optimizations on LLVM IR that improve both the
         analyses and the generated code quality of RV, a state-of-the-art
         vectorizing framework.  We show that, using RV combined with our
         optimizations, auto-vectorized ray-tracing kernels perform within 10%
         of manually-vectorized implementations by experts.

       </p>
     </li>

     <li> <a id="sss"><b>Siddharth Shankar Swain</b>: Efficient use of memory by
         reducing size of AST dumps in cross file analysis by clang static
         analyzer</a>

       <p>
         Clang SA works well with function call within a translation unit. When
         execution reaches a function implemented in another TU, analyzer skips
         analysis of called function definition. For handling cross file bugs, the
         CTU analysis feature was developed. The CTU model consists of two passes.
         The first pass dumps AST for all translation unit, creates a function map
         to corresponding AST. In the second pass when TU external function is
         reached during the analysis, the location of the definition of that
         function is looked up in the function definition index and the definition
         is imported from the containing AST binary into the caller's context
         using the ASTImporter class. During the analysis, we need to store the
         dumped ASTs temporarily. For a large code base this can be a problem and
         we have seen it practically where the code analysis stops due to memory
         shortage. Not only in CTU analysis but also in general case clang SA
         analysis reducing size of ASTs can also lead to scaling of clang SA to
         larger code bases. We are basically using two methods:
       </p>

       <p>
           1) Using Outlining method on the source code to find out AST that
           share common factors or sub trees. We throw away those ASTs that
           won't match any other AST, thereby reducing number of ASTs dumped in
           memory.
       </p>

       <p>
           2) Tree prunning technique to keep only those parts of tree necessary
           for cross translation unit analysis and eliminating the rest to
           decrease the size of tree. Finding necessary part of tree can be done
           by finding the dependency path from the exploded graph where
           instructions dependent on the function call/execution will be
           present. A thing to note here is that prunning of only those branches
           whose no child is a function call should be done.
       </p>
     </li>

     <li> <a id="am"><b>Alexander Matz and Holger Fröning</b>: Enabling
         Automatic Partitioning of Data-Parallel Kernels with Polyhedral
         Compilation </a> &nbsp;<a href="./slides/matz_2018_CGO_LLVM_perf.pdf">[Slides]</a>
       <p>

           Data-parallel accelerators are pervasive in today's computing
           landscape due to their high energy-efficiency and performance. GPUs,
           in particular, are very successful and utilize the
           Bulk-Synchronous-Parallel programming model to expose the available
           parallelism in an application core to the hardware. Programming a
           single GPU using the BSP programming model (in the form of OpenCL and
           CUDA) adds moderate complexity and is usually manageable.

       </p>
       <p>

           If more than a single GPU is to be used, however, all data transfers
           and kernel executions have to be orchestrated manually in order to
           achieve good performance. This is tedious and error prone. Given the
           regular nature of many GPUs kernels, this orchestration and the
           distribution of work should be possible automatically.

       </p>
       <p>

           In this talk, we present an approach to automatically partition
           single-GPU CUDA applications for execution on multiple GPUs and a
           preliminary performance analysis. We use polyhedral compilation for
           the extraction of the memory access patterns of GPU kernels and a
           light-weight runtime-system to synchronize device buffers and
           orchestrate kernel execution. The runtime-system utilizes code
           generated by polyhedral compilation to keep track of the state of
           device buffers before and after each kernel execution and issues
           minimal data movements if required. Partitioned kernels need to be
           extended to only compute a subset of the original execution grid. Our
           preliminary performance analysis achieves speedups of up to 12x for
           three model applications taken from the Berkeley Dwarves.

       </p>
       <p>

           Although we focus on NVIDIA CUDA applications in this talk we see no
           conceptual differences of this approach in regards to alternative
           implementations of the BSP programming model (e.g. OpenCL).

       </p>
     </li>

     <li> <a id="wm"><b>William Moses</b>: Tensor Comprehensions</a>
       <p>
       TBA.
       </p>
     </li>
   </ul>
 </p>

 <div class="www_sectiontitle">Call for Speakers</div>

 <p>
   We invite speakers from academia and industry to present their work on the
   following list of topics (including and not limited to:)
 </p>
   <ul>
     <li>improving performance and size of code generated by LLVM,</li>
     <li>improving performance of LLVM's runtime libraries,</li>
     <li>improving the security of generated code,</li>
     <li>tools developed with LLVM for performance analysis,</li>
     <li>performance tracking over time,</li>
     <li>compiler flags, annotations and remarks to understand and improve
         performance,</li>
     <li>any other topic related to improving and maintaining the performance
         and quality of LLVM generated code.</li>
   </ul>
 <p>
   While the primary focus of the workshop is on these topics, we welcome any
   submission related to the LLVM compiler infrastructure, its sub-projects
   (clang, lldb, Polly, ...), as well as its use in industry and academia.
 </p>

 <p>
   We are looking for:
 </p>
 <ul>
   <li>keynote speakers,</li>
   <li>technical presentations: 30 minutes plus questions and discussion,</li>
   <li>tutorials,</li>
   <li>BOFs.</li>
 </ul>

 <p>
   Proposals should provide enough information for the review committee to be
   able to judge the quality of the submission. Proposals can be submitted under
   the form of an extended abstract, full paper, or slides.  Proposals should be
   submitted to
   <a href="https://easychair.org/conferences/?conf=llvmcgo2018">Easychair
   LLVM-CGO 2018</a>.

   The deadline for receiving submissions is December 22, 2017.  Speakers
   will be notified of acceptance or rejection by January 5.
 </p>

 <p>
   Workshop organization: Johannes Doerfert, Renato Golin, Aditya Kumar,
   Sebastian Pop, Hal Finkel, and Tanya Lattner.
 </p>

 <!-- *********************************************************************** -->
 <hr>

 <!--#include virtual="../../footer.incl" -->
	<!--#include virtual="../../header.incl" -->

	<div class="www_sectiontitle">Second LLVM Performance Workshop at CGO</div>

	<ul>
	<li><b>What</b>: Second LLVM Performance Workshop at CGO</li>
	<li><b>When</b>: Saturday February 24th, 2018</li>
	<li><b>Where</b>: Vienna, Austria</li>
	</ul>

	<p>
	An LLVM Performance Workshop will be held at CGO 2018. The workshop
	is co-located with CC, HPCA, and PPoPP. It takes place at the <a
	href="http://cgo.org/cgo2018/venue/"> Austria Trend Eventhotel Pyramide</a>
	in Vienna.

	If you are interested in attending the workshop, please register at the
	<a href="http://cgo.org/cgo2018/workshops.html">CGO website.</a>
	</p>

	<div class="www_sectiontitle">Schedule</div>
	<p>
	<table width="100%">
	<tr><td><b>Time</b></td> <td><b>Room</b></td> <td><b>Speaker</b></td> <td><b>Title</b></td> <td> </td></tr>
	<tr>
	<td>9:15</td>
	<td>Europa 2</td>
	<td>Maha Kooli</td>
	<td>How to Evaluate "In-Memory Computing" Performances without Hardware Measurements?</td>
	<td><a href="#mk">[Abstract]</a> </td>
	</tr>
	<tr>
	<td>10:00-10:30</td>
	<td> </td>
	<td colspan=3>Coffee Break</td>
	</tr>
	<tr>
	<td>10:30</td>
	<td>Europa 2</td>
	<td>Arsène Pérard-Gayot</td>
	<td>Optimizing LLVM IR for Guided Vectorization</td>
	<td><a href="#apg">[Abstract]</a> </td>
	</tr>
	<tr>
	<td>11:15</td>
	<td>Europa 2</td>
	<td>Siddharth Shankar Swain</td>
	<td>Efficient use of memory by reducing size of AST dumps in cross file analysis by clang static analyzer</td>
	<td><a href="#sss">[Abstract]</a> </td>
	</tr>
	<tr>
	<td>12:00-13:30</td>
	<td> </td>
	<td colspan=3>Lunch</td>
	</tr>
	<tr>
	<td>13:30</td>
	<td>Europa 2</td>
	<td>Julian Hammer</td>
	<td>Cache-aware Scheduling and Performance Modeling with LLVM-Polly and Kerncraft</td>
	<td><a href="#jh">[Abstract]</a>
	<a href="./slides/hammer_2018_CGO_LLVM_perf.pdf">[Slides]</a>
	</td>
	</tr>
	<tr>
	<td>14:15</td>
	<td>Europa 2</td>
	<td>Alexander Matz</td>
	<td>Enabling Automatic Partitioning of Data-Parallel Kernels with Polyhedral Compilation</td>
	<td><a href="#am">[Abstract]</a>
	<a href="./slides/matz_2018_CGO_LLVM_perf.pdf">[Slides]</a>
	</td>
	</tr>
	<tr>
	<td>15:00-15:30</td>
	<td> </td>
	<td colspan=3>Coffee Break</td>
	</tr>
	<tr>
	<td>15:30</td>
	<td>Europa 2</td>
	<td>William Moses</td>
	<td>Tensor Comprehensions</td>
	<td><a href="#wm">[Abstract]</a> </td>
	</tr>
	<tr>
	<td>16:15</td>
	<td>Europa 2</td>
	<td> </td>
	<td>LLVM Q&A Panel: <b>Questions Welcome</b></td>
	<td> </td>
	</tr>
	<tr>
	<td>17:00</td>
	<td> </td>
	<td colspan=3>Workshop ends.</td>
	</tr>
	</table>
	</p>


	<div class="www_sectiontitle">Abstracts</div>
	<p>
	<ul>
	<li> <a id="jh"><b>Julian Hammer, Johannes Doerfert, Georg Hager, Gerhard
	Wellein and Sebastian Hack</b>: Cache-aware Scheduling and
	Performance Modeling with LLVM-Polly and Kerncraft
	Compilation </a>  <a href="./slides/hammer_2018_CGO_LLVM_perf.pdf">[Slides]</a>
	<p>

	LLVM/Polly is the polyhedral optimizer of the LLVM project. While there
	currently is a serious integration effort going on, Polly still lacks
	basic support for essential optimizations. In this work we replace the
	fixed tile-sizes policy employed by Polly with an access- and hardware-
	dependent one. In contrast to Polly's scheduling, our tile-size selection
	targets spatial instead of temporal locality. The proposed tile-size
	selection is based on analytic performance modeling using the Layer
	Conditions model, and extended to cope with non-affine accesses and
	non-perfectly nested loops, which are found in many real-world codes.
	Nevertheless, it is best suited for linear-sequential accesses as found
	in stencil computations.

	</p>
	</li>

	<li> <a id="mk"><b>Maha Kooli, Henri-Pierre Charles, Jean-Philippe Noel and
	Bastien Giraud</b>: How to Evaluate "In-Memory Computing"
	Performances without Hardware Measurements? </a>
	<p>

	This paper presents a software platform to evaluate the performance of
	In-Memory Computing architecture based on emerging memory that embeds
	computing abilities. The platform includes emulation tools that are based
	on the Low Level Virtual Machine (LLVM). It permits to early experiment
	applications when the hardware system is not fully designed, and generate
	execution traces. These execution traces are then analyzed to evaluate
	the system performances.

	</p>
	</li>

	<li> <a id="apg"><b> Arsène Pérard-Gayot, Richard Membarth, Philipp
	Slusallek, Simon Moll, Roland Leißa and Sebastian Hack</b>:
	Optimizing LLVM IR for Guided Vectorization</a>

	<p>

	Guided vectorization takes a scalar program (operating on a single
	element of data) and transforms it into a vectorized program (operating
	on multiple elements at once). The performance of the vectorized
	program strongly depends on the precision of the analyses performed by
	the vectorizing compiler, and the quality of the target code generator.
	In particular, these analyses must determine whether an expression is
	the same for all lanes (uniform) or not. Since divergent control flow
	is expensive, the compiler should ensure that it remains uniform
	whenever possible. In this presentation, we present data layout
	transformations and optimizations on LLVM IR that improve both the
	analyses and the generated code quality of RV, a state-of-the-art
	vectorizing framework. We show that, using RV combined with our
	optimizations, auto-vectorized ray-tracing kernels perform within 10%
	of manually-vectorized implementations by experts.

	</p>
	</li>

	<li> <a id="sss"><b>Siddharth Shankar Swain</b>: Efficient use of memory by
	reducing size of AST dumps in cross file analysis by clang static
	analyzer</a>

	<p>
	Clang SA works well with function call within a translation unit. When
	execution reaches a function implemented in another TU, analyzer skips
	analysis of called function definition. For handling cross file bugs, the
	CTU analysis feature was developed. The CTU model consists of two passes.
	The first pass dumps AST for all translation unit, creates a function map
	to corresponding AST. In the second pass when TU external function is
	reached during the analysis, the location of the definition of that
	function is looked up in the function definition index and the definition
	is imported from the containing AST binary into the caller's context
	using the ASTImporter class. During the analysis, we need to store the
	dumped ASTs temporarily. For a large code base this can be a problem and
	we have seen it practically where the code analysis stops due to memory
	shortage. Not only in CTU analysis but also in general case clang SA
	analysis reducing size of ASTs can also lead to scaling of clang SA to
	larger code bases. We are basically using two methods:
	</p>

	<p>
	1) Using Outlining method on the source code to find out AST that
	share common factors or sub trees. We throw away those ASTs that
	won't match any other AST, thereby reducing number of ASTs dumped in
	memory.
	</p>

	<p>
	2) Tree prunning technique to keep only those parts of tree necessary
	for cross translation unit analysis and eliminating the rest to
	decrease the size of tree. Finding necessary part of tree can be done
	by finding the dependency path from the exploded graph where
	instructions dependent on the function call/execution will be
	present. A thing to note here is that prunning of only those branches
	whose no child is a function call should be done.
	</p>
	</li>

	<li> <a id="am"><b>Alexander Matz and Holger Fröning</b>: Enabling
	Automatic Partitioning of Data-Parallel Kernels with Polyhedral
	Compilation </a>  <a href="./slides/matz_2018_CGO_LLVM_perf.pdf">[Slides]</a>
	<p>

	Data-parallel accelerators are pervasive in today's computing
	landscape due to their high energy-efficiency and performance. GPUs,
	in particular, are very successful and utilize the
	Bulk-Synchronous-Parallel programming model to expose the available
	parallelism in an application core to the hardware. Programming a
	single GPU using the BSP programming model (in the form of OpenCL and
	CUDA) adds moderate complexity and is usually manageable.

	</p>
	<p>

	If more than a single GPU is to be used, however, all data transfers
	and kernel executions have to be orchestrated manually in order to
	achieve good performance. This is tedious and error prone. Given the
	regular nature of many GPUs kernels, this orchestration and the
	distribution of work should be possible automatically.

	</p>
	<p>

	In this talk, we present an approach to automatically partition
	single-GPU CUDA applications for execution on multiple GPUs and a
	preliminary performance analysis. We use polyhedral compilation for
	the extraction of the memory access patterns of GPU kernels and a
	light-weight runtime-system to synchronize device buffers and
	orchestrate kernel execution. The runtime-system utilizes code
	generated by polyhedral compilation to keep track of the state of
	device buffers before and after each kernel execution and issues
	minimal data movements if required. Partitioned kernels need to be
	extended to only compute a subset of the original execution grid. Our
	preliminary performance analysis achieves speedups of up to 12x for
	three model applications taken from the Berkeley Dwarves.

	</p>
	<p>

	Although we focus on NVIDIA CUDA applications in this talk we see no
	conceptual differences of this approach in regards to alternative
	implementations of the BSP programming model (e.g. OpenCL).

	</p>
	</li>

	<li> <a id="wm"><b>William Moses</b>: Tensor Comprehensions</a>
	<p>
	TBA.
	</p>
	</li>
	</ul>
	</p>

	<div class="www_sectiontitle">Call for Speakers</div>

	<p>
	We invite speakers from academia and industry to present their work on the
	following list of topics (including and not limited to:)
	</p>
	<ul>
	<li>improving performance and size of code generated by LLVM,</li>
	<li>improving performance of LLVM's runtime libraries,</li>
	<li>improving the security of generated code,</li>
	<li>tools developed with LLVM for performance analysis,</li>
	<li>performance tracking over time,</li>
	<li>compiler flags, annotations and remarks to understand and improve
	performance,</li>
	<li>any other topic related to improving and maintaining the performance
	and quality of LLVM generated code.</li>
	</ul>
	<p>
	While the primary focus of the workshop is on these topics, we welcome any
	submission related to the LLVM compiler infrastructure, its sub-projects
	(clang, lldb, Polly, ...), as well as its use in industry and academia.
	</p>

	<p>
	We are looking for:
	</p>
	<ul>
	<li>keynote speakers,</li>
	<li>technical presentations: 30 minutes plus questions and discussion,</li>
	<li>tutorials,</li>
	<li>BOFs.</li>
	</ul>

	<p>
	Proposals should provide enough information for the review committee to be
	able to judge the quality of the submission. Proposals can be submitted under
	the form of an extended abstract, full paper, or slides. Proposals should be
	submitted to
	<a href="https://easychair.org/conferences/?conf=llvmcgo2018">Easychair
	LLVM-CGO 2018</a>.

	The deadline for receiving submissions is December 22, 2017. Speakers
	will be notified of acceptance or rejection by January 5.
	</p>

	<p>
	Workshop organization: Johannes Doerfert, Renato Golin, Aditya Kumar,
	Sebastian Pop, Hal Finkel, and Tanya Lattner.
	</p>

	<!-- *********************************************************************** -->
	<hr>

	<!--#include virtual="../../footer.incl" -->