| <!--#include virtual="../../header.incl" --> |
| |
| <div class="www_sectiontitle">Second LLVM Performance Workshop at CGO</div> |
| |
| <ul> |
| <li><b>What</b>: Second LLVM Performance Workshop at CGO</li> |
| <li><b>When</b>: Saturday February 24th, 2018</li> |
| <li><b>Where</b>: Vienna, Austria</li> |
| </ul> |
| |
| <p> |
| An LLVM Performance Workshop will be held at CGO 2018. The workshop |
| is co-located with CC, HPCA, and PPoPP. It takes place at the <a |
| href="http://cgo.org/cgo2018/venue/"> Austria Trend Eventhotel Pyramide</a> |
| in Vienna. |
| |
| If you are interested in attending the workshop, please register at the |
| <a href="http://cgo.org/cgo2018/workshops.html">CGO website.</a> |
| </p> |
| |
| <div class="www_sectiontitle">Schedule</div> |
| <p> |
| <table width="100%"> |
| <tr><td><b>Time</b></td> <td><b>Room</b></td> <td><b>Speaker</b></td> <td><b>Title</b></td> <td> </td></tr> |
| <tr> |
| <td>9:15</td> |
| <td>Europa 2</td> |
| <td>Maha Kooli</td> |
| <td>How to Evaluate "In-Memory Computing" Performances without Hardware Measurements?</td> |
| <td><a href="#mk">[Abstract]</a> </td> |
| </tr> |
| <tr> |
| <td>10:00-10:30</td> |
| <td> </td> |
| <td colspan=3>Coffee Break</td> |
| </tr> |
| <tr> |
| <td>10:30</td> |
| <td>Europa 2</td> |
| <td>Arsène Pérard-Gayot</td> |
| <td>Optimizing LLVM IR for Guided Vectorization</td> |
| <td><a href="#apg">[Abstract]</a> </td> |
| </tr> |
| <tr> |
| <td>11:15</td> |
| <td>Europa 2</td> |
| <td>Siddharth Shankar Swain</td> |
| <td>Efficient use of memory by reducing size of AST dumps in cross file analysis by clang static analyzer</td> |
| <td><a href="#sss">[Abstract]</a> </td> |
| </tr> |
| <tr> |
| <td>12:00-13:30</td> |
| <td> </td> |
| <td colspan=3>Lunch</td> |
| </tr> |
| <tr> |
| <td>13:30</td> |
| <td>Europa 2</td> |
| <td>Julian Hammer</td> |
| <td>Cache-aware Scheduling and Performance Modeling with LLVM-Polly and Kerncraft</td> |
| <td><a href="#jh">[Abstract]</a> |
| <a href="./slides/hammer_2018_CGO_LLVM_perf.pdf">[Slides]</a> |
| </td> |
| </tr> |
| <tr> |
| <td>14:15</td> |
| <td>Europa 2</td> |
| <td>Alexander Matz</td> |
| <td>Enabling Automatic Partitioning of Data-Parallel Kernels with Polyhedral Compilation</td> |
| <td><a href="#am">[Abstract]</a> |
| <a href="./slides/matz_2018_CGO_LLVM_perf.pdf">[Slides]</a> |
| </td> |
| </tr> |
| <tr> |
| <td>15:00-15:30</td> |
| <td> </td> |
| <td colspan=3>Coffee Break</td> |
| </tr> |
| <tr> |
| <td>15:30</td> |
| <td>Europa 2</td> |
| <td>William Moses</td> |
| <td>Tensor Comprehensions</td> |
| <td><a href="#wm">[Abstract]</a> </td> |
| </tr> |
| <tr> |
| <td>16:15</td> |
| <td>Europa 2</td> |
| <td> </td> |
| <td>LLVM Q&A Panel: <b>Questions Welcome</b></td> |
| <td> </td> |
| </tr> |
| <tr> |
| <td>17:00</td> |
| <td> </td> |
| <td colspan=3>Workshop ends.</td> |
| </tr> |
| </table> |
| </p> |
| |
| |
| <div class="www_sectiontitle">Abstracts</div> |
| <p> |
| <ul> |
| <li> <a id="jh"><b>Julian Hammer, Johannes Doerfert, Georg Hager, Gerhard |
| Wellein and Sebastian Hack</b>: Cache-aware Scheduling and |
| Performance Modeling with LLVM-Polly and Kerncraft |
| Compilation </a> <a href="./slides/hammer_2018_CGO_LLVM_perf.pdf">[Slides]</a> |
| <p> |
| |
| LLVM/Polly is the polyhedral optimizer of the LLVM project. While there |
| currently is a serious integration effort going on, Polly still lacks |
| basic support for essential optimizations. In this work we replace the |
| fixed tile-sizes policy employed by Polly with an access- and hardware- |
| dependent one. In contrast to Polly's scheduling, our tile-size selection |
| targets spatial instead of temporal locality. The proposed tile-size |
| selection is based on analytic performance modeling using the Layer |
| Conditions model, and extended to cope with non-affine accesses and |
| non-perfectly nested loops, which are found in many real-world codes. |
| Nevertheless, it is best suited for linear-sequential accesses as found |
| in stencil computations. |
| |
| </p> |
| </li> |
| |
| <li> <a id="mk"><b>Maha Kooli, Henri-Pierre Charles, Jean-Philippe Noel and |
| Bastien Giraud</b>: How to Evaluate "In-Memory Computing" |
| Performances without Hardware Measurements? </a> |
| <p> |
| |
| This paper presents a software platform to evaluate the performance of |
| In-Memory Computing architecture based on emerging memory that embeds |
| computing abilities. The platform includes emulation tools that are based |
| on the Low Level Virtual Machine (LLVM). It permits to early experiment |
| applications when the hardware system is not fully designed, and generate |
| execution traces. These execution traces are then analyzed to evaluate |
| the system performances. |
| |
| </p> |
| </li> |
| |
| <li> <a id="apg"><b> Arsène Pérard-Gayot, Richard Membarth, Philipp |
| Slusallek, Simon Moll, Roland Leißa and Sebastian Hack</b>: |
| Optimizing LLVM IR for Guided Vectorization</a> |
| |
| <p> |
| |
| Guided vectorization takes a scalar program (operating on a single |
| element of data) and transforms it into a vectorized program (operating |
| on multiple elements at once). The performance of the vectorized |
| program strongly depends on the precision of the analyses performed by |
| the vectorizing compiler, and the quality of the target code generator. |
| In particular, these analyses must determine whether an expression is |
| the same for all lanes (uniform) or not. Since divergent control flow |
| is expensive, the compiler should ensure that it remains uniform |
| whenever possible. In this presentation, we present data layout |
| transformations and optimizations on LLVM IR that improve both the |
| analyses and the generated code quality of RV, a state-of-the-art |
| vectorizing framework. We show that, using RV combined with our |
| optimizations, auto-vectorized ray-tracing kernels perform within 10% |
| of manually-vectorized implementations by experts. |
| |
| </p> |
| </li> |
| |
| <li> <a id="sss"><b>Siddharth Shankar Swain</b>: Efficient use of memory by |
| reducing size of AST dumps in cross file analysis by clang static |
| analyzer</a> |
| |
| <p> |
| Clang SA works well with function call within a translation unit. When |
| execution reaches a function implemented in another TU, analyzer skips |
| analysis of called function definition. For handling cross file bugs, the |
| CTU analysis feature was developed. The CTU model consists of two passes. |
| The first pass dumps AST for all translation unit, creates a function map |
| to corresponding AST. In the second pass when TU external function is |
| reached during the analysis, the location of the definition of that |
| function is looked up in the function definition index and the definition |
| is imported from the containing AST binary into the caller's context |
| using the ASTImporter class. During the analysis, we need to store the |
| dumped ASTs temporarily. For a large code base this can be a problem and |
| we have seen it practically where the code analysis stops due to memory |
| shortage. Not only in CTU analysis but also in general case clang SA |
| analysis reducing size of ASTs can also lead to scaling of clang SA to |
| larger code bases. We are basically using two methods: |
| </p> |
| |
| <p> |
| 1) Using Outlining method on the source code to find out AST that |
| share common factors or sub trees. We throw away those ASTs that |
| won't match any other AST, thereby reducing number of ASTs dumped in |
| memory. |
| </p> |
| |
| <p> |
| 2) Tree prunning technique to keep only those parts of tree necessary |
| for cross translation unit analysis and eliminating the rest to |
| decrease the size of tree. Finding necessary part of tree can be done |
| by finding the dependency path from the exploded graph where |
| instructions dependent on the function call/execution will be |
| present. A thing to note here is that prunning of only those branches |
| whose no child is a function call should be done. |
| </p> |
| </li> |
| |
| <li> <a id="am"><b>Alexander Matz and Holger Fröning</b>: Enabling |
| Automatic Partitioning of Data-Parallel Kernels with Polyhedral |
| Compilation </a> <a href="./slides/matz_2018_CGO_LLVM_perf.pdf">[Slides]</a> |
| <p> |
| |
| Data-parallel accelerators are pervasive in today's computing |
| landscape due to their high energy-efficiency and performance. GPUs, |
| in particular, are very successful and utilize the |
| Bulk-Synchronous-Parallel programming model to expose the available |
| parallelism in an application core to the hardware. Programming a |
| single GPU using the BSP programming model (in the form of OpenCL and |
| CUDA) adds moderate complexity and is usually manageable. |
| |
| </p> |
| <p> |
| |
| If more than a single GPU is to be used, however, all data transfers |
| and kernel executions have to be orchestrated manually in order to |
| achieve good performance. This is tedious and error prone. Given the |
| regular nature of many GPUs kernels, this orchestration and the |
| distribution of work should be possible automatically. |
| |
| </p> |
| <p> |
| |
| In this talk, we present an approach to automatically partition |
| single-GPU CUDA applications for execution on multiple GPUs and a |
| preliminary performance analysis. We use polyhedral compilation for |
| the extraction of the memory access patterns of GPU kernels and a |
| light-weight runtime-system to synchronize device buffers and |
| orchestrate kernel execution. The runtime-system utilizes code |
| generated by polyhedral compilation to keep track of the state of |
| device buffers before and after each kernel execution and issues |
| minimal data movements if required. Partitioned kernels need to be |
| extended to only compute a subset of the original execution grid. Our |
| preliminary performance analysis achieves speedups of up to 12x for |
| three model applications taken from the Berkeley Dwarves. |
| |
| </p> |
| <p> |
| |
| Although we focus on NVIDIA CUDA applications in this talk we see no |
| conceptual differences of this approach in regards to alternative |
| implementations of the BSP programming model (e.g. OpenCL). |
| |
| </p> |
| </li> |
| |
| <li> <a id="wm"><b>William Moses</b>: Tensor Comprehensions</a> |
| <p> |
| TBA. |
| </p> |
| </li> |
| </ul> |
| </p> |
| |
| <div class="www_sectiontitle">Call for Speakers</div> |
| |
| <p> |
| We invite speakers from academia and industry to present their work on the |
| following list of topics (including and not limited to:) |
| </p> |
| <ul> |
| <li>improving performance and size of code generated by LLVM,</li> |
| <li>improving performance of LLVM's runtime libraries,</li> |
| <li>improving the security of generated code,</li> |
| <li>tools developed with LLVM for performance analysis,</li> |
| <li>performance tracking over time,</li> |
| <li>compiler flags, annotations and remarks to understand and improve |
| performance,</li> |
| <li>any other topic related to improving and maintaining the performance |
| and quality of LLVM generated code.</li> |
| </ul> |
| <p> |
| While the primary focus of the workshop is on these topics, we welcome any |
| submission related to the LLVM compiler infrastructure, its sub-projects |
| (clang, lldb, Polly, ...), as well as its use in industry and academia. |
| </p> |
| |
| <p> |
| We are looking for: |
| </p> |
| <ul> |
| <li>keynote speakers,</li> |
| <li>technical presentations: 30 minutes plus questions and discussion,</li> |
| <li>tutorials,</li> |
| <li>BOFs.</li> |
| </ul> |
| |
| <p> |
| Proposals should provide enough information for the review committee to be |
| able to judge the quality of the submission. Proposals can be submitted under |
| the form of an extended abstract, full paper, or slides. Proposals should be |
| submitted to |
| <a href="https://easychair.org/conferences/?conf=llvmcgo2018">Easychair |
| LLVM-CGO 2018</a>. |
| |
| The deadline for receiving submissions is December 22, 2017. Speakers |
| will be notified of acceptance or rejection by January 5. |
| </p> |
| |
| <p> |
| Workshop organization: Johannes Doerfert, Renato Golin, Aditya Kumar, |
| Sebastian Pop, Hal Finkel, and Tanya Lattner. |
| </p> |
| |
| <!-- *********************************************************************** --> |
| <hr> |
| |
| <!--#include virtual="../../footer.incl" --> |