devmtg/2017-02-04/index.html - llvm-www - Git at Google

 <!--#include virtual="../../header.incl" -->

 <div class="www_sectiontitle">LLVM Performance Workshop at CGO</div>

 <ul>
   <li><b>What</b>: LLVM Performance Workshop at CGO</li>
   <li><b>When</b>: Saturday February 4th, 2017</li>
   <li><b>Where</b>: Austin, Texas, USA</li>
 </ul>

 <p>
   An LLVM Performance Workshop will be held at CGO 2017. The workshop
   is co-located with CC, HPCA, and PPoPP.  If you are interested in
   attending the workshop, please register at the
   <a href="http://cgo.org/cgo2017/workshops.html">CGO website</a>.
 </p>

 <div class="www_sectiontitle">Program</div>
 <p>
 The workshop takes place at the <a href="http://cgo.org/cgo2017/travel-information.html">Hilton Hotel</a> in
 downtown Austin (500 East 4th St).
 </p>
 <p>
 <font color="red"><b>Update:</b></font> If you indicated this morning that you wanted to join us for dinner, here's the location of the restaurant: <a href="http://www.manuels.com/">Manuel's Downtown</a>, 310 Congress Avenue, Austin, TX 78701. We have a reservation at <b>5pm</b> (dinner is at your own expense). The restaurant is within walking distance from the hotel.
 </p>
 <p>
  <table border="1">
    <tr><th>Time</th> <th>Room</th> <th>Speaker</th> <th>Title</th> <th>&nbsp;</th></tr>
   <tr>
     <td>7:30-8:30</td>
     <td>616AB</td>
     <td colspan=3>Breakfast</td>
   </tr>
   <tr>
     <td>&nbsp;</td>
     <td>&nbsp;</td>
     <td colspan=3><b>Session 1: Parallel Code Generation</b></td>
   </tr>
   <tr>
     <td>8:30am</td>
     <td>400/402</td>
     <td>Johannes Doerfert (Saarland University)</td>
     <td>Polyhedral "Driven" Optimizations on Real Codes</td>
     <td><a href="#doerfert">[Abstract]</a> [<a href="Polyhedral-Driven-Optimizations-on-Real-Codes.pdf">Slides</a>]</td>
   </tr>
   <tr>
     <td>9:00am</td>
     <td>400/402</td>
     <td>Tobias Grosser (ETH Zurich)</td>
     <td>Polly-ACC - Accelerator support with Polly-ACC</td>
     <td><a href="#grosser">[Abstract]</a> [<a href="Polly-ACC-Transparent-Compilation-to-Heterogeneous-Hardware.pptx">Slides</a>]</td>
   </tr>
   <tr>
     <td>9:30am</td>
     <td>400/402</td>
     <td>Tao Schardl and William Moses (MIT)</td>
     <td>The Tapir Extension to LLVM's Intermediate Representation for Fork-Join Parallelism</td>
     <td><a href="#schardl">[Abstract]</a> [<a href="tapir-llvm.pdf">Slides</a>]</td>
   </tr>
   <tr>
     <td>10:00-10:30</td>
     <td>616AB</td>
     <td colspan=3>Break</td>
   </tr>
   <tr>
     <td>&nbsp;</td>
     <td>&nbsp;</td>
     <td colspan=3><b>Session 2: Performance in Libraries and Languages</b></td>
   </tr>
   <tr>
     <td>10:30am</td>
     <td>400/402</td>
     <td>Hal Finkel (Argonne National Laboratory)</td>
     <td>Modeling restrict-qualified pointers in LLVM</td>
     <td><a href="#finkel">[Abstract]</a> [<a href="Restrict-Qualified-Pointers-in-LLVM.pdf">Slides</a>]</td>
   </tr>
   <tr>
     <td>11am</td>
     <td>400/402</td>
     <td>Pranav Bhandarkar, Anshuman Dasgupta, Ron Lieberman, Dan Palermo (Qualcomm Innovation Center) Dillon Sharlet and Andrew Adams (Google)</td>
     <td>Halide for Hexagon DSP with Hexagon Vector eXtensions (HVX) using LLVM</td>
     <td><a href="#bhandarkar">[Abstract]</a> [<a href="Halide-for-Hexagon-DSP-with-Hexagon-Vector-eXtensions-HVX-using-LLVM.pdf">Slides</a>]</td>
   </tr>
   <tr>
     <td>11:30am</td>
     <td>400/402</td>
     <td>Aditya Kumar and Sebastian Pop (Samsung Austin R&amp;D Center)</td>
     <td>Performance analysis of libcxx</td>
     <td><a href="#kumar">[Abstract]</a> [<a href="Performance-analysis-of-libcxx.pdf">Slides</a>]</td>
   </tr>
   <tr>
     <td>12:00-1:30</td>
     <td>&nbsp;</td>
     <td colspan=3>Lunch</td>
   </tr>
   <tr>
     <td>&nbsp;</td>
     <td>&nbsp;</td>
     <td colspan=3><b>Session 3: Whole-application performance tuning</b></td>
   </tr>
   <tr>
     <td>1:30pm</td>
     <td>400/402</td>
     <td>Brian Railing (CMU)</td>
     <td>Improving LLVM Instrumentation Overheads</td>
     <td><a href="#railing">[Abstract]</a> [<a href="Improving-LLVM-Instrumentation-Overheads.pdf">Slides</a>]</td>
   </tr>
   <tr>
     <td>2pm</td>
     <td>400/402</td>
     <td>Sergei Larin, Harsha Jagasia and Tobias Edler von Koch (Qualcomm Innovation Center)</td>
     <td>Impact of the current LLVM inlining strategy on complex embedded application memory utilization and performance</td>
     <td><a href="#larin">[Abstract]</a> [<a href="Impact-of-the-current-LLVM-inlining-strategy.pdf">Slides</a>]</td>
   </tr>
   <tr>
     <td>2:30pm</td>
     <td>400/402</td>
     <td>Mehdi Amini (Apple)</td>
     <td>LTO/ThinLTO BoF</td>
     <td><a href="#amini">[Abstract]</a></td>
   </tr>
   <tr>
     <td>3:00-3:30</td>
     <td>616AB</td>
     <td colspan=3>Break</td>
   </tr>
   <tr>
     <td>&nbsp;</td>
     <td>&nbsp;</td>
     <td colspan=3><b>Session 4: Backend optimizations</b></td>
   </tr>
   <tr>
     <td>3:30pm</td>
     <td>400/402</td>
     <td>Krzysztof Parzyszek (Qualcomm Innovation Center)</td>
     <td>Register Data Flow framework</td>
     <td><a href="#krzy">[Abstract]</a> [<a href="Register-Data-Flow-Framework.pptx">Slides</a>]</td>
   </tr>
   <tr>
     <td>4pm</td>
     <td>400/402</td>
     <td>Evandro Menezes, Sebastian Pop and Aditya Kumar (Samsung Austin R&amp;D Center)</td>
     <td>Efficient clustering of case statements for indirect branch predictors</td>
     <td><a href="#menezes">[Abstract]</a> [<a href="Efficient-clustering-of-case-statements-for-indirect-branch-prediction.pdf">Slides</a>]</td>
   </tr>
   <tr>
     <td>4:30pm</td>
     <td>&nbsp;</td>
     <td colspan=3>Workshop ends.</td>
   </tr>
  </table>
 </p>

 <div class="www_sectiontitle">Abstracts</div>
 <p>
   <ul>
     <li> <a id="krzy"><b>Krzysztof Parzyszek</b>: Register Data Flow framework</a>
       <p>
         Register Data Flow is a framework implemented in LLVM that enables
         data-flow optimizations on machine IR after register allocation. While
         most of the data-flow optimizations on machine IR take place during the
         SSA phase, when virtual registers obey the static single assignment
         form, passes like pseudo-instruction expansion or frame index
         replacement may expose opportunities for further optimizations. At the
         same time, data-flow analysis is much more complicated after register
         allocation, and implementing compiler passes that require it may not
         seem like a worthwhile investment. The intent of RDF is to abstract this
         analysis and provide access to it through a familiar and convenient
         interface.
       </p>
       <p>
         The central concept in RDF is a data-flow graph, which emulates SSA. In
         contrast to the SSA-based optimization phase where SSA is a part of the
         program representation, the RDF data-flow graph is a separate, auxiliary
         structure. It can be built on demand and it does not require any
         modifications to the program. Traversal of the graph can provide
         information about reaching definitions of any given register access, as
         well as reached definitions and reached uses for register
         definitions. The graph provides connections for easily locating the
         corresponding elements of the machine IR. A utility class that
         recalculates basic block live-in information is implemented to make
         writing whole-function optimizations easier. In this talk, I will give
         an overview of RDF and its use in the Hexagon backend.
       </p>
     </li>
     <li> <a id="schardl"><b>Tao Schardl and William Moses</b>: The Tapir Extension to LLVM's Intermediate Representation for Fork-Join Parallelism</a>
       <p>
         This talk explores how fork-join parallelism, as supported by
         dynamic-multithreading concurrency platforms such as Cilk and
         OpenMP, can be embedded into a compiler's intermediate
         representation (IR). Mainstream compilers typically treat parallel
         linguistic constructs as syntactic sugar for function calls into a
         parallel runtime. These calls prevent the compiler from performing
         optimizations across parallel control flow. Remedying this
         situation, however, is generally thought to require an extensive
         reworking of compiler analyses and code transformations to handle
         parallel semantics.
       </p>
       <p>
         Tapir is a compiler IR that represents logically parallel tasks
         asymmetrically in the program's control flow graph. Tapir allows
         the compiler to optimize across parallel control flow with only
         minor changes to its existing analyses and code transformations. To
         prototype Tapir in the LLVM compiler, for example, we added or
         modified approximately 5000 lines of LLVM's approximately
         3-million-line codebase. Tapir enables many traditional compiler
         optimizations for serial code, including loop-invariant-code motion,
         common-subexpression elimination, and tail-recursion elimination, to
         optimize across parallel control flow, as well as purely parallel
         optimizations.
       </p>
       <p>
         This work was conducted in collaboration with Charles E. Leiserson.
         The proposal is a preliminary copy of our paper on Tapir, which will
         appear at PPoPP 2017. This talk will focus on the technical details
         of implementing Tapir in LLVM.
       </p>
     </li>
     <li> <a id="kumar"><b>Aditya Kumar, Sebastian Pop, and Laxman Sole</b>: Performance analysis of libcxx</a>
       <p>
         We will discuss the improvements and future work on libcxx. This
         includes the improvements on standard library algorithms like
         string::find and basic_streambuf::xsgetn. These algorithms were
         suboptimal and we got huge improvements after optimizing
         them. Similarly, we enabled the inlining of constructor and destructor
         of std::string. We will present a systematic analysis of function
         attributes in libc++ and the places where we added missing
         attributes. We will present a comparative analysis of clang-libc++
         vs. gcc-libstdc++ on representative benchmarks. Finally we will talk
         about our contributions to google-benchmark, which comes with libc++, to
         help keep track of performance regressions.
       </p>
     </li>
     <li> <a id="finkel"><b>Hal Finkel</b>: Modeling restrict-qualified pointers in LLVM</a>
       <p>
         It is not always possible for a compiler to statically determine enough
         about the pointer-aliasing properties of a program, especially for
         functions which need to be considered in isolation, to generate the
         highest-performance code possible. Multiversioning can be employed but
         its effectiveness is limited by the combinatorially-large number of
         potential configurations. To address these practical problems, the C
         standard introduced the restrict keyword which can adorn pointer
         variables. The restrict keyword can be used by the programmer to convey
         pointer-aliasing information to the optimizer. Often, this is
         information that is difficult or impossible for the optimizer to deduce
         on its own.
       </p>
       <p>
         The semantics of restrict, however, are subtle and rely on source-level
         constructs that are not generally represented within LLVM's
         IR. Maximally maintaining the aliasing information correctly in the face
         of function inlining and other code-motion transformations, without
         interfering with those transformations, is not trivial. While LLVM has
         long used strict-qualified pointers that are function arguments, and an
         initial phase of this work provided a way to preserve this information
         in the face of function inlining, I'll describe a new scheme in LLVM
         that allows the representation of aliasing information from block-local
         restrict-qualified pointers as well. This more-general class of
         restrict-qualified pointers is widely used in scientific code.
       </p>
       <p>
         In this talk, I'll cover the use cases for restrict-qualified pointers,
         the difficulties in representing their semantics at the IR level, why
         the existing aliasing metadata cannot represent restrict-qualified
         pointers effectively, how the proposed representation allows the
         preservation of these semantics with minimal impact to the optimizer,
         and how the optimizer can use this information to generate
         higher-performance code. I'll also discuss how this scheme relates to
         others related to pointer variables (e.g. TBAA and alignment
         assumptions).
       </p>
     </li>
     <li> <a id="amini"><b>Mehdi Amini</b>: LTO/ThinLTO BoF</a>
       <p>
         LTO is an important technique for getting the maximum performance from
         the compiler. We presented the ThinLTO model and implementation in LLVM
         at the last LLVM Dev Meeting. This provided the audience with a good
         overview of the high-level flow of ThinLTO and the 3-phases split
         involved.
       </p>
       <p>
         The proposal for this BoF is to gather and discuss the existing
         user-experience, the current limitations and what features folks are
         expecting the most out of ThinLTO. We can go over the current
         optimizations currently in development upstream.
       </p>
     </li>
     <li> <a id="doerfert"><b>Johannes Doerfert</b>: Polyhedral "Driven" Optimizations on Real Codes</a>
       <p>In this talk I will present polyhedral "driven" optimizations on real
         codes.  The term polyhedral "driven" is used as there are two flavors of
         optimization I want to discuss (depending on my progress and the
         duration of the talk).
       </p>
       <p>
         The first follows the classical approach applied by LLVM/Polly but with
         special consideration of general benchmarks like SPEC. I will show how
         LLVM/Polly can be used to perform beneficial optimizations in (at least)
         libquantum, hmmer, lbm and bzip2. I will also discuss what I think is
         needed to identify such optimization opportunities automatically.
       </p>
       <p>
         The second polyhedral driven optimization I want to present is a
         conceptual follow-up of the "Polyhedral Info" GSoC project. This project
         was the first try to augment LLVM analysis and transformation passes
         with polyhedral information.  While the project was build on top of
         LLVM/Polly, I will present an alternative approach. First I will
         introduce a modular, demand driven and caching polyhedral program
         analysis that natively integrates into the existing LLVM pipeline. Then
         I will show how to utilize this analysis in existing LLVM optimizations
         to improve performance. Finally, I will use the polyhedral analysis to
         derive new, complex control flow optimizations that are not, or only in
         a simpler form, present in LLVM.
       </p>
     </li>
     <li> <a id="grosser"><b>Tobias Grosser</b>: Polly-ACC - Accelerator support with Polly-ACC</a>
       <p>
         Programming today's increasingly complex heterogeneous hardware is
         difficult, as it commonly requires the use of data-parallel languages,
         pragma annotations, specialized libraries, or DSL compilers. Adding
         explicit accelerator support into a larger code base is not only costly,
         but also introduces additional complexity that hinders long-term
         maintenance. We propose a new heterogeneous compiler that brings us
         closer to the dream of automatic accelerator mapping. Starting from a
         sequential compiler IR, we automatically generate a hybrid executable
         that - in combination with a new data management system - transparently
         offloads suitable code regions. Our approach is almost regression free
         for a wide range of applications while improving a range of compute
         kernels as well as two full SPEC CPU applications. We expect our work to
         reduce the initial cost of accelerator usage and to free developer time
         to investigate algorithmic changes.
       </p>
     </li>
     <li> <a id="railing"><b>Brian Railing</b>: Improving LLVM Instrumentation Overheads</a>
       <p>
         The behavior and structure of a shared-memory parallel program can be
         characterized by a task graph that encodes the instructions, memory
         accesses, and dependencies of each piece of parallel work. The task
         graph representation can encode the actions of any threading library and
         is agnostic to the target architecture. Contech [1] is an LLVM-based
         tool that generates a task graph representation, by instrumenting the
         program when it is compiled such that it ultimately outputs a task graph
         when executed. This paper describes several approaches to improving the
         overhead of Contech's instrumentation by augmenting the static compiler
         analysis.
       </p>
       <p>
         The additional analyses are able to first determine similar memory
         address calculations in the LLVM intermediate representation and elide
         them from the instrumentation to reduce the data recorded, an approach
         only previously attempted with dynamic binary instrumentation based on
         common registers [2] [3]. Second, this analysis is supplemented by
         performing tail duplication which increases the memory operations in a
         single basic block and therefore may provide further opportunities to
         elide instrumentation, without compromising the accuracy or detail of
         the data recorded. These optimizations reduce the data recorded by 22%,
         which has a proportionate decrease in overhead from 3.7x to 3.3x for
         PARSEC benchmarks.
       </p>
       <p>
         [1] B. P. Railing, E. R. Hein, and T. M. Conte. "Contech: Efficiently
         Generating Dynamic Task Graphs for Arbitrary Parallel Programs". In: ACM
         Trans. Archit. Code Optim. 12.2 (July 2015), 25:1-25:24.
       </p>
       <p>
         [2] Q. Zhao, I. Cutcutache, and W.-F. Wong. "Pipa: Pipelined Profiling
         and Analysis on Multi-core Systems". In: Proceedings of the 6th Annual
         IEEE/ACM International Symposium on Code Generation and
         Optimization. CGO '08. Boston, MA, USA: ACM, 2008, pp. 185-194.
       </p>
       <p>
         [3] K. Jee et al. "ShadowReplica: Efficient Parallelization of Dynamic
         Data Flow Tracking". In: Proceedings of the 2013 ACM SIGSAC Conference
         on Computer &#38; Communications Security. CCS '13. Berlin, Germany:
         ACM, 2013, pp. 235-246.
       </p>
     </li>
     <li> <a id="menezes"><b>Evandro Menezes, Sebastian Pop, and Aditya Kumar</b>: Efficient clustering of case statements for indirect branch predictors</a>
       <p>
         We present an O(nlogn) algorithm as implemented in LLVM to compile a
         switch statement into jump tables. To generate jump tables that can be
         efficiently predicted by current hardware branch predictors, we added an
         upper bound on the number of entries in each generated jump table. This
         modification of the previously best known algorithm reduces the
         complexity from O(n^2) to O(nlogn).  We illustrate the performance
         achieved by the improved algorithm on the Samsung Exynos-M1 processor
         running several benchmarks.
       </p>
     </li>
     <li> <a id="bhandarkar"><b>Pranav Bhandarkar, Anshuman Dasgupta, Ron Lieberman, Dan Palermo, Dillon Sharlet, and Andrew Adams</b>: Halide for Hexagon DSP with Hexagon Vector eXtensions (HVX) using LLVM</a>
       <p>
         Halide is a domain specific language that endeavors to make it easier to
         construct large and composite image processing applications. Halide is
         unique in its design approach to decoupling the algorithm from the
         organization (schedule) of the computation. Algorithms once written and
         tested for correctness can then be continually tuned for performance as
         Halide allows for easily changing the schedule - tiling, parallelizing,
         prefetching or vectorizing different dimensions of the loop nest that
         form the structure of the algorithm.
       </p>
       <p>
         Halide programs are transformed into the Halide Intermediate
         Representation (IR) by the Halide compiler. This IR is analyzed and
         optimized before generating LLVM bitcode for the target
         requested. Halide links with the LLVM optimizer and codegen libraries
         for supported targets, and uses these to generate object code.
       </p>
       <p>
         In this workshop, we will present our work on retargeting Halide to the
         Hexagon DSP with focus on the Hexagon Vector eXtensions (HVX).
       </p>
       <p>
         Our workshop will present the halide constructs used in a simple blur
         5x5, the corresponding Halide IR, and a few of the important LLVM
         Hexagon passes which generate HVX vector instructions.
       </p>
       <p>
         We will demonstrate compilation using LLVM.org and Halide.org tools, and
         execution of the blur 5x5 pipeline on a Snapdragon 820 development board
         using the Halide Hexagon offloader. In particular we will demonstrate
         the various improvements which can be realized with scheduling changes.
       </p>
     </li>
     <li> <a id="larin"><b>Sergei Larin, Harsha Jagasia and Tobias Edler von Koch</b>: Impact of the current LLVM inlining strategy on complex embedded application memory utilization and performance</a>
       <p>
         Sophisticated embedded applications with extensive and fine degree of
         memory management are presenting a unique challenge to contemporary tool
         chains. Like many open source projects LLVM optimizes its core
         optimization tradeoffs for common cases and a set of common
         architectures. Even with back end specific hooks, it is not always
         possible to exert appropriate degree of control over some key
         optimizations. We propose a case study on "in-depth" analysis of LLVM
         PGO assisted inlining in a complex embedded application.
       </p>
       <p>
         The program in question is a large scale embedded networking application
         designed to be custom tuned for a variety of actual embedded platforms
         with a range of memory and performance constrains. It makes a high use
         of linker scripts to configure and fine tune memory assignment to
         ultimately guarantee optimal performance in constrained memory
         environment while being extremely power conscious.
       </p>
       <p>
         The moment a tool chain is addressing a non-uniform memory model, "one
         size fits all" approach to optimizations like inlining stops being
         optimal. For instance, based on section assignment, completely unknown
         to the compiler, inlining takes place in areas that are facing different
         cost/benefit tradeoffs. The content of L1 and L2 Icache should not be
         "enlarged" even if performance can theoretically improve. Inlining
         across such section boundaries are also ill-advisable, since control
         flow exchange (jump) between sections destined to different levels of
         memory hierarchy can produce unexpected performance
         implications. Finally, tightly budgeted low-level and high-performance
         memories might swell beyond their physical limits.
       </p>
       <p>
         The current state of LLVM inline is somewhat transitional in
         anticipation of structural updates to the pass manager, and as such it
         still strongly relies on heuristic + PGO based inline cost
         computation. In such situation the introduction of back-end hooks might
         allow targets to fine-tune inlining decisions to some degree but they
         still fall far short to the degree of control needed by the
         above-described systems. Additional challenge is posed by high degree of
         complexity to capture actual system run-time behavior, and even
         collecting appropriate traces to generate meaningful PGO data. Battery
         powered embedded chips rarely have sophisticated tracing capabilities,
         yet present extremely complex run time environments.
       </p>
     </li>
   </ul>
 </p>

 <div class="www_sectiontitle">Call for Speakers</div>

 <p>
   We invite speakers from academia and industry to present their work on the
   following list of topics (including and not limited to:)
   <ul>
     <li>improving performance and size of code generated by LLVM,</li>
     <li>improving performance of LLVM's runtime libraries,</li>
     <li>tools developed with LLVM for performance analysis of compiler generated code,</li>
     <li>bots and trackers of performance over time,</li>
     <li>improving the security of generated code,</li>
     <li>any other topic related to improving and maintaining the performance and quality of LLVM generated code.</li>
   </ul>
   While the primary focus of the workshop is on these topics, we welcome any
   submission related to the LLVM compiler infrastructure, its sub-projects
   (Clang, Linker, libraries), and its use in industry and academia.
 </p>

 <p>
   We are looking for:
 </p>
 <ul>
   <li>keynote speakers,</li>
   <li>technical presentations: 30 minutes plus questions and discussion,</li>
   <li>tutorials,</li>
   <li>BOFs.</li>
 </ul>

 <p>
   Proposals should provide enough information for the review committee to be
   able to judge the quality of the submission. Proposals can be submitted under
   the form of an extended abstract, full paper, or slides.  Proposals should be
   submitted to
   <a href="https://easychair.org/conferences/?conf=llvmcgo2017">Easychair
   LLVM-CGO 2017</a>.  The deadline for receiving submissions is December 1st,
   2016.  Speakers will be notified of acceptance or rejection by December 15.
 </p>

 <p>
   Workshop organization: Sebastian Pop, Aditya Kumar, Tobias Edler von Koch, and
   Tanya Lattner.
 </p>

 <!-- *********************************************************************** -->
 <hr>

 <!--#include virtual="../../footer.incl" -->