commit	39db5753f993abcc4289dd165e8297a4e28f4b0a	[log] [tgz]
author	David Green <david.green@arm.com>	Thu Jan 21 21:03:41 2021 +0000
committer	David Green <david.green@arm.com>	Thu Jan 21 21:03:41 2021 +0000
tree	9bc109f1ef2eaee7dbb1207ec64448de4bcfdea8
parent	2f03528f5e7fd9df0a12091392e000c697497262 [diff]

[LV][ARM] Inloop reduction cost modelling

This adds cost modelling for the inloop vectorization added in
745bf6cf4471. Up until now they have been modelled as the original
underlying instruction, usually an add. This happens to works OK for MVE
with instructions that are reducing into the same type as they are
working on. But MVE's instructions can perform the equivalent of an
extended MLA as a single instruction:

  %sa = sext <16 x i8> A to <16 x i32>
  %sb = sext <16 x i8> B to <16 x i32>
  %m = mul <16 x i32> %sa, %sb
  %r = vecreduce.add(%m)
  ->
  R = VMLADAV A, B

There are other instructions for performing add reductions of
v4i32/v8i16/v16i8 into i32 (VADDV), for doing the same with v4i32->i64
(VADDLV) and for performing a v4i32/v8i16 MLA into an i64 (VMLALDAV).
The i64 are particularly interesting as there are no native i64 add/mul
instructions, leading to the i64 add and mul naturally getting very
high costs.

Also worth mentioning, under NEON there is the concept of a sdot/udot
instruction which performs a partial reduction from a v16i8 to a v4i32.
They extend and mul/sum the first four elements from the inputs into the
first element of the output, repeating for each of the four output
lanes. They could possibly be represented in the same way as above in
llvm, so long as a vecreduce.add could perform a partial reduction. The
vectorizer would then produce a combination of in and outer loop
reductions to efficiently use the sdot and udot instructions. Although
this patch does not do that yet, it does suggest that separating the
input reduction type from the produced result type is a useful concept
to model. It also shows that a MLA reduction as a single instruction is
fairly common.

This patch attempt to improve the costmodelling of in-loop reductions
by:
 - Adding some pattern matching in the loop vectorizer cost model to
   match extended reduction patterns that are optionally extended and/or
   MLA patterns. This marks the cost of the reduction instruction correctly
   and the sext/zext/mul leading up to it as free, which is otherwise
   difficult to tell and may get a very high cost. (In the long run this
   can hopefully be replaced by vplan producing a single node and costing
   it correctly, but that is not yet something that vplan can do).
 - getExtendedAddReductionCost is added to query the cost of these
   extended reduction patterns.
 - Expanded the ARM costs to account for these expanded sizes, which is a
   fairly simple change in itself.
 - Some minor alterations to allow inloop reduction larger than the highest
   vector width and i64 MVE reductions.
 - An extra InLoopReductionImmediateChains map was added to the vectorizer
   for it to efficiently detect which instructions are reductions in the
   cost model.
 - The tests have some updates to show what I believe is optimal
   vectorization and where we are now.

Put together this can greatly improve performance for reduction loop
under MVE.

Differential Revision: https://reviews.llvm.org/D93476

9 files changed

tree: 9bc109f1ef2eaee7dbb1207ec64448de4bcfdea8

README.md

The LLVM Compiler Infrastructure

This directory and its sub-directories contain source code for LLVM, a toolkit for the construction of highly optimized compilers, optimizers, and run-time environments.

The README briefly describes how to get started with building LLVM. For more information on how to contribute to the LLVM project, please take a look at the Contributing to LLVM guide.

Getting Started with the LLVM System

Taken from https://llvm.org/docs/GettingStarted.html.

Overview

Welcome to the LLVM project!

The LLVM project has multiple components. The core of the project is itself called “LLVM”. This contains all of the tools, libraries, and header files needed to process intermediate representations and converts it into object files. Tools include an assembler, disassembler, bitcode analyzer, and bitcode optimizer. It also contains basic regression tests.

C-like languages use the Clang front end. This component compiles C, C++, Objective-C, and Objective-C++ code into LLVM bitcode -- and from there into object files, using LLVM.

Other components include: the libc++ C++ standard library, the LLD linker, and more.

Getting the Source Code and Building LLVM

The LLVM Getting Started documentation may be out of date. The Clang Getting Started page might have more accurate information.

This is an example work-flow and configuration to get and build the LLVM source:

Checkout LLVM (including related sub-projects like Clang):
- git clone https://github.com/llvm/llvm-project.git
- Or, on windows, git clone --config core.autocrlf=false https://github.com/llvm/llvm-project.git
Configure and build LLVM and Clang:
- cd llvm-project
- mkdir build
- cd build
- cmake -G <generator> [options] ../llvm
  Some common build system generators are:
  - Ninja --- for generating Ninja build files. Most llvm developers use Ninja.
  - Unix Makefiles --- for generating make-compatible parallel makefiles.
  - Visual Studio --- for generating Visual Studio projects and solutions.
  - Xcode --- for generating Xcode projects.
  Some Common options:
  - -DLLVM_ENABLE_PROJECTS='...' --- semicolon-separated list of the LLVM sub-projects you'd like to additionally build. Can include any of: clang, clang-tools-extra, libcxx, libcxxabi, libunwind, lldb, compiler-rt, lld, polly, or debuginfo-tests.
    For example, to build LLVM, Clang, libcxx, and libcxxabi, use -DLLVM_ENABLE_PROJECTS="clang;libcxx;libcxxabi".
  - -DCMAKE_INSTALL_PREFIX=directory --- Specify for directory the full path name of where you want the LLVM tools and libraries to be installed (default /usr/local).
  - -DCMAKE_BUILD_TYPE=type --- Valid options for type are Debug, Release, RelWithDebInfo, and MinSizeRel. Default is Debug.
  - -DLLVM_ENABLE_ASSERTIONS=On --- Compile with assertion checks enabled (default is Yes for Debug builds, No for all other build types).
- cmake --build . [-- [options] <target>] or your build system specified above directly.
  - The default target (i.e. ninja or make) will build all of LLVM.
  - The check-all target (i.e. ninja check-all) will run the regression tests to ensure everything is in working order.
  - CMake will generate targets for each tool and library, and most LLVM sub-projects generate their own check-<project> target.
  - Running a serial build will be slow. To improve speed, try running a parallel build. That's done by default in Ninja; for make, use the option -j NNN, where NNN is the number of parallel jobs, e.g. the number of CPUs you have.
- For more information see CMake

Consult the Getting Started with LLVM page for detailed information on configuring and compiling LLVM. You can visit Directory Layout to learn about the layout of the source code tree.