docs/Proposals/VectorizationPlan.rst - llvm - Git at Google

 ==================
 Vectorization Plan
 ==================

 .. contents::
    :local:

 Abstract
 ========
 The vectorization transformation can be rather complicated, involving several
 potential alternatives, especially for outer-loops [1]_ but also possibly for
 innermost loops. These alternatives may have significant performance impact,
 both positive and negative. A cost model is therefore employed to identify the
 best alternative, including the alternative of avoiding any transformation
 altogether.

 The Vectorization Plan is an explicit model for describing vectorization
 candidates. It serves for both optimizing candidates including estimating their
 cost reliably, and for performing their final translation into IR. This
 facilitates dealing with multiple vectorization candidates.

 High-level Design
 =================

 Vectorization Workflow
 ----------------------
 VPlan-based vectorization involves three major steps, taking a "scenario-based
 approach" to vectorization planning:

 1. Legal Step: check if a loop can be legally vectorized; encode constraints and
    artifacts if so.
 2. Plan Step:

    a. Build initial VPlans following the constraints and decisions taken by
       Legal Step 1, and compute their cost.
    b. Apply optimizations to the VPlans, possibly forking additional VPlans.
       Prune sub-optimal VPlans having relatively high cost.
 3. Execute Step: materialize the best VPlan. Note that this is the only step
    that modifies the IR.

 Design Guidelines
 -----------------
 In what follows, the term "input IR" refers to code that is fed into the
 vectorizer whereas the term "output IR" refers to code that is generated by the
 vectorizer. The output IR contains code that has been vectorized or "widened"
 according to a loop Vectorization Factor (VF), and/or loop unroll-and-jammed
 according to an Unroll Factor (UF).
 The design of VPlan follows several high-level guidelines:

 1. Analysis-like: building and manipulating VPlans must not modify the input IR.
    In particular, if the best option is not to vectorize at all, the
    vectorization process terminates before reaching Step 3, and compilation
    should proceed as if VPlans had not been built.

 2. Align Cost & Execute: each VPlan must support both estimating the cost and
    generating the output IR code, such that the cost estimation evaluates the
    to-be-generated code reliably.

 3. Support vectorizing additional constructs:

    a. Outer-loop vectorization. In particular, VPlan must be able to model the
       control-flow of the output IR which may include multiple basic-blocks and
       nested loops.
    b. SLP vectorization.
    c. Combinations of the above, including nested vectorization: vectorizing
       both an inner loop and an outer-loop at the same time (each with its own
       VF and UF), mixed vectorization: vectorizing a loop with SLP patterns
       inside [4]_, (re)vectorizing input IR containing vector code.
    d. Function vectorization [2]_.

 4. Support multiple candidates efficiently. In particular, similar candidates
    related to a range of possible VF's and UF's must be represented efficiently.
    Potential versioning needs to be supported efficiently.

 5. Support vectorizing idioms, such as interleaved groups of strided loads or
    stores. This is achieved by modeling a sequence of output instructions using
    a "Recipe", which is responsible for computing its cost and generating its
    code.

 6. Encapsulate Single-Entry Single-Exit regions (SESE). During vectorization
    such regions may need to be, for example, predicated and linearized, or
    replicated VF*UF times to handle scalarized and predicated instructions.
    Innerloops are also modelled as SESE regions.

 Low-level Design
 ================
 The low-level design of VPlan comprises of the following classes.

 :LoopVectorizationPlanner:
   A LoopVectorizationPlanner is designed to handle the vectorization of a loop
   or a loop nest. It can construct, optimize and discard one or more VPlans,
   each VPlan modelling a distinct way to vectorize the loop or the loop nest.
   Once the best VPlan is determined, including the best VF and UF, this VPlan
   drives the generation of output IR.

 :VPlan:
   A model of a vectorized candidate for a given input IR loop or loop nest. This
   candidate is represented using a Hierarchical CFG. VPlan supports estimating
   the cost and driving the generation of the output IR code it represents.

 :Hierarchical CFG:
   A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. The
   Hierarchical CFG data structure is similar to the Tile Tree [5]_, where
   cross-Tile edges are lifted to connect Tiles instead of the original
   basic-blocks as in Sharir [6]_, promoting the Tile encapsulation. The terms
   Region and Block are used rather than Tile [5]_ to avoid confusion with loop
   tiling.

 :VPBlockBase:
   The building block of the Hierarchical CFG. A pure-virtual base-class of
   VPBasicBlock and VPRegionBlock, see below. VPBlockBase models the hierarchical
   control-flow relations with other VPBlocks. Note that in contrast to the IR
   BasicBlock, a VPBlockBase models its control-flow successors and predecessors
   directly, rather than through a Terminator branch or through predecessor
   branches that "use" the VPBlockBase.

 :VPBasicBlock:
   VPBasicBlock is a subclass of VPBlockBase, and serves as the leaves of the
   Hierarchical CFG. It represents a sequence of output IR instructions that will
   appear consecutively in an output IR basic-block. The instructions of this
   basic-block originate from one or more VPBasicBlocks. VPBasicBlock holds a
   sequence of zero or more VPRecipes that model the cost and generation of the
   output IR instructions.

 :VPRegionBlock:
   VPRegionBlock is a subclass of VPBlockBase. It models a collection of
   VPBasicBlocks and VPRegionBlocks which form a SESE subgraph of the output IR
   CFG. A VPRegionBlock may indicate that its contents are to be replicated a
   constant number of times when output IR is generated, effectively representing
   a loop with constant trip-count that will be completely unrolled. This is used
   to support scalarized and predicated instructions with a single model for
   multiple candidate VF's and UF's.

 :VPRecipeBase:
   A pure-virtual base class modeling a sequence of one or more output IR
   instructions, possibly based on one or more input IR instructions. These
   input IR instructions are referred to as "Ingredients" of the Recipe. A Recipe
   may specify how its ingredients are to be transformed to produce the output IR
   instructions; e.g., cloned once, replicated multiple times or widened
   according to selected VF.

 :VPTransformState:
   Stores information used for generating output IR, passed from
   LoopVectorizationPlanner to its selected VPlan for execution, and used to pass
   additional information down to VPBlocks and VPRecipes.

 Related LLVM components
 -----------------------
 1. SLP Vectorizer: one can compare the VPlan model with LLVM's existing SLP
    tree, where TSLP [3]_ adds Plan Step 2.b.

 2. RegionInfo: one can compare VPlan's H-CFG with the Region Analysis as used by
    Polly [7]_.

 References
 ----------
 .. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit
     Nuzman and Ayal Zaks, PACT 2008.

 .. [2] "Proposal for function vectorization and loop vectorization with function
     calls", Xinmin Tian, [`cfe-dev
     <http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html>`_].,
     March 2, 2016.
     See also `review <https://reviews.llvm.org/D22792>`_.

 .. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios
     Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015.

 .. [4] "Exploiting mixed SIMD parallelism by reducing data reorganization
     overhead", Hao Zhou and Jingling Xue, CGO 2016.

 .. [5] "Register Allocation via Hierarchical Graph Coloring", David Callahan and
     Brian Koblenz, PLDI 1991

 .. [6] "Structural analysis: A new approach to flow analysis in optimizing
     compilers", M. Sharir, Journal of Computer Languages, Jan. 1980

 .. [7] "Enabling Polyhedral Optimizations in LLVM", Tobias Grosser, Diploma
     thesis, 2011.

 .. [8] "Introducing VPlan to the Loop Vectorizer", Gil Rapaport and Ayal Zaks,
     European LLVM Developers' Meeting 2017.
	==================
	Vectorization Plan
	==================

	.. contents::
	:local:

	Abstract
	========
	The vectorization transformation can be rather complicated, involving several
	potential alternatives, especially for outer-loops [1]_ but also possibly for
	innermost loops. These alternatives may have significant performance impact,
	both positive and negative. A cost model is therefore employed to identify the
	best alternative, including the alternative of avoiding any transformation
	altogether.

	The Vectorization Plan is an explicit model for describing vectorization
	candidates. It serves for both optimizing candidates including estimating their
	cost reliably, and for performing their final translation into IR. This
	facilitates dealing with multiple vectorization candidates.

	High-level Design
	=================

	Vectorization Workflow
	----------------------
	VPlan-based vectorization involves three major steps, taking a "scenario-based
	approach" to vectorization planning:

	1. Legal Step: check if a loop can be legally vectorized; encode constraints and
	artifacts if so.
	2. Plan Step:

	a. Build initial VPlans following the constraints and decisions taken by
	Legal Step 1, and compute their cost.
	b. Apply optimizations to the VPlans, possibly forking additional VPlans.
	Prune sub-optimal VPlans having relatively high cost.
	3. Execute Step: materialize the best VPlan. Note that this is the only step
	that modifies the IR.

	Design Guidelines
	-----------------
	In what follows, the term "input IR" refers to code that is fed into the
	vectorizer whereas the term "output IR" refers to code that is generated by the
	vectorizer. The output IR contains code that has been vectorized or "widened"
	according to a loop Vectorization Factor (VF), and/or loop unroll-and-jammed
	according to an Unroll Factor (UF).
	The design of VPlan follows several high-level guidelines:

	1. Analysis-like: building and manipulating VPlans must not modify the input IR.
	In particular, if the best option is not to vectorize at all, the
	vectorization process terminates before reaching Step 3, and compilation
	should proceed as if VPlans had not been built.

	2. Align Cost & Execute: each VPlan must support both estimating the cost and
	generating the output IR code, such that the cost estimation evaluates the
	to-be-generated code reliably.

	3. Support vectorizing additional constructs:

	a. Outer-loop vectorization. In particular, VPlan must be able to model the
	control-flow of the output IR which may include multiple basic-blocks and
	nested loops.
	b. SLP vectorization.
	c. Combinations of the above, including nested vectorization: vectorizing
	both an inner loop and an outer-loop at the same time (each with its own
	VF and UF), mixed vectorization: vectorizing a loop with SLP patterns
	inside [4]_, (re)vectorizing input IR containing vector code.
	d. Function vectorization [2]_.

	4. Support multiple candidates efficiently. In particular, similar candidates
	related to a range of possible VF's and UF's must be represented efficiently.
	Potential versioning needs to be supported efficiently.

	5. Support vectorizing idioms, such as interleaved groups of strided loads or
	stores. This is achieved by modeling a sequence of output instructions using
	a "Recipe", which is responsible for computing its cost and generating its
	code.

	6. Encapsulate Single-Entry Single-Exit regions (SESE). During vectorization
	such regions may need to be, for example, predicated and linearized, or
	replicated VF*UF times to handle scalarized and predicated instructions.
	Innerloops are also modelled as SESE regions.

	Low-level Design
	================
	The low-level design of VPlan comprises of the following classes.

	:LoopVectorizationPlanner:
	A LoopVectorizationPlanner is designed to handle the vectorization of a loop
	or a loop nest. It can construct, optimize and discard one or more VPlans,
	each VPlan modelling a distinct way to vectorize the loop or the loop nest.
	Once the best VPlan is determined, including the best VF and UF, this VPlan
	drives the generation of output IR.

	:VPlan:
	A model of a vectorized candidate for a given input IR loop or loop nest. This
	candidate is represented using a Hierarchical CFG. VPlan supports estimating
	the cost and driving the generation of the output IR code it represents.

	:Hierarchical CFG:
	A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. The
	Hierarchical CFG data structure is similar to the Tile Tree [5]_, where
	cross-Tile edges are lifted to connect Tiles instead of the original
	basic-blocks as in Sharir [6]_, promoting the Tile encapsulation. The terms
	Region and Block are used rather than Tile [5]_ to avoid confusion with loop
	tiling.

	:VPBlockBase:
	The building block of the Hierarchical CFG. A pure-virtual base-class of
	VPBasicBlock and VPRegionBlock, see below. VPBlockBase models the hierarchical
	control-flow relations with other VPBlocks. Note that in contrast to the IR
	BasicBlock, a VPBlockBase models its control-flow successors and predecessors
	directly, rather than through a Terminator branch or through predecessor
	branches that "use" the VPBlockBase.

	:VPBasicBlock:
	VPBasicBlock is a subclass of VPBlockBase, and serves as the leaves of the
	Hierarchical CFG. It represents a sequence of output IR instructions that will
	appear consecutively in an output IR basic-block. The instructions of this
	basic-block originate from one or more VPBasicBlocks. VPBasicBlock holds a
	sequence of zero or more VPRecipes that model the cost and generation of the
	output IR instructions.

	:VPRegionBlock:
	VPRegionBlock is a subclass of VPBlockBase. It models a collection of
	VPBasicBlocks and VPRegionBlocks which form a SESE subgraph of the output IR
	CFG. A VPRegionBlock may indicate that its contents are to be replicated a
	constant number of times when output IR is generated, effectively representing
	a loop with constant trip-count that will be completely unrolled. This is used
	to support scalarized and predicated instructions with a single model for
	multiple candidate VF's and UF's.

	:VPRecipeBase:
	A pure-virtual base class modeling a sequence of one or more output IR
	instructions, possibly based on one or more input IR instructions. These
	input IR instructions are referred to as "Ingredients" of the Recipe. A Recipe
	may specify how its ingredients are to be transformed to produce the output IR
	instructions; e.g., cloned once, replicated multiple times or widened
	according to selected VF.

	:VPTransformState:
	Stores information used for generating output IR, passed from
	LoopVectorizationPlanner to its selected VPlan for execution, and used to pass
	additional information down to VPBlocks and VPRecipes.

	Related LLVM components
	-----------------------
	1. SLP Vectorizer: one can compare the VPlan model with LLVM's existing SLP
	tree, where TSLP [3]_ adds Plan Step 2.b.

	2. RegionInfo: one can compare VPlan's H-CFG with the Region Analysis as used by
	Polly [7]_.

	References
	----------
	.. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit
	Nuzman and Ayal Zaks, PACT 2008.

	.. [2] "Proposal for function vectorization and loop vectorization with function
	calls", Xinmin Tian, [`cfe-dev
	<http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html>`_].,
	March 2, 2016.
	See also `review <https://reviews.llvm.org/D22792>`_.

	.. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios
	Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015.

	.. [4] "Exploiting mixed SIMD parallelism by reducing data reorganization
	overhead", Hao Zhou and Jingling Xue, CGO 2016.

	.. [5] "Register Allocation via Hierarchical Graph Coloring", David Callahan and
	Brian Koblenz, PLDI 1991

	.. [6] "Structural analysis: A new approach to flow analysis in optimizing
	compilers", M. Sharir, Journal of Computer Languages, Jan. 1980

	.. [7] "Enabling Polyhedral Optimizations in LLVM", Tobias Grosser, Diploma
	thesis, 2011.

	.. [8] "Introducing VPlan to the Loop Vectorizer", Gil Rapaport and Ayal Zaks,
	European LLVM Developers' Meeting 2017.