blob: 287dbc15d7fc8260e98bab0c29cfd472a39ba2f5 [file] [edit]
====================================================
Loop Fusion in LLVM
====================================================
1. Introduction
===============
Loop fusion (also called loop jamming) is a compiler optimization that
merges two adjacent loops into a single loop, provided the
transformation preserves the program's original semantics. The
motivation is straightforward: by executing the bodies of two loops in
a single pass over the iteration space, we reduce loop overhead (fewer
branch instructions, fewer induction variable updates), improve
temporal data locality (data written by the first loop body and read
by the second is still in cache), and create new opportunities for
subsequent optimizations such as instruction scheduling and register
allocation.
LLVM's implementation resides in
``llvm/lib/Transforms/Scalar/LoopFuse.cpp`` and is based on
Christopher Barton's MSc thesis, *"Code Transformations to Augment the
Scope of Loop Fusion in a Production Compiler"*. The pass operates on
LLVM IR, leveraging several core analysis frameworks --
Scalar Evolution (SCEV), Dependence Analysis
(DA), and Dominator/Post-Dominator Trees -- to determine legality and
perform the CFG rewiring that fuses two loops into one.
2. Prerequisite Concepts
========================
The fusion pass relies on several standard LLVM loop concepts that
are documented elsewhere: simplified loop form, rotated loop form,
loop guard branches, dominator and post-dominator trees, Scalar
Evolution (SCEV), and trip count with peeling. The analysis that
drives the most involved legality check -- Dependence Analysis (DA)
-- is summarized below, because later sections refer directly to its
dependence kinds and direction vectors.
DA determines the dependence relation between pairs of memory
accesses. A dependence from instruction S1 to S2 is characterized by:
- **Flow (true) dependence**: S1 writes, S2 reads the same location
(read-after-write).
- **Anti dependence**: S1 reads, S2 writes (write-after-read).
- **Output dependence**: Both S1 and S2 write (write-after-write).
Each dependence carries a *direction vector* at each loop nesting
level, indicating whether the source iteration is less than (``<``),
equal to (``=``), or greater than (``>``) the sink iteration at that
level. A dependence with a ``>`` component at the current loop level
represents a *backward loop-carried dependence* (also called a
negative-distance dependence). Such dependences are the critical
hazard that loop fusion must respect.
3. High-Level Algorithm
=======================
The pass operates in a top-down, level-by-level fashion over the loop
nest tree. At each nesting depth, it:
1. **Collects fusion candidates**: Identifies each eligible loop and
partitions the eligible loops into ordered chains of
*control-flow equivalent, strictly adjacent* loops sorted in
dominance order.
2. **Attempts pairwise fusion**: Walks each chain linearly, testing
every consecutive pair ``(FC0, FC1)`` against the four legality
conditions. If all conditions hold, the pair is fused and replaced
by the fused loop in the chain, which is then considered for
further fusion with its successor.
3. **Descends to inner loops**: After processing all sibling groups at
the current depth, the pass descends one level and repeats.
This strategy means outermost loops are fused first. Fusing inner
loops is handled in subsequent iterations of the outer while-loop.
4. Phase 1: Candidate Collection
=================================
For each group of sibling loops (loops sharing a parent) at the
current depth, the pass gathers the loops that are eligible for
fusion.
4.1 Eligibility Check
----------------------
Each loop is first scanned for disqualifying properties. A loop is
rejected immediately if any of its blocks has its address taken, if
any instruction in the loop may throw, or if the loop contains a
volatile memory access. The pass cannot reason about exception-based
control flow or volatile ordering, so these are hard blockers.
Surviving loops must additionally satisfy four structural
requirements:
1. **Structural validity**: preheader, header, exiting block, exit
block, and latch all exist. This implicitly requires simplified
form.
2. **Computable trip count**: SCEV must be able to compute a
loop-invariant backedge-taken count. If not, the pass cannot
compare trip counts.
3. **Simplified form**: the loop must be in simplified form (single
preheader, dedicated exit blocks, single latch).
4. **Rotated form**: the loop must be in rotated form (do-while
shape, with the exit test at the latch). The exiting block must
be the latch.
If any check fails, the loop is discarded and an optimization remark
is emitted explaining why.
4.2 Grouping by Adjacency
---------------------------
Eligible loops are partitioned into ordered chains based on *strict
adjacency*. Two loops FC0 and FC1 are strictly adjacent if:
- **Unguarded loops**: FC0's exit block is exactly FC1's preheader
(i.e., there is no intervening code between the two loops).
- **Guarded loops**: FC0's entry block dominates FC1's entry block,
and FC0's exit block has a unique successor that is FC1's entry
block.
The algorithm appends each new loop to the first existing chain whose
last element is strictly adjacent to it. If no such chain exists, a
new chain is started. Because the input loops are already supplied in
dominance (program) order, this single-pass grouping correctly
partitions eligible loops into maximal chains of adjacent loops.
5. Phase 2: Pairwise Fusion Attempts
=====================================
The pass iterates over each chain and attempts to fuse consecutive
pairs ``(FC0, FC1)``. The following legality conditions are checked
in order, with early exit on failure.
5.1 Condition 1: Identical Trip Counts (Conformance)
------------------------------------------------------
The conformance check uses SCEV to retrieve the backedge-taken count
of both loops. If the SCEV expressions are identical (pointer equality
after canonicalization), the loops are conforming.
If they differ but both are small constants, the pass computes the
arithmetic difference. If the first loop has more iterations than the
second, and the difference does not exceed the command-line limit
``-loop-fusion-peel-max-count`` (default 0), the pass marks the pair
as eligible for peeling. Peeling the first loop by the difference
will equalize the trip counts.
The current implementation does not support the case where the second
loop has more iterations than the first.
5.2 Condition 2: Compatible Guard Structure
--------------------------------------------
Both loops must be either both guarded or both unguarded. If one is
guarded and the other is not, fusion is rejected. When both are
guarded, the guard-comparison check ensures that:
1. The guard condition instructions are structurally identical
(same opcode, same operands).
2. The true/false successors have the same polarity relative to the
loop (i.e., both guards branch to the preheader on the same
condition outcome).
Additionally, for guarded loops, the pass verifies that instructions
in FC0's exit block can be safely moved before FC1's exit block, and
that FC1's guard block instructions can be safely moved before FC0's
guard block terminator. These moves reuse LLVM's generic code-motion
safety helpers.
5.3 Condition 3: No Negative-Distance Dependencies
----------------------------------------------------
The dependence check is the most involved legality analysis. It
examines all pairs of memory accesses where at least one is a write:
- (FC0 writes) x (FC1 writes) -- output dependences
- (FC0 writes) x (FC1 reads) -- flow dependences
- (FC1 writes) x (FC0 reads) -- anti dependences
Additionally, it checks for *cross-loop def-use chains*: if any
instruction in FC1 uses a value defined inside FC0, fusion is rejected
outright. This is because after fusion, the definition from a given
iteration of the original FC0 would need to be available to the same
iteration of FC1, but the interleaved execution order may not preserve
this.
For each memory pair, the pass invokes one of three dependence
analysis strategies, selectable via
``--loop-fusion-dependence-analysis``. The default is DA-based
analysis (``da``); the SCEV-based and combined modes are retained as
opt-ins. The SCEV-based path is expected to be removed in a future
change.
**SCEV-based analysis**:
Rewrites the access function of FC0's instruction into FC1's loop
iteration space by substituting FC0's loop with FC1's loop inside
every add-recurrence subexpression. The pass then asks SCEV whether
FC0's address is known to be greater than or equal to FC1's address.
A non-negative difference means no backward dependence.
**DA-based analysis**:
Queries LLVM's dependence analysis on the pair. If no dependence
exists, fusion is safe. If a dependence exists, the pass examines its
direction vector:
1. At outer loop levels (levels above the current fusion level): if
any level has a direction that excludes equality (``EQ``), the
outer indices differ and the dependence does not constrain fusion
at the current level.
2. At the current level: if the direction excludes ``GT``
(greater-than), there is no backward loop-carried dependence, and
fusion is safe. For example, a pure ``LT`` direction indicates a
forward dependence like ``A[i] = ...; ... = A[i-1]``, which
remains valid after fusion.
3. Loop-invariant (scalar) non-anti dependences at the current level
are also safe.
**Combined analysis**:
Accepts the pair if either SCEV-based or DA-based analysis approves
it.
5.4 Condition 4: Empty or Movable Preheader
---------------------------------------------
FC1's preheader must be empty (containing only the terminator branch)
or all its instructions must be safely movable. The preheader-motion
check classifies each preheader instruction as either hoistable (to
FC0's preheader) or sinkable (into FC1's body after the fused loop):
- **Hoisting**: An instruction can be hoisted if all its operands
either dominate FC0's preheader target or have already been
classified as hoistable. PHI nodes cannot be hoisted. Memory
instructions can be hoisted only if they have no flow, anti, or
output dependences with non-hoisted preheader instructions or with
FC0's memory accesses.
- **Sinking**: An instruction can be sunk if none of its users are
inside FC1's loop body, and it has no flow or output dependences
with FC1's memory accesses.
- Atomic and volatile instructions cannot be moved.
If any instruction is neither hoistable nor sinkable, fusion is
abandoned for this pair.
5.5 Profitability
------------------
The profitability check currently always reports that fusion is
beneficial. This is intentional for testing coverage and is expected
to evolve to include cost-model heuristics (e.g., register pressure
estimation, cache footprint analysis) in the future.
6. Phase 3: The Fusion Transformation
======================================
Once all legality checks pass, the transformation proceeds in two
stages: an optional peeling step and the actual CFG rewiring.
6.1 Loop Peeling (Optional)
-----------------------------
If the trip counts differ by a constant ``d``, the pass peels ``d``
iterations from FC0. This extracts the first ``d`` iterations as
straight-line code before the loop, so the remaining loop has the
same trip count as FC1. After peeling:
1. The post-dominator tree is recalculated.
2. FC0's cached block pointers are refreshed.
3. The peeled iteration blocks' branches are rewritten to remove edges
to FC1's preheader, ensuring FC0's entry block still dominates
FC1's entry block.
6.2 CFG Rewiring (Non-Guarded Loops)
--------------------------------------
For non-guarded loops, the CFG transformation performs these steps:
1. **Merge preheaders**: Move all instructions from FC1's preheader
into FC0's preheader (before the terminator).
2. **Rewire FC0's exiting block**: Instead of branching to FC1's
preheader (which is about to be deleted), the exiting block now
targets FC1's header directly.
3. **Delete FC1's preheader**: It has no predecessors left; replace
its terminator with ``unreachable``.
4. **Move PHI nodes**: All PHI nodes from FC1's header are moved to
FC0's header. If a PHI has no uses, it is deleted.
5. **Introduce intermediate PHI nodes**: If FC0's exiting block is
not the latch (a rare case given the rotated form requirement),
new PHI nodes are inserted in FC1's header. These select the
loop-carried value from FC0 when arriving via FC0's latch, or
``poison`` when arriving via FC0's exiting block (which means the
loop is exiting and the value is dead).
6. **Reconnect latches**:
- FC0's latch, which previously branched back to FC0's header, now
branches to FC1's header.
- FC1's latch, which previously branched back to FC1's header, now
branches to FC0's header.
- FC0's latch branch is simplified to unconditional (both targets
are now FC1's header).
7. **Transfer loop blocks**: All basic blocks belonging to FC1 are
moved into FC0. All child loops of FC1 are re-parented under FC0.
8. **Merge latches**: Instructions from FC0's old latch are moved to
FC1's latch (which is now the single latch of the fused loop). If
FC0's latch has a unique successor, the blocks are merged.
9. **Cleanup**: FC1 is erased from the function's loop information.
SCEV caches for both loops are invalidated, and the dominator and
post-dominator trees are updated in a single batched transaction.
The resulting fused loop has:
- FC0's preheader as its preheader.
- FC0's header as its header.
- A body containing blocks from both FC0 and FC1.
- FC1's latch as its latch.
- FC1's exit block as its exit block.
6.3 CFG Rewiring (Guarded Loops)
----------------------------------
The guarded-loop fusion path handles the additional complexity of
guard branches and exit blocks:
1. FC0's guard is updated to use FC1's non-loop block as its bypass
target, effectively making FC0's guard protect both loops.
2. FC0's exit block instructions are moved to FC1's exit block.
3. FC1's guard block is disconnected and deleted.
4. FC0's exit block is disconnected and deleted.
5. The latch rewiring and block transfer proceed identically to the
non-guarded case.
6.4 Post-Fusion Bookkeeping
-----------------------------
After fusion, the fused loop replaces the original pair in the chain
and becomes the new left operand for the next pairwise attempt. This
enables *cascading fusion*: if three adjacent loops A, B, C are all
fusible, A+B is fused first, then (A+B)+C is attempted.
FC1's loop is also recorded as removed so that it is skipped when the
pass descends to inner nesting levels.
7. Limitations
===============
The current implementation is purely opportunistic: it fuses only
loop pairs that already satisfy the four legality conditions. It does
not reshape the surrounding code to create new fusion opportunities,
and several legality checks are intentionally conservative.
7.1 Algorithmic Scope
----------------------
- **No loop reshaping to enable fusion.** The pass does not insert
guards, rotate loops, run loop-simplify, or otherwise modify loops
that miss a legality condition. Loops that are not already in
rotated, simplified form with a computable trip count are discarded
as ineligible.
- **No profitability cost model.** The profitability check
unconditionally reports every legal pair as beneficial. Fusion
proceeds regardless of cache footprint, register pressure, or any
interaction with downstream vectorization.
- **Cross-loop def-use chains are always rejected.** If any
instruction in FC1 uses a value produced inside FC0, the pair is
rejected outright. Patterns such as a reduction whose result is
consumed by the next loop are blocked even when the producing value
is loop-invariant and could legally be hoisted.
7.2 Trip-Count Equalization
----------------------------
- **Peeling is disabled by default.** The maximum number of
iterations the pass may peel is controlled by the command-line
option ``-loop-fusion-peel-max-count``, which defaults to ``0``.
Peeling therefore never fires unless the user opts in explicitly.
- **Peeling shrinks only the first loop.** If the second loop has
more iterations than the first, fusion is rejected. There is no
attempt to peel the second loop or extend the first.
- **Peeling requires constant trip counts with a single exit.** Both
loops must have a small constant trip count for the peel distance
to be computed. Symbolic trip-count differences are not handled.
7.4 Preheader and Block Handling
---------------------------------
- **Volatile or atomic preheader instructions block fusion.** Such
instructions cannot be moved, so a non-empty preheader that
contains them prevents fusion.
- **Memory-reading preheader instructions are handled
conservatively.** The pass does try to move memory-reading
preheader instructions, but only when dependence checks prove
hoisting or sinking is safe. This conservative filtering can still
reject fusible-looking cases.
- **FC1's preheader must reduce to a single terminator.** After
hoisting and sinking, the simple (non-guarded) fusion path requires
FC1's preheader to contain exactly one instruction -- its branch.
Any instruction that is neither hoistable nor sinkable prevents
fusion rather than being preserved by block merging.
- **Blocks are rewired, not merged.** Preheaders, latches, and exit
blocks are reconnected to new successors during fusion, but never
merged with their neighbors. The fused Control-Flow Graph (CFG)
therefore retains more basic blocks than strictly necessary.
7.5 Eligibility Blockers
-------------------------
A loop is rejected during candidate construction if any of the
following properties holds:
- A block in the loop has its address taken (for example, used as a
``blockaddress`` operand).
- Any instruction in the loop may throw.
- The loop contains a volatile ``load`` or ``store``.
These conditions are hard blockers, not soft preferences: the pass
never attempts to work around them.
**Atomic operations are not reasoned about.** Atomic loads, stores,
``atomicrmw``, ``cmpxchg``, and ``fence`` instructions inside the
loop body are tracked only as ordinary memory reads or writes. The
pass does not inspect their memory ordering (``unordered``,
``monotonic``, ``acquire``, ``release``, ``seq_cst``), and the
underlying dependence analysis reasons about address-based data
dependence rather than inter-thread synchronization. Fusing two
loops that contain non-``unordered`` atomics can therefore change
observable multi-threaded behavior even when the dependence check
reports no conflict.