| ==================================================== |
| Loop Fusion in LLVM |
| ==================================================== |
| |
| 1. Introduction |
| =============== |
| |
| Loop fusion (also called loop jamming) is a compiler optimization that |
| merges two adjacent loops into a single loop, provided the |
| transformation preserves the program's original semantics. The |
| motivation is straightforward: by executing the bodies of two loops in |
| a single pass over the iteration space, we reduce loop overhead (fewer |
| branch instructions, fewer induction variable updates), improve |
| temporal data locality (data written by the first loop body and read |
| by the second is still in cache), and create new opportunities for |
| subsequent optimizations such as instruction scheduling and register |
| allocation. |
| |
| LLVM's implementation resides in |
| ``llvm/lib/Transforms/Scalar/LoopFuse.cpp`` and is based on |
| Christopher Barton's MSc thesis, *"Code Transformations to Augment the |
| Scope of Loop Fusion in a Production Compiler"*. The pass operates on |
| LLVM IR, leveraging several core analysis frameworks -- |
| Scalar Evolution (SCEV), Dependence Analysis |
| (DA), and Dominator/Post-Dominator Trees -- to determine legality and |
| perform the CFG rewiring that fuses two loops into one. |
| |
| 2. Prerequisite Concepts |
| ======================== |
| |
| The fusion pass relies on several standard LLVM loop concepts that |
| are documented elsewhere: simplified loop form, rotated loop form, |
| loop guard branches, dominator and post-dominator trees, Scalar |
| Evolution (SCEV), and trip count with peeling. The analysis that |
| drives the most involved legality check -- Dependence Analysis (DA) |
| -- is summarized below, because later sections refer directly to its |
| dependence kinds and direction vectors. |
| |
| DA determines the dependence relation between pairs of memory |
| accesses. A dependence from instruction S1 to S2 is characterized by: |
| |
| - **Flow (true) dependence**: S1 writes, S2 reads the same location |
| (read-after-write). |
| - **Anti dependence**: S1 reads, S2 writes (write-after-read). |
| - **Output dependence**: Both S1 and S2 write (write-after-write). |
| |
| Each dependence carries a *direction vector* at each loop nesting |
| level, indicating whether the source iteration is less than (``<``), |
| equal to (``=``), or greater than (``>``) the sink iteration at that |
| level. A dependence with a ``>`` component at the current loop level |
| represents a *backward loop-carried dependence* (also called a |
| negative-distance dependence). Such dependences are the critical |
| hazard that loop fusion must respect. |
| |
| |
| 3. High-Level Algorithm |
| ======================= |
| |
| The pass operates in a top-down, level-by-level fashion over the loop |
| nest tree. At each nesting depth, it: |
| |
| 1. **Collects fusion candidates**: Identifies each eligible loop and |
| partitions the eligible loops into ordered chains of |
| *control-flow equivalent, strictly adjacent* loops sorted in |
| dominance order. |
| 2. **Attempts pairwise fusion**: Walks each chain linearly, testing |
| every consecutive pair ``(FC0, FC1)`` against the four legality |
| conditions. If all conditions hold, the pair is fused and replaced |
| by the fused loop in the chain, which is then considered for |
| further fusion with its successor. |
| 3. **Descends to inner loops**: After processing all sibling groups at |
| the current depth, the pass descends one level and repeats. |
| |
| This strategy means outermost loops are fused first. Fusing inner |
| loops is handled in subsequent iterations of the outer while-loop. |
| |
| |
| 4. Phase 1: Candidate Collection |
| ================================= |
| |
| For each group of sibling loops (loops sharing a parent) at the |
| current depth, the pass gathers the loops that are eligible for |
| fusion. |
| |
| 4.1 Eligibility Check |
| ---------------------- |
| |
| Each loop is first scanned for disqualifying properties. A loop is |
| rejected immediately if any of its blocks has its address taken, if |
| any instruction in the loop may throw, or if the loop contains a |
| volatile memory access. The pass cannot reason about exception-based |
| control flow or volatile ordering, so these are hard blockers. |
| |
| Surviving loops must additionally satisfy four structural |
| requirements: |
| |
| 1. **Structural validity**: preheader, header, exiting block, exit |
| block, and latch all exist. This implicitly requires simplified |
| form. |
| 2. **Computable trip count**: SCEV must be able to compute a |
| loop-invariant backedge-taken count. If not, the pass cannot |
| compare trip counts. |
| 3. **Simplified form**: the loop must be in simplified form (single |
| preheader, dedicated exit blocks, single latch). |
| 4. **Rotated form**: the loop must be in rotated form (do-while |
| shape, with the exit test at the latch). The exiting block must |
| be the latch. |
| |
| If any check fails, the loop is discarded and an optimization remark |
| is emitted explaining why. |
| |
| 4.2 Grouping by Adjacency |
| --------------------------- |
| |
| Eligible loops are partitioned into ordered chains based on *strict |
| adjacency*. Two loops FC0 and FC1 are strictly adjacent if: |
| |
| - **Unguarded loops**: FC0's exit block is exactly FC1's preheader |
| (i.e., there is no intervening code between the two loops). |
| - **Guarded loops**: FC0's entry block dominates FC1's entry block, |
| and FC0's exit block has a unique successor that is FC1's entry |
| block. |
| |
| The algorithm appends each new loop to the first existing chain whose |
| last element is strictly adjacent to it. If no such chain exists, a |
| new chain is started. Because the input loops are already supplied in |
| dominance (program) order, this single-pass grouping correctly |
| partitions eligible loops into maximal chains of adjacent loops. |
| |
| |
| 5. Phase 2: Pairwise Fusion Attempts |
| ===================================== |
| |
| The pass iterates over each chain and attempts to fuse consecutive |
| pairs ``(FC0, FC1)``. The following legality conditions are checked |
| in order, with early exit on failure. |
| |
| 5.1 Condition 1: Identical Trip Counts (Conformance) |
| ------------------------------------------------------ |
| |
| The conformance check uses SCEV to retrieve the backedge-taken count |
| of both loops. If the SCEV expressions are identical (pointer equality |
| after canonicalization), the loops are conforming. |
| |
| If they differ but both are small constants, the pass computes the |
| arithmetic difference. If the first loop has more iterations than the |
| second, and the difference does not exceed the command-line limit |
| ``-loop-fusion-peel-max-count`` (default 0), the pass marks the pair |
| as eligible for peeling. Peeling the first loop by the difference |
| will equalize the trip counts. |
| |
| The current implementation does not support the case where the second |
| loop has more iterations than the first. |
| |
| 5.2 Condition 2: Compatible Guard Structure |
| -------------------------------------------- |
| |
| Both loops must be either both guarded or both unguarded. If one is |
| guarded and the other is not, fusion is rejected. When both are |
| guarded, the guard-comparison check ensures that: |
| |
| 1. The guard condition instructions are structurally identical |
| (same opcode, same operands). |
| 2. The true/false successors have the same polarity relative to the |
| loop (i.e., both guards branch to the preheader on the same |
| condition outcome). |
| |
| Additionally, for guarded loops, the pass verifies that instructions |
| in FC0's exit block can be safely moved before FC1's exit block, and |
| that FC1's guard block instructions can be safely moved before FC0's |
| guard block terminator. These moves reuse LLVM's generic code-motion |
| safety helpers. |
| |
| 5.3 Condition 3: No Negative-Distance Dependencies |
| ---------------------------------------------------- |
| |
| The dependence check is the most involved legality analysis. It |
| examines all pairs of memory accesses where at least one is a write: |
| |
| - (FC0 writes) x (FC1 writes) -- output dependences |
| - (FC0 writes) x (FC1 reads) -- flow dependences |
| - (FC1 writes) x (FC0 reads) -- anti dependences |
| |
| Additionally, it checks for *cross-loop def-use chains*: if any |
| instruction in FC1 uses a value defined inside FC0, fusion is rejected |
| outright. This is because after fusion, the definition from a given |
| iteration of the original FC0 would need to be available to the same |
| iteration of FC1, but the interleaved execution order may not preserve |
| this. |
| |
| For each memory pair, the pass invokes one of three dependence |
| analysis strategies, selectable via |
| ``--loop-fusion-dependence-analysis``. The default is DA-based |
| analysis (``da``); the SCEV-based and combined modes are retained as |
| opt-ins. The SCEV-based path is expected to be removed in a future |
| change. |
| |
| **SCEV-based analysis**: |
| Rewrites the access function of FC0's instruction into FC1's loop |
| iteration space by substituting FC0's loop with FC1's loop inside |
| every add-recurrence subexpression. The pass then asks SCEV whether |
| FC0's address is known to be greater than or equal to FC1's address. |
| A non-negative difference means no backward dependence. |
| |
| **DA-based analysis**: |
| Queries LLVM's dependence analysis on the pair. If no dependence |
| exists, fusion is safe. If a dependence exists, the pass examines its |
| direction vector: |
| |
| 1. At outer loop levels (levels above the current fusion level): if |
| any level has a direction that excludes equality (``EQ``), the |
| outer indices differ and the dependence does not constrain fusion |
| at the current level. |
| 2. At the current level: if the direction excludes ``GT`` |
| (greater-than), there is no backward loop-carried dependence, and |
| fusion is safe. For example, a pure ``LT`` direction indicates a |
| forward dependence like ``A[i] = ...; ... = A[i-1]``, which |
| remains valid after fusion. |
| 3. Loop-invariant (scalar) non-anti dependences at the current level |
| are also safe. |
| |
| **Combined analysis**: |
| Accepts the pair if either SCEV-based or DA-based analysis approves |
| it. |
| |
| 5.4 Condition 4: Empty or Movable Preheader |
| --------------------------------------------- |
| |
| FC1's preheader must be empty (containing only the terminator branch) |
| or all its instructions must be safely movable. The preheader-motion |
| check classifies each preheader instruction as either hoistable (to |
| FC0's preheader) or sinkable (into FC1's body after the fused loop): |
| |
| - **Hoisting**: An instruction can be hoisted if all its operands |
| either dominate FC0's preheader target or have already been |
| classified as hoistable. PHI nodes cannot be hoisted. Memory |
| instructions can be hoisted only if they have no flow, anti, or |
| output dependences with non-hoisted preheader instructions or with |
| FC0's memory accesses. |
| - **Sinking**: An instruction can be sunk if none of its users are |
| inside FC1's loop body, and it has no flow or output dependences |
| with FC1's memory accesses. |
| - Atomic and volatile instructions cannot be moved. |
| |
| If any instruction is neither hoistable nor sinkable, fusion is |
| abandoned for this pair. |
| |
| 5.5 Profitability |
| ------------------ |
| |
| The profitability check currently always reports that fusion is |
| beneficial. This is intentional for testing coverage and is expected |
| to evolve to include cost-model heuristics (e.g., register pressure |
| estimation, cache footprint analysis) in the future. |
| |
| |
| 6. Phase 3: The Fusion Transformation |
| ====================================== |
| |
| Once all legality checks pass, the transformation proceeds in two |
| stages: an optional peeling step and the actual CFG rewiring. |
| |
| 6.1 Loop Peeling (Optional) |
| ----------------------------- |
| |
| If the trip counts differ by a constant ``d``, the pass peels ``d`` |
| iterations from FC0. This extracts the first ``d`` iterations as |
| straight-line code before the loop, so the remaining loop has the |
| same trip count as FC1. After peeling: |
| |
| 1. The post-dominator tree is recalculated. |
| 2. FC0's cached block pointers are refreshed. |
| 3. The peeled iteration blocks' branches are rewritten to remove edges |
| to FC1's preheader, ensuring FC0's entry block still dominates |
| FC1's entry block. |
| |
| 6.2 CFG Rewiring (Non-Guarded Loops) |
| -------------------------------------- |
| |
| For non-guarded loops, the CFG transformation performs these steps: |
| |
| 1. **Merge preheaders**: Move all instructions from FC1's preheader |
| into FC0's preheader (before the terminator). |
| |
| 2. **Rewire FC0's exiting block**: Instead of branching to FC1's |
| preheader (which is about to be deleted), the exiting block now |
| targets FC1's header directly. |
| |
| 3. **Delete FC1's preheader**: It has no predecessors left; replace |
| its terminator with ``unreachable``. |
| |
| 4. **Move PHI nodes**: All PHI nodes from FC1's header are moved to |
| FC0's header. If a PHI has no uses, it is deleted. |
| |
| 5. **Introduce intermediate PHI nodes**: If FC0's exiting block is |
| not the latch (a rare case given the rotated form requirement), |
| new PHI nodes are inserted in FC1's header. These select the |
| loop-carried value from FC0 when arriving via FC0's latch, or |
| ``poison`` when arriving via FC0's exiting block (which means the |
| loop is exiting and the value is dead). |
| |
| 6. **Reconnect latches**: |
| |
| - FC0's latch, which previously branched back to FC0's header, now |
| branches to FC1's header. |
| - FC1's latch, which previously branched back to FC1's header, now |
| branches to FC0's header. |
| - FC0's latch branch is simplified to unconditional (both targets |
| are now FC1's header). |
| |
| 7. **Transfer loop blocks**: All basic blocks belonging to FC1 are |
| moved into FC0. All child loops of FC1 are re-parented under FC0. |
| |
| 8. **Merge latches**: Instructions from FC0's old latch are moved to |
| FC1's latch (which is now the single latch of the fused loop). If |
| FC0's latch has a unique successor, the blocks are merged. |
| |
| 9. **Cleanup**: FC1 is erased from the function's loop information. |
| SCEV caches for both loops are invalidated, and the dominator and |
| post-dominator trees are updated in a single batched transaction. |
| |
| The resulting fused loop has: |
| |
| - FC0's preheader as its preheader. |
| - FC0's header as its header. |
| - A body containing blocks from both FC0 and FC1. |
| - FC1's latch as its latch. |
| - FC1's exit block as its exit block. |
| |
| 6.3 CFG Rewiring (Guarded Loops) |
| ---------------------------------- |
| |
| The guarded-loop fusion path handles the additional complexity of |
| guard branches and exit blocks: |
| |
| 1. FC0's guard is updated to use FC1's non-loop block as its bypass |
| target, effectively making FC0's guard protect both loops. |
| 2. FC0's exit block instructions are moved to FC1's exit block. |
| 3. FC1's guard block is disconnected and deleted. |
| 4. FC0's exit block is disconnected and deleted. |
| 5. The latch rewiring and block transfer proceed identically to the |
| non-guarded case. |
| |
| 6.4 Post-Fusion Bookkeeping |
| ----------------------------- |
| |
| After fusion, the fused loop replaces the original pair in the chain |
| and becomes the new left operand for the next pairwise attempt. This |
| enables *cascading fusion*: if three adjacent loops A, B, C are all |
| fusible, A+B is fused first, then (A+B)+C is attempted. |
| |
| FC1's loop is also recorded as removed so that it is skipped when the |
| pass descends to inner nesting levels. |
| |
| |
| 7. Limitations |
| =============== |
| |
| The current implementation is purely opportunistic: it fuses only |
| loop pairs that already satisfy the four legality conditions. It does |
| not reshape the surrounding code to create new fusion opportunities, |
| and several legality checks are intentionally conservative. |
| |
| 7.1 Algorithmic Scope |
| ---------------------- |
| |
| - **No loop reshaping to enable fusion.** The pass does not insert |
| guards, rotate loops, run loop-simplify, or otherwise modify loops |
| that miss a legality condition. Loops that are not already in |
| rotated, simplified form with a computable trip count are discarded |
| as ineligible. |
| - **No profitability cost model.** The profitability check |
| unconditionally reports every legal pair as beneficial. Fusion |
| proceeds regardless of cache footprint, register pressure, or any |
| interaction with downstream vectorization. |
| - **Cross-loop def-use chains are always rejected.** If any |
| instruction in FC1 uses a value produced inside FC0, the pair is |
| rejected outright. Patterns such as a reduction whose result is |
| consumed by the next loop are blocked even when the producing value |
| is loop-invariant and could legally be hoisted. |
| |
| 7.2 Trip-Count Equalization |
| ---------------------------- |
| |
| - **Peeling is disabled by default.** The maximum number of |
| iterations the pass may peel is controlled by the command-line |
| option ``-loop-fusion-peel-max-count``, which defaults to ``0``. |
| Peeling therefore never fires unless the user opts in explicitly. |
| - **Peeling shrinks only the first loop.** If the second loop has |
| more iterations than the first, fusion is rejected. There is no |
| attempt to peel the second loop or extend the first. |
| - **Peeling requires constant trip counts with a single exit.** Both |
| loops must have a small constant trip count for the peel distance |
| to be computed. Symbolic trip-count differences are not handled. |
| |
| |
| 7.4 Preheader and Block Handling |
| --------------------------------- |
| |
| - **Volatile or atomic preheader instructions block fusion.** Such |
| instructions cannot be moved, so a non-empty preheader that |
| contains them prevents fusion. |
| - **Memory-reading preheader instructions are handled |
| conservatively.** The pass does try to move memory-reading |
| preheader instructions, but only when dependence checks prove |
| hoisting or sinking is safe. This conservative filtering can still |
| reject fusible-looking cases. |
| - **FC1's preheader must reduce to a single terminator.** After |
| hoisting and sinking, the simple (non-guarded) fusion path requires |
| FC1's preheader to contain exactly one instruction -- its branch. |
| Any instruction that is neither hoistable nor sinkable prevents |
| fusion rather than being preserved by block merging. |
| - **Blocks are rewired, not merged.** Preheaders, latches, and exit |
| blocks are reconnected to new successors during fusion, but never |
| merged with their neighbors. The fused Control-Flow Graph (CFG) |
| therefore retains more basic blocks than strictly necessary. |
| |
| 7.5 Eligibility Blockers |
| ------------------------- |
| |
| A loop is rejected during candidate construction if any of the |
| following properties holds: |
| |
| - A block in the loop has its address taken (for example, used as a |
| ``blockaddress`` operand). |
| - Any instruction in the loop may throw. |
| - The loop contains a volatile ``load`` or ``store``. |
| |
| These conditions are hard blockers, not soft preferences: the pass |
| never attempts to work around them. |
| |
| **Atomic operations are not reasoned about.** Atomic loads, stores, |
| ``atomicrmw``, ``cmpxchg``, and ``fence`` instructions inside the |
| loop body are tracked only as ordinary memory reads or writes. The |
| pass does not inspect their memory ordering (``unordered``, |
| ``monotonic``, ``acquire``, ``release``, ``seq_cst``), and the |
| underlying dependence analysis reasons about address-based data |
| dependence rather than inter-thread synchronization. Fusing two |
| loops that contain non-``unordered`` atomics can therefore change |
| observable multi-threaded behavior even when the dependence check |
| reports no conflict. |
| |