docs/LoopFusion.rst - llvm-project/llvm - Git at Google

 ====================================================
 Loop Fusion in LLVM
 ====================================================

 1. Introduction
 ===============

 Loop fusion (also called loop jamming) is a compiler optimization that
 merges two adjacent loops into a single loop, provided the
 transformation preserves the program's original semantics. The
 motivation is straightforward: by executing the bodies of two loops in
 a single pass over the iteration space, we reduce loop overhead (fewer
 branch instructions, fewer induction variable updates), improve
 temporal data locality (data written by the first loop body and read
 by the second is still in cache), and create new opportunities for
 subsequent optimizations such as instruction scheduling and register
 allocation.

 LLVM's implementation resides in
 ``llvm/lib/Transforms/Scalar/LoopFuse.cpp`` and is based on
 Christopher Barton's MSc thesis, *"Code Transformations to Augment the
 Scope of Loop Fusion in a Production Compiler"*. The pass operates on
 LLVM IR, leveraging several core analysis frameworks --
 Scalar Evolution (SCEV), Dependence Analysis
 (DA), and Dominator/Post-Dominator Trees -- to determine legality and
 perform the CFG rewiring that fuses two loops into one.

 2. Prerequisite Concepts
 ========================

 The fusion pass relies on several standard LLVM loop concepts that
 are documented elsewhere: simplified loop form, rotated loop form,
 loop guard branches, dominator and post-dominator trees, Scalar
 Evolution (SCEV), and trip count with peeling. The analysis that
 drives the most involved legality check -- Dependence Analysis (DA)
 -- is summarized below, because later sections refer directly to its
 dependence kinds and direction vectors.

 DA determines the dependence relation between pairs of memory
 accesses. A dependence from instruction S1 to S2 is characterized by:

 - **Flow (true) dependence**: S1 writes, S2 reads the same location
   (read-after-write).
 - **Anti dependence**: S1 reads, S2 writes (write-after-read).
 - **Output dependence**: Both S1 and S2 write (write-after-write).

 Each dependence carries a *direction vector* at each loop nesting
 level, indicating whether the source iteration is less than (``<``),
 equal to (``=``), or greater than (``>``) the sink iteration at that
 level. A dependence with a ``>`` component at the current loop level
 represents a *backward loop-carried dependence* (also called a
 negative-distance dependence). Such dependences are the critical
 hazard that loop fusion must respect.


 3. High-Level Algorithm
 =======================

 The pass operates in a top-down, level-by-level fashion over the loop
 nest tree. At each nesting depth, it:

 1. **Collects fusion candidates**: Identifies each eligible loop and
    partitions the eligible loops into ordered chains of
    *control-flow equivalent, strictly adjacent* loops sorted in
    dominance order.
 2. **Attempts pairwise fusion**: Walks each chain linearly, testing
    every consecutive pair ``(FC0, FC1)`` against the four legality
    conditions. If all conditions hold, the pair is fused and replaced
    by the fused loop in the chain, which is then considered for
    further fusion with its successor.
 3. **Descends to inner loops**: After processing all sibling groups at
    the current depth, the pass descends one level and repeats.

 This strategy means outermost loops are fused first. Fusing inner
 loops is handled in subsequent iterations of the outer while-loop.


 4. Phase 1: Candidate Collection
 =================================

 For each group of sibling loops (loops sharing a parent) at the
 current depth, the pass gathers the loops that are eligible for
 fusion.

 4.1 Eligibility Check
 ----------------------

 Each loop is first scanned for disqualifying properties. A loop is
 rejected immediately if any of its blocks has its address taken, if
 any instruction in the loop may throw, or if the loop contains a
 volatile memory access. The pass cannot reason about exception-based
 control flow or volatile ordering, so these are hard blockers.

 Surviving loops must additionally satisfy four structural
 requirements:

 1. **Structural validity**: preheader, header, exiting block, exit
    block, and latch all exist. This implicitly requires simplified
    form.
 2. **Computable trip count**: SCEV must be able to compute a
    loop-invariant backedge-taken count. If not, the pass cannot
    compare trip counts.
 3. **Simplified form**: the loop must be in simplified form (single
    preheader, dedicated exit blocks, single latch).
 4. **Rotated form**: the loop must be in rotated form (do-while
    shape, with the exit test at the latch). The exiting block must
    be the latch.

 If any check fails, the loop is discarded and an optimization remark
 is emitted explaining why.

 4.2 Grouping by Adjacency
 ---------------------------

 Eligible loops are partitioned into ordered chains based on *strict
 adjacency*. Two loops FC0 and FC1 are strictly adjacent if:

 - **Unguarded loops**: FC0's exit block is exactly FC1's preheader
   (i.e., there is no intervening code between the two loops).
 - **Guarded loops**: FC0's entry block dominates FC1's entry block,
   and FC0's exit block has a unique successor that is FC1's entry
   block.

 The algorithm appends each new loop to the first existing chain whose
 last element is strictly adjacent to it. If no such chain exists, a
 new chain is started. Because the input loops are already supplied in
 dominance (program) order, this single-pass grouping correctly
 partitions eligible loops into maximal chains of adjacent loops.


 5. Phase 2: Pairwise Fusion Attempts
 =====================================

 The pass iterates over each chain and attempts to fuse consecutive
 pairs ``(FC0, FC1)``. The following legality conditions are checked
 in order, with early exit on failure.

 5.1 Condition 1: Identical Trip Counts (Conformance)
 ------------------------------------------------------

 The conformance check uses SCEV to retrieve the backedge-taken count
 of both loops. If the SCEV expressions are identical (pointer equality
 after canonicalization), the loops are conforming.

 If they differ but both are small constants, the pass computes the
 arithmetic difference. If the first loop has more iterations than the
 second, and the difference does not exceed the command-line limit
 ``-loop-fusion-peel-max-count`` (default 0), the pass marks the pair
 as eligible for peeling. Peeling the first loop by the difference
 will equalize the trip counts.

 The current implementation does not support the case where the second
 loop has more iterations than the first.

 5.2 Condition 2: Compatible Guard Structure
 --------------------------------------------

 Both loops must be either both guarded or both unguarded. If one is
 guarded and the other is not, fusion is rejected. When both are
 guarded, the guard-comparison check ensures that:

 1. The guard condition instructions are structurally identical
    (same opcode, same operands).
 2. The true/false successors have the same polarity relative to the
    loop (i.e., both guards branch to the preheader on the same
    condition outcome).

 Additionally, for guarded loops, the pass verifies that instructions
 in FC0's exit block can be safely moved before FC1's exit block, and
 that FC1's guard block instructions can be safely moved before FC0's
 guard block terminator. These moves reuse LLVM's generic code-motion
 safety helpers.

 5.3 Condition 3: No Negative-Distance Dependencies
 ----------------------------------------------------

 The dependence check is the most involved legality analysis. It
 examines all pairs of memory accesses where at least one is a write:

 - (FC0 writes) x (FC1 writes) -- output dependences
 - (FC0 writes) x (FC1 reads) -- flow dependences
 - (FC1 writes) x (FC0 reads) -- anti dependences

 Additionally, it checks for *cross-loop def-use chains*: if any
 instruction in FC1 uses a value defined inside FC0, fusion is rejected
 outright. This is because after fusion, the definition from a given
 iteration of the original FC0 would need to be available to the same
 iteration of FC1, but the interleaved execution order may not preserve
 this.

 For each memory pair, the pass invokes one of three dependence
 analysis strategies, selectable via
 ``--loop-fusion-dependence-analysis``. The default is DA-based
 analysis (``da``); the SCEV-based and combined modes are retained as
 opt-ins. The SCEV-based path is expected to be removed in a future
 change.

 **SCEV-based analysis**:
 Rewrites the access function of FC0's instruction into FC1's loop
 iteration space by substituting FC0's loop with FC1's loop inside
 every add-recurrence subexpression. The pass then asks SCEV whether
 FC0's address is known to be greater than or equal to FC1's address.
 A non-negative difference means no backward dependence.

 **DA-based analysis**:
 Queries LLVM's dependence analysis on the pair. If no dependence
 exists, fusion is safe. If a dependence exists, the pass examines its
 direction vector:

 1. At outer loop levels (levels above the current fusion level): if
    any level has a direction that excludes equality (``EQ``), the
    outer indices differ and the dependence does not constrain fusion
    at the current level.
 2. At the current level: if the direction excludes ``GT``
    (greater-than), there is no backward loop-carried dependence, and
    fusion is safe. For example, a pure ``LT`` direction indicates a
    forward dependence like ``A[i] = ...; ... = A[i-1]``, which
    remains valid after fusion.
 3. Loop-invariant (scalar) non-anti dependences at the current level
    are also safe.

 **Combined analysis**:
 Accepts the pair if either SCEV-based or DA-based analysis approves
 it.

 5.4 Condition 4: Empty or Movable Preheader
 ---------------------------------------------

 FC1's preheader must be empty (containing only the terminator branch)
 or all its instructions must be safely movable. The preheader-motion
 check classifies each preheader instruction as either hoistable (to
 FC0's preheader) or sinkable (into FC1's body after the fused loop):

 - **Hoisting**: An instruction can be hoisted if all its operands
   either dominate FC0's preheader target or have already been
   classified as hoistable. PHI nodes cannot be hoisted. Memory
   instructions can be hoisted only if they have no flow, anti, or
   output dependences with non-hoisted preheader instructions or with
   FC0's memory accesses.
 - **Sinking**: An instruction can be sunk if none of its users are
   inside FC1's loop body, and it has no flow or output dependences
   with FC1's memory accesses.
 - Atomic and volatile instructions cannot be moved.

 If any instruction is neither hoistable nor sinkable, fusion is
 abandoned for this pair.

 5.5 Profitability
 ------------------

 The profitability check currently always reports that fusion is
 beneficial. This is intentional for testing coverage and is expected
 to evolve to include cost-model heuristics (e.g., register pressure
 estimation, cache footprint analysis) in the future.


 6. Phase 3: The Fusion Transformation
 ======================================

 Once all legality checks pass, the transformation proceeds in two
 stages: an optional peeling step and the actual CFG rewiring.

 6.1 Loop Peeling (Optional)
 -----------------------------

 If the trip counts differ by a constant ``d``, the pass peels ``d``
 iterations from FC0. This extracts the first ``d`` iterations as
 straight-line code before the loop, so the remaining loop has the
 same trip count as FC1. After peeling:

 1. The post-dominator tree is recalculated.
 2. FC0's cached block pointers are refreshed.
 3. The peeled iteration blocks' branches are rewritten to remove edges
    to FC1's preheader, ensuring FC0's entry block still dominates
    FC1's entry block.

 6.2 CFG Rewiring (Non-Guarded Loops)
 --------------------------------------

 For non-guarded loops, the CFG transformation performs these steps:

 1. **Merge preheaders**: Move all instructions from FC1's preheader
    into FC0's preheader (before the terminator).

 2. **Rewire FC0's exiting block**: Instead of branching to FC1's
    preheader (which is about to be deleted), the exiting block now
    targets FC1's header directly.

 3. **Delete FC1's preheader**: It has no predecessors left; replace
    its terminator with ``unreachable``.

 4. **Move PHI nodes**: All PHI nodes from FC1's header are moved to
    FC0's header. If a PHI has no uses, it is deleted.

 5. **Introduce intermediate PHI nodes**: If FC0's exiting block is
    not the latch (a rare case given the rotated form requirement),
    new PHI nodes are inserted in FC1's header. These select the
    loop-carried value from FC0 when arriving via FC0's latch, or
    ``poison`` when arriving via FC0's exiting block (which means the
    loop is exiting and the value is dead).

 6. **Reconnect latches**:

    - FC0's latch, which previously branched back to FC0's header, now
      branches to FC1's header.
    - FC1's latch, which previously branched back to FC1's header, now
      branches to FC0's header.
    - FC0's latch branch is simplified to unconditional (both targets
      are now FC1's header).

 7. **Transfer loop blocks**: All basic blocks belonging to FC1 are
    moved into FC0. All child loops of FC1 are re-parented under FC0.

 8. **Merge latches**: Instructions from FC0's old latch are moved to
    FC1's latch (which is now the single latch of the fused loop). If
    FC0's latch has a unique successor, the blocks are merged.

 9. **Cleanup**: FC1 is erased from the function's loop information.
    SCEV caches for both loops are invalidated, and the dominator and
    post-dominator trees are updated in a single batched transaction.

 The resulting fused loop has:

 - FC0's preheader as its preheader.
 - FC0's header as its header.
 - A body containing blocks from both FC0 and FC1.
 - FC1's latch as its latch.
 - FC1's exit block as its exit block.

 6.3 CFG Rewiring (Guarded Loops)
 ----------------------------------

 The guarded-loop fusion path handles the additional complexity of
 guard branches and exit blocks:

 1. FC0's guard is updated to use FC1's non-loop block as its bypass
    target, effectively making FC0's guard protect both loops.
 2. FC0's exit block instructions are moved to FC1's exit block.
 3. FC1's guard block is disconnected and deleted.
 4. FC0's exit block is disconnected and deleted.
 5. The latch rewiring and block transfer proceed identically to the
    non-guarded case.

 6.4 Post-Fusion Bookkeeping
 -----------------------------

 After fusion, the fused loop replaces the original pair in the chain
 and becomes the new left operand for the next pairwise attempt. This
 enables *cascading fusion*: if three adjacent loops A, B, C are all
 fusible, A+B is fused first, then (A+B)+C is attempted.

 FC1's loop is also recorded as removed so that it is skipped when the
 pass descends to inner nesting levels.


 7. Limitations
 ===============

 The current implementation is purely opportunistic: it fuses only
 loop pairs that already satisfy the four legality conditions. It does
 not reshape the surrounding code to create new fusion opportunities,
 and several legality checks are intentionally conservative.

 7.1 Algorithmic Scope
 ----------------------

 - **No loop reshaping to enable fusion.** The pass does not insert
   guards, rotate loops, run loop-simplify, or otherwise modify loops
   that miss a legality condition. Loops that are not already in
   rotated, simplified form with a computable trip count are discarded
   as ineligible.
 - **No profitability cost model.** The profitability check
   unconditionally reports every legal pair as beneficial. Fusion
   proceeds regardless of cache footprint, register pressure, or any
   interaction with downstream vectorization.
 - **Cross-loop def-use chains are always rejected.** If any
   instruction in FC1 uses a value produced inside FC0, the pair is
   rejected outright. Patterns such as a reduction whose result is
   consumed by the next loop are blocked even when the producing value
   is loop-invariant and could legally be hoisted.

 7.2 Trip-Count Equalization
 ----------------------------

 - **Peeling is disabled by default.** The maximum number of
   iterations the pass may peel is controlled by the command-line
   option ``-loop-fusion-peel-max-count``, which defaults to ``0``.
   Peeling therefore never fires unless the user opts in explicitly.
 - **Peeling shrinks only the first loop.** If the second loop has
   more iterations than the first, fusion is rejected. There is no
   attempt to peel the second loop or extend the first.
 - **Peeling requires constant trip counts with a single exit.** Both
   loops must have a small constant trip count for the peel distance
   to be computed. Symbolic trip-count differences are not handled.


 7.4 Preheader and Block Handling
 ---------------------------------

 - **Volatile or atomic preheader instructions block fusion.** Such
   instructions cannot be moved, so a non-empty preheader that
   contains them prevents fusion.
 - **Memory-reading preheader instructions are handled
   conservatively.** The pass does try to move memory-reading
   preheader instructions, but only when dependence checks prove
   hoisting or sinking is safe. This conservative filtering can still
   reject fusible-looking cases.
 - **FC1's preheader must reduce to a single terminator.** After
   hoisting and sinking, the simple (non-guarded) fusion path requires
   FC1's preheader to contain exactly one instruction -- its branch.
   Any instruction that is neither hoistable nor sinkable prevents
   fusion rather than being preserved by block merging.
 - **Blocks are rewired, not merged.** Preheaders, latches, and exit
   blocks are reconnected to new successors during fusion, but never
   merged with their neighbors. The fused Control-Flow Graph (CFG)
   therefore retains more basic blocks than strictly necessary.

 7.5 Eligibility Blockers
 -------------------------

 A loop is rejected during candidate construction if any of the
 following properties holds:

 - A block in the loop has its address taken (for example, used as a
   ``blockaddress`` operand).
 - Any instruction in the loop may throw.
 - The loop contains a volatile ``load`` or ``store``.

 These conditions are hard blockers, not soft preferences: the pass
 never attempts to work around them.

 **Atomic operations are not reasoned about.** Atomic loads, stores,
 ``atomicrmw``, ``cmpxchg``, and ``fence`` instructions inside the
 loop body are tracked only as ordinary memory reads or writes. The
 pass does not inspect their memory ordering (``unordered``,
 ``monotonic``, ``acquire``, ``release``, ``seq_cst``), and the
 underlying dependence analysis reasons about address-based data
 dependence rather than inter-thread synchronization. Fusing two
 loops that contain non-``unordered`` atomics can therefore change
 observable multi-threaded behavior even when the dependence check
 reports no conflict.
	====================================================
	Loop Fusion in LLVM
	====================================================

	1. Introduction
	===============

	Loop fusion (also called loop jamming) is a compiler optimization that
	merges two adjacent loops into a single loop, provided the
	transformation preserves the program's original semantics. The
	motivation is straightforward: by executing the bodies of two loops in
	a single pass over the iteration space, we reduce loop overhead (fewer
	branch instructions, fewer induction variable updates), improve
	temporal data locality (data written by the first loop body and read
	by the second is still in cache), and create new opportunities for
	subsequent optimizations such as instruction scheduling and register
	allocation.

	LLVM's implementation resides in
	``llvm/lib/Transforms/Scalar/LoopFuse.cpp`` and is based on
	Christopher Barton's MSc thesis, *"Code Transformations to Augment the
	Scope of Loop Fusion in a Production Compiler"*. The pass operates on
	LLVM IR, leveraging several core analysis frameworks --
	Scalar Evolution (SCEV), Dependence Analysis
	(DA), and Dominator/Post-Dominator Trees -- to determine legality and
	perform the CFG rewiring that fuses two loops into one.

	2. Prerequisite Concepts
	========================

	The fusion pass relies on several standard LLVM loop concepts that
	are documented elsewhere: simplified loop form, rotated loop form,
	loop guard branches, dominator and post-dominator trees, Scalar
	Evolution (SCEV), and trip count with peeling. The analysis that
	drives the most involved legality check -- Dependence Analysis (DA)
	-- is summarized below, because later sections refer directly to its
	dependence kinds and direction vectors.

	DA determines the dependence relation between pairs of memory
	accesses. A dependence from instruction S1 to S2 is characterized by:

	- Flow (true) dependence: S1 writes, S2 reads the same location
	(read-after-write).
	- Anti dependence: S1 reads, S2 writes (write-after-read).
	- Output dependence: Both S1 and S2 write (write-after-write).

	Each dependence carries a direction vector at each loop nesting
	level, indicating whether the source iteration is less than (``<``),
	equal to (``=``), or greater than (``>``) the sink iteration at that
	level. A dependence with a ``>`` component at the current loop level
	represents a backward loop-carried dependence (also called a
	negative-distance dependence). Such dependences are the critical
	hazard that loop fusion must respect.


	3. High-Level Algorithm
	=======================

	The pass operates in a top-down, level-by-level fashion over the loop
	nest tree. At each nesting depth, it:

	1. Collects fusion candidates: Identifies each eligible loop and
	partitions the eligible loops into ordered chains of
	control-flow equivalent, strictly adjacent loops sorted in
	dominance order.
	2. Attempts pairwise fusion: Walks each chain linearly, testing
	every consecutive pair ``(FC0, FC1)`` against the four legality
	conditions. If all conditions hold, the pair is fused and replaced
	by the fused loop in the chain, which is then considered for
	further fusion with its successor.
	3. Descends to inner loops: After processing all sibling groups at
	the current depth, the pass descends one level and repeats.

	This strategy means outermost loops are fused first. Fusing inner
	loops is handled in subsequent iterations of the outer while-loop.


	4. Phase 1: Candidate Collection
	=================================

	For each group of sibling loops (loops sharing a parent) at the
	current depth, the pass gathers the loops that are eligible for
	fusion.

	4.1 Eligibility Check
	----------------------

	Each loop is first scanned for disqualifying properties. A loop is
	rejected immediately if any of its blocks has its address taken, if
	any instruction in the loop may throw, or if the loop contains a
	volatile memory access. The pass cannot reason about exception-based
	control flow or volatile ordering, so these are hard blockers.

	Surviving loops must additionally satisfy four structural
	requirements:

	1. Structural validity: preheader, header, exiting block, exit
	block, and latch all exist. This implicitly requires simplified
	form.
	2. Computable trip count: SCEV must be able to compute a
	loop-invariant backedge-taken count. If not, the pass cannot
	compare trip counts.
	3. Simplified form: the loop must be in simplified form (single
	preheader, dedicated exit blocks, single latch).
	4. Rotated form: the loop must be in rotated form (do-while
	shape, with the exit test at the latch). The exiting block must
	be the latch.

	If any check fails, the loop is discarded and an optimization remark
	is emitted explaining why.

	4.2 Grouping by Adjacency
	---------------------------

	Eligible loops are partitioned into ordered chains based on *strict
	adjacency*. Two loops FC0 and FC1 are strictly adjacent if:

	- Unguarded loops: FC0's exit block is exactly FC1's preheader
	(i.e., there is no intervening code between the two loops).
	- Guarded loops: FC0's entry block dominates FC1's entry block,
	and FC0's exit block has a unique successor that is FC1's entry
	block.

	The algorithm appends each new loop to the first existing chain whose
	last element is strictly adjacent to it. If no such chain exists, a
	new chain is started. Because the input loops are already supplied in
	dominance (program) order, this single-pass grouping correctly
	partitions eligible loops into maximal chains of adjacent loops.


	5. Phase 2: Pairwise Fusion Attempts
	=====================================

	The pass iterates over each chain and attempts to fuse consecutive
	pairs ``(FC0, FC1)``. The following legality conditions are checked
	in order, with early exit on failure.

	5.1 Condition 1: Identical Trip Counts (Conformance)
	------------------------------------------------------

	The conformance check uses SCEV to retrieve the backedge-taken count
	of both loops. If the SCEV expressions are identical (pointer equality
	after canonicalization), the loops are conforming.

	If they differ but both are small constants, the pass computes the
	arithmetic difference. If the first loop has more iterations than the
	second, and the difference does not exceed the command-line limit
	``-loop-fusion-peel-max-count`` (default 0), the pass marks the pair
	as eligible for peeling. Peeling the first loop by the difference
	will equalize the trip counts.

	The current implementation does not support the case where the second
	loop has more iterations than the first.

	5.2 Condition 2: Compatible Guard Structure
	--------------------------------------------

	Both loops must be either both guarded or both unguarded. If one is
	guarded and the other is not, fusion is rejected. When both are
	guarded, the guard-comparison check ensures that:

	1. The guard condition instructions are structurally identical
	(same opcode, same operands).
	2. The true/false successors have the same polarity relative to the
	loop (i.e., both guards branch to the preheader on the same
	condition outcome).

	Additionally, for guarded loops, the pass verifies that instructions
	in FC0's exit block can be safely moved before FC1's exit block, and
	that FC1's guard block instructions can be safely moved before FC0's
	guard block terminator. These moves reuse LLVM's generic code-motion
	safety helpers.

	5.3 Condition 3: No Negative-Distance Dependencies
	----------------------------------------------------

	The dependence check is the most involved legality analysis. It
	examines all pairs of memory accesses where at least one is a write:

	- (FC0 writes) x (FC1 writes) -- output dependences
	- (FC0 writes) x (FC1 reads) -- flow dependences
	- (FC1 writes) x (FC0 reads) -- anti dependences

	Additionally, it checks for cross-loop def-use chains: if any
	instruction in FC1 uses a value defined inside FC0, fusion is rejected
	outright. This is because after fusion, the definition from a given
	iteration of the original FC0 would need to be available to the same
	iteration of FC1, but the interleaved execution order may not preserve
	this.

	For each memory pair, the pass invokes one of three dependence
	analysis strategies, selectable via
	``--loop-fusion-dependence-analysis``. The default is DA-based
	analysis (``da``); the SCEV-based and combined modes are retained as
	opt-ins. The SCEV-based path is expected to be removed in a future
	change.

	SCEV-based analysis:
	Rewrites the access function of FC0's instruction into FC1's loop
	iteration space by substituting FC0's loop with FC1's loop inside
	every add-recurrence subexpression. The pass then asks SCEV whether
	FC0's address is known to be greater than or equal to FC1's address.
	A non-negative difference means no backward dependence.

	DA-based analysis:
	Queries LLVM's dependence analysis on the pair. If no dependence
	exists, fusion is safe. If a dependence exists, the pass examines its
	direction vector:

	1. At outer loop levels (levels above the current fusion level): if
	any level has a direction that excludes equality (``EQ``), the
	outer indices differ and the dependence does not constrain fusion
	at the current level.
	2. At the current level: if the direction excludes ``GT``
	(greater-than), there is no backward loop-carried dependence, and
	fusion is safe. For example, a pure ``LT`` direction indicates a
	forward dependence like ``A[i] = ...; ... = A[i-1]``, which
	remains valid after fusion.
	3. Loop-invariant (scalar) non-anti dependences at the current level
	are also safe.

	Combined analysis:
	Accepts the pair if either SCEV-based or DA-based analysis approves
	it.

	5.4 Condition 4: Empty or Movable Preheader
	---------------------------------------------

	FC1's preheader must be empty (containing only the terminator branch)
	or all its instructions must be safely movable. The preheader-motion
	check classifies each preheader instruction as either hoistable (to
	FC0's preheader) or sinkable (into FC1's body after the fused loop):

	- Hoisting: An instruction can be hoisted if all its operands
	either dominate FC0's preheader target or have already been
	classified as hoistable. PHI nodes cannot be hoisted. Memory
	instructions can be hoisted only if they have no flow, anti, or
	output dependences with non-hoisted preheader instructions or with
	FC0's memory accesses.
	- Sinking: An instruction can be sunk if none of its users are
	inside FC1's loop body, and it has no flow or output dependences
	with FC1's memory accesses.
	- Atomic and volatile instructions cannot be moved.

	If any instruction is neither hoistable nor sinkable, fusion is
	abandoned for this pair.

	5.5 Profitability
	------------------

	The profitability check currently always reports that fusion is
	beneficial. This is intentional for testing coverage and is expected
	to evolve to include cost-model heuristics (e.g., register pressure
	estimation, cache footprint analysis) in the future.


	6. Phase 3: The Fusion Transformation
	======================================

	Once all legality checks pass, the transformation proceeds in two
	stages: an optional peeling step and the actual CFG rewiring.

	6.1 Loop Peeling (Optional)
	-----------------------------

	If the trip counts differ by a constant ``d``, the pass peels ``d``
	iterations from FC0. This extracts the first ``d`` iterations as
	straight-line code before the loop, so the remaining loop has the
	same trip count as FC1. After peeling:

	1. The post-dominator tree is recalculated.
	2. FC0's cached block pointers are refreshed.
	3. The peeled iteration blocks' branches are rewritten to remove edges
	to FC1's preheader, ensuring FC0's entry block still dominates
	FC1's entry block.

	6.2 CFG Rewiring (Non-Guarded Loops)
	--------------------------------------

	For non-guarded loops, the CFG transformation performs these steps:

	1. Merge preheaders: Move all instructions from FC1's preheader
	into FC0's preheader (before the terminator).

	2. Rewire FC0's exiting block: Instead of branching to FC1's
	preheader (which is about to be deleted), the exiting block now
	targets FC1's header directly.

	3. Delete FC1's preheader: It has no predecessors left; replace
	its terminator with ``unreachable``.

	4. Move PHI nodes: All PHI nodes from FC1's header are moved to
	FC0's header. If a PHI has no uses, it is deleted.

	5. Introduce intermediate PHI nodes: If FC0's exiting block is
	not the latch (a rare case given the rotated form requirement),
	new PHI nodes are inserted in FC1's header. These select the
	loop-carried value from FC0 when arriving via FC0's latch, or
	``poison`` when arriving via FC0's exiting block (which means the
	loop is exiting and the value is dead).

	6. Reconnect latches:

	- FC0's latch, which previously branched back to FC0's header, now
	branches to FC1's header.
	- FC1's latch, which previously branched back to FC1's header, now
	branches to FC0's header.
	- FC0's latch branch is simplified to unconditional (both targets
	are now FC1's header).

	7. Transfer loop blocks: All basic blocks belonging to FC1 are
	moved into FC0. All child loops of FC1 are re-parented under FC0.

	8. Merge latches: Instructions from FC0's old latch are moved to
	FC1's latch (which is now the single latch of the fused loop). If
	FC0's latch has a unique successor, the blocks are merged.

	9. Cleanup: FC1 is erased from the function's loop information.
	SCEV caches for both loops are invalidated, and the dominator and
	post-dominator trees are updated in a single batched transaction.

	The resulting fused loop has:

	- FC0's preheader as its preheader.
	- FC0's header as its header.
	- A body containing blocks from both FC0 and FC1.
	- FC1's latch as its latch.
	- FC1's exit block as its exit block.

	6.3 CFG Rewiring (Guarded Loops)
	----------------------------------

	The guarded-loop fusion path handles the additional complexity of
	guard branches and exit blocks:

	1. FC0's guard is updated to use FC1's non-loop block as its bypass
	target, effectively making FC0's guard protect both loops.
	2. FC0's exit block instructions are moved to FC1's exit block.
	3. FC1's guard block is disconnected and deleted.
	4. FC0's exit block is disconnected and deleted.
	5. The latch rewiring and block transfer proceed identically to the
	non-guarded case.

	6.4 Post-Fusion Bookkeeping
	-----------------------------

	After fusion, the fused loop replaces the original pair in the chain
	and becomes the new left operand for the next pairwise attempt. This
	enables cascading fusion: if three adjacent loops A, B, C are all
	fusible, A+B is fused first, then (A+B)+C is attempted.

	FC1's loop is also recorded as removed so that it is skipped when the
	pass descends to inner nesting levels.


	7. Limitations
	===============

	The current implementation is purely opportunistic: it fuses only
	loop pairs that already satisfy the four legality conditions. It does
	not reshape the surrounding code to create new fusion opportunities,
	and several legality checks are intentionally conservative.

	7.1 Algorithmic Scope
	----------------------

	- No loop reshaping to enable fusion. The pass does not insert
	guards, rotate loops, run loop-simplify, or otherwise modify loops
	that miss a legality condition. Loops that are not already in
	rotated, simplified form with a computable trip count are discarded
	as ineligible.
	- No profitability cost model. The profitability check
	unconditionally reports every legal pair as beneficial. Fusion
	proceeds regardless of cache footprint, register pressure, or any
	interaction with downstream vectorization.
	- Cross-loop def-use chains are always rejected. If any
	instruction in FC1 uses a value produced inside FC0, the pair is
	rejected outright. Patterns such as a reduction whose result is
	consumed by the next loop are blocked even when the producing value
	is loop-invariant and could legally be hoisted.

	7.2 Trip-Count Equalization
	----------------------------

	- Peeling is disabled by default. The maximum number of
	iterations the pass may peel is controlled by the command-line
	option ``-loop-fusion-peel-max-count``, which defaults to ``0``.
	Peeling therefore never fires unless the user opts in explicitly.
	- Peeling shrinks only the first loop. If the second loop has
	more iterations than the first, fusion is rejected. There is no
	attempt to peel the second loop or extend the first.
	- Peeling requires constant trip counts with a single exit. Both
	loops must have a small constant trip count for the peel distance
	to be computed. Symbolic trip-count differences are not handled.


	7.4 Preheader and Block Handling
	---------------------------------

	- Volatile or atomic preheader instructions block fusion. Such
	instructions cannot be moved, so a non-empty preheader that
	contains them prevents fusion.
	- **Memory-reading preheader instructions are handled
	conservatively.** The pass does try to move memory-reading
	preheader instructions, but only when dependence checks prove
	hoisting or sinking is safe. This conservative filtering can still
	reject fusible-looking cases.
	- FC1's preheader must reduce to a single terminator. After
	hoisting and sinking, the simple (non-guarded) fusion path requires
	FC1's preheader to contain exactly one instruction -- its branch.
	Any instruction that is neither hoistable nor sinkable prevents
	fusion rather than being preserved by block merging.
	- Blocks are rewired, not merged. Preheaders, latches, and exit
	blocks are reconnected to new successors during fusion, but never
	merged with their neighbors. The fused Control-Flow Graph (CFG)
	therefore retains more basic blocks than strictly necessary.

	7.5 Eligibility Blockers
	-------------------------

	A loop is rejected during candidate construction if any of the
	following properties holds:

	- A block in the loop has its address taken (for example, used as a
	``blockaddress`` operand).
	- Any instruction in the loop may throw.
	- The loop contains a volatile ``load`` or ``store``.

	These conditions are hard blockers, not soft preferences: the pass
	never attempts to work around them.

	Atomic operations are not reasoned about. Atomic loads, stores,
	``atomicrmw``, ``cmpxchg``, and ``fence`` instructions inside the
	loop body are tracked only as ordinary memory reads or writes. The
	pass does not inspect their memory ordering (``unordered``,
	``monotonic``, ``acquire``, ``release``, ``seq_cst``), and the
	underlying dependence analysis reasons about address-based data
	dependence rather than inter-thread synchronization. Fusing two
	loops that contain non-``unordered`` atomics can therefore change
	observable multi-threaded behavior even when the dependence check
	reports no conflict.