commit: 60bdf0965441ef244a4fd79e4cd056359b9d31d5
[log]
author: Sergei Barannikov <barannikov88@gmail.com>
Sat Sep 20 04:58:53 2025 +0300
committer: GitHub <noreply@github.com>
Sat Sep 20 01:58:53 2025 +0000
tree: a8d1bcfa607c57009edfe9bbdcbb071c5651e329
parent: f6c711b426300d2f8b914fc7debe6d09c026740e [diff]

[TableGen][DecoderEmitter] Rework table construction/emission (#155889)

### Current state

We have FilterChooser class, which can be thought of as a **tree of
encodings**. Tree nodes are instances of FilterChooser itself, and come
in two types:

* A node containing single encoding that has *constant* bits in the
specified bit range, a.k.a. singleton node.
* A node containing only child nodes, where each child represents a set
of encodings that have the same *constant* bits in the specified bit
range.

Either of these nodes can have an additional child, which represents a
set of encodings that have some *unknown* bits in the same bit range.

As can be seen, the **data structure is very high level**.

The encoding tree represented by FilterChooser is then converted into a
finite-state machine (FSM), represented as **byte array**. The
translation is straightforward: for each node of the tree we emit a
sequence of opcodes that check encoding bits and predicates for each
encoding. For a singleton node we also emit a terminal "decode" opcode.

The translation is done in one go, and this has negative consequences:

* We miss optimization opportunities.
* We have to use "fixups" when encoding transitions in the FSM since we
don't know the size of the data we want to jump over in advance. We have
to emit the data first and then fix up the location of the jump. This
means the fixup size has to be large enough to encode the longest jump,
so **most of the transitions are encoded inefficiently**.
* Finally, when converting the FSM into human readable form, we have to
**decode the byte array we've just emitted**. This is also done in one
go, so we **can't do any pretty printing**.

### This PR

We introduce an intermediary data structure, decoder tree, that can be
thought as **AST of the decoder program**.
This data structure is **low level** and as such allows for optimization
and analysis.
It resolves all the issues listed above. We now can:
* Emit more optimal opcode sequences.
* Compute the size of the data to be emitted in advance, avoiding
fixups.
* Do pretty printing.

Serialization is done by a new class, DecoderTableEmitter, which
converts the AST into a FSM in **textual form**, streamed right into the
output file.

### Results
* The new approach immediately resulted in 12% total table size savings
across all in-tree targets, without implementing any optimizations on
the AST. Many tables observe ~20% size reduction.
* The generated file is much more readable.
* The implementation is arguably simpler and more straightforward (the
diff is only +150~200 lines, which feels rather small for the benefits
the change gives).

19 files changed

tree: a8d1bcfa607c57009edfe9bbdcbb071c5651e329

README.md

The LLVM Compiler Infrastructure

Welcome to the LLVM project!

This repository contains the source code for LLVM, a toolkit for the construction of highly optimized compilers, optimizers, and run-time environments.

The LLVM project has multiple components. The core of the project is itself called “LLVM”. This contains all of the tools, libraries, and header files needed to process intermediate representations and convert them into object files. Tools include an assembler, disassembler, bitcode analyzer, and bitcode optimizer.

C-like languages use the Clang frontend. This component compiles C, C++, Objective-C, and Objective-C++ code into LLVM bitcode -- and from there into object files, using LLVM.

Other components include: the libc++ C++ standard library, the LLD linker, and more.

Getting the Source Code and Building LLVM

Consult the Getting Started with LLVM page for information on building and running LLVM.

For information on how to contribute to the LLVM project, please take a look at the Contributing to LLVM guide.

Getting in touch

Join the LLVM Discourse forums, Discord chat, LLVM Office Hours or Regular sync-ups.

The LLVM project has adopted a code of conduct for participants to all modes of communication within the project.