| llvm-ir2vec - IR2Vec Embedding Generation Tool |
| ============================================== |
| |
| .. program:: llvm-ir2vec |
| |
| SYNOPSIS |
| -------- |
| |
| :program:`llvm-ir2vec` [*options*] *input-file* |
| |
| DESCRIPTION |
| ----------- |
| |
| :program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It |
| generates IR2Vec embeddings for LLVM IR and supports triplet generation |
| for vocabulary training. It provides two main operation modes: |
| |
| 1. **Triplet Mode**: Generates triplets (opcode, type, operands) for vocabulary |
| training from LLVM IR. |
| |
| 2. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary |
| at different granularity levels (instruction, basic block, or function). |
| |
| The tool is designed to facilitate machine learning applications that work with |
| LLVM IR by converting the IR into numerical representations that can be used by |
| ML models. |
| |
| .. note:: |
| |
| For information about using IR2Vec programmatically within LLVM passes and |
| the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_ |
| section in the MLGO documentation. |
| |
| OPERATION MODES |
| --------------- |
| |
| Triplet Generation Mode |
| ~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts triplets |
| consisting of opcodes, types, and operands. These triplets can be used to train |
| vocabularies for embedding generation. |
| |
| Usage: |
| |
| .. code-block:: bash |
| |
| llvm-ir2vec --mode=triplets input.bc -o triplets.txt |
| |
| Embedding Generation Mode |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| In embedding mode, :program:`llvm-ir2vec` uses a pre-trained vocabulary to |
| generate numerical embeddings for LLVM IR at different levels of granularity. |
| |
| Example Usage: |
| |
| .. code-block:: bash |
| |
| llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=vocab.json --level=func input.bc -o embeddings.txt |
| |
| OPTIONS |
| ------- |
| |
| .. option:: --mode=<mode> |
| |
| Specify the operation mode. Valid values are: |
| |
| * ``triplets`` - Generate triplets for vocabulary training |
| * ``embeddings`` - Generate embeddings using trained vocabulary (default) |
| |
| .. option:: --level=<level> |
| |
| Specify the embedding generation level. Valid values are: |
| |
| * ``inst`` - Generate instruction-level embeddings |
| * ``bb`` - Generate basic block-level embeddings |
| * ``func`` - Generate function-level embeddings (default) |
| |
| .. option:: --function=<name> |
| |
| Process only the specified function instead of all functions in the module. |
| |
| .. option:: --ir2vec-vocab-path=<path> |
| |
| Specify the path to the vocabulary file (required for embedding mode). |
| The vocabulary file should be in JSON format and contain the trained |
| vocabulary for embedding generation. See `llvm/lib/Analysis/models` |
| for pre-trained vocabulary files. |
| |
| .. option:: --ir2vec-opc-weight=<weight> |
| |
| Specify the weight for opcode embeddings (default: 1.0). This controls |
| the relative importance of instruction opcodes in the final embedding. |
| |
| .. option:: --ir2vec-type-weight=<weight> |
| |
| Specify the weight for type embeddings (default: 0.5). This controls |
| the relative importance of type information in the final embedding. |
| |
| .. option:: --ir2vec-arg-weight=<weight> |
| |
| Specify the weight for argument embeddings (default: 0.2). This controls |
| the relative importance of operand information in the final embedding. |
| |
| .. option:: -o <filename> |
| |
| Specify the output filename. Use ``-`` to write to standard output (default). |
| |
| .. option:: --help |
| |
| Print a summary of command line options. |
| |
| .. note:: |
| |
| ``--level``, ``--function``, ``--ir2vec-vocab-path``, ``--ir2vec-opc-weight``, |
| ``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in embedding |
| mode. These options are ignored in triplet mode. |
| |
| INPUT FILE FORMAT |
| ----------------- |
| |
| :program:`llvm-ir2vec` accepts LLVM bitcode files (``.bc``) and LLVM IR files |
| (``.ll``) as input. The input file should contain valid LLVM IR. |
| |
| OUTPUT FORMAT |
| ------------- |
| |
| Triplet Mode Output |
| ~~~~~~~~~~~~~~~~~~~ |
| |
| In triplet mode, the output consists of lines containing space-separated triplets: |
| |
| .. code-block:: text |
| |
| <opcode> <type> <operand1> <operand2> ... |
| |
| Each line represents the information of one instruction, with the opcode, type, |
| and operands. |
| |
| Embedding Mode Output |
| ~~~~~~~~~~~~~~~~~~~~~ |
| |
| In embedding mode, the output format depends on the specified level: |
| |
| * **Function Level**: One embedding vector per function |
| * **Basic Block Level**: One embedding vector per basic block, grouped by function |
| * **Instruction Level**: One embedding vector per instruction, grouped by basic block and function |
| |
| Each embedding is represented as a floating point vector. |
| |
| EXIT STATUS |
| ----------- |
| |
| :program:`llvm-ir2vec` returns 0 on success, and a non-zero value on failure. |
| |
| Common failure cases include: |
| |
| * Invalid or missing input file |
| * Missing or invalid vocabulary file (in embedding mode) |
| * Specified function not found in the module |
| * Invalid command line options |
| |
| SEE ALSO |
| -------- |
| |
| :doc:`../MLGO` |
| |
| For more information about the IR2Vec algorithm and approach, see: |
| `IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_. |