| llvm-ir2vec - IR2Vec Embedding Generation Tool |
| ============================================== |
| |
| .. program:: llvm-ir2vec |
| |
| SYNOPSIS |
| -------- |
| |
| :program:`llvm-ir2vec` [*subcommand*] [*options*] |
| |
| DESCRIPTION |
| ----------- |
| |
| :program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It |
| generates IR2Vec embeddings for LLVM IR and supports triplet generation |
| for vocabulary training. The tool provides three main subcommands: |
| |
| 1. **triplets**: Generates numeric triplets in train2id format for vocabulary |
| training from LLVM IR. |
| |
| 2. **entities**: Generates entity mapping files (entity2id.txt) for vocabulary |
| training. |
| |
| 3. **embeddings**: Generates IR2Vec embeddings using a trained vocabulary |
| at different granularity levels (instruction, basic block, or function). |
| |
| The tool is designed to facilitate machine learning applications that work with |
| LLVM IR by converting the IR into numerical representations that can be used by |
| ML models. The `triplets` subcommand generates numeric IDs directly instead of string |
| triplets, streamlining the training data preparation workflow. |
| |
| .. note:: |
| |
| For information about using IR2Vec programmatically within LLVM passes and |
| the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_ |
| section in the MLGO documentation. |
| |
| OPERATION MODES |
| --------------- |
| |
| Triplet Generation and Entity Mapping Modes are used for preparing |
| vocabulary and training data for knowledge graph embeddings. The Embedding Mode |
| is used for generating embeddings from LLVM IR using a pre-trained vocabulary. |
| |
| The Seed Embedding Vocabulary of IR2Vec is trained on a large corpus of LLVM IR |
| by modeling the relationships between opcodes, types, and operands as a knowledge |
| graph. For this purpose, Triplet Generation and Entity Mapping Modes generate |
| triplets and entity mappings in the standard format used for knowledge graph |
| embedding training (see |
| <https://github.com/thunlp/OpenKE/tree/OpenKE-PyTorch?tab=readme-ov-file#data-format> |
| for details). |
| |
| See `llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py` for more details on how |
| these two modes are used to generate the triplets and entity mappings. |
| |
| Triplet Generation |
| ~~~~~~~~~~~~~~~~~~ |
| |
| With the `triplets` subcommand, :program:`llvm-ir2vec` analyzes LLVM IR and extracts |
| numeric triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets |
| are generated in the standard format used for knowledge graph embedding training. |
| The tool outputs numeric IDs directly using the ir2vec::Vocabulary mapping |
| infrastructure, eliminating the need for string-to-ID preprocessing. |
| |
| Usage: |
| |
| .. code-block:: bash |
| |
| llvm-ir2vec triplets input.bc -o triplets_train2id.txt |
| |
| Entity Mapping Generation |
| ~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| With the `entities` subcommand, :program:`llvm-ir2vec` generates the entity mappings |
| supported by IR2Vec in the standard format used for knowledge graph embedding |
| training. This subcommand outputs all supported entities (opcodes, types, and |
| operands) with their corresponding numeric IDs, and is not specific for an |
| LLVM IR file. |
| |
| Usage: |
| |
| .. code-block:: bash |
| |
| llvm-ir2vec entities -o entity2id.txt |
| |
| Embedding Generation |
| ~~~~~~~~~~~~~~~~~~~~ |
| |
| With the `embeddings` subcommand, :program:`llvm-ir2vec` uses a pre-trained vocabulary to |
| generate numerical embeddings for LLVM IR at different levels of granularity. |
| |
| Example Usage: |
| |
| .. code-block:: bash |
| |
| llvm-ir2vec embeddings --ir2vec-vocab-path=vocab.json --level=func input.bc -o embeddings.txt |
| |
| OPTIONS |
| ------- |
| |
| Global options: |
| |
| .. option:: -o <filename> |
| |
| Specify the output filename. Use ``-`` to write to standard output (default). |
| |
| .. option:: --help |
| |
| Print a summary of command line options. |
| |
| Subcommand-specific options: |
| |
| **embeddings** subcommand: |
| |
| .. option:: <input-file> |
| |
| The input LLVM IR or bitcode file to process. This positional argument is |
| required for the `embeddings` subcommand. |
| |
| .. option:: --level=<level> |
| |
| Specify the embedding generation level. Valid values are: |
| |
| * ``inst`` - Generate instruction-level embeddings |
| * ``bb`` - Generate basic block-level embeddings |
| * ``func`` - Generate function-level embeddings (default) |
| |
| .. option:: --function=<name> |
| |
| Process only the specified function instead of all functions in the module. |
| |
| .. option:: --ir2vec-vocab-path=<path> |
| |
| Specify the path to the vocabulary file (required for embedding generation). |
| The vocabulary file should be in JSON format and contain the trained |
| vocabulary for embedding generation. See `llvm/lib/Analysis/models` |
| for pre-trained vocabulary files. |
| |
| .. option:: --ir2vec-opc-weight=<weight> |
| |
| Specify the weight for opcode embeddings (default: 1.0). This controls |
| the relative importance of instruction opcodes in the final embedding. |
| |
| .. option:: --ir2vec-type-weight=<weight> |
| |
| Specify the weight for type embeddings (default: 0.5). This controls |
| the relative importance of type information in the final embedding. |
| |
| .. option:: --ir2vec-arg-weight=<weight> |
| |
| Specify the weight for argument embeddings (default: 0.2). This controls |
| the relative importance of operand information in the final embedding. |
| |
| |
| **triplets** subcommand: |
| |
| .. option:: <input-file> |
| |
| The input LLVM IR or bitcode file to process. This positional argument is |
| required for the `triplets` subcommand. |
| |
| **entities** subcommand: |
| |
| No subcommand-specific options. |
| |
| OUTPUT FORMAT |
| ------------- |
| |
| Triplet Mode Output |
| ~~~~~~~~~~~~~~~~~~~ |
| |
| In triplet mode, the output consists of numeric triplets in train2id format with |
| metadata headers. The format includes: |
| |
| .. code-block:: text |
| |
| MAX_RELATIONS=<max_relations_count> |
| <head_entity_id> <tail_entity_id> <relation_id> |
| <head_entity_id> <tail_entity_id> <relation_id> |
| ... |
| |
| Each line after the metadata header represents one instruction relationship, |
| with numeric IDs for head entity, relation, and tail entity. The metadata |
| header (MAX_RELATIONS) provides counts for post-processing and training setup. |
| |
| Entity Mode Output |
| ~~~~~~~~~~~~~~~~~~ |
| |
| In entity mode, the output consists of entity mapping in the format: |
| |
| .. code-block:: text |
| |
| <total_entities> |
| <entity_string> <numeric_id> |
| <entity_string> <numeric_id> |
| ... |
| |
| The first line contains the total number of entities, followed by one entity |
| mapping per line with tab-separated entity string and numeric ID. |
| |
| Embedding Mode Output |
| ~~~~~~~~~~~~~~~~~~~~~ |
| |
| In embedding mode, the output format depends on the specified level: |
| |
| * **Function Level**: One embedding vector per function |
| * **Basic Block Level**: One embedding vector per basic block, grouped by function |
| * **Instruction Level**: One embedding vector per instruction, grouped by basic block and function |
| |
| Each embedding is represented as a floating point vector. |
| |
| EXIT STATUS |
| ----------- |
| |
| :program:`llvm-ir2vec` returns 0 on success, and a non-zero value on failure. |
| |
| Common failure cases include: |
| |
| * Invalid or missing input file |
| * Missing or invalid vocabulary file (in embedding mode) |
| * Specified function not found in the module |
| * Invalid command line options |
| |
| SEE ALSO |
| -------- |
| |
| :doc:`../MLGO` |
| |
| For more information about the IR2Vec algorithm and approach, see: |
| `IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_. |