[IR2Vec][llvm-ir2vec] Changing clEnumValN to cl::SubCommand (#151384) Refactor llvm-ir2vec to use subcommands instead of a mode flag for better CLI usability. - Converted the `--mode` flag to three distinct subcommands: `triplets`, `entities`, and `embeddings` - Updated documentation, tests, and python script GitOrigin-RevId: 21f1f9558df3830ffa637def364e3c0cb0dbb3c0
diff --git a/docs/CommandGuide/llvm-ir2vec.rst b/docs/CommandGuide/llvm-ir2vec.rst index 2f00c9f..0c9fb6e 100644 --- a/docs/CommandGuide/llvm-ir2vec.rst +++ b/docs/CommandGuide/llvm-ir2vec.rst
@@ -6,27 +6,27 @@ SYNOPSIS -------- -:program:`llvm-ir2vec` [*options*] *input-file* +:program:`llvm-ir2vec` [*subcommand*] [*options*] DESCRIPTION ----------- :program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It generates IR2Vec embeddings for LLVM IR and supports triplet generation -for vocabulary training. It provides three main operation modes: +for vocabulary training. The tool provides three main subcommands: -1. **Triplet Mode**: Generates numeric triplets in train2id format for vocabulary +1. **triplets**: Generates numeric triplets in train2id format for vocabulary training from LLVM IR. -2. **Entity Mode**: Generates entity mapping files (entity2id.txt) for vocabulary +2. **entities**: Generates entity mapping files (entity2id.txt) for vocabulary training. -3. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary +3. **embeddings**: Generates IR2Vec embeddings using a trained vocabulary at different granularity levels (instruction, basic block, or function). The tool is designed to facilitate machine learning applications that work with LLVM IR by converting the IR into numerical representations that can be used by -ML models. The triplet mode generates numeric IDs directly instead of string +ML models. The `triplets` subcommand generates numeric IDs directly instead of string triplets, streamlining the training data preparation workflow. .. note:: @@ -53,111 +53,115 @@ See `llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py` for more details on how these two modes are used to generate the triplets and entity mappings. -Triplet Generation Mode -~~~~~~~~~~~~~~~~~~~~~~~ +Triplet Generation +~~~~~~~~~~~~~~~~~~ -In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts numeric -triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets -are generated in the standard format used for knowledge graph embedding training. -The tool outputs numeric IDs directly using the ir2vec::Vocabulary mapping +With the `triplets` subcommand, :program:`llvm-ir2vec` analyzes LLVM IR and extracts +numeric triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets +are generated in the standard format used for knowledge graph embedding training. +The tool outputs numeric IDs directly using the ir2vec::Vocabulary mapping infrastructure, eliminating the need for string-to-ID preprocessing. Usage: .. code-block:: bash - llvm-ir2vec --mode=triplets input.bc -o triplets_train2id.txt + llvm-ir2vec triplets input.bc -o triplets_train2id.txt -Entity Mapping Generation Mode -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Entity Mapping Generation +~~~~~~~~~~~~~~~~~~~~~~~~~ -In entity mode, :program:`llvm-ir2vec` generates the entity mappings supported by -IR2Vec in the standard format used for knowledge graph embedding training. This -mode outputs all supported entities (opcodes, types, and operands) with their -corresponding numeric IDs, and is not specific for an LLVM IR file. +With the `entities` subcommand, :program:`llvm-ir2vec` generates the entity mappings +supported by IR2Vec in the standard format used for knowledge graph embedding +training. This subcommand outputs all supported entities (opcodes, types, and +operands) with their corresponding numeric IDs, and is not specific for an +LLVM IR file. Usage: .. code-block:: bash - llvm-ir2vec --mode=entities -o entity2id.txt + llvm-ir2vec entities -o entity2id.txt -Embedding Generation Mode -~~~~~~~~~~~~~~~~~~~~~~~~~~ +Embedding Generation +~~~~~~~~~~~~~~~~~~~~ -In embedding mode, :program:`llvm-ir2vec` uses a pre-trained vocabulary to +With the `embeddings` subcommand, :program:`llvm-ir2vec` uses a pre-trained vocabulary to generate numerical embeddings for LLVM IR at different levels of granularity. Example Usage: .. code-block:: bash - llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=vocab.json --level=func input.bc -o embeddings.txt + llvm-ir2vec embeddings --ir2vec-vocab-path=vocab.json --level=func input.bc -o embeddings.txt OPTIONS ------- -.. option:: --mode=<mode> - - Specify the operation mode. Valid values are: - - * ``triplets`` - Generate triplets for vocabulary training - * ``entities`` - Generate entity mappings for vocabulary training - * ``embeddings`` - Generate embeddings using trained vocabulary (default) - -.. option:: --level=<level> - - Specify the embedding generation level. Valid values are: - - * ``inst`` - Generate instruction-level embeddings - * ``bb`` - Generate basic block-level embeddings - * ``func`` - Generate function-level embeddings (default) - -.. option:: --function=<name> - - Process only the specified function instead of all functions in the module. - -.. option:: --ir2vec-vocab-path=<path> - - Specify the path to the vocabulary file (required for embedding mode). - The vocabulary file should be in JSON format and contain the trained - vocabulary for embedding generation. See `llvm/lib/Analysis/models` - for pre-trained vocabulary files. - -.. option:: --ir2vec-opc-weight=<weight> - - Specify the weight for opcode embeddings (default: 1.0). This controls - the relative importance of instruction opcodes in the final embedding. - -.. option:: --ir2vec-type-weight=<weight> - - Specify the weight for type embeddings (default: 0.5). This controls - the relative importance of type information in the final embedding. - -.. option:: --ir2vec-arg-weight=<weight> - - Specify the weight for argument embeddings (default: 0.2). This controls - the relative importance of operand information in the final embedding. +Global options: .. option:: -o <filename> - Specify the output filename. Use ``-`` to write to standard output (default). + Specify the output filename. Use ``-`` to write to standard output (default). .. option:: --help - Print a summary of command line options. + Print a summary of command line options. -.. note:: +Subcommand-specific options: - ``--level``, ``--function``, ``--ir2vec-vocab-path``, ``--ir2vec-opc-weight``, - ``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in embedding - mode. These options are ignored in triplet and entity modes. +**embeddings** subcommand: -INPUT FILE FORMAT ------------------ +.. option:: <input-file> -:program:`llvm-ir2vec` accepts LLVM bitcode files (``.bc``) and LLVM IR files -(``.ll``) as input. The input file should contain valid LLVM IR. + The input LLVM IR or bitcode file to process. This positional argument is + required for the `embeddings` subcommand. + +.. option:: --level=<level> + + Specify the embedding generation level. Valid values are: + + * ``inst`` - Generate instruction-level embeddings + * ``bb`` - Generate basic block-level embeddings + * ``func`` - Generate function-level embeddings (default) + +.. option:: --function=<name> + + Process only the specified function instead of all functions in the module. + +.. option:: --ir2vec-vocab-path=<path> + + Specify the path to the vocabulary file (required for embedding generation). + The vocabulary file should be in JSON format and contain the trained + vocabulary for embedding generation. See `llvm/lib/Analysis/models` + for pre-trained vocabulary files. + +.. option:: --ir2vec-opc-weight=<weight> + + Specify the weight for opcode embeddings (default: 1.0). This controls + the relative importance of instruction opcodes in the final embedding. + +.. option:: --ir2vec-type-weight=<weight> + + Specify the weight for type embeddings (default: 0.5). This controls + the relative importance of type information in the final embedding. + +.. option:: --ir2vec-arg-weight=<weight> + + Specify the weight for argument embeddings (default: 0.2). This controls + the relative importance of operand information in the final embedding. + + +**triplets** subcommand: + +.. option:: <input-file> + + The input LLVM IR or bitcode file to process. This positional argument is + required for the `triplets` subcommand. + +**entities** subcommand: + + No subcommand-specific options. OUTPUT FORMAT -------------
diff --git a/test/tools/llvm-ir2vec/embeddings.ll b/test/tools/llvm-ir2vec/embeddings.ll index 993ea86..f9aa108 100644 --- a/test/tools/llvm-ir2vec/embeddings.ll +++ b/test/tools/llvm-ir2vec/embeddings.ll
@@ -1,10 +1,10 @@ -; RUN: llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-DEFAULT -; RUN: llvm-ir2vec --mode=embeddings --level=func --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL -; RUN: llvm-ir2vec --mode=embeddings --level=func --function=abc --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL-ABC -; RUN: not llvm-ir2vec --mode=embeddings --level=func --function=def --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s 2>&1 | FileCheck %s -check-prefix=CHECK-FUNC-DEF -; RUN: llvm-ir2vec --mode=embeddings --level=bb --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-BB-LEVEL -; RUN: llvm-ir2vec --mode=embeddings --level=bb --function=abc_repeat --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-BB-LEVEL-ABC-REPEAT -; RUN: llvm-ir2vec --mode=embeddings --level=inst --function=abc_repeat --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-INST-LEVEL-ABC-REPEAT +; RUN: llvm-ir2vec embeddings --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-DEFAULT +; RUN: llvm-ir2vec embeddings --level=func --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL +; RUN: llvm-ir2vec embeddings --level=func --function=abc --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-FUNC-LEVEL-ABC +; RUN: not llvm-ir2vec embeddings --level=func --function=def --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s 2>&1 | FileCheck %s -check-prefix=CHECK-FUNC-DEF +; RUN: llvm-ir2vec embeddings --level=bb --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-BB-LEVEL +; RUN: llvm-ir2vec embeddings --level=bb --function=abc_repeat --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-BB-LEVEL-ABC-REPEAT +; RUN: llvm-ir2vec embeddings --level=inst --function=abc_repeat --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s | FileCheck %s -check-prefix=CHECK-INST-LEVEL-ABC-REPEAT define dso_local noundef float @abc(i32 noundef %a, float noundef %b) #0 { entry:
diff --git a/test/tools/llvm-ir2vec/entities.ll b/test/tools/llvm-ir2vec/entities.ll index 57c3d6f..737044c 100644 --- a/test/tools/llvm-ir2vec/entities.ll +++ b/test/tools/llvm-ir2vec/entities.ll
@@ -1,4 +1,4 @@ -; RUN: llvm-ir2vec --mode=entities | FileCheck %s +; RUN: llvm-ir2vec entities | FileCheck %s CHECK: 92 CHECK-NEXT: Ret 0
diff --git a/test/tools/llvm-ir2vec/error-handling.ll b/test/tools/llvm-ir2vec/error-handling.ll index c23c529..b944ea0 100644 --- a/test/tools/llvm-ir2vec/error-handling.ll +++ b/test/tools/llvm-ir2vec/error-handling.ll
@@ -1,14 +1,7 @@ ; Test error handling and input validation for llvm-ir2vec tool -; RUN: not llvm-ir2vec --mode=embeddings %s 2>&1 | FileCheck %s -check-prefix=CHECK-NO-VOCAB - -; RUN: not llvm-ir2vec --mode=embeddings --function=nonexistent --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s 2>&1 | FileCheck %s -check-prefix=CHECK-FUNC-NOT-FOUND - -; RUN: llvm-ir2vec --mode=triplets --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json --level=inst %s 2>&1 | FileCheck %s -check-prefix=CHECK-UNUSED-LEVEL -; RUN: llvm-ir2vec --mode=entities --level=inst %s 2>&1 | FileCheck %s -check-prefix=CHECK-UNUSED-LEVEL - -; RUN: llvm-ir2vec --mode=triplets --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json --function=dummy %s 2>&1 | FileCheck %s -check-prefix=CHECK-UNUSED-FUNC -; RUN: llvm-ir2vec --mode=entities --function=dummy %s 2>&1 | FileCheck %s -check-prefix=CHECK-UNUSED-FUNC +; RUN: not llvm-ir2vec embeddings %s 2>&1 | FileCheck %s -check-prefix=CHECK-NO-VOCAB +; RUN: not llvm-ir2vec embeddings --function=nonexistent --ir2vec-vocab-path=%ir2vec_test_vocab_dir/dummy_3D_nonzero_opc_vocab.json %s 2>&1 | FileCheck %s -check-prefix=CHECK-FUNC-NOT-FOUND ; Simple test function for valid IR define i32 @test_func(i32 %a) { @@ -18,5 +11,3 @@ ; CHECK-NO-VOCAB: error: IR2Vec vocabulary file path not specified; You may need to set it using --ir2vec-vocab-path ; CHECK-FUNC-NOT-FOUND: Error: Function 'nonexistent' not found -; CHECK-UNUSED-LEVEL: Warning: --level option is ignored -; CHECK-UNUSED-FUNC: Warning: --function option is ignored
diff --git a/test/tools/llvm-ir2vec/triplets.ll b/test/tools/llvm-ir2vec/triplets.ll index dcd1dc9..a7fd9e4 100644 --- a/test/tools/llvm-ir2vec/triplets.ll +++ b/test/tools/llvm-ir2vec/triplets.ll
@@ -1,4 +1,4 @@ -; RUN: llvm-ir2vec --mode=triplets %s | FileCheck %s -check-prefix=TRIPLETS +; RUN: llvm-ir2vec triplets %s | FileCheck %s -check-prefix=TRIPLETS define i32 @simple_add(i32 %a, i32 %b) { entry:
diff --git a/tools/llvm-ir2vec/llvm-ir2vec.cpp b/tools/llvm-ir2vec/llvm-ir2vec.cpp index f6ed94b..8e17a4a 100644 --- a/tools/llvm-ir2vec/llvm-ir2vec.cpp +++ b/tools/llvm-ir2vec/llvm-ir2vec.cpp
@@ -9,22 +9,22 @@ /// \file /// This file implements the IR2Vec embedding generation tool. /// -/// This tool provides three main modes: +/// This tool provides three main subcommands: /// -/// 1. Triplet Generation Mode (--mode=triplets): +/// 1. Triplet Generation (triplets): /// Generates numeric triplets (head, tail, relation) for vocabulary /// training. Output format: MAX_RELATION=N header followed by /// head\ttail\trelation lines. Relations: 0=Type, 1=Next, 2+=Arg0,Arg1,... -/// Usage: llvm-ir2vec --mode=triplets input.bc -o train2id.txt +/// Usage: llvm-ir2vec triplets input.bc -o train2id.txt /// -/// 2. Entities Generation Mode (--mode=entities): +/// 2. Entity Mappings (entities): /// Generates entity mappings for vocabulary training. /// Output format: <total_entities> header followed by entity\tid lines. -/// Usage: llvm-ir2vec --mode=entities input.bc -o entity2id.txt +/// Usage: llvm-ir2vec entities input.bc -o entity2id.txt /// -/// 3. Embedding Generation Mode (--mode=embeddings): +/// 3. Embedding Generation (embeddings): /// Generates IR2Vec embeddings using a trained vocabulary. -/// Usage: llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=vocab.json +/// Usage: llvm-ir2vec embeddings --ir2vec-vocab-path=vocab.json /// --level=func input.bc -o embeddings.txt Levels: --level=inst /// (instructions), --level=bb (basic blocks), --level=func (functions) /// (See IR2Vec.cpp for more embedding generation options) @@ -55,36 +55,33 @@ static cl::OptionCategory IR2VecToolCategory("IR2Vec Tool Options"); +// Subcommands +static cl::SubCommand + TripletsSubCmd("triplets", "Generate triplets for vocabulary training"); +static cl::SubCommand + EntitiesSubCmd("entities", + "Generate entity mappings for vocabulary training"); +static cl::SubCommand + EmbeddingsSubCmd("embeddings", + "Generate embeddings using trained vocabulary"); + +// Common options static cl::opt<std::string> InputFilename(cl::Positional, cl::desc("<input bitcode file or '-' for stdin>"), - cl::init("-"), cl::cat(IR2VecToolCategory)); + cl::init("-"), cl::sub(TripletsSubCmd), + cl::sub(EmbeddingsSubCmd), cl::cat(IR2VecToolCategory)); static cl::opt<std::string> OutputFilename("o", cl::desc("Output filename"), cl::value_desc("filename"), cl::init("-"), cl::cat(IR2VecToolCategory)); -enum ToolMode { - TripletMode, // Generate triplets for vocabulary training - EntityMode, // Generate entity mappings for vocabulary training - EmbeddingMode // Generate embeddings using trained vocabulary -}; - -static cl::opt<ToolMode> Mode( - "mode", cl::desc("Tool operation mode:"), - cl::values(clEnumValN(TripletMode, "triplets", - "Generate triplets for vocabulary training"), - clEnumValN(EntityMode, "entities", - "Generate entity mappings for vocabulary training"), - clEnumValN(EmbeddingMode, "embeddings", - "Generate embeddings using trained vocabulary")), - cl::init(EmbeddingMode), cl::cat(IR2VecToolCategory)); - +// Embedding-specific options static cl::opt<std::string> FunctionName("function", cl::desc("Process specific function only"), cl::value_desc("name"), cl::Optional, cl::init(""), - cl::cat(IR2VecToolCategory)); + cl::sub(EmbeddingsSubCmd), cl::cat(IR2VecToolCategory)); enum EmbeddingLevel { InstructionLevel, // Generate instruction-level embeddings @@ -93,14 +90,15 @@ }; static cl::opt<EmbeddingLevel> - Level("level", cl::desc("Embedding generation level (for embedding mode):"), + Level("level", cl::desc("Embedding generation level:"), cl::values(clEnumValN(InstructionLevel, "inst", "Generate instruction-level embeddings"), clEnumValN(BasicBlockLevel, "bb", "Generate basic block-level embeddings"), clEnumValN(FunctionLevel, "func", "Generate function-level embeddings")), - cl::init(FunctionLevel), cl::cat(IR2VecToolCategory)); + cl::init(FunctionLevel), cl::sub(EmbeddingsSubCmd), + cl::cat(IR2VecToolCategory)); namespace { @@ -291,7 +289,7 @@ Error processModule(Module &M, raw_ostream &OS) { IR2VecTool Tool(M); - if (Mode == EmbeddingMode) { + if (EmbeddingsSubCmd) { // Initialize vocabulary for embedding generation // Note: Requires --ir2vec-vocab-path option to be set auto VocabStatus = Tool.initializeVocabulary(); @@ -311,6 +309,7 @@ Tool.generateEmbeddings(OS); } } else { + // Both triplets and entities use triplet generation Tool.generateTriplets(OS); } return Error::success(); @@ -334,14 +333,6 @@ "See https://llvm.org/docs/CommandGuide/llvm-ir2vec.html for more " "information.\n"); - // Validate command line options - if (Mode != EmbeddingMode) { - if (Level.getNumOccurrences() > 0) - errs() << "Warning: --level option is ignored\n"; - if (FunctionName.getNumOccurrences() > 0) - errs() << "Warning: --function option is ignored\n"; - } - std::error_code EC; raw_fd_ostream OS(OutputFilename, EC); if (EC) { @@ -349,7 +340,7 @@ return 1; } - if (Mode == EntityMode) { + if (EntitiesSubCmd) { // Just dump entity mappings without processing any IR IR2VecTool::generateEntityMappings(OS); return 0;
diff --git a/utils/mlgo-utils/IR2Vec/generateTriplets.py b/utils/mlgo-utils/IR2Vec/generateTriplets.py index c48503e..80ac4c6 100644 --- a/utils/mlgo-utils/IR2Vec/generateTriplets.py +++ b/utils/mlgo-utils/IR2Vec/generateTriplets.py
@@ -124,7 +124,7 @@ # Run llvm-ir2vec with opt's output as input ir2vec_proc = subprocess.Popen( - [self.ir2vec_binary, "--mode=triplets", "-", "-o", "-"], + [self.ir2vec_binary, "triplets", "-", "-o", "-"], stdin=opt_proc.stdout, stdout=subprocess.PIPE, stderr=subprocess.PIPE, @@ -223,7 +223,7 @@ def _generate_entity2id(self, output_file: Path) -> None: """Generate entity2id.txt using llvm-ir2vec""" subprocess.run( - [str(self.ir2vec_binary), "--mode=entities", "-o", str(output_file)], + [str(self.ir2vec_binary), "entities", "-o", str(output_file)], check=True, capture_output=True, )