mlir/docs/Dialects/GPU.md - llvm-project.git - Git at Google

 # 'gpu' Dialect

 Note: this dialect is more likely to change than others in the near future; use
 with caution.

 This dialect provides middle-level abstractions for launching GPU kernels
 following a programming model similar to that of CUDA or OpenCL. It provides
 abstractions for kernel invocations (and may eventually provide those for device
 management) that are not present at the lower level (e.g., as LLVM IR intrinsics
 for GPUs). Its goal is to abstract away device- and driver-specific
 manipulations to launch a GPU kernel and provide a simple path towards GPU
 execution from MLIR. It may be targeted, for example, by DSLs using MLIR. The
 dialect uses `gpu` as its canonical prefix.

 This dialect also abstracts away primitives commonly available in GPU code, such
 as with `gpu.thread_id` (an operation that returns the ID of threads within
 a thread block/workgroup along a given dimension). While the compilation
 pipelines documented below expect such code to live inside a `gpu.module` and
 `gpu.func`, these intrinsic wrappers may be used outside of this context.

 Intrinsic-wrapping operations should not expect that they have a parent of type
 `gpu.func`. However, operations that deal in compiling and launching GPU functions,
 like `gpu.launch_func` or `gpu.binary` may assume that the dialect's full layering
 is being used.

 [TOC]

 ## GPU address spaces

 The GPU dialect exposes the `gpu.address_space` attribute, which currently has
 three values: `global`, `workgroup`, and `private`.

 These address spaces represent the types of buffer commonly seen in GPU compilation.
 `global` memory is memory that resides in the GPU's global memory. `workgroup`
 memory is a limited, per-workgroup resource: all threads in a workgroup/thread
 block access the same values in `workgroup` memory. Finally, `private` memory is
 used to represent `alloca`-like buffers that are private to a single thread/workitem.

 These address spaces may be used as the `memorySpace` attribute on `memref` values.
 The `gpu.module`/`gpu.func` compilation pipeline will lower such memory space
 usages to the correct address spaces on target platforms. Memory attributions should be
 created with the correct memory space on the memref.

 ## Memory attribution

 Memory buffers are defined at the function level, either in "gpu.launch" or in
 "gpu.func" ops. This encoding makes it clear where the memory belongs and makes
 the lifetime of the memory visible. The memory is only accessible while the
 kernel is launched/the function is currently invoked. The latter is more strict
 than actual GPU implementations but using static memory at the function level is
 just for convenience. It is also always possible to pass pointers to the
 workgroup memory into other functions, provided they expect the correct memory
 space.

 The buffers are considered live throughout the execution of the GPU function
 body. The absence of memory attribution syntax means that the function does not
 require special buffers. Rationale: although the underlying models declare
 memory buffers at the module level, we chose to do it at the function level to
 provide some structuring for the lifetime of those buffers; this avoids the
 incentive to use the buffers for communicating between different kernels or
 launches of the same kernel, which should be done through function arguments
 instead; we chose not to use `alloca`-style approach that would require more
 complex lifetime analysis following the principles of MLIR that promote
 structure and representing analysis results in the IR.

 ## GPU Compilation
 ### Compilation overview
 The compilation process in the GPU dialect has two main stages: GPU module
 serialization and offloading operations translation. Together these stages can
 produce GPU binaries and the necessary code to execute them.

 An example of how the compilation workflow look is:

 ```
 mlir-opt example.mlir                   \
   --pass-pipeline="builtin.module(      \
     gpu-kernel-outlining,               \ # Outline gpu.launch body to a kernel.
     nvvm-attach-target{chip=sm_90 O=3}, \ # Attach an NVVM target to a gpu.module op.
     gpu.module(convert-gpu-to-nvvm),    \ # Convert GPU to NVVM.
     gpu-to-llvm,                        \ # Convert GPU to LLVM.
     gpu-module-to-binary                \ # Serialize GPU modules to binaries.
   )" -o example-nvvm.mlir
 mlir-translate example-nvvm.mlir        \
   --mlir-to-llvmir                      \ # Obtain the translated LLVM IR.
   -o example.ll
 ```

 This compilation process expects all GPU code to live in a `gpu.module` and
 expects all kernels to be `gpu.func` operations. Non-kernel functions, like
 device library calls, may be defined using `func.func` or other non-GPU dialect
 operations. This permits downstream systems to use these wrappers without
 requiring them to use the GPU dialect's function operations, which might not include
 information those systems want to have as intrinsic values on their functions.
 Additionally, this allows for using `func.func` for device-side library functions
 in `gpu.module`s.

 ### Default NVVM Compilation Pipeline: gpu-lower-to-nvvm-pipeline

 The `gpu-lower-to-nvvm-pipeline` compilation pipeline serves as the default way
 for NVVM target compilation within MLIR. This pipeline operates by lowering
 primary dialects (arith, memref, scf, vector, gpu, and nvgpu) to NVVM target. It
 begins by lowering GPU code region(s) to the specified NVVM compilation target
 and subsequently handles the host code.

 This pipeline specifically requires explicitly parallel IR and doesn't do GPU
 parallelization. To enable parallelism, necessary transformations must be
 applied before utilizing this pipeline.

 It's designed to provide a generic solution for NVVM targets, generating NVVM
 and LLVM dialect code compatible with `mlir-cpu-runner` or execution engine.

 #### Example:

 Here's a snippet illustrating the use of primary dialects, including arith,
 within GPU code execution:

 ```
 func.func @main() {
     %c2 = arith.constant 2 : index
     %c1 = arith.constant 1 : index
     gpu.launch
         blocks(%0, %1, %2) in (%3 = %c1, %4 = %c1, %5 = %c1)
         threads(%6, %7, %8) in (%9 = %c2, %10 = %c1, %11 = %c1) {
         gpu.printf "Hello from %d\n" %6 : index
         gpu.terminator
     }
     return
 }
 ```

 The `gpu-lower-to-nvvm` pipeline compiles this input code to NVVM format as
 below. It provides customization options like specifying SM capability, PTX
 version, and optimization level. Once compiled, the resulting IR is ready for
 execution using `mlir-cpu-runner`. Alternatively, it can be translated into
 LLVM, expanding its utility within the system.

 ```
 mlir-opt example.mlir -gpu-lower-to-nvvm-pipeline = "cubin-chip=sm_90a cubin-features=+ptx80 opt-level=3"
 ```

 ### Module serialization
 Attributes implementing the GPU Target Attribute Interface handle the
 serialization process and are called Target attributes. These attributes can be
 attached to GPU Modules indicating the serialization scheme to compile the
 module into a binary string.

 The `gpu-module-to-binary` pass searches for all nested GPU modules and
 serializes the module using the target attributes attached to the module,
 producing a binary with an object for every target.

 Example:
 ```
 // Input:
 gpu.module @kernels [#nvvm.target<chip = "sm_90">, #nvvm.target<chip = "sm_60">] {
   ...
 }
 // mlir-opt --gpu-module-to-binary:
 gpu.binary @kernels [
   #gpu.object<#nvvm.target<chip = "sm_90">, "sm_90 cubin">,
   #gpu.object<#nvvm.target<chip = "sm_60">, "sm_60 cubin">
 ]
 ```

 ### Offloading LLVM translation
 Attributes implementing the GPU Offloading LLVM Translation Attribute Interface
 handle the translation of GPU binaries and kernel launches into LLVM
 instructions and are called Offloading attributes. These attributes are
 attached to GPU binary operations.

 During the LLVM translation process, GPU binaries get translated using the
 scheme provided by the Offloading attribute, translating the GPU binary into
 LLVM instructions. Meanwhile, Kernel launches are translated by searching the
 appropriate binary and invoking the procedure provided by the Offloading
 attribute in the binary for translating kernel launches into LLVM instructions.

 Example:
 ```
 // Input:
 // Binary with multiple objects but selecting the second one for embedding.
 gpu.binary @binary <#gpu.select_object<#rocdl.target<chip = "gfx90a">>> [
     #gpu.object<#nvvm.target, "NVPTX">,
     #gpu.object<#rocdl.target<chip = "gfx90a">, "AMDGPU">
   ]
 llvm.func @foo() {
   ...
   // Launching a kernel inside the binary.
   gpu.launch_func @binary::@func blocks in (%0, %0, %0)
                                  threads in (%0, %0, %0) : i64
                                  dynamic_shared_memory_size %2
                                  args(%1 : i32, %1 : i32)
   ...
 }
 // mlir-translate --mlir-to-llvmir:
 @binary_bin_cst = internal constant [6 x i8] c"AMDGPU", align 8
 @binary_func_kernel_name = private unnamed_addr constant [7 x i8] c"func\00", align 1
 ...
 define void @foo() {
   ...
   %module = call ptr @mgpuModuleLoad(ptr @binary_bin_cst)
   %kernel = call ptr @mgpuModuleGetFunction(ptr %module, ptr @binary_func_kernel_name)
   call void @mgpuLaunchKernel(ptr %kernel, ...) ; Launch the kernel
   ...
   call void @mgpuModuleUnload(ptr %module)
   ...
 }
 ...
 ```

 ### The binary operation
 From a semantic point of view, GPU binaries allow the implementation of many
 concepts, from simple object files to fat binaries. By default, the binary
 operation uses the `#gpu.select_object` offloading attribute; this attribute
 embeds a single object in the binary as a global string, see the attribute docs
 for more information.

 ## Operations

 [include "Dialects/GPUOps.md"]
	# 'gpu' Dialect

	Note: this dialect is more likely to change than others in the near future; use
	with caution.

	This dialect provides middle-level abstractions for launching GPU kernels
	following a programming model similar to that of CUDA or OpenCL. It provides
	abstractions for kernel invocations (and may eventually provide those for device
	management) that are not present at the lower level (e.g., as LLVM IR intrinsics
	for GPUs). Its goal is to abstract away device- and driver-specific
	manipulations to launch a GPU kernel and provide a simple path towards GPU
	execution from MLIR. It may be targeted, for example, by DSLs using MLIR. The
	dialect uses `gpu` as its canonical prefix.

	This dialect also abstracts away primitives commonly available in GPU code, such
	as with `gpu.thread_id` (an operation that returns the ID of threads within
	a thread block/workgroup along a given dimension). While the compilation
	pipelines documented below expect such code to live inside a `gpu.module` and
	`gpu.func`, these intrinsic wrappers may be used outside of this context.

	Intrinsic-wrapping operations should not expect that they have a parent of type
	`gpu.func`. However, operations that deal in compiling and launching GPU functions,
	like `gpu.launch_func` or `gpu.binary` may assume that the dialect's full layering
	is being used.

	[TOC]

	## GPU address spaces

	The GPU dialect exposes the `gpu.address_space` attribute, which currently has
	three values: `global`, `workgroup`, and `private`.

	These address spaces represent the types of buffer commonly seen in GPU compilation.
	`global` memory is memory that resides in the GPU's global memory. `workgroup`
	memory is a limited, per-workgroup resource: all threads in a workgroup/thread
	block access the same values in `workgroup` memory. Finally, `private` memory is
	used to represent `alloca`-like buffers that are private to a single thread/workitem.

	These address spaces may be used as the `memorySpace` attribute on `memref` values.
	The `gpu.module`/`gpu.func` compilation pipeline will lower such memory space
	usages to the correct address spaces on target platforms. Memory attributions should be
	created with the correct memory space on the memref.

	## Memory attribution

	Memory buffers are defined at the function level, either in "gpu.launch" or in
	"gpu.func" ops. This encoding makes it clear where the memory belongs and makes
	the lifetime of the memory visible. The memory is only accessible while the
	kernel is launched/the function is currently invoked. The latter is more strict
	than actual GPU implementations but using static memory at the function level is
	just for convenience. It is also always possible to pass pointers to the
	workgroup memory into other functions, provided they expect the correct memory
	space.

	The buffers are considered live throughout the execution of the GPU function
	body. The absence of memory attribution syntax means that the function does not
	require special buffers. Rationale: although the underlying models declare
	memory buffers at the module level, we chose to do it at the function level to
	provide some structuring for the lifetime of those buffers; this avoids the
	incentive to use the buffers for communicating between different kernels or
	launches of the same kernel, which should be done through function arguments
	instead; we chose not to use `alloca`-style approach that would require more
	complex lifetime analysis following the principles of MLIR that promote
	structure and representing analysis results in the IR.

	## GPU Compilation
	### Compilation overview
	The compilation process in the GPU dialect has two main stages: GPU module
	serialization and offloading operations translation. Together these stages can
	produce GPU binaries and the necessary code to execute them.

	An example of how the compilation workflow look is:

	```
	mlir-opt example.mlir \
	--pass-pipeline="builtin.module( \
	gpu-kernel-outlining, \ # Outline gpu.launch body to a kernel.
	nvvm-attach-target{chip=sm_90 O=3}, \ # Attach an NVVM target to a gpu.module op.
	gpu.module(convert-gpu-to-nvvm), \ # Convert GPU to NVVM.
	gpu-to-llvm, \ # Convert GPU to LLVM.
	gpu-module-to-binary \ # Serialize GPU modules to binaries.
	)" -o example-nvvm.mlir
	mlir-translate example-nvvm.mlir \
	--mlir-to-llvmir \ # Obtain the translated LLVM IR.
	-o example.ll
	```

	This compilation process expects all GPU code to live in a `gpu.module` and
	expects all kernels to be `gpu.func` operations. Non-kernel functions, like
	device library calls, may be defined using `func.func` or other non-GPU dialect
	operations. This permits downstream systems to use these wrappers without
	requiring them to use the GPU dialect's function operations, which might not include
	information those systems want to have as intrinsic values on their functions.
	Additionally, this allows for using `func.func` for device-side library functions
	in `gpu.module`s.

	### Default NVVM Compilation Pipeline: gpu-lower-to-nvvm-pipeline

	The `gpu-lower-to-nvvm-pipeline` compilation pipeline serves as the default way
	for NVVM target compilation within MLIR. This pipeline operates by lowering
	primary dialects (arith, memref, scf, vector, gpu, and nvgpu) to NVVM target. It
	begins by lowering GPU code region(s) to the specified NVVM compilation target
	and subsequently handles the host code.

	This pipeline specifically requires explicitly parallel IR and doesn't do GPU
	parallelization. To enable parallelism, necessary transformations must be
	applied before utilizing this pipeline.

	It's designed to provide a generic solution for NVVM targets, generating NVVM
	and LLVM dialect code compatible with `mlir-cpu-runner` or execution engine.

	#### Example:

	Here's a snippet illustrating the use of primary dialects, including arith,
	within GPU code execution:

	```
	func.func @main() {
	%c2 = arith.constant 2 : index
	%c1 = arith.constant 1 : index
	gpu.launch
	blocks(%0, %1, %2) in (%3 = %c1, %4 = %c1, %5 = %c1)
	threads(%6, %7, %8) in (%9 = %c2, %10 = %c1, %11 = %c1) {
	gpu.printf "Hello from %d\n" %6 : index
	gpu.terminator
	}
	return
	}
	```

	The `gpu-lower-to-nvvm` pipeline compiles this input code to NVVM format as
	below. It provides customization options like specifying SM capability, PTX
	version, and optimization level. Once compiled, the resulting IR is ready for
	execution using `mlir-cpu-runner`. Alternatively, it can be translated into
	LLVM, expanding its utility within the system.

	```
	mlir-opt example.mlir -gpu-lower-to-nvvm-pipeline = "cubin-chip=sm_90a cubin-features=+ptx80 opt-level=3"
	```

	### Module serialization
	Attributes implementing the GPU Target Attribute Interface handle the
	serialization process and are called Target attributes. These attributes can be
	attached to GPU Modules indicating the serialization scheme to compile the
	module into a binary string.

	The `gpu-module-to-binary` pass searches for all nested GPU modules and
	serializes the module using the target attributes attached to the module,
	producing a binary with an object for every target.

	Example:
	```
	// Input:
	gpu.module @kernels [#nvvm.target<chip = "sm_90">, #nvvm.target<chip = "sm_60">] {
	...
	}
	// mlir-opt --gpu-module-to-binary:
	gpu.binary @kernels [
	#gpu.object<#nvvm.target<chip = "sm_90">, "sm_90 cubin">,
	#gpu.object<#nvvm.target<chip = "sm_60">, "sm_60 cubin">
	]
	```

	### Offloading LLVM translation
	Attributes implementing the GPU Offloading LLVM Translation Attribute Interface
	handle the translation of GPU binaries and kernel launches into LLVM
	instructions and are called Offloading attributes. These attributes are
	attached to GPU binary operations.

	During the LLVM translation process, GPU binaries get translated using the
	scheme provided by the Offloading attribute, translating the GPU binary into
	LLVM instructions. Meanwhile, Kernel launches are translated by searching the
	appropriate binary and invoking the procedure provided by the Offloading
	attribute in the binary for translating kernel launches into LLVM instructions.

	Example:
	```
	// Input:
	// Binary with multiple objects but selecting the second one for embedding.
	gpu.binary @binary <#gpu.select_object<#rocdl.target<chip = "gfx90a">>> [
	#gpu.object<#nvvm.target, "NVPTX">,
	#gpu.object<#rocdl.target<chip = "gfx90a">, "AMDGPU">
	]
	llvm.func @foo() {
	...
	// Launching a kernel inside the binary.
	gpu.launch_func @binary::@func blocks in (%0, %0, %0)
	threads in (%0, %0, %0) : i64
	dynamic_shared_memory_size %2
	args(%1 : i32, %1 : i32)
	...
	}
	// mlir-translate --mlir-to-llvmir:
	@binary_bin_cst = internal constant [6 x i8] c"AMDGPU", align 8
	@binary_func_kernel_name = private unnamed_addr constant [7 x i8] c"func\00", align 1
	...
	define void @foo() {
	...
	%module = call ptr @mgpuModuleLoad(ptr @binary_bin_cst)
	%kernel = call ptr @mgpuModuleGetFunction(ptr %module, ptr @binary_func_kernel_name)
	call void @mgpuLaunchKernel(ptr %kernel, ...) ; Launch the kernel
	...
	call void @mgpuModuleUnload(ptr %module)
	...
	}
	...
	```

	### The binary operation
	From a semantic point of view, GPU binaries allow the implementation of many
	concepts, from simple object files to fat binaries. By default, the binary
	operation uses the `#gpu.select_object` offloading attribute; this attribute
	embeds a single object in the binary as a global string, see the attribute docs
	for more information.

	## Operations

	[include "Dialects/GPUOps.md"]