Triton Learning One Kernel at a Time: Vector Addition

Machine Learning


a little optimization can be very helpful. Models like GPT4 Fee More than $100 million trainingit makes a 1% efficiency gain value Over a million dollars. A powerful way to optimize the efficiency of machine learning models is to create some of the components. Directly on the GPU. If you're like me, a quick reference to Cuda Kernels is enough to send chills to your spine.

Fortunately, Openai release Triton In 2021, new languages ​​and compilers will abstract much of the complexity of CUDA, allowing less experienced practitioners to write the core of their performance. Here is a notable example: I can't sleepPromises LLM Training Services 30x faster training and 60% less memory usagethank you for everything Replace layers written in Pytorch with Triton kernel.

In this tutorial series, you will learn the basics of GPU architecture and how to implement a high-performance Triton kernel! All code presented in this series is available at https://github.com/rpegoud/triton-kernels.

GPU Architecture Basics

In this section,nvidia) By the end of this article, you will need to start and write the first Triton kernel.

Starting with the smallest software unit, we can describe the hierarchy of execution units as follows:

  • thread:minimum Unit of workexecutes user-defined kernel code.
  • warp:minimum Scheduling Unitthey always consist of 32 parallel threads, each with their own instruction address counter and registration state. Warp thread Let's start together But that's true You can branch freely and Run independently.
  • Thread block:A group of warp that all threads can Cooperate through shared memory Synchronize barriers. Thread blocks must be able to execute Be independent In any order, parallel or sequentially. This independence enables thread blocks Scheduled in any order with any number of coresso GPU programs scale efficiently by the number of cores. For example, to synchronize memory accesses, threads in a block can be synchronized at a specific point in the kernel as needed.
  • Streaming Multiprocessor (SM): Responsible unit Run many warps in parallelowns shared memory and L1 cache (holds the latest global memory lines accessed by SM). SM has its own special Warp Scheduler It is distorted from a thread block that is ready to run.

On the hardware side, the smallest unit of work is CUDA Corephysical Arithmetic logic uniRun t(alu) Thread arithmetic operations (or a portion of it).

You can see this section by analogy CUDA Core As Individual workersa warp It's a Squadron of 32 workers The same instructions are given at once. They may or may not perform this task in the same way (branching), and they may potentially complete it at different times (independence). a Thread block It consists of Several teams sharing a common workspace (i.e., they share memories), workers from all teams in the workspace can wait for each other to have lunch at the same time. a Streaming Multiprocessor It's a Many teams work together to share tools and storage in the factory floor. lastly, GPU It's a The whole plantthere are many floors.

The hierarchy of Nvidia GPU architecture. Dotted rectangles represent memory blocks (made by the author)

Optimization Basics

When optimizing deep learning models, we are juggling with three main components:

  1. I'll calculate it: Time spent by GPU Computing Floating Point Operations (FLOPS).
  2. Memory: Time spent transferring tensors in the GPU.
  3. overhead: All other operations (Python interpreter, Pytorch Dispatch,…).

Keeping these components in mind will help you figure out the right way to resolve bottlenecks. For example, if most of the time is spent doing memory transfers, then increasing computing is useless. Ideally, most of the time should be spent computationally. More precisely, matrix multiplication optimizes the precise operating GPU.

This means minimizing the cost paid to move data from one of the CPU to the GPU (“).Data transfer cost”), from one node to another (”Network Cost”) or from cuda's global memory (drumcheap but slow) shared memory to Cuda (sramexpensive but fastest device memory). The latter is called Bandwidth Cost And for now it will be our main focus. Common strategies for reducing bandwidth costs include:

  1. Reuse Data loaded into shared memory in multiple steps. A major example of this is tiled matrix growth that we will cover in future posts.
  2. fusion Multiple operations on a single kernel (as starting all kernels means moving data from DRAM to SRAM). For example, you can merge matrix multiplication and activation functions. Generally, Operator integration It prevents many global memory reads/writes and provides two operators with the opportunity to converge, providing significant performance improvements.
Matrix multiplication is followed by relu activation without operator fusion. (Created by the author)

In this example, we perform matrix growth x@W Save the results in an intermediate variable a. Next, apply a relu In a Save the result in a variable y. This requires the GPU to read x and W Write the result in global memory aPlease read from a I'll finally write it again y. Instead, operator fusion performs matrix multiplication and applies relu to a single kernel, half the amount of reads and writes them to global memory.

Multiplication of fused matrix and activation of Relu. (Created by the author)

Triton

Here we write the first Triton kernel, a simple vector addition. First, let's explain how this operation is disassembled and executed on the GPU.

Consider summing the entries in two vectors X and YEach has seven elements (n_elements=7).

Tell the GPU to tackle this issue with 3 chunks of elements at once (BLOCK_SIZE=3). Therefore, to cover all seven elements of the input vector, the GPU launches three parallel “programs”, an independent instance of the kernel. pid:

  • Program 0 is assigned elements 0, 1, 2.
  • Program 1 is assigned elements 3, 4, 5.
  • Program 2 is assigned elements 6.

These programs then write the results back to the vector Z It is stored in global memory.

The important details are that the kernel does not receive the entire vector Xreceive a instead Pointer to the memory address of the first element, X[0]. To access the actual value of Xyou must manually load from global memory.

You can access the data for each block using the program ID. block_start = pid * BLOCK_SIZE. From there, computing can get the remaining element addresses of that block offsets = block_start + range(0, BLOCK_SIZE) Load them into memory.

However, only element 6 is assigned to program 2, but the offset is [6, 7, 8]. To avoid indexing errors, Triton can be defined to us mask Identify valid target elements here mask = offsets < n_elements.

Now you can safely load X and Y Add them together before writing the results back to the output variables Z In global memory in a similar way.

Block-by-block vector index. Slices of x, y, z are sent to independent thread blocks indexed with their own IDs. (Image by the author)

Let's take a closer look at the code. The Triton kernel is:

import triton
import triton.language as tl

@triton.jit
def add_kernel(
	x_ptr, # pointer to the first memory entry of x
	y_ptr, # pointer to the first memory entry of y
	output_ptr, # pointer to the first memory entry of the output
	n_elements, # dimension of x and y
	BLOCK_SIZE: tl.constexpr, # size of a single block
):
	# --- Compute offsets and mask ---
	pid = tl.program_id(axis=0) # block index
	block_start = pid * BLOCK_SIZE # start index for current block
	offsets = block_start + tl.arange(0, BLOCK_SIZE) # index range
	mask = offsets < n_elements # mask out-of-bound elements
	
	# --- Load variables from global memory ---
	x = tl.load(x_ptr + offsets, mask=mask)
	y = tl.load(y_ptr + offsets, mask=mask)

	# --- Operation ---
	output = x + y	
	
	# --- Save results to global memory ---
	tl.store(pointer=output_ptr + offsets, value=output, mask=mask)

Let's break down some of the Triton-specific syntax.

  • Firstly, the Triton kernel is always decorated @triton.jit.
  • Secondly, some arguments must be declared as static. This means that they are known at the time of calculation. This is necessary BLOCK_SIZE It is achieved by adding tl.constexpr Input annotations. Also, note that other variables are not appropriate Python variables, so you don't annotate other variables.
  • I'll use it tl.program_id To access the current block's ID, tl.arange The same goes for numpy's behavior np.arange.
  • Loading and saving variables is achieved by calling tl.load and tl.store Comes with an array of pointers. Please note that there are no return Statement, this role will be delegated tl.store.

To use the kernel, you need to write it now Pytorch level wrapper It provides a memory pointer and defines a Kernel Grid. In general, kernel grids are Number of thread blocks allocated to the kernel along each axis. In the previous example, we used a 1D grid of three thread blocks. grid = (3, ).

Default to handle different array sizes grid = (ceil(n_elements / BLOCK_SIZE), ).

def add(X: torch.Tensor, Y: torch.Tensor) -> torch.Tensor:
	"""PyTorch wrapper for `add_kernel`."""
	output = torch.zeros_like(x) # allocate memory for the output
	n_elements = output.numel()  # dimension of X and Y
	
	# cdiv = ceil div, computes the number of blocks to use
	grid = lambda meta: (triton.cdiv(n_elements, meta["BLOCK_SIZE"]),)
	# calling the kernel will automatically store `BLOCK_SIZE` in `meta`
	# and update `output`
	add_kernel[grid](X, Y, output, n_elements, BLOCK_SIZE=1024)
	
	return output

Below are two final notes about the wrapper:

You may have noticed that grid It is defined as a lambda function. This allows Triton to calculate and launch the number of thread blocks At startup. So calculate the grid size based on the block size stored in metaa dictionary of compile time constants exposed to the kernel.

The value of the kernel when calling output It will change internally so there is no need to reassign it output = add_kernel[…].
You can conclude this tutorial by ensuring that the kernel works properly.

x, y = torch.randn((2, 2048), device="cuda")

print(add(x, y))
>> tensor([ 1.8022, 0.6780, 2.8261, ..., 1.5445, 0.2563, -0.1846], device='cuda:0')

abs_difference = torch.abs((x + y) - add(x, y))
print(f"Max absolute difference: {torch.max(abs_difference)}")
>> Max absolute difference: 0.0

In this introduction, in the next post, we will learn to implement more interesting kernels such as multiplication of tiled matrixes and see how to integrate Triton kernels into Pytorch models. autograd.

Until next time! 👋

References and useful resources





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *