A hackable compiler to generate efficient fused GPU kernels for AI models [P]

A hackable compiler to generate efficient fused GPU kernels for AI models

The modern machine learning (ML) compiler stack is indeed formidable. TVM boasts over 500,000 lines of C++. PyTorch, meanwhile, layers in Dynamo, Inductor, and Triton atop one another. I’ve built a hackable ML compiler from scratch and am now documenting the development process.

Currently, on an RTX 5090 GPU, the generated FP32 kernels run at geomean 1.11× faster compared to running in eager mode with PyTorch, and they also outperform torch.compile by 1.20×. The compiled kernels are comparable for operations like reducing and sequence processing on models such as TinyLlama-128 and Qwen2.5-7B at a sequence length of 128.

The first part of this series focused on an end-to-end walk through the upper layers of our ML compiler pipeline, detailing how it handles operations like RMSNorm. This second installment will close the gap by explaining Tile IR, Kernel IR, and the associated lowering rules in depth.

For more details, see Part 1 and the full article at A Principled ML Compiler Stack in 5,000 Lines of Python.

The article provides an example of a sequence of operations written in loop-nest form (Loop IR) and how they are transformed into a GPU schedule. For instance, a simplified version of the RMSNorm layer is as follows:

python
v0 = reciprocal(2048)
for a0 in 0..32: # free for a1 in 0..2048:
    v1 = multiply(in2, in2)
    acc0 <- add(acc0, v1)
v2 = multiply(acc0, v0)
v3 = add(v2, 1e-06)
v4 = rsqrt(v3)
for a2 in 0..2048: # free
    v5 = multiply(in3, v4)
    v6 = multiply(v5, in4)
    merged_n0[0, a0, a2] = v6

Each stage of the compilation process can be executed using a command-line tool. For example, to see how stage_inputs passes inputs into shared memory for an RMSNorm layer, you could run:

bash
deplodock compile \
  -c "torch.nn.RMSNorm(2048)(torch.randn(1,32,2048))" \
  --ir tile -vv \
| awk '/^>>> t:007/,/^<<< t:007/'

The resulting CUDA kernel for the RMSNorm layer would look like this:

c
extern "C" __global__ __launch_bounds__(256) void k_rms_norm_reduce(const float* x, const float* p_weight, float* rms_norm) {
  float v0 = 1.0f / 2048.0f;
  int a1 = blockIdx.x;
  int a0 = threadIdx.x;
  int lane = threadIdx.x & 31;
  int warp = threadIdx.x >> 5;
  float acc0 = 0.0f;

  __shared__ float x_smem[2048];
  for (int x_smem_flat = a0; x_smem_flat < 2048; x_smem_flat += 256) {
    float x_smem_v = x[a1 * 2048 + x_smem_flat];
    x_smem[x_smem_flat] = x_smem_v;
  }
  __syncthreads();

  for (int a2 = a0; a2 < 2048; a2 += 256) {
    float in2 = x_smem[a2];
    float v1 = in2 * in2;
    acc0 += v1;
  }

  float acc0_w = acc0;
  acc0_w = acc0_w + __shfl_xor_sync(0xffffffff, acc0_w, 16);
  acc0_w = acc0_w + __shfl_xor_sync(0xffffffff, acc0_w, 8);
  acc0_w = acc0_w + __shfl_xor_sync(0xffffffff, acc0_w, 4);
  acc0_w = acc0_w + __shfl_xor_sync(0xffffffff, acc0_w, 2);
  acc0_w = acc0_w + __shfl_xor_sync(0xffffffff, acc0_w, 1);

  __shared__ float acc0_smem[8];
  if (lane == 0) {
    acc0_smem[warp] = acc0_w;
  }
  __syncthreads();

  for (int s = 4; s > 0; s >>= 1) {
    if (warp < s) {
      acc0_smem[warp] = acc0_smem[warp] + acc0_smem[warp + s];
    }
    __syncthreads();
  }

  float acc0_b = acc0_smem[0];
  float v2 = acc0_b * v0;
  float v3 = v2 + 1e-06f;
  float v4 = rsqrtf(v3);
  for (int a2 = a0; a2 < 2048; a2 += 256) {
    float in3 = x_smem[a2];
    float in4 = p_weight[a2];
    float v5 = in3 * v4;
    float v6 = v5 * in4;
    rms_norm[a1 * 2048 + a2] = v6;
  }
}

Key Takeaways

The hackable compiler enables efficient generation of GPU kernels for AI models.
It mimics the optimization steps a CUDA engineer would perform when optimizing kernels, including tileification, chunking, and staging input buffers into shared memory.
This approach can significantly improve performance compared to running in eager mode or using torch.compile.

This work has implications for both researchers and practitioners who want to optimize their ML models for specific hardware architectures.

Source Read original →