An overview of modern LLM compiler stack: writing an interactive and hackable compiler

Hey r/LocalLLaMA, What it means for makers and artists: A production ML compiler stack is incredibly complex, with tools like TVM (500K+…

By AI Maestro May 19, 2026 1 min read
An overview of modern LLM compiler stack: writing an interactive and hackable compiler

Hey r/LocalLLaMA,

What it means for makers and artists:

A production ML compiler stack is incredibly complex, with tools like TVM (500K+ lines of C++), PyTorch’s layers such as Dynamo, Inductor, and Triton, and XLA, MLIR, Halide, Mojo. This article aims to demystify the core concepts by building a small ML compiler from scratch in pure Python and raw CUDA.

After a month of work, this three-part series has been completed:

  • Part 1: Walks an RMSNorm layer through the upper half of the pipeline: Torch IR (captured as FX graph), Tensor IR (decomposing ops into Elementwise / Reduction / IndexMap), Loop IR (a kernel written as a loop nest fused with other kernels), Tile IR (scheduling onto the GPU), Kernel IR (materializing schedules into hardware primitives), and CUDA (emitting source ready for nvcc).
  • Part 2: Explains how a loop nest becomes a GPU schedule. Sixteen mechanical Tile-IR passes to split computations into blocks, map them to threads, stage inputs into shared memory, etc.
  • Part 3: Finishes with autotuning using SP-MCTS over the cross-product of rule parameters instead of heuristics.

The entire pipeline is controlled via a single CLI:

  1. deplodock compile -c "nn.RMSNorm(2048)(torch.randn(1,32,2048))" --ir tensor|loop|tile|kernel|cuda
  2. deplodock run --bench -c "nn.Softmax(dim=-1)(torch.randn(1,28,2048,2048))"
  3. deplodock tune -c "nn.RMSNorm(2048)(torch.randn(1,32,2048))" -v
  4. Full model compilation with Qwen/Qwen2.5-7B

The three parts are self-contained enough that you can skip ahead if interested in one layer: – IR Hierarchy — From PyTorch to Emitted CUDA, – Tile IR — Scheduling Loops onto a GPU, – Autotuning — A Search Loop Over Tile-IR Rewrites.

Key Takeaways

  • This series provides a principled approach to understanding and building an ML compiler from scratch in Python and CUDA.
  • The autotuning method using SP-MCTS is novel and effective for improving performance on real hardware.
  • The CLI tool allows users to inspect, benchmark, and tune the compilation process easily.

For more details, check out the repository.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top