Hey r/LocalLLaMA,
What it means for makers and artists:
A production ML compiler stack is incredibly complex, with tools like TVM (500K+ lines of C++), PyTorch’s layers such as Dynamo, Inductor, and Triton, and XLA, MLIR, Halide, Mojo. This article aims to demystify the core concepts by building a small ML compiler from scratch in pure Python and raw CUDA.
After a month of work, this three-part series has been completed:
- Part 1: Walks an RMSNorm layer through the upper half of the pipeline: Torch IR (captured as FX graph), Tensor IR (decomposing ops into Elementwise / Reduction / IndexMap), Loop IR (a kernel written as a loop nest fused with other kernels), Tile IR (scheduling onto the GPU), Kernel IR (materializing schedules into hardware primitives), and CUDA (emitting source ready for nvcc).
- Part 2: Explains how a loop nest becomes a GPU schedule. Sixteen mechanical Tile-IR passes to split computations into blocks, map them to threads, stage inputs into shared memory, etc.
- Part 3: Finishes with autotuning using SP-MCTS over the cross-product of rule parameters instead of heuristics.
The entire pipeline is controlled via a single CLI:
deplodock compile -c "nn.RMSNorm(2048)(torch.randn(1,32,2048))" --ir tensor|loop|tile|kernel|cudadeplodock run --bench -c "nn.Softmax(dim=-1)(torch.randn(1,28,2048,2048))"deplodock tune -c "nn.RMSNorm(2048)(torch.randn(1,32,2048))" -v- Full model compilation with Qwen/Qwen2.5-7B
The three parts are self-contained enough that you can skip ahead if interested in one layer: – IR Hierarchy — From PyTorch to Emitted CUDA, – Tile IR — Scheduling Loops onto a GPU, – Autotuning — A Search Loop Over Tile-IR Rewrites.
Key Takeaways
- This series provides a principled approach to understanding and building an ML compiler from scratch in Python and CUDA.
- The autotuning method using SP-MCTS is novel and effective for improving performance on real hardware.
- The CLI tool allows users to inspect, benchmark, and tune the compilation process easily.
For more details, check out the repository.




