An overview of modern LLM compiler stack: writing an interactive and hackable compiler

What it means for makers and artists:

A production ML compiler stack is incredibly complex, with tools like TVM (500K+ lines of C++), PyTorch’s layers such as Dynamo, Inductor, and Triton, and XLA, MLIR, Halide, Mojo. This article aims to demystify the core concepts by building a small ML compiler from scratch in pure Python and raw CUDA.

After a month of work, this three-part series has been completed:

Part 1: Walks an RMSNorm layer through the upper half of the pipeline: Torch IR (captured as FX graph), Tensor IR (decomposing ops into Elementwise / Reduction / IndexMap), Loop IR (a kernel written as a loop nest fused with other kernels), Tile IR (scheduling onto the GPU), Kernel IR (materializing schedules into hardware primitives), and CUDA (emitting source ready for nvcc).
Part 2: Explains how a loop nest becomes a GPU schedule. Sixteen mechanical Tile-IR passes to split computations into blocks, map them to threads, stage inputs into shared memory, etc.
Part 3: Finishes with autotuning using SP-MCTS over the cross-product of rule parameters instead of heuristics.

The entire pipeline is controlled via a single CLI:

deplodock compile -c "nn.RMSNorm(2048)(torch.randn(1,32,2048))" --ir tensor|loop|tile|kernel|cuda
deplodock run --bench -c "nn.Softmax(dim=-1)(torch.randn(1,28,2048,2048))"
deplodock tune -c "nn.RMSNorm(2048)(torch.randn(1,32,2048))" -v
Full model compilation with Qwen/Qwen2.5-7B

The three parts are self-contained enough that you can skip ahead if interested in one layer: – IR Hierarchy, From PyTorch to Emitted CUDA, – Tile IR, Scheduling Loops onto a GPU, – Autotuning, A Search Loop Over Tile-IR Rewrites.

Key Takeaways

This series provides a principled approach to understanding and building an ML compiler from scratch in Python and CUDA.
The autotuning method using SP-MCTS is novel and effective for improving performance on real hardware.
The CLI tool allows users to inspect, benchmark, and tune the compilation process easily.

For more details, check out the repository.

An overview of modern LLM compiler stack: writing an interactive and hackable compiler

What it means for makers and artists:

Key Takeaways

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

datasette-apps 0.2a0

Ten advances in mathematics…

Judge denies xAI’s request…

What it means for makers and artists:

Key Takeaways

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

datasette-apps 0.2a0

Ten advances in mathematics…

Judge denies xAI’s request…