“`html
NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend
The Compilation Pipeline |
Single-Source Compilation and the Host/Device Split |
What You Can Write in a Kernel
How This Makes a Change
NVIDIA AI researchers have recently released cuda-oxide, an experimental compiler designed to allow developers to write CUDA SIMT (Single Instruction, Multiple Threads) GPU kernels in safe Rust code. The project compiles Rust directly into PTX — the assembly-like intermediate representation used by NVIDIA GPUs — without necessitating any foreign language interactions or C/C++ glue code.
Writing GPU kernels today typically involves writing C++ and using CUDA’s programming model directly, or relying on Python-level abstractions like Triton that generate CUDA under the hood. The Rust ecosystem for GPU development has seen various projects attempting to bridge this gap — such as Rust-GPU targeting SPIR-V for Vulkan/graphics compute, rust-cuda providing a backend for NVVM IR, CubeCL using an embedded DSL with a JIT runtime capable of cross-compiling to CUDA or ROCm/WGPU, and std::offload utilizing LLVM’s implicit offloading path.
CUDA-oxide occupies a unique position in this landscape. Its design centers on bringing CUDA into Rust — allowing kernel authors to write GPU code naturally using safe Rust constructs like __global__ functions, rather than writing generic Rust functions that happen to run on a GPU. By contrast, rust-cuda focuses more on bringing Rust to NVIDIA GPUs: leveraging Rust’s features like async/await, parts of the standard library running on-device, and an abstracted programming model over CUDA concepts.
The NVlabs team notes its collaboration with rust-cuda maintainers and considers both projects complementary. The core of cuda-oxide is a custom rustc codegen backend — the layer responsible for generating machine code in Rust. Instead of emitting native CPU code, this backend runs a separate pipeline specifically tailored for device code:
- Rust Source → rustc frontend →
rustc_public(Stable MIR) →dialect-mir→mem2reg→dialect-llvm→ LLVM IR (.ll) → PTX (.ptx)
The pipeline includes several key components:
- Why
rustc_public?: The raw internal MIR representation in Rust changes between nightly versions without stability guarantees. cuda-oxide usesrustc_public, also known as Stable MIR, which is Rust’s official versioned and stable API over the compiler’s internals. This allows the backend to read MIR without breaking on each nightly update. - What is Pliron?: The middle stages use Pliron, a Rust-native MLIR-like IR framework written entirely in Rust. Choosing Pliron instead of upstream MLIR means the entire compiler can be built with
cargo— no C++ toolchain, no CMake, and no tablegen. cuda-oxide defines three custom Pliron dialects:dialect-mir,dialect-llvm, anddialect-nvvm. - What does
llcdo?: After thedialect-llvmprinter serializes the IR into a textual.llfile, an external tool likellc(LLVM’s static compiler with NVPTX backend) compiles it to PTX assembly. This is the final stage outside pure Rust. The resulting.ptxfile is written next to the host binary — for example,target/debug/vecadd.ptx— and loaded by the CUDA driver at runtime.
You as a developer can observe each stage with:
cargo oxide pipeline vecadd
This command prints the full trace from Rust MIR through each dialect down to PTX output, allowing for debugging and inspection.
The Compilation Pipeline
At the heart of cuda-oxide is a custom rustc codegen backend — the layer in Rust responsible for generating machine code. Instead of emitting native CPU code, this backend runs a distinct pipeline specifically designed for device code:
- Rust Source → rustc frontend →
rustc_public(Stable MIR) →dialect-mir→mem2reg→dialect-llvm→ LLVM IR (.ll) → PTX (.ptx)
The pipeline includes several critical elements:
- Why
rustc_public?: The raw internal MIR representation in Rust changes between nightly versions without stability guarantees. cuda-oxide usesrustc_public, also known as Stable MIR, which is Rust’s official versioned and stable API over the compiler’s internals. This allows the backend to read MIR without breaking on each nightly update. - What is Pliron?: The middle stages use a Pliron, a Rust-native MLIR-like IR framework written entirely in Rust. Choosing Pliron instead of the upstream MLIR means the entire compiler can be built with
cargo— no C++ toolchain, no CMake, and no tablegen. cuda-oxide defines three custom Pliron dialects:dialect-mir,dialect-llvm, anddialect-nvvm. - What does
llcdo?: After thedialect-llvmprinter serializes the IR into a textual.llfile, an external tool likellc(LLVM’s static compiler with NVPTX backend) compiles it to PTX assembly. This is the final stage outside pure Rust. The resulting.ptxfile is written next to the host binary — for example,target/debug/vecadd.ptx— and loaded by the CUDA driver at runtime.
You as a developer can observe each stage with:
cargo oxide pipeline vecadd
This command prints the full trace from Rust MIR through each dialect down to PTX output, allowing for debugging and inspection.
Single-Source Compilation and the Host/Device Split
The host and device code live in the same .rs source file. cargo oxide sets -Z codegen-backend=librustc_codegen_cuda.so, which routes code generation through cuda-oxide’s backend. The backend then scans compiled code for monomorphized functions whose names carry the reserved cuda_oxide_kernel_<hash> prefix — the namespace that the #[kernel] proc macro creates. Functions matching this prefix go through the cuda-oxide pipeline to produce PTX; all other host code is delegated to rustc’s standard LLVM backend.
cargo oxide run vecadd
cargo oxide debug vecadd --tui # Debug with cuda-gdb
The device code from library dependencies is compiled lazily: the backend reads their Stable MIR from .rlib metadata on demand, only compiling functions a kernel actually calls.
What You Can Write in a Kernel
CUDA-oxide supports a meaningful subset of Rust in GPU kernel functions, marked with the #[kernel] attribute macro. This includes:
- Generic functions with monomorphization: e.g.,
fn scale<T: Copy>(...), which is compiled to a concrete PTX kernel per type used at the call site. - Closures with captures: closures passed from the host are scalarized and passed as PTX kernel parameters automatically.
- User-defined structs and enums: standard Rust data structures work inside kernels.
- Pattern matching: constructs like
match,if let, and related ones work in device code. - Full GPU intrinsics: the
cuda-devicecrate provides wrappers for thread indexing, warp operations (e.g.,shfl_sync,ballot_sync), shared memory, barriers, TMA (Tensor Memory Accelerator), Thread Block Clusters, and scoped atomics (6 types × 3 scopes × 5 orderings).
An important GPU-specific compiler detail: rustc’s JumpThreading MIR optimization — which duplicates function calls into both branches of an if-statement — is disabled for device code in cuda-oxide. On CPUs this is a safe optimization, but on GPUs it breaks barrier semantics: all threads in a block must converge at the same bar.sync instruction, and duplicating it across branches violates that requirement. Additionally, sync primitives are marked convergent in the emitted LLVM IR so that LLVM’s optimization passes cannot move or duplicate them across control flow.
How to Use NVIDIA Star Elastic
To use cuda-oxide, follow these steps:
- Prerequisites: Verify your system meets the specific version requirements for each dependency. Currently, cuda-oxide is Linux-only, tested on Ubuntu 24.Source Read original →
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




