“`html

NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend

NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend

How This Makes a Change

NVIDIA AI researchers have recently released cuda-oxide, an experimental compiler designed to allow developers to write CUDA SIMT (Single Instruction, Multiple Threads) GPU kernels in safe Rust code. The project compiles Rust directly into PTX, the assembly-like intermediate representation used by NVIDIA GPUs, without necessitating any foreign language interactions or C/C++ glue code.

Writing GPU kernels today typically involves writing C++ and using CUDA’s programming model directly, or relying on Python-level abstractions like Triton that generate CUDA under the hood. The Rust ecosystem for GPU development has seen various projects attempting to bridge this gap, such as Rust-GPU targeting SPIR-V for Vulkan/graphics compute, rust-cuda providing a backend for NVVM IR, CubeCL using an embedded DSL with a JIT runtime capable of cross-compiling to CUDA or ROCm/WGPU, and std::offload utilizing LLVM’s implicit offloading path.

CUDA-oxide occupies a unique position in this landscape. Its design centers on bringing CUDA into Rust, allowing kernel authors to write GPU code naturally using safe Rust constructs like __global__ functions, rather than writing generic Rust functions that happen to run on a GPU. By contrast, rust-cuda focuses more on bringing Rust to NVIDIA GPUs: leveraging Rust’s features like async/await, parts of the standard library running on-device, and an abstracted programming model over CUDA concepts.

The NVlabs team notes its collaboration with rust-cuda maintainers and considers both projects complementary. The core of cuda-oxide is a custom rustc codegen backend, the layer responsible for generating machine code in Rust. Instead of emitting native CPU code, this backend runs a separate pipeline specifically tailored for device code:

Rust Source → rustc frontend → rustc_public (Stable MIR) → dialect-mir → mem2reg → dialect-llvm → LLVM IR (.ll) → PTX (.ptx)

The pipeline includes several key components:

Why rustc_public?: The raw internal MIR representation in Rust changes between nightly versions without stability guarantees. cuda-oxide uses rustc_public, also known as Stable MIR, which is Rust’s official versioned and stable API over the compiler’s internals. This allows the backend to read MIR without breaking on each nightly update.
What is Pliron?: The middle stages use Pliron, a Rust-native MLIR-like IR framework written entirely in Rust. Choosing Pliron instead of upstream MLIR means the entire compiler can be built with cargo, no C++ toolchain, no CMake, and no tablegen. cuda-oxide defines three custom Pliron dialects: dialect-mir, dialect-llvm, and dialect-nvvm.
What does llc do?: After the dialect-llvm printer serializes the IR into a textual .ll file, an external tool like llc (LLVM’s static compiler with NVPTX backend) compiles it to PTX assembly. This is the final stage outside pure Rust. The resulting .ptx file is written next to the host binary, for example, target/debug/vecadd.ptx, and loaded by the CUDA driver at runtime.

You as a developer can observe each stage with:

cargo oxide pipeline vecadd

This command prints the full trace from Rust MIR through each dialect down to PTX output, allowing for debugging and inspection.

The Compilation Pipeline

At the heart of cuda-oxide is a custom rustc codegen backend, the layer in Rust responsible for generating machine code. Instead of emitting native CPU code, this backend runs a distinct pipeline specifically designed for device code:

Rust Source → rustc frontend → rustc_public (Stable MIR) → dialect-mir → mem2reg → dialect-llvm → LLVM IR (.ll) → PTX (.ptx)

The pipeline includes several critical elements:

Why rustc_public?: The raw internal MIR representation in Rust changes between nightly versions without stability guarantees. cuda-oxide uses rustc_public, also known as Stable MIR, which is Rust’s official versioned and stable API over the compiler’s internals. This allows the backend to read MIR without breaking on each nightly update.
What is Pliron?: The middle stages use a Pliron, a Rust-native MLIR-like IR framework written entirely in Rust. Choosing Pliron instead of the upstream MLIR means the entire compiler can be built with cargo, no C++ toolchain, no CMake, and no tablegen. cuda-oxide defines three custom Pliron dialects: dialect-mir, dialect-llvm, and dialect-nvvm.
What does llc do?: After the dialect-llvm printer serializes the IR into a textual .ll file, an external tool like llc (LLVM’s static compiler with NVPTX backend) compiles it to PTX assembly. This is the final stage outside pure Rust. The resulting .ptx file is written next to the host binary, for example, target/debug/vecadd.ptx, and loaded by the CUDA driver at runtime.

You as a developer can observe each stage with:

cargo oxide pipeline vecadd

This command prints the full trace from Rust MIR through each dialect down to PTX output, allowing for debugging and inspection.

Single-Source Compilation and the Host/Device Split

The host and device code live in the same .rs source file. cargo oxide sets -Z codegen-backend=librustc_codegen_cuda.so, which routes code generation through cuda-oxide’s backend. The backend then scans compiled code for monomorphized functions whose names carry the reserved cuda_oxide_kernel_<hash> prefix, the namespace that the #[kernel] proc macro creates. Functions matching this prefix go through the cuda-oxide pipeline to produce PTX; all other host code is delegated to rustc’s standard LLVM backend.

cargo oxide run vecadd
cargo oxide debug vecadd --tui    # Debug with cuda-gdb

The device code from library dependencies is compiled lazily: the backend reads their Stable MIR from .rlib metadata on demand, only compiling functions a kernel actually calls.

What You Can Write in a Kernel

CUDA-oxide supports a meaningful subset of Rust in GPU kernel functions, marked with the #[kernel] attribute macro. This includes:

Generic functions with monomorphization: e.g., fn scale<T: Copy>(...), which is compiled to a concrete PTX kernel per type used at the call site.
Closures with captures: closures passed from the host are scalarized and passed as PTX kernel parameters automatically.
User-defined structs and enums: standard Rust data structures work inside kernels.
Pattern matching: constructs like match, if let, and related ones work in device code.
Full GPU intrinsics: the cuda-device crate provides wrappers for thread indexing, warp operations (e.g., shfl_sync, ballot_sync), shared memory, barriers, TMA (Tensor Memory Accelerator), Thread Block Clusters, and scoped atomics (6 types × 3 scopes × 5 orderings).

An important GPU-specific compiler detail: rustc’s JumpThreading MIR optimization, which duplicates function calls into both branches of an if-statement, is disabled for device code in cuda-oxide. On CPUs this is a safe optimization, but on GPUs it breaks barrier semantics: all threads in a block must converge at the same bar.sync instruction, and duplicating it across branches violates that requirement. Additionally, sync primitives are marked convergent in the emitted LLVM IR so that LLVM’s optimization passes cannot move or duplicate them across control flow.

How to Use NVIDIA Star Elastic

To use cuda-oxide, follow these steps:

Prerequisites: Verify your system meets the specific version requirements for each dependency. Currently, cuda-oxide is Linux-only, tested on Ubuntu 24.
Source Read original →
Related reading
The SignalThe Signal: Edition 02Read this edition →Every Friday: the one AI story that actually mattered, plus the tools worth your time.
AI Maestro is an independent British AI publication. We test what we recommend, and we write it the way we would say it. More about us

NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX

NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend

How This Makes a Change

The Compilation Pipeline

Single-Source Compilation and the Host/Device Split

What You Can Write in a Kernel

How to Use NVIDIA Star Elastic

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Kimi K3 vs DeepSeek…

Fine-Tuning Qwen3 with LoRA…

Dave Eggers told OpenAI…

How This Makes a Change

The Compilation Pipeline

Single-Source Compilation and the Host/Device Split

What You Can Write in a Kernel

How to Use NVIDIA Star Elastic

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Kimi K3 vs DeepSeek…

Fine-Tuning Qwen3 with LoRA…

Dave Eggers told OpenAI…