GPU communication overhead is a measurable bottleneck in production AI workloads. According to data cited by the mKernel project, communication can consume 43.6% of the forward pass and 32% of end-to-end training time. Across popular Mixture-of-Experts (MoE) models, inter-device communication can account for up to 47% of total execution time. Researchers from UC Berkeleyās UCCL project have released mKernel, a library of persistent CUDA kernels that fuse intra-node NVLink communication, inter-node RDMA, and compute into a single kernel.
The Problem: Host-Driven Communication
The standard model for multi-GPU communication is host-driven: the CPU runs the control path and calls into a library like NCCL or NVSHMEM. The library issues the collective operation ā an AllReduce, an AllGather, etc. ā across GPUs. Compute and communication run on separate CUDA streams and overlap at kernel boundaries.
The research team identifies two problems with this approach:
(1) CPUs are not scaling with GPU compute. A GB300 NVL72 rack integrates 72 Blackwell Ultra GPUs and 36 Grace CPUs, delivering 720 PFLOP/s FP8/FP6, 1.44 EFLOP/s FP4 Tensor Core performance, and 130 TB/s of all-to-all intra-rack NVLink bandwidth. At those speeds, microsecond-scale host orchestration overhead ā a cudaLaunchKernel call, a CPU-side āall writes doneā check, an inter-stream event ā shows up directly as pipeline bubbles.
(2) Host-driven systems overlap compute and communication at coarse kernel boundaries. Finer-grained overlap at the tile or chunk level is not possible from the host side.
The alternative is GPU-driven communication: the GPU itself triggers transfers, with communication fused into the same kernel as the compute. Most existing fused kernel libraries operate within a single node, or a single GPU. mKernel targets the multi-node case.
What mKernel Does
mKernel is a library of persistent CUDA kernels. Each kernel fuses intra-node NVLink communication, inter-node RDMA, and dense compute into a single kernel.
Multi-GPU + multi-node, in one kernel: Both intra-node NVLink and inter-node RDMA live inside the same persistent kernel.
Fine-grained intra-kernel overlap: Compute and communication overlap at tile/chunk granularity, covering both intra-node and inter-node GPU communication.
Persistent kernel with SM specialization: CTAs self-assign roles: compute, intra-comm, inter-send, inter-reduce. The number of SMs dedicated to each role is tunable per shape.
GPU-driven networking built on libibverbs: mKernel uses GPU-initiated RDMA writes without depending on NCCL or NVSHMEM. The communication backend is written from scratch to maximize performance and support heterogeneous networking devices.
The Five Fused Kernels
| Kernel | What it fuses | Description |
|---|---|---|
| AllGather + GEMM | AllGather ā GEMM | Each rank holds a shard of A. While ranks gather peersā shards over NVLink/RDMA, the local GEMM consumes tiles as soon as they arrive. |
| GEMM + AllReduce | GEMM ā AllReduce | Computes C = A @ B and reduces partial outputs across all ranks in one launch. Output tiles are pushed into the reduction tree the instant theyāre produced. |
| MoE Dispatch + GEMM | All-to-All dispatch ā grouped GEMM | Routes MoE tokens to their expert ranks (intra-node NVLink + inter-node all-to-all) and runs the per-expert grouped GEMM in the same kernel. Tokens are processed as soon as they land ā no staging buffer round-trip. |
| Ring Attention | Ring KV exchange ā FlashAttention | Sequence-parallel attention across ranks. Each step rotates a KV chunk around the ring while the local FlashAttention consumes the previously-received chunk. Compute and the ring send/recv run concurrently inside a single persistent kernel. |
| GEMM + ReduceScatter | GEMM ā ReduceScatter | Computes C = A @ B and reduce-scatters the output. Each output tile is reduced and forwarded to its owning rank as soon as it is produced. |
Evaluation Setup
The research team evaluated mKernel on two 2-node Ć 8-H200 clusters that differ only in their inter-node fabric:
| Testbed | Nodes Ć GPUs | Intra-node | Inter-node transport | NIC |
|---|---|---|---|---|
| AWS EFA | 2 Ć 8 H200 | NVLink | AWS EFA / SRD | 16 Ć 200 Gb/s EFA per node |
| ConnectX-7 | 2 Ć 8 H200 | NVLink | InfiniBand | 8 Ć 400 Gb/s NVIDIA ConnectX-7 per node |
mKernel was benchmarked against NCCL, Triton-distributed, Flux, Mercury, MagiAttention, Transformer-Engine, and ring-flash-attention. The team notes that further benchmarking at larger scale is still in progress.
Backends and Requirements
mKernel supports two networking backends:
| Backend | Macro | Transport | Where it runs |
|---|---|---|---|
| CX7 | -DINTERNODE_BACKEND_IBVERBS | libibverbs RC | ConnectX-7 / InfiniBand / RoCE |
| EFA | -DINTERNODE_BACKEND_EFA | libibverbs + efadv (SRD) | AWS p5/p5e (H200, EFA) |
Both backends share the same host-side API and the same on-GPU kernel. Only the proxy/session implementation differs (session.h for CX7, session_efa.h for EFA). Requirements: NVIDIA Hopper GPUs (default build targets sm_90a), CUDA 12.9, Python with PyTorch. The CX7 backend requires libibverbs development headers and libraries. The EFA backend requires AWS EFA installation with libfabric, libibverbs, efadv, and EFA headers under EFA_HOME=/opt/amazon/efa by default.
Marktechpostās Visual Explainer
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.












