Unlocking asynchronicity in continuous batching

This is the second post in a series on efficient large language model (LLM) inference. The first post covered continuous batching from first principles. It introduces some concepts we build upon: key-value (KV) cache, FlashAttention, attention masks, etc.

An H200 costs around $5 an hour on Inference Endpoints. That’s cheap for an hour, but use it for a day and you are already paying $140. If this is the case, you want your GPU to be used to its fullest.
We have seen that Continuous Batching improves GPU utilization by scheduling tightly packed batches, so no compute is wasted on padding. But there is a second source of waste that continuous batching does not address: by default it is synchronous. This means the CPU and GPU take turns: while the GPU computes, the CPU waits. And while the CPU prepares the next batch, the GPU waits. In a loop running hundreds of steps per second, these idle gaps add up, and as we will show, they can account for nearly a quarter of total runtime. To ensure the GPU is busy computing 100% of the time, we need to get rid of those gaps.

Synchronous batching

This is how naive synchronous batching works:

The CPU prepares a new batch by selecting which requests to include, updating the KV cache table, evicting finished requests, and admitting new ones. Once done, it transfers the prepared inputs to the GPU.
The GPU runs its forward pass and samples (i.e., chooses) a new token for each request. The results come back to the CPU so it knows what token each request just produced. This cycle repeats again.

Notice the red annotation on the right: after the GPU finishes computing, it goes idle. The next batch cannot start until the CPU has gone through its update step: sampling the output tokens, updating request states, re-scheduling the batch. This is the core inefficiency of synchronous batching: the CPU and GPU take turns. While the GPU is computing, the CPU is idle. While the CPU is updating, the GPU is idle. In no circumstances are they both doing useful work at the same time. For a single forward pass this might seem like a small price to pay, but in a continuous batching loop running hundreds of steps per second, these idle gaps accumulate into real throughput loss.

To showcase this, we profiled the time spent on CPU and GPU when generating 8K tokens with a batch size of 32 using an 8B model. The timeline alternates between green (GPU active, CPU idle) and red (CPU active, GPU idle): the two never overlap. Total generation time is 300.6 seconds, with 24.0% of that spent with an idle GPU waiting for the CPU to finish. Nearly a quarter of all generation time is wasted, from the point of view of the GPU. This is the pessimistic way of viewing things.

The optimistic way is that generation time would drop from 300 to 228 seconds (a free 24% speedup!), if we could eliminate CPU overhead entirely. This requires zero new kernel or model changes, just careful coordination of hardware. Fundamentally, the idea is simple: we need to figure out how to run batch preparation for batch N+1 while batch N is computing. But this simple idea hides a few technical difficulties:

How can we launch something on the GPU and get back control to the CPU?
How can we make sure data is ready, for either CPU or GPU tasks, by the time each task is launched?
How can we prepare batch N+1 if it is based on the predictions of batch N?

To answer these questions and build asynchronous batching from scratch as part of continuous batching in the transformers library, feel free to check the code and compare!

Creating concurrency

Our end goal is to have concurrent execution of CPU and GPU operations. We need a way to categorize our operations so we can let the machine know which tasks can run concurrently. We can achieve this using CUDA streams.

What is a CUDA stream?

To understand how CUDA orders its operations, we need to talk about CUDA streams. A stream is an ordered queue of GPU operations (kernel launches, memory copies, synchronization barriers) that executes in the order they were submitted. Every GPU operation is always scheduled inside a stream. Operations within the same stream are sequential: the GPU will not start the next one until the previous has completed. Operations in different streams are independent of each other and can run concurrently.

Default and non-default streams

If you have never explicitly used CUDA streams in PyTorch, you might be surprised they exist at all. A typical PyTorch script never mentions them, and it does not feel like GPU operations are asynchronous: the CPU seems to wait for the GPU to finish before moving on. That feeling is accurate, and it comes from the default stream.

When you call a PyTorch operation without specifying a stream, it lands on the default stream. The default stream has one special property: it is synchronizing. If an operation is scheduled on the default stream, it waits for all other streams to be flushed, i.e., all work on the GPU has to be over before a single operation on the default stream can start. The reverse is also true: any operation, regardless of its stream, waits for the default stream to be flushed before it launches.

So if you transfer to the CPU the result of a default stream operation, even with a transfer that is supposed to be non-blocking for the CPU, your CPU will still block until all GPU operations have finished because the operations were scheduled on the default stream. This effectively destroys any effort to build concurrency.

That’s why we need to use non-default streams. Enqueuing a kernel launch or a non-blocking memory copy returns control to the CPU immediately. The GPU will run the operation in the background, but the CPU does not wait. This answers our first question: to get back CPU control after launching GPU work, we use a non-default stream.

For the rest of this post, we will assume all memory transfers from one device to the other are non-blocking. We will therefore have to synchronize them ourselves.

Back to Continuous Batching

We established that no GPU operation should land on the default stream. But the question remains: if we are not using the default stream, what streams should we use? Let us go back to the synchronous batching figure:

Transfer of inputs from CPU to GPU
Compute on the GPU
Transfer of outputs from the GPU to the CPU

This means we need three streams: one for compute, one for CPU-to-GPU transfers, and one for GPU-to-CPU transfers. The transfers are independent, so there is no reason to serialize them, and each one gets its own stream.

A note on nomenclature: when talking about CPUs and GPUs, the convention used throughout the CUDA documentation is to call the CPU the host and the GPU the device. We will use that convention from now on. CPU-to-GPU transfers are called host-to-device (H2D) transfers, and GPU-to-CPU transfers are called device-to-host (D2H) transfers. Hence, the three streams are the H2D stream, the compute stream, and the D2H stream.

Let us now try to use streams to asynchronously launch a batch on the GPU and get back CPU control. From the CPU, we do the following:

Prepare the batch input data on the CPU (no stream, CPU-only operations)
Transfer it to the GPU (using the H2D stream)
Run compute on the GPU (using the compute stream)
Retrieve the batch outputs (using the D2H stream)
Take a look at the results (no stream)

If we do this using only CUDA streams, the results are available almost instantly but they are incorrect. To understand why, let us look at what happened:

The GPU launched all three operations nearly simultaneously. The compute stream did not wait for the H2D transfer to complete, so the forward pass ran on whatever was already sitting in GPU memory. The D2H stream did not wait for compute to finish, so it transferred results that had not been computed yet. Step 5 returned instantly because nothing was blocking the CPU: there was no default stream to synchronize against.

The operations are all running correctly in isolation. The problem is that we never told the streams to wait for each other. We know that compute must start after H2D completes, and that D2H must start after compute completes, but we did not enforce that ordering. We need a mechanism to say “do not start this operation until that one is done” across stream boundaries.

Enforcing synchronization

To enforce synchronization between the streams, we are going to use CUDA events.

Key Takeaways

Asynchronous batching can significantly improve GPU utilization by ensuring both CPU and GPU operations are concurrent.
CUDA streams allow for independent execution of tasks on different parts of the GPU, enabling parallelism that was not possible with synchronous processes.
To achieve asynchronous batching in continuous batching, we utilized CUDA events to enforce ordering between streams, ensuring that tasks were launched in a coordinated manner.

Originally published at huggingface.co. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Unlocking asynchronicity in continuous batching

Unlocking asynchronicity in continuous batching

Synchronous batching