Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP

Profiling in PyTorch (Part 2): Moving Beyond nn.Linear to a Fused MLP In the opening instalment of this series, we dissected torch.add(torch.matmul(x,…

By AI Maestro June 11, 2026 6 min read
Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP


Profiling in PyTorch (Part 2): Moving Beyond nn.Linear to a Fused MLP

In the opening instalment of this series, we dissected torch.add(torch.matmul(x, w), b) to teach you how to interpret PyTorch profiler traces. We also covered essential concepts including the CPU dispatch chain, launch overhead, the distinction between overhead-bound and compute-bound regimes, and the internal mechanics of torch.compile.

Now, we advance to the next level. We swap the manual matrix multiplication and addition pair for an nn.Linear layer (with bias=True). This is the fundamental component every deep learning model relies upon. We then stack three of these layers, separated by an activation function, to construct a Multilayer Perceptron (MLP) block.

The code for this article is available here:

02_linear.py, 03_simple_mlp.py, and 03_kernels_mlp.py. As before, keep the files open in a separate tab to follow along. We utilise an NVIDIA A100-SXM4-80GB GPU for these experiments. Setting up a GPU on the Hugging Face infrastructure is straightforward, allowing you to experiment with Dev Mode Spaces or run the scripts via the Hugging Face Jobs pipeline.

Before diving in, let us recap two recurring concepts:

  • A GPU kernel is a programme executing in parallel across thousands of GPU threads.
  • The CPU schedules and launches these kernels. The majority of PyTorch overhead visible in a profiler trace stems from this scheduling work.

Transitioning from matmul-add to Linear

The nn.Linear module wraps the same matrix multiplication and addition operations profiled in Part 1. The distinction is that it owns its weight and bias as parameters and exposes a forward method familiar to PyTorch users.

# bias=True would truly emulate the multiplication and addition operations we saw in part 1 of the series

linear_layer = nn.Linear(in_dim, out_dim, bias=True)

y = linear_layer(x)

The operation can be expressed as:

y = x @ w.T + b

Where x is the input, w is the weight, and b is the bias. Let us run 02_linear.py and examine the profile.

uv run 02_linear.py --batch 1024 --in_dim 32 --out_dim 64

uvx trace-util traces -b traces

trace-util is a utility that syncs your traces to a Hugging Face bucket and provides Prefeto URLs on your terminal.

Figure 1 displays the profiler trace of a forward pass through the linear layer. We trace the forward call with a schedule setup similar to the previous traces, using wait=1, warmup=1, and active=3. This results in three Profile Steps appearing in both the CPU and GPU lanes.

What is the transpose doing?

Zooming into the profiler trace as shown in Figure 2, we observe an aten::t (transpose) operation preceding the aten::addmm (multiplication and addition) operation. It is evident that nn.Linear transposes the weight parameter before multiplying it with the input. This is why we see the aten::t operation.

A crucial detail is that aten::t does not copy or reorganise data; it merely rewrites tensor metadata (shape and stride) on the CPU to represent the transposed matrix. It does not launch a kernel on the GPU. You can verify this by inspecting the GPU lane in the trace or checking the aten::t row in the profiler table for the time consumed on CUDA.

Why are there no separate mul and add kernels?

There is no aten::add (the bias addition) in the dispatch chain of the linear layer, as seen in Figure 3. This is because the bias addition has been folded into the matrix multiplication kernel, a technique known as an epilogue.

An epilogue is a small computation a GEMM (GEneral Matrix Multiply) kernel performs at the very end, just before writing results back to HBM (High Bandwidth Memory, the GPU’s main memory). Adding a bias, applying an activation, or scaling by a constant are classic examples. The purpose of an epilogue is to avoid loading or writing to HBM a second time, as memory traffic makes an operation expensive.

nn.Linear calls torch.nn.functional.linear, which in turn invokes aten::linear. aten::linear inspects the inputs, detects that a bias was passed, and dispatches aten::addmm(bias, x, weight) instead of performing a matmul and an add separately. addmm computes:

out = x @ weight.T + bias

The cuBLAS GEMM kernel running on the GPU includes a built-in bias-add variant, which is the kernel aten::addmm selects. The add never appears as a separate kernel because it is part of the matmul kernel’s writeback, which is exactly what an epilogue is.

This is a moment to note something subtle. The kernel seen in Part 1 under --compile (addmm) is the same kernel eager nn.Linear already uses. There is nothing left for torch.compile to fuse here, which we will verify next.

Can –compile help a single Linear?

Let us compile the forward pass and review the profiler trace (visualised in the next section).

uv run 02_linear.py --batch 1024 --in_dim 32 --out_dim 64 --compile

uvx trace-util traces -b traces

Comparing the eager and compiled traces for a single nn.Linear‘s forward reveals:

  • The same cuBLAS GEMM kernel on the GPU.
  • The same aten::addmm operation on the CPU.
  • A few extra rows on the CPU lane unique to the compile pass.

This is worth internalising. A common reflex is to reach for torch.compile whenever a model feels sluggish. For a single GEMM-with-bias, compile has very little to do. This is not a bug; it is simply that compile requires more than one operation to perform any fusing. Let us prove this by examining an MLP.

Where did the transpose go? Kernel layouts and pre-ops

A careful reader comparing the eager and compiled traces will notice that the eager CPU dispatch chain contains more steps than the compiled one.

Table 1: Eager dispatch chain where aten::linear traverses aten::t (transpose) followed by aten::addmm

The eager CPU dispatch chain inside aten::linear consists of aten::t followed by aten::addmm. To understand what aten::t actually does, we must briefly explore strides and views.

A tensor stores data as one flat, contiguous run of numbers in memory. The shape and stride are metadata sitting atop that run, telling PyTorch how to traverse it: a stride of (s0, s1) means “step s0 elements to move one row, step s1 to move one column”. Changing the metadata creates a different view of the same raw data without copying:

>>> M = torch.tensor([[0, 1],

... [2, 3],

... [4, 5]])

>>> M.shape, M.stride()

(torch.Size([3, 2]), (2, 1)) # two steps per row, one step per column

>>> T = M.t()

>>> T.shape, T.stride()

(torch.Size([2, 3]), (1, 2)) # shape and stride swapped, data untouched

>>> T

tensor([[0, 2, 4],

[1, 3, 5]])

>>> T.flatten()

tensor([0, 2, 4, 1, 3, 5])

M.t() did not move a single number. It returned a new view with swapped strides, so reading it row-by-row now walks the original buffer 0, 1, 2, 3, 4, 5 in transposed order. The underlying data is identical; only the metadata differs.

This is exactly what aten::t does inside the linear layer: it does not allocate a new tensor or copy any data, but produces a view of the weight with rewritten strides.

As seen in Figure 5, compile did not remove a GPU kernel; it removed the CPU overhead of dispatching that view. Inductor traced through the view chain at compile time, computed the resulting strides once, and emitted a direct aten::addmm call with those strides hard-coded. A few microseconds of CPU work disappear while the GPU performs identical math.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top