How to Speed Up Transformer Training Using NVIDIA Apex (FusedAdam, FusedLayerNorm) and Native torch.amp

For creators and engineers building generative models, the bottleneck is rarely the creative spark; it is the time spent waiting for training loops to converge. This guide dissects NVIDIA Apex, specifically isolating the components that still deliver tangible performance gains in modern GPU workflows. Rather than treating Apex as a monolithic library, we separate legacy elements from the high-performance kernels that actually matter. We verify CUDA runtime compatibility, compile the necessary extensions, and ensure the environment supports the fused operations required for speed. Following setup, we benchmark FusedAdam against PyTorch AdamW, compare FusedLayerNorm and FusedRMSNorm against standard implementations, and contrast legacy mixed-precision handling with modern native torch.amp approaches. Finally, we execute a full Transformer training experiment to measure the real-world impact on throughput when switching from vanilla FP32 to a fused Apex-plus-AMP pipeline.

Copy Code

import os, sys, time, subprocess, importlib
import torch
assert torch.cuda.is_available(), (
   "No CUDA GPU found. In Colab: Runtime > Change runtime type > Hardware accelerator = GPU"
)
DEV = torch.device("cuda")
print(f"[env] torch {torch.__version__} | CUDA {torch.version.cuda} | GPU {torch.cuda.get_device_name(0)}")
def _module_present(name: str) -> bool:
   try:
       importlib.import_module(name)
       return True
   except Exception:
       return False
def _build_apex():
   print("[apex] building from source with CUDA + C++ extensions "
         "(~10-20 min on first run; grab a coffee)...")
   subprocess.run([sys.executable, "-m", "pip", "install", "-q", "ninja", "packaging"], check=True)
   if not os.path.isdir("apex"):
       subprocess.run(["git", "clone", "--depth", "1",
                       "https://github.com/NVIDIA/apex"], check=True)
   env = os.environ.copy()
   env["APEX_CPP_EXT"]        = "1"
   env["APEX_CUDA_EXT"]       = "1"
   env["MAX_JOBS"]            = "4"
   env["NVCC_APPEND_FLAGS"]   = "--threads 4"
   cmd = [sys.executable, "-m", "pip", "install", "-v",
          "--no-build-isolation", "--no-cache-dir", "./apex"]
   proc = subprocess.run(cmd, env=env)
   if proc.returncode != 0:
       print("[apex] CUDA build failed -> falling back to PYTHON-ONLY install "
             "(fused kernels will be unavailable, tutorial still runs).")
       subprocess.run([sys.executable, "-m", "pip", "install", "-v",
                       "--no-build-isolation", "--no-cache-dir", "./apex"], check=False)
if not _module_present("amp_C"):
   _build_apex()
HAS_AMP_C  = _module_present("amp_C")
HAS_FLN    = _module_present("fused_layer_norm_cuda")
try:
   import apex
   from apex.optimizers import FusedAdam
   from apex.normalization import FusedLayerNorm
   try:
       from apex.normalization import FusedRMSNorm
       HAS_RMS = True
   except Exception:
       HAS_RMS = False
   from apex import amp
   APEX_OK = True
except Exception as e:
   print(f"[apex] import failed: {e}")
   APEX_OK = False
print("\n[capabilities]")
print(f"  apex importable    : {APEX_OK}")
print(f"  FusedAdam kernels  : {HAS_AMP_C}")
print(f"  FusedLayerNorm krnl: {HAS_FLN}")
print(f"  FusedRMSNorm       : {APEX_OK and HAS_RMS}")
print("=" * 78)
def bench(fn, iters=50, warmup=10):
   for _ in range(warmup):
       fn()
   torch.cuda.synchronize()
   t0 = time.perf_counter()
   for _ in range(iters):
       fn()
   torch.cuda.synchronize()
   return (time.perf_counter() - t0) / iters * 1e3

Initial Environment Check and Build

The process begins by validating the CUDA environment, confirming GPU availability, and displaying the active PyTorch, CUDA, and hardware details. We then compile NVIDIA Apex from source, explicitly enabling CUDA and C++ extensions to unlock the fused kernels, rather than relying on a limited Python-only package. The script also probes for the availability of FusedAdam, FusedLayerNorm, FusedRMSNorm, and legacy AMP support, while establishing a reusable benchmarking utility for the subsequent tests.

Copy Code

print("\n### SECTION A: FusedAdam vs AdamW ###")
def make_many_param_model(n_layers=60, dim=512):
   return torch.nn.Sequential(*[torch.nn.Linear(dim, dim) for _ in range(n_layers)]).to(DEV)
def opt_step_factory(optimizer, model, dim=512):
   x = torch.randn(64, dim, device=DEV)
   def step():
       optimizer.zero_grad(set_to_none=True)
       out = model(x).pow(2).mean()
       out.backward()
       optimizer.step()
   return step
m1 = make_many_param_model()
torch_adam = torch.optim.AdamW(m1.parameters(), lr=1e-3)
ms_torch = bench(opt_step_factory(torch_adam, m1))
print(f"  torch.optim.AdamW : {ms_torch:6.2f} ms / step")
if HAS_AMP_C and APEX_OK:
   m2 = make_many_param_model()
   m2.load_state_dict(m1.state_dict())
   fused_adam = FusedAdam(m2.parameters(), lr=1e-3)
   ms_fused = bench(opt_step_factory(fused_adam, m2))
   print(f"  apex.FusedAdam    : {ms_fused:6.2f} ms / step   "
         f"(~{ms_torch/ms_fused:0.2f}x on optimizer-bound step)")
else:
   print("  apex.FusedAdam    : SKIPPED (cuda ext not built)")

Optimizer Performance Comparison

To isolate optimizer overhead, we benchmark PyTorch AdamW against Apex FusedAdam using a deep model composed entirely of linear layers. By running identical step patterns for both methods, the comparison focuses strictly on update velocity rather than architectural differences. We report the step duration and calculate the speedup to determine if the fused multi-tensor optimizer offers a practical advantage in the current GPU runtime.

Copy Code

print("\n### SECTION B: FusedLayerNorm / FusedRMSNorm ###")

B, T, H = 32, 512, 1024

x = torch.randn(B, T, H, device=DEV, requires_grad=True)

torch_ln = torch.nn.LayerNorm(H).to(DEV)

def ln_torch():

y = torch_ln(x); y.sum().backward()

ms_ln_torch = bench(ln_torch)

print(f"  nn.LayerNorm       : {ms_ln_torch:6.2f} ms / fwd+bwd")

if HAS_FLN and APEX_OK:

fused_ln = FusedLayerNorm(H).to(DEV)

with torch.no_grad():

fused_ln.weight.copy_(torch_ln.weight); fused_ln.bias.copy_(torch_ln.bias)

diff = (fused_ln(x.detach()) - torch_ln(x.detach())).abs().max().item()

print(f"    max|fused - torch| = {diff:.2e}  (should be ~1e-3 or smaller)")

def ln_fused():

y = fused_ln(x); y.sum().backward()

ms_ln_fused = bench(ln_fused)

print(f"  apex.FusedLayerNorm: {ms_ln_fused:6.2f} ms / fwd+bwd   "

f"(~{ms_ln_torch/ms_ln_fused:0.2f}x)")

if HAS_RMS:

fused_rms = FusedRMSNorm(H).to(DEV)

def rms_fused():

y = fused_rms(x); y.sum().backward()

print(f"  apex.FusedRMSNorm  : {bench(rms_fused):6.2f} ms / fwd+bwd "

f"(RMSNorm:
Source Read original →
Related reading
NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon
Accelerating Transformer Training with NVIDIA Transformer Engine, Fused Kernels, BF16, FP8, and GPU Benchmarking
Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference and 21.9% Training Speedup in LLMs
The SignalThe Signal: Edition 03Read this edition →Every Friday: the one AI story that actually mattered, plus the tools worth your time.

AM
AI Maestro is an independent British AI publication. We test what we recommend, and we write it the way we would say it. More about us

How to Speed Up Transformer Training Using NVIDIA Apex (FusedAdam, FusedLayerNorm) and Native torch.amp

Initial Environment Check and Build

Optimizer Performance Comparison

`Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.`

`follow us`

`Popular Tag`

`Popular Post`

`Ten advances in mathematics…`

`Judge denies xAI’s request…`

`YouTuber Hank Green says…`

Initial Environment Check and Build

Optimizer Performance Comparison

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Ten advances in mathematics…

Judge denies xAI’s request…

YouTuber Hank Green says…

`Related articles`

`Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.`

`follow us`

`Popular Tag`

`Popular Post`

`Ten advances in mathematics…`

`Judge denies xAI’s request…`

`YouTuber Hank Green says…`