For creators and engineers building generative models, the bottleneck is rarely the creative spark; it is the time spent waiting for training loops to converge. This guide dissects NVIDIA Apex, specifically isolating the components that still deliver tangible performance gains in modern GPU workflows. Rather than treating Apex as a monolithic library, we separate legacy elements from the high-performance kernels that actually matter. We verify CUDA runtime compatibility, compile the necessary extensions, and ensure the environment supports the fused operations required for speed. Following setup, we benchmark FusedAdam against PyTorch AdamW, compare FusedLayerNorm and FusedRMSNorm against standard implementations, and contrast legacy mixed-precision handling with modern native torch.amp approaches. Finally, we execute a full Transformer training experiment to measure the real-world impact on throughput when switching from vanilla FP32 to a fused Apex-plus-AMP pipeline.
import os, sys, time, subprocess, importlib
import torch
assert torch.cuda.is_available(), (
"No CUDA GPU found. In Colab: Runtime > Change runtime type > Hardware accelerator = GPU"
)
DEV = torch.device("cuda")
print(f"[env] torch {torch.__version__} | CUDA {torch.version.cuda} | GPU {torch.cuda.get_device_name(0)}")
def _module_present(name: str) -> bool:
try:
importlib.import_module(name)
return True
except Exception:
return False
def _build_apex():
print("[apex] building from source with CUDA + C++ extensions "
"(~10-20 min on first run; grab a coffee)...")
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "ninja", "packaging"], check=True)
if not os.path.isdir("apex"):
subprocess.run(["git", "clone", "--depth", "1",
"https://github.com/NVIDIA/apex"], check=True)
env = os.environ.copy()
env["APEX_CPP_EXT"] = "1"
env["APEX_CUDA_EXT"] = "1"
env["MAX_JOBS"] = "4"
env["NVCC_APPEND_FLAGS"] = "--threads 4"
cmd = [sys.executable, "-m", "pip", "install", "-v",
"--no-build-isolation", "--no-cache-dir", "./apex"]
proc = subprocess.run(cmd, env=env)
if proc.returncode != 0:
print("[apex] CUDA build failed -> falling back to PYTHON-ONLY install "
"(fused kernels will be unavailable, tutorial still runs).")
subprocess.run([sys.executable, "-m", "pip", "install", "-v",
"--no-build-isolation", "--no-cache-dir", "./apex"], check=False)
if not _module_present("amp_C"):
_build_apex()
HAS_AMP_C = _module_present("amp_C")
HAS_FLN = _module_present("fused_layer_norm_cuda")
try:
import apex
from apex.optimizers import FusedAdam
from apex.normalization import FusedLayerNorm
try:
from apex.normalization import FusedRMSNorm
HAS_RMS = True
except Exception:
HAS_RMS = False
from apex import amp
APEX_OK = True
except Exception as e:
print(f"[apex] import failed: {e}")
APEX_OK = False
print("\n[capabilities]")
print(f" apex importable : {APEX_OK}")
print(f" FusedAdam kernels : {HAS_AMP_C}")
print(f" FusedLayerNorm krnl: {HAS_FLN}")
print(f" FusedRMSNorm : {APEX_OK and HAS_RMS}")
print("=" * 78)
def bench(fn, iters=50, warmup=10):
for _ in range(warmup):
fn()
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(iters):
fn()
torch.cuda.synchronize()
return (time.perf_counter() - t0) / iters * 1e3Initial Environment Check and Build
The process begins by validating the CUDA environment, confirming GPU availability, and displaying the active PyTorch, CUDA, and hardware details. We then compile NVIDIA Apex from source, explicitly enabling CUDA and C++ extensions to unlock the fused kernels, rather than relying on a limited Python-only package. The script also probes for the availability of FusedAdam, FusedLayerNorm, FusedRMSNorm, and legacy AMP support, while establishing a reusable benchmarking utility for the subsequent tests.
print("\n### SECTION A: FusedAdam vs AdamW ###")
def make_many_param_model(n_layers=60, dim=512):
return torch.nn.Sequential(*[torch.nn.Linear(dim, dim) for _ in range(n_layers)]).to(DEV)
def opt_step_factory(optimizer, model, dim=512):
x = torch.randn(64, dim, device=DEV)
def step():
optimizer.zero_grad(set_to_none=True)
out = model(x).pow(2).mean()
out.backward()
optimizer.step()
return step
m1 = make_many_param_model()
torch_adam = torch.optim.AdamW(m1.parameters(), lr=1e-3)
ms_torch = bench(opt_step_factory(torch_adam, m1))
print(f" torch.optim.AdamW : {ms_torch:6.2f} ms / step")
if HAS_AMP_C and APEX_OK:
m2 = make_many_param_model()
m2.load_state_dict(m1.state_dict())
fused_adam = FusedAdam(m2.parameters(), lr=1e-3)
ms_fused = bench(opt_step_factory(fused_adam, m2))
print(f" apex.FusedAdam : {ms_fused:6.2f} ms / step "
f"(~{ms_torch/ms_fused:0.2f}x on optimizer-bound step)")
else:
print(" apex.FusedAdam : SKIPPED (cuda ext not built)")Optimizer Performance Comparison
To isolate optimizer overhead, we benchmark PyTorch AdamW against Apex FusedAdam using a deep model composed entirely of linear layers. By running identical step patterns for both methods, the comparison focuses strictly on update velocity rather than architectural differences. We report the step duration and calculate the speedup to determine if the fused multi-tensor optimizer offers a practical advantage in the current GPU runtime.
print("\n### SECTION B: FusedLayerNorm / FusedRMSNorm ###")
B, T, H = 32, 512, 1024
x = torch.randn(B, T, H, device=DEV, requires_grad=True)
torch_ln = torch.nn.LayerNorm(H).to(DEV)
def ln_torch():
y = torch_ln(x); y.sum().backward()
ms_ln_torch = bench(ln_torch)
print(f" nn.LayerNorm : {ms_ln_torch:6.2f} ms / fwd+bwd")
if HAS_FLN and APEX_OK:
fused_ln = FusedLayerNorm(H).to(DEV)
with torch.no_grad():
fused_ln.weight.copy_(torch_ln.weight); fused_ln.bias.copy_(torch_ln.bias)
diff = (fused_ln(x.detach()) - torch_ln(x.detach())).abs().max().item()
print(f" max|fused - torch| = {diff:.2e} (should be ~1e-3 or smaller)")
def ln_fused():
y = fused_ln(x); y.sum().backward()
ms_ln_fused = bench(ln_fused)
print(f" apex.FusedLayerNorm: {ms_ln_fused:6.2f} ms / fwd+bwd "
f"(~{ms_ln_torch/ms_ln_fused:0.2f}x)")
if HAS_RMS:
fused_rms = FusedRMSNorm(H).to(DEV)
def rms_fused():
y = fused_rms(x); y.sum().backward()
print(f" apex.FusedRMSNorm : {bench(rms_fused):6.2f} ms / fwd+bwd "
f"(RMSNorm:Source Read original →Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




