“`html
Build Recurrent-Depth Transformers with OpenMythos for MLA, GQA, Sparse MoE, and Loop-Scaled Reasoning
In this tutorial, we explore OpenMythos, a library for building advanced recurrent-depth transformer workflows. We create an end-to-end workflow in Google Colab to build models with either MLA (Multi-Latent Attention) or GQA (Grouped-Query Attention) attention mechanisms.
Setup and Imports
We install OpenMythos from GitHub if installing via PyPI fails. We import necessary Python, PyTorch, NumPy, and plotting libraries for model building, training, and visualization. A fixed random seed is set to ensure reproducibility, and we use CUDA when available.
```python
import subprocess, sys
def pip(*args):
subprocess.run([sys.executable, "-m", "pip", "install", "-q", *args], check=False)
try:
import open_mythos # noqa: F401
except Exception:
pip("open-mythos")
try:
import open_mythos # noqa: F401
except Exception:
pip("git+https://github.com/kyegomez/OpenMythos.git")
import math, random, time
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
from open_mythos.main import OpenMythos, MythosConfig
SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
print(f"Device: {device} | Torch: {torch.__version__}")
```
```/pre>
Model Building and Parameter Counting
We define a reusable model factory that builds small OpenMythos models with either MLA or GQA attention. We compare both variants by checking their parameter counts and the spectral radius of the recurrent injection matrix, which should be less than 1 for stability.
```python
def build_model(attn_type: str = "mla", max_loop_iters: int = 8) -> tuple:
"""Build a small OpenMythos model. Two attention variants supported.
MLA — Multi-Latent Attention (compressed KV cache, DeepSeek-V2 style)
GQA — Grouped-Query Attention (fewer KV heads than Q heads)
"""
base = dict(
vocab_size = 64,
dim = 128,
n_heads = 4,
max_seq_len = 32,
max_loop_iters = max_loop_iters,
prelude_layers = 1,
coda_layers = 1,
n_experts = 4,
n_shared_experts = 1,
n_experts_per_tok= 2,
expert_dim = 64,
lora_rank = 8,
attn_type = attn_type,
)
if attn_type == "gqa":
cfg = MythosConfig(**base, n_kv_heads=2)
else:
cfg = MythosConfig(
**base, n_kv_heads=4,
kv_lora_rank=32, q_lora_rank=32,
qk_rope_head_dim=16, qk_nope_head_dim=16, v_head_dim=16,
)
model = OpenMythos(cfg).to(device)
return model, cfg
model_mla, cfg_mla = build_model("mla")
model_gqa, cfg_gqa = build_model("gqa")def n_params(m):
return sum(p.numel() for p in m.parameters())
print(f"\n[MLA] params: {n_params(model_mla):>10,}")
print(f"[GQA] params: {n_params(model_gqa):>10,}")def spectral_radius(model):
A = model.recurrent.injection.get_A().detach().cpu()
if A.dim() == 1:
rho = A.abs().max().item()
else:
rho = torch.linalg.eigvals(A.float()).abs().max().item()
return rhoprint(f"\nρ(A) MLA: {spectral_radius(model_mla):.4f} (must be < 1)")
print(f"ρ(A) GQA: {spectral_radius(model_gqa):.4f} (must be < 1)")ids = torch.randint(0, cfg_mla.vocab_size, (2, 16), device=device)
with torch.no_grad():
logits = model_mla(ids, n_loops=4)
gen = model_mla.generate(ids, max_new_tokens=4, n_loops=8)print(f"\nForward logits shape: {tuple(logits.shape)}")
print(f"Generation shape: {tuple(gen.shape)}")
```
```/pre>
Synthetic Compositional Reasoning Task
We create a synthetic compositional task where the model predicts the sum of digit tokens modulo 7. We define the token scheme, sequence structure, and dataset class to generate random digit-chain examples. We then build training, test, and out-of-distribution loaders to evaluate both normal performance and depth extrapolation.
```python
PAD, START, EQ = 0, 1, 2
DIGIT_BASE = 10
M = 7
SEQ_LEN = cfg_mla.max_seq_len
MIN_LEN, MAX_LEN = 2, 5def make_example(chain_len: int):
digits = [random.randint(0, M-1) for _ in range(chain_len)]
target = sum(digits) % M
toks = [START] + [DIGIT_BASE + d for d in digits] + [EQ]
toks = toks + [PAD] * (SEQ_LEN - len(toks))
return toks[:SEQ_LEN], DIGIT_BASE + targetclass ChainDataset(Dataset):
def __init__(self, n, lo, hi):
self.items = [make_example(random.randint(lo, hi)) for _ in range(n)]
def __len__(self):
return len(self.items)
def __getitem__(self, i):
x, y = self.items[i]
return torch.tensor(x, dtype=torch.long), torch.tensor(y, dtype=torch.long)train_loader = DataLoader(ChainDataset(3000, MIN_LEN, MAX_LEN), batch_size=64, shuffle=True)
test_loader = DataLoader(ChainDataset(400, MIN_LEN, MAX_LEN), batch_size=64)
ood_loader = DataLoader(ChainDataset(400, MAX_LEN+1, MAX_LEN+3), batch_size=64)
```
```/pre>
Key Takeaways
- The MLA model has fewer parameters compared to the GQA model.
- The spectral radius of the recurrent injection matrix for both models is checked and should be less than 1 for stability.
- We trained an MLA model with a fixed number of recurrent loops, optimized it using AdamW, and monitored its performance over epochs. The training loss was tracked, and we observed how loop count affects inference-time accuracy.
- The synthetic digit-chain example task demonstrated that the MLA model can perform well even at longer chain lengths (out-of-distribution).
```
```/html
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




