Build Recurrent-Depth Transformers with OpenMythos for MLA, GQA, Sparse MoE, and Loop-Scaled Reasoning

“`html Build Recurrent-Depth Transformers with OpenMythos for MLA, GQA, Sparse MoE, and Loop-Scaled Reasoning In this tutorial, we explore OpenMythos, a library…

By AI Maestro May 22, 2026 4 min read
Build Recurrent-Depth Transformers with OpenMythos for MLA, GQA, Sparse MoE, and Loop-Scaled Reasoning

“`html

Build Recurrent-Depth Transformers with OpenMythos for MLA, GQA, Sparse MoE, and Loop-Scaled Reasoning

In this tutorial, we explore OpenMythos, a library for building advanced recurrent-depth transformer workflows. We create an end-to-end workflow in Google Colab to build models with either MLA (Multi-Latent Attention) or GQA (Grouped-Query Attention) attention mechanisms.

Setup and Imports

We install OpenMythos from GitHub if installing via PyPI fails. We import necessary Python, PyTorch, NumPy, and plotting libraries for model building, training, and visualization. A fixed random seed is set to ensure reproducibility, and we use CUDA when available.

```python
import subprocess, sys
def pip(*args):
subprocess.run([sys.executable, "-m", "pip", "install", "-q", *args], check=False)
try:
import open_mythos # noqa: F401
except Exception:
pip("open-mythos")
try:
import open_mythos # noqa: F401
except Exception:
pip("git+https://github.com/kyegomez/OpenMythos.git")
import math, random, time
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
from open_mythos.main import OpenMythos, MythosConfig
SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)
print(f"Device: {device} | Torch: {torch.__version__}")
```
```/pre>

Model Building and Parameter Counting

We define a reusable model factory that builds small OpenMythos models with either MLA or GQA attention. We compare both variants by checking their parameter counts and the spectral radius of the recurrent injection matrix, which should be less than 1 for stability.

```python
def build_model(attn_type: str = "mla", max_loop_iters: int = 8) -> tuple:
"""Build a small OpenMythos model. Two attention variants supported.
MLA — Multi-Latent Attention (compressed KV cache, DeepSeek-V2 style)
GQA — Grouped-Query Attention (fewer KV heads than Q heads)
"""
base = dict(
vocab_size = 64,
dim = 128,
n_heads = 4,
max_seq_len = 32,
max_loop_iters = max_loop_iters,
prelude_layers = 1,
coda_layers = 1,
n_experts = 4,
n_shared_experts = 1,
n_experts_per_tok= 2,
expert_dim = 64,
lora_rank = 8,
attn_type = attn_type,
)
if attn_type == "gqa":
cfg = MythosConfig(**base, n_kv_heads=2)
else:
cfg = MythosConfig(
**base, n_kv_heads=4,
kv_lora_rank=32, q_lora_rank=32,
qk_rope_head_dim=16, qk_nope_head_dim=16, v_head_dim=16,
)
model = OpenMythos(cfg).to(device)
return model, cfg
model_mla, cfg_mla = build_model("mla")
model_gqa, cfg_gqa = build_model("gqa")

def n_params(m):
return sum(p.numel() for p in m.parameters())
print(f"\n[MLA] params: {n_params(model_mla):>10,}")
print(f"[GQA] params: {n_params(model_gqa):>10,}")

def spectral_radius(model):
A = model.recurrent.injection.get_A().detach().cpu()
if A.dim() == 1:
rho = A.abs().max().item()
else:
rho = torch.linalg.eigvals(A.float()).abs().max().item()
return rho

print(f"\nρ(A) MLA: {spectral_radius(model_mla):.4f} (must be < 1)")
print(f"ρ(A) GQA: {spectral_radius(model_gqa):.4f} (must be < 1)")

ids = torch.randint(0, cfg_mla.vocab_size, (2, 16), device=device)
with torch.no_grad():
logits = model_mla(ids, n_loops=4)
gen = model_mla.generate(ids, max_new_tokens=4, n_loops=8)

print(f"\nForward logits shape: {tuple(logits.shape)}")
print(f"Generation shape: {tuple(gen.shape)}")
```
```/pre>

Synthetic Compositional Reasoning Task

We create a synthetic compositional task where the model predicts the sum of digit tokens modulo 7. We define the token scheme, sequence structure, and dataset class to generate random digit-chain examples. We then build training, test, and out-of-distribution loaders to evaluate both normal performance and depth extrapolation.

```python
PAD, START, EQ = 0, 1, 2
DIGIT_BASE = 10
M = 7
SEQ_LEN = cfg_mla.max_seq_len
MIN_LEN, MAX_LEN = 2, 5

def make_example(chain_len: int):
digits = [random.randint(0, M-1) for _ in range(chain_len)]
target = sum(digits) % M
toks = [START] + [DIGIT_BASE + d for d in digits] + [EQ]
toks = toks + [PAD] * (SEQ_LEN - len(toks))
return toks[:SEQ_LEN], DIGIT_BASE + target

class ChainDataset(Dataset):
def __init__(self, n, lo, hi):
self.items = [make_example(random.randint(lo, hi)) for _ in range(n)]
def __len__(self):
return len(self.items)
def __getitem__(self, i):
x, y = self.items[i]
return torch.tensor(x, dtype=torch.long), torch.tensor(y, dtype=torch.long)

train_loader = DataLoader(ChainDataset(3000, MIN_LEN, MAX_LEN), batch_size=64, shuffle=True)
test_loader = DataLoader(ChainDataset(400, MIN_LEN, MAX_LEN), batch_size=64)
ood_loader = DataLoader(ChainDataset(400, MAX_LEN+1, MAX_LEN+3), batch_size=64)
```
```/pre>

Key Takeaways

  • The MLA model has fewer parameters compared to the GQA model.
  • The spectral radius of the recurrent injection matrix for both models is checked and should be less than 1 for stability.
  • We trained an MLA model with a fixed number of recurrent loops, optimized it using AdamW, and monitored its performance over epochs. The training loss was tracked, and we observed how loop count affects inference-time accuracy.
  • The synthetic digit-chain example task demonstrated that the MLA model can perform well even at longer chain lengths (out-of-distribution).

```
```/html

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top