Build Recurrent-Depth Transformers with OpenMythos for MLA, GQA, Sparse MoE, and Loop-Scaled Reasoning

“`html

Build Recurrent-Depth Transformers with OpenMythos for MLA, GQA, Sparse MoE, and Loop-Scaled Reasoning

In this tutorial, we explore OpenMythos, a library for building advanced recurrent-depth transformer workflows. We create an end-to-end workflow in Google Colab to build models with either MLA (Multi-Latent Attention) or GQA (Grouped-Query Attention) attention mechanisms.

Setup and Imports

We install OpenMythos from GitHub if installing via PyPI fails. We import necessary Python, PyTorch, NumPy, and plotting libraries for model building, training, and visualization. A fixed random seed is set to ensure reproducibility, and we use CUDA when available.

```python

import subprocess, sys

def pip(*args):

subprocess.run([sys.executable, "-m", "pip", "install", "-q", *args], check=False)

try:

import open_mythos  # noqa: F401

except Exception:

pip("open-mythos")

try:

import open_mythos  # noqa: F401

except Exception:

pip("git+https://github.com/kyegomez/OpenMythos.git")

import math, random, time

import numpy as np

import torch

import torch.nn as nn

import torch.nn.functional as F

from torch.utils.data import Dataset, DataLoader

import matplotlib.pyplot as plt

from open_mythos.main import OpenMythos, MythosConfig

SEED = 42

random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)

print(f"Device: {device} | Torch: {torch.__version__}")

```

```/pre>

Model Building and Parameter Counting

We define a reusable model factory that builds small OpenMythos models with either MLA or GQA attention. We compare both variants by checking their parameter counts and the spectral radius of the recurrent injection matrix, which should be less than 1 for stability.

```python

def build_model(attn_type: str = "mla", max_loop_iters: int = 8) -> tuple:

"""Build a small OpenMythos model. Two attention variants supported.

MLA — Multi-Latent Attention (compressed KV cache, DeepSeek-V2 style)

GQA — Grouped-Query Attention (fewer KV heads than Q heads)

"""

base = dict(

vocab_size       = 64,

dim              = 128,

n_heads          = 4,

max_seq_len      = 32,

max_loop_iters   = max_loop_iters,

prelude_layers   = 1,

coda_layers      = 1,

n_experts        = 4,

n_shared_experts = 1,

n_experts_per_tok= 2,

expert_dim       = 64,

lora_rank        = 8,

attn_type        = attn_type,

)

if attn_type == "gqa":

cfg = MythosConfig(**base, n_kv_heads=2)

else:

cfg = MythosConfig(

**base, n_kv_heads=4,

kv_lora_rank=32, q_lora_rank=32,

qk_rope_head_dim=16, qk_nope_head_dim=16, v_head_dim=16,

)

model = OpenMythos(cfg).to(device)

return model, cfg

model_mla, cfg_mla = build_model("mla")

model_gqa, cfg_gqa = build_model("gqa")
def n_params(m):

return sum(p.numel() for p in m.parameters())

print(f"\n[MLA] params: {n_params(model_mla):>10,}")

print(f"[GQA] params: {n_params(model_gqa):>10,}")
def spectral_radius(model):

A = model.recurrent.injection.get_A().detach().cpu()

if A.dim() == 1:

rho = A.abs().max().item()

else:

rho = torch.linalg.eigvals(A.float()).abs().max().item()

return rho
print(f"\nρ(A) MLA: {spectral_radius(model_mla):.4f}   (must be < 1)")

print(f"ρ(A) GQA: {spectral_radius(model_gqa):.4f}   (must be < 1)")
ids = torch.randint(0, cfg_mla.vocab_size, (2, 16), device=device)

with torch.no_grad():

logits = model_mla(ids, n_loops=4)

gen = model_mla.generate(ids, max_new_tokens=4, n_loops=8)
print(f"\nForward logits shape:  {tuple(logits.shape)}")

print(f"Generation shape:      {tuple(gen.shape)}")

```

```/pre>

Synthetic Compositional Reasoning Task

We create a synthetic compositional task where the model predicts the sum of digit tokens modulo 7. We define the token scheme, sequence structure, and dataset class to generate random digit-chain examples. We then build training, test, and out-of-distribution loaders to evaluate both normal performance and depth extrapolation.

```python

PAD, START, EQ = 0, 1, 2

DIGIT_BASE     = 10

M              = 7

SEQ_LEN        = cfg_mla.max_seq_len

MIN_LEN, MAX_LEN = 2, 5
def make_example(chain_len: int):

digits = [random.randint(0, M-1) for _ in range(chain_len)]

target = sum(digits) % M

toks = [START] + [DIGIT_BASE + d for d in digits] + [EQ]

toks = toks + [PAD] * (SEQ_LEN - len(toks))

return toks[:SEQ_LEN], DIGIT_BASE + target
class ChainDataset(Dataset):

def __init__(self, n, lo, hi):

self.items = [make_example(random.randint(lo, hi)) for _ in range(n)]

def __len__(self):

return len(self.items)

def __getitem__(self, i):

x, y = self.items[i]

return torch.tensor(x, dtype=torch.long), torch.tensor(y, dtype=torch.long)
train_loader = DataLoader(ChainDataset(3000, MIN_LEN, MAX_LEN), batch_size=64, shuffle=True)

test_loader  = DataLoader(ChainDataset(400,  MIN_LEN, MAX_LEN), batch_size=64)

ood_loader   = DataLoader(ChainDataset(400,  MAX_LEN+1, MAX_LEN+3), batch_size=64)

```

```/pre>

Key Takeaways

The MLA model has fewer parameters compared to the GQA model.
The spectral radius of the recurrent injection matrix for both models is checked and should be less than 1 for stability.
We trained an MLA model with a fixed number of recurrent loops, optimized it using AdamW, and monitored its performance over epochs. The training loss was tracked, and we observed how loop count affects inference-time accuracy.
The synthetic digit-chain example task demonstrated that the MLA model can perform well even at longer chain lengths (out-of-distribution).

```
```/html

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Build Recurrent-Depth Transformers with OpenMythos for MLA, GQA, Sparse MoE, and Loop-Scaled Reasoning