FlashLM v9.7

Back with an update. Some of you saw v10 FSP, the one where I found that Future Sentence Prediction gave a 2.5x PPL improvement. Well, I’ve run 20+ more experiments since then trying to get the model to actually understand what it’s saying. Spoiler: lower PPL does not mean more coherent.

Quick clarification on naming: CPUFlow is my cumsum based CPU native architecture (v1 through v9.7). FlashLM is the broader project including attention experiments (v10 FSP), ternary models (v5 Thunderbolt), etc. All trained from scratch on free tier CPUs.

The finding that surprised me:

My best perplexity model ever (CPUFlow v8, val PPL 9.30) produces complete gibberish. My baseline (CPUFlow v5-LN, val PPL 11.94) generates partially coherent children’s stories with named characters and narrative structure. CPUFlow v9.7 (val PPL 10.23) is the best of both worlds, partially coherent generation with better PPL. But to be clear: no FlashLM model achieves true coherence. They all lose it ~100 tokens in.

Results (all on TinyStories, 2h, 4 free CPU cores):

Version	Series	Architecture	Params	Val PPL	Coherent?
v5	FlashLM	Ternary recurrence	29.7M	1.36	No
v7.4	FlashLM	Gated DeltaNet + SWA	6.6M	2.33	No
v10 FSP	FlashLM	Attention + FSP	3.74M	10.24	Partial
v8	CPUFlow	FSP + hard slot routing (M=32)	2.0M	9.30	No
v9.7	CPUFlow	cumsum + RAM Net (no routing loss)	2.47M	10.23	Partial
v5-LN	CPUFlow	Fused cumsum + LayerNorm + FSP	2.0M	11.94	Partial
v9	CPUFlow	cumsum + RAM Net + contrastive routing	2.48M	9.73	No

What happened between v10 FSP and now:

After the FSP breakthrough I went down a rabbit hole trying to add entity tracking, making the model remember "who’s who" in a story. I tried six different mechanisms:

Softmax memory bank (v7), gates collapsed on cold start, stayed at 0.12. Warm start fixed gates but softmax still blended everything together.
Hard argmax routing (v8), each token routes to exactly one slot. Best PPL ever (9.30) but totally incoherent. The discrete routing broke the continuous context.
Supervised slot routing (v8.5), gave the model ground truth entity labels as supervision. Mode collapse: everything routes to slot 24.
Product Softmax addressing (v9), 3 sub softmaxes x 8 = 512 virtual slots, Top 8 sparse selection. Nice math, addresses still collapsed.
Contrastive entity routing (v9.5), explicit push apart loss on entity addresses. Pull from CE overwhelmed push from contrastive loss.
Two phase contrastive training (v9.6), freeze backbone, train memory first, then train everything. Same collapse.

Turns out there’s a reason. Feng & Steinhardt (2024) showed you need ~160M parameters before entity specific addressing even becomes possible. At 2.5M params, the binding threshold is a brick wall. Six different mechanisms, same fundamental limit.

What actually worked, v9.7:

I gave up on entity tracking and just added RAM Net sparse memory as a dumb capacity expansion. The architecture is CPUFlow v5-LN’s cumsum backbone + a memory sidepath (512 slots, Product Softmax addressing, Top 8 sparse read/write). Direct addition, no gate, no routing loss. Just extra parameters.

Architecture:

embed + CumStepPos → [RAMScanBlock × 6] → LayerNorm → tied output + FSP RAMScanBlock: x_n = LayerNorm(x) h = W_proj(x_n) # fused: d → 3k query, key, value = chunk(h, 3) key = sigmoid(key); value = tanh(value) scan_out = W_m(query * cumsum(key*value) / cumsum(key)) addr = W_addr(x_n) → Product Softmax → Top 8 of 512 virtual slots mem_out = sparse_read_write(addr, x_n) merged = scan_out + W_mem_proj(mem_out) x = x + W_out(merged) x = x + ff_down(relu(ff_up(LayerNorm(x))))

Sample outputs (CPUFlow v9.7, val PPL 10.23):

Prompt: "Lily and Tim went to the park. They"

…They saw many kids playing near the back house. They went up to a tree and gave them to their dad. They were very happy. After a while, they saw a big pile of ants. It was not a normal day. They did not want to play hide behind. Tim and his friends were scared, but they did not want to go home. They got in and played in the big bush. They learned that they should always listen too.

Prompt: "There was a little girl named Lily. She loved to play with her friends. One day"

…she put her shoes in the park. In the park, Lily saw a big lock on the ground. She wanted to open it. She tried to open the key, but it was too small. She tried to unlock the door open, but she could not. Lily tried to open the door, but it was too tight. She pulled and walked up, up the church, and eventually, the lock was locked. She was very confused. Her mom came to help.

Named characters, pronoun tracking works ("Lily… She"), some story structure. But "she tried to open the key" is nonsense, and the story drifts from park to church with no transition. Coherence breaks down after ~100 tokens. Being honest, this is partially coherent, not truly coherent. No FlashLM model achieves narrative coherence.

Model Weights:

All CPUFlow weights are on HuggingFace (MIT license):

CPUFlow v9.7 (best partially coherent): https://huggingface.co/changcheng967/cpuflow-v97-memory
CPUFlow v8 (best PPL, incoherent): https://huggingface.co/changcheng967/cpuflow-v8-discrete
CPUFlow v5-LN (partially coherent baseline): https://huggingface.co/changcheng967/cpuflow-v5-ln
CPUFlow v9 (RAM Net, incoherent): https://huggingface.co/changcheng967/cpuflow-v9-ram
FlashLM v10 FSP (attention + FSP): https://huggingface.co/changcheng967/flashlm-v10-fsp

Links:

GitHub: https://github.com/changcheng967/FlashLM
Website: https://changcheng967.github.io/FlashLM/

Happy to answer questions about the architecture, the entity tracking failures, or CPU training in general.

submitted by /u/Own-Albatross868

Source Read original →

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

tencent/Hy3

US investors will soon…

The ‘first’ AI-run ransomware…