I built a Mamba1 variant I call SM1 with d_state=1 that runs on Blackwell in pure PyTorch [P]

“`html

I recently built a variant of Mamba1, which I’ve dubbed SM1. This variant runs on Blackwell in pure PyTorch and features the configuration with `d_state=1`. The key innovation here is that it replaces the selective scan mechanism used by Mamba1 with two native PyTorch operations for scalar computations:

L = torch.cumprod(dA, dim=1)
h = L * (h0.unsqueeze(1) + torch.cumsum(dBx / L.clamp(min=1e-6), dim=1))
y = h * C

This approach is the exact closed-form solution to the `d_state=1` recurrence relation. Unlike Mamba1 with `d_state=16`, which uses a more complex selective scan mechanism, SM1 simplifies this by eliminating the intermediate state (`S`) entirely. This results in significantly less memory usage for inference states—about 14,080 floats or 56 KB per model, without the need for a key-value cache and with constant time complexity per token.

The primary advantage of SM1 is its efficiency. By reducing scan memory by approximately 16 times compared to Mamba1 with `d_state=16`, it allows training larger models like one with 130 million parameters on smaller hardware such as an RTX 5060 Ti, which has a total capacity of around 16 GB.

This work highlights the importance of choosing appropriate configurations for specific tasks. In this case, SM1 demonstrates how eliminating unnecessary complexity (like multiple state intermediates) can lead to substantial improvements in both performance and resource efficiency.

“`

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

I built a Mamba1 variant I call SM1 with d_state=1 that runs on Blackwell in pure PyTorch [P]

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Alphabet plans to raise…

Nvidia chases $200B CPU…

Kaximia on channeling aggression,…