Makers and artists relying on generative tools are facing a potential revolution in how they work. A new approach to large language models promises to slash the cost of running these systems while allowing them to process significantly more data at once. This shift could mean that tasks once reserved for expensive, cloud-based supercomputers—like analysing hundreds of documents or entire codebases—become accessible on standard hardware.
A bold claim meets independent scrutiny
Subquadratic, an AI startup based in Miami, recently emerged from stealth mode with a provocative assertion: it has solved a mathematical bottleneck that has constrained large language models for nearly ten years. Initially, the company offered few details, leading to widespread doubt. However, it has since begun releasing data from independent evaluations, and the results suggest the claims warrant serious attention.
The new architecture, dubbed SubQ, is described as faster, cheaper, and far more energy-efficient than current market leaders. Crucially, it claims to handle up to 12 times as much text simultaneously. This capability enables the processing of massive datasets, such as hundreds of documents or full code repositories, without the usual performance degradation.
Despite this efficiency, Subquadratic insists that SubQ matches the performance of top-tier models from Google DeepMind, OpenAI, and Anthropic on critical tasks like coding. Yet, the company has not yet made the model widely available for public testing, which naturally fuels skepticism.
The reaction from the technical community was immediate and harsh. Dan McAteer, an AI engineer, summarised the sentiment on X: “SubQ is either the biggest breakthrough since the Transformer … or it’s AI Theranos.”
A month later, the company has published further documentation and released results from tests conducted by the third-party firm Appen. “We expected healthy skepticism,” says Alex Whedon, the startup’s co-founder and chief technology officer. “In hindsight, releasing the third-party benchmarks alongside the initial announcement would have preempted much of the skepticism, which is why we’re taking the time to make sure any future results are fully verified before putting them out.”
Appen’s director of generative AI research, Jeanine Sinanan-Singh, noted that seeing the results validated by an external party was “really exciting.” She added that while shocking claims are often met with disbelief, the independent data confirmed the architecture’s potential to be a game changer for speed and efficiency.
The math behind the bottleneck
To understand the significance of Subquadratic’s work, one must look at how standard LLMs function. Most models rely on a neural network architecture called a transformer, which uses a mechanism known as dense attention. As outlined in the foundational 2017 paper titled “Attention Is All You Need,” this process requires the model to calculate relationships between every single token in a text.
The computational cost grows quadratically with text length. For a document of 10,000 words, the system must perform nearly 50 million individual multiplications. This is why LLMs are notoriously power-hungry. As Justin Dangel, the firm’s CEO, explains, summarising a book like The Great Gatsby requires the model to consider the first word against the last, and every other possible combination in between.
Double the number of words, and the computational load roughly quadruples. Subquadratic argues that this quadratic expansion is the primary barrier to scaling models for complex, data-heavy tasks.
Sparse attention as the solution
The company’s solution involves abandoning dense attention in favour of sparse attention. Instead of multiplying every token against every other token, the system selects only a subset of relationships to calculate. The premise is that not all word relationships within a text are semantically significant.
“Sparse attention says not all of those relationships are important, because they’re not,” says Whedon. “If you’re reading a book, you’re not going to look at the first and second words, first and third—that’s insane.”
While sparse attention is not a new concept, previous attempts have struggled to capture meaning as effectively as dense attention. Subquadratic claims to have cracked the code by using a dynamic selection mechanism that chooses which tokens matter on a fly, rather than relying on fixed patterns. “Language is too sophisticated for that,” Whedon notes, highlighting that their approach adapts to the specific input.
Performance and cost metrics
Appen’s evaluation supports several of the company’s assertions. In a theoretical speed test against FlashAttention, a previous sparse-attention technique, SubQ was found to be 56 times faster.
On LiveCodeBench, a rigorous test of coding capabilities, SubQ scored 89.7%, placing it in the same tier as other top models. Sinanan-Singh confirmed that the model continues to provide frontier-level performance in coding tasks.
The cost difference is stark, though harder to verify independently due to limited access. Dangel states that running Anthropic’s LLM Opus 4.6 through a specific retrieval test cost $2,600. In contrast, running the same task with SubQ cost just eight dollars.
Context window capacity is another major advantage. While most leading models today handle contexts of around one million tokens, SubQ supports windows up to 12 million tokens long. In a demonstration, SubQ successfully reasoned over information from 400 documents in seconds, whereas Perplexity failed to load the full dataset.
Appen also conducted a needle-in-a-haystack test to measure retrieval accuracy. SubQ scored 98% with context windows of six and 12 million tokens, sustaining near-perfect long-context retrieval at scales rarely tested.
Remaining questions and limitations
Despite impressive benchmarks, critics argue that testing under specific conditions does not fully represent real-world versatility. Subquadratic is positioning SubQ specifically for coding and large-scale data retrieval. While tens of thousands of users have signed up for early access, including over 500 enterprise customers, the waitlist remains long due to the company’s limited resources.
One significant technical caveat is that SubQ reused the weights from the Chinese open-source model Qwen to bootstrap its system, rather than training from scratch. While common practice, this contradicts the claim of having fully reinvented LLM architecture. Will Depue, an independent AI researcher, noted that while the technology may be real, the evidence does not yet justify the stronger claim of solving the quadratic attention bottleneck.
Whedon defends the approach, stating that to build a competitive model against giants like OpenAI, new ideas are the only option. Until more developers can test the model in diverse scenarios, a degree of healthy scepticism is justified.
Key takeaways
- Subquadratic claims to have solved the quadratic attention bottleneck, enabling LLMs to process 12x more text with significantly lower costs and energy usage.
- Independent benchmarks by Appen validate high performance in coding (89.7% on LiveCodeBench) and near-perfect retrieval accuracy (98%) over massive 12 million token contexts.
- Despite the technical breakthrough, the model is not yet widely available, and its reliance on pre-trained weights from Qwen invites scrutiny regarding the extent of its architectural innovation.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




