Open-source AI model beats GPT and Claude on Bridgewater finance tests
Bridgewater Associates and Thinking Machines Lab claim a fine-tuned open-weight model achieves 84.7 percent accuracy on financial document analysis, compared to 78.2 percent for the best commercial models tested.
The Qwen3-235B model, adjusted using internal expert knowledge, costs nearly 14 times less to operate than the alternatives. These figures come from the firms’ own internal evaluation.
Investors are overwhelmed by daily news, corporate filings, and emails. Bridgewater’s AIA Labs and Thinking Machines Lab say reading these documents is not the core task. The real work involves a constant stream of small, repeated judgment calls about what actually matters. Researchers wanted to automate this triage.
They defined six tasks based on an investor’s daily routine. One example is deciding if a financial article is relevant to an executive. Another is determining if a central bank document signals future interest rate changes. For investors, these calls are trivial, yet they struggle to articulate their reasoning.
The report offers a specific example. A headline about Donald Trump’s claim to Greenland is flagged as irrelevant, while a threat of new tariffs on China is highly relevant. Both stories touch on geopolitics and finance.
Frontier models failed in these tests. Variants of Gemini, Claude, and GPT hit only about 50 percent accuracy with a basic prompt. Expert-written instructions and a three-tier rating system pushed accuracy into the mid-70s. This still fell short of the 80 percent threshold the authors set for trustworthy deployment.
Newer models barely improve per dollar. GPT 5.4 costs 43 percent more than 5.2 but is only marginally more accurate.
The real value lives inside investors’ heads
The solution was fine-tuning, retraining an open-weight model on proprietary examples. The key ingredient was the judgment of Bridgewater investors. Initially, cheap outside contractors labelled the documents, but many of those labels were wrong.
To avoid having expensive professionals review everything, the researchers used a workaround. A first model learned from the flawed labels and re-evaluated the same documents. Wherever the model and the original label disagreed, there was likely an error. Only those disputed cases went to investors for correction.
Training ran on the Tinker platform from Thinking Machines Lab, built on top of the open model Qwen3-235B. In the team’s own evaluation, the fine-tuned model hit 84.7 percent accuracy versus 78.2 percent for the best frontier model tested. It also cost nearly 14 times less to run.
This is not a truly independent comparison, of course. Both companies have a clear interest in selling their product.
Still, the finding beyond the numbers is worth noting. It shows once again that the big labs like OpenAI have not absorbed all the data out there. Huge pools of proprietary corporate data and untrained human expertise still exist, and they hold real room for improvement. That is especially true where companies deliberately keep their most valuable data private.
Anyone who hands that data to a frontier lab risks competing against a product built on top of it. Fine-tuning open models through tools like Tinker gives companies an alternative. They keep the weights, the data, and, depending on the setup, the GPUs themselves.




