I run an AI-based fact-checking platform and I refuse to let the LLM produce the verdict. Here's why.

After a year building a production fact-checking system, I keep defending one design decision: the LLM in our pipeline never produces a numeric score or a true/false verdict. Instead, it extracts structured factual flags from source material, and a deterministic Python scoring layer turns those into a verdict tier.

This is uncomfortable to explain because everyone assumes that “AI-powered fact-checking” means the AI gives the verdict. It would be cleaner if I let the LLM say something like ‘this claim is 73% likely false’ and call it a day. But here’s why I won’t do that.

LLM scoring instability is real and underdocumented. Run the same prompt with the same model on the same claim five times, and you might get verdicts ranging from ‘mostly false’ to ‘partially true’, depending on sampling temperature and the order of sources in the context window. This is fine for creative writing. It’s catastrophic when a journalist needs to defend their decision to publish or kill a story. Saying “Our scoring varies by 30% based on stochastic sampling” isn’t something you can put in front of an editorial board.

LLM verdicts are also unauditable. When the LLM says ‘false’, there is no way to point at which sources mattered, what signals pushed the score, or how weights applied. The reasoning chain is opaque even with chain-of-thought prompting because the chain itself is generated probabilistically and may rationalize after the fact rather than reflect the actual computation. Journalists I’ve spoken with don’t want a confident AI verdict; they want a verifiable one.

These are different things. The split I landed on is this: the LLM excels at extraction. Given a source document and a claim, it can flag ‘this source confirms X’, ‘this source contradicts Y’, or ‘this source is silent on Z’ with reasonable consistency. These flags are structured (booleans or short categorical labels), not numeric scores. The Python scoring layer then applies pre-defined weights based on source credibility (independently computed from MBFC, NewsGuard, RSF, and Wikidata cross-referencing) to produce a verdict tier. The weights are documented. The scoring rules are deterministic. The same input always produces the same output. Anyone can audit which sources contributed how much to a given verdict.

The trade-off is real. The system is less flexible than letting the LLM ‘reason’ freely, especially for edge cases where the claim doesn’t fit the categorical extraction schema. The scoring weights themselves are a design choice embedded with assumptions, and changing them requires deliberate engineering rather than retraining. But these are honest constraints visible to the user, not hidden non-determinism dressed up as objectivity.

I think this matters beyond fact-checking. Any high-stakes domain where AI is being used to produce decisions (credit scoring, hiring filters, medical triage, legal triage) faces a fundamental choice: let the LLM produce the score and hope nobody notices the stochasticity, or constrain the LLM to extraction and put decision logic somewhere auditable. The industry mostly does the first thing because it ships faster. I think the second approach is the only one defensible long-term, especially under the EU AI Act which is going to start requiring decision explainability in production systems within the next 18 months.

Curious if anyone here is building similar deterministic-on-top-of-LLM architectures in other domains, or if there are counter-arguments I’m missing. The “let the LLM decide” school has obvious advantages I’m probably under-weighting.

Key Takeaways

The LLM should never produce a numeric score or true/false verdict for high-stakes decisions.
Extracted flags are more auditable and transparent than probabilistic scoring layers.
The trade-off between flexibility (letting the LLM reason freely) and auditability is real, with deterministic extraction being preferable for high-stakes applications.
The EU AI Act will require production systems to be explainable. This shift may favor deterministic architectures over models that generate arbitrary scores.

Source Read original →