AI hallucination: why systems fabricate and how to stop them

Executive summary

AI hallucination is intrinsic to how language models generate text, not a bug that the next model release will fix. A language model produces plausible continuations, but it does not know which value-creation, decarbonization, or compliance claims are safe to act on, and no amount of prompt engineering or model upgrade will give it that ability. Operating partners, CFOs, and ESG teams are now reading AI-generated summaries of sustainability data, and some of those summaries are wrong in a particular way: source-adjacent claims that look exactly like findings. The fix sits in the system orchestration around the model, by extracting evidence before narrative, validating claims against deterministic rules, using a second model to challenge unsupported conclusions, and routing material claims to humans. AI hallucination is not a prompt problem; it is a systems orchestration problem.

The CFO of a mid-market portfolio company is preparing the year-end update for the bank syndicate. She runs the draft sustainability section through her AI tool to check it before it goes out. The summary that comes back reads: "The company met its sustainability-linked loan (SLL) KPI for 2026, with an 18 percent reduction in Scope 3 emissions versus the 2023 baseline, triggering the 10 basis point margin step-down provided for in the loan agreement."

The line is wrong. The 18 percent reduction in the underlying emissions report refers to Scope 1 and Scope 2 emissions, not Scope 3, and the SLL covenant in the loan agreement specifically requires Scope 3 reduction, which the company has not measured at all in 2026 because the Scope 3 inventory project is still in scoping. What the AI did was see "18 percent reduction" and "SLL margin step" near each other in two different documents and assemble a sentence that sounded right.

The AI did not lie. It found the pieces that read like they belonged together and connected them. Source-adjacent: close enough to real information that a CFO scanning the draft has no reason to pause, far enough from the truth that the conclusion is wrong, and material enough that an incorrect compliance certificate to the lender would expose the company to misrepresentation risk under the loan agreement.

This is the difference between summarization and analysis. A summarizer assembles plausible sentences from source material, while an analytical system checks whether each of those claims is actually supported by evidence in the underlying documents. Most tools sold as "AI analysis" are summarizers in a different wrapper.

Why AI hallucinates

In the previous article, we opened the box on how agent systems are structured: specialized agents, deterministic workflows, tools that ground reasoning in data. If you missed it, the short version is that an AI agent is a workflow of specialized models with deterministic glue between them, not a single model doing everything. Architecture gives you modularity, auditability, and focused context, but none of that changes what happens inside any one of the models when it generates text. The model is not retrieving facts. It is predicting tokens.

A language model does not look up whether the company hit its Scope 3 SLL KPI. Token prediction is pattern matching. "SLL margin step-down," "Scope 3 reduction," and "vs baseline" are common collocations in sustainability finance reporting, so the model predicts that they belong together. The fact that the underlying source data says Scope 1 and 2, not Scope 3, lives somewhere outside the model's optimization function and never enters the prediction.

There is no rule inside the model that distinguishes a measured reduction from a misattributed one, and there will not be one in the next version either. Hallucination is not a defect of any specific model. It is a property of how language models generate text: the same mechanism that produces useful continuations is the mechanism that produces wrong ones, and you cannot subtract one from the other inside the model. A bigger model trained on more data still predicts tokens, and a more elaborate prompt still constrains a token predictor from the outside. The engineering problem is therefore not to fix the model but to orchestrate a system around the model that catches its errors before they enter a decision.

A Nature paper published on 22 April 2026 frames the problem differently but points the same way: next-word prediction biases the model toward hallucination, and accuracy-based evaluations reward guessing over admitting uncertainty. Prompt engineering reduces fabrication at the margins, but it does not turn a probabilistic generator into an evidence system. You cannot ask a token predictor to stop producing plausible-but-wrong continuations unless the system around it changes what counts as an acceptable answer.

According to Google Cloud's grounding guidance, retrieval-augmented generation reduces fabrication by connecting the model to source-of-truth systems rather than by changing the model's nature. A 2025 systematic review in AI makes the same point from the research side, with the caveat that retrieval helps when the sources are precise and current and tied to claim-level verification, and it can fail when retrieval returns noisy or conflicting context. A Knowledge-Based Systems paper from April 2026 extends the argument for agents and notes that observability traces and verified planning histories can themselves become grounding layers if you build them in. Grounding helps. Evaluation tells you whether it actually helped in your specific system.

Technique	What it does	Evidence base	Limitation
RAG / grounding	Feeds source documents into the model's context	Google Cloud and hallucination-mitigation reviews describe it as a grounding layer	Retrieval quality determines output quality
Claim-level validation	Checks whether each claim has evidence	Research reviews emphasize verification and abstention policies	Requires domain-specific rules
Human review	Routes high-risk outputs to people	LangChain reports human review as a common evaluation method among teams using evals	Does not scale to every line
Evaluator-optimizer	Uses a second model to challenge the first	Anthropic describes this as a production workflow pattern	Evaluator can share the same blind spots

Every system will hallucinate. The architecture decides whether the hallucinations stay visible long enough to be caught, or whether they make it into a board pack, a compliance certificate, or a CSRD-aligned disclosure before anyone notices.

Evidence-first architecture: extract before generate

Most AI systems receive a question and produce an answer in one pass. The model reads the documents, reasons about the question, and writes a response all at once. Generation and retrieval happen in a single step, which means the model can fill gaps in the evidence with plausible language and never tell you that it has done so. The CFO's AI tool in the opener was running this kind of pipeline: documents in, answer out, no separation between what the source actually said and what the model inferred.

A two-pass architecture changes the order. In the first pass, the system extracts: it pulls specific claims, figures, and evidence from the source documents, with each item tagged to a page number, a verbatim quote, and a source type (emissions report, loan agreement, board minute). No interpretation, no synthesis. The second pass generates analysis based only on what the first pass produced, so if the extraction tagged "18% reduction" as Scope 1+2 from the emissions report and tagged the SLL covenant as Scope 3 KPI from the loan agreement, the generation step cannot wire them into a single covenant compliance claim, because the evidence set does not contain a matched pair.

We built our pipeline at Axion Lab around this principle, because the alternative is silent failure: the model produces a confident answer that the surrounding system has no way to challenge. The extraction pass produces a structured evidence set that the generation pass consumes, and the gap between what was extracted and what was inferred becomes an explicit, auditable property of the system rather than a hidden one. This does not eliminate hallucination. Extraction can miss relevant information, especially when it lives in non-canonical formats, and generation can still misinterpret what it received. But the failure mode shifts from "the system confidently stated something false" to "the system's evidence set was incomplete," and the second problem can be diagnosed, fixed, and audited. It is also a problem that a senior practitioner can inspect by reading the output, because the gaps are visible.

This case study shows how this played out on a real portfolio company, where the difference between an extractive pipeline and a single-pass model produced two materially different conclusions on the same set of documents.

The trust stack: three checks that run on the evidence

Extraction sets the ground truth that everything else works from. Three layers of checking run on top of it, each catching a different class of failure, and none of them sufficient on its own.

The first layer is deterministic validation, which is code rather than AI. Rules check every output against fixed conditions before it moves forward. A useful rule from the SLL example would read: any covenant compliance claim must cite the specific KPI scope from the loan agreement and match it to the measured scope in the emissions report. If the two scopes do not match, the claim is rejected automatically and flagged for analyst review. That is a five-line rule, applied in milliseconds at near-zero marginal cost, and it would have caught the Scope 3 fabrication in the opener before the line ever reached the bank syndicate. Rules cannot reason about cases that fall outside their definitions, but they can apply unambiguous standards consistently, which is something language models cannot do reliably.

The second layer is AI verification, what Anthropic calls the evaluator-optimizer pattern. A separate model reviews the first model's output with the explicit job of challenging it rather than rephrasing it. The evaluator receives the original sources alongside the generated analysis and looks for contradictions, unsupported claims, and logical gaps between the claim and the evidence. In the SLL example, the first model writes "18% Scope 3 reduction triggered margin step-down," and the evaluator reads the loan agreement and the emissions report, identifies that the 18% number is Scope 1+2 while the covenant is Scope 3, and flags the claim as a scope mismatch. This is real risk reduction rather than proof. Both models share architectural similarities and training data, which means they can produce correlated errors that survive the second check.

The third layer is human oversight, but not the line-by-line kind. That defeats the purpose of automation, and analysts who are asked to review everything end up reviewing nothing carefully. The system runs with autonomy that scales to the stake: routine outputs that fall well within rule tolerance proceed without review, medium-confidence outputs are flagged for spot checks, and material claims, the kind that involve covenant compliance, regulatory positions, ESG certifications, or value-creation claims above a defined EBITDA-impact threshold, require human verification before the report is issued.

Human oversight fails when it lives as a policy rather than as a workflow. In April 2026, The Guardian reported that Sullivan & Cromwell apologized to a federal judge after AI-generated errors entered a court filing, and the firm's letter said its AI policies were not followed and a secondary review process did not catch inaccurate citations. Different domain, same failure mode: review existed on paper, but the workflow let unsupported claims through. According to LangChain's State of Agent Engineering, 57.3% of respondents had AI agents running in production while only 52.4% reported offline evaluations on test sets, and the gap between deployment and evaluation is where the damage happens. Shipping an AI agent is one thing. Knowing when its output quality changes is another, and that is a workflow question rather than a memo question.

Layer	What it does	Catches	Example from the opener
1. Deterministic validation	Rule-based checks on every output	Format errors, missing citations, scope mismatches, unreconciled claims	Rule: SLL covenant scope must match measured scope. Catches Scope 1+2 vs Scope 3 mismatch.
2. AI verification	Second model challenges first model's output	Unsupported claims, logical contradictions, source misinterpretation	Evaluator reads loan agreement and emissions report, finds scope mismatch, flags claim.
3. Human oversight	Review scaled to risk and confidence	Edge cases, context-dependent judgment, novel situations	CFO reviews any covenant compliance claim before lender disclosure.

No single layer is sufficient. Rules cannot adapt to cases the rule-writer did not anticipate, a second model trained on similar data can replicate the first model's blind spots, and humans asked to review everything end up reviewing nothing carefully. The three checks together force a hallucination to survive more independent tests than any one of them would impose alone.

The stack still fails. Extraction misses non-canonical formats: handwritten meeting notes, scanned PDFs without OCR, photos of documents from a site visit. Deterministic rules go stale faster than they get updated, especially when the portfolio shifts or the regulatory standard changes. The evaluator and the primary model share training data and therefore share blind spots, and correlated errors can survive the second check because both models think the same way. The stack does not catch everything, but it surfaces more failures before they reach a decision, and surfaced failures are the kind operators can actually fix.

The trust stack: three checks on the evidence set

Domain expertise as the moat

The substance that gives Layer 1 its bite is the rule library, and the rule library is the part that does not come with the foundation model. The model is a commodity that anyone can access. The differentiation is in what you encode as constraints around the model: the domain expertise that prevents it from producing claims a senior practitioner would question.

Take the SLL example. A claim of covenant compliance looks like one number but expands, when you decompose it, into several conditions: the specific KPI named in the loan agreement, the measured metric in the emissions report under the right scope, the methodology used to calculate that metric, the baseline year against which it is measured, and the materiality of any difference between the certified figure and the actual measurement. When you encode each of these conditions as a rule, the AI cannot write "the company met its SLL KPI" unless every sub-condition is met and evidenced in the extraction set. If any element is missing or mismatched, the claim is flagged as unsubstantiated, regardless of how confident the language model sounded.

The same approach maps onto decarbonization milestones. An emissions reduction claim has the same structure as the SLL claim: a measured baseline, primary data showing change since that baseline, attribution to a specific intervention rather than a co-occurring event, and reconciliation to either an audited inventory or a third-party assurance opinion. ESG teams routinely take credit for cost or emissions reductions that procurement, weather, or production volume drove, and a model trained on press releases will reproduce those misattributions. A rule that requires every reduction claim above one percentage point of group emissions to trace to a measurement methodology and a baseline year catches the misattribution before it appears in the report.

This is where institutional knowledge compounds into a moat. A firm that has spent twenty years evaluating sustainability data already knows which assertions require verification, which metrics carry weight, and which programs are routinely mis-attributed. Translating that knowledge into machine-readable rules is the work that separates a useful system from a fluent one, and it is also the work that does not transfer to a new entrant who has the same model access but no rule library.

What this means for AI tool evaluation

The evaluation problem is getting harder, not easier. According to Stanford HAI's 2026 AI Index, capability benchmarks are saturating while responsible-AI reporting remains uneven, and documented AI incidents rose from 233 in 2024 to 362. Leaderboard performance is not a proxy for reliability on your actual workflow. Operating partners, CFOs, and ESG teams should ask any vendor how they measure unsupported claims on the kind of sustainability data the portfolio company produces, and they should ask for the evaluation reports.

Operating partners, CFOs, and ESG teams are the people who carry the consequences when an AI claim survives into an EBITDA waterfall, a transition plan, a CSRD disclosure, or a covenant compliance certificate to a lender. Going back to the opener: in an evidence-first system with the trust stack running on top of it, the same AI tool reading the same documents would produce something different. The output would look like:

The documents and the model did not change. What changed was the trust stack on top of the model and the rule library that gave Layer 1 something to check against. Stop calling this a hallucination problem. Hallucination is native to how language models work and will not be subtracted by the next model release; what is fixable is the orchestration around the model, and that is where the engineering problem actually sits.

Knowing how to evaluate which AI tools have this engineering, and which ones are a model wrapped in an interface, is a different question.