Why GPT-5.4 and LLMs Fail at Basic Fact-Retrieval

A May 2026 New Stack report found a 67% variance in basic fact-retrieval among frontier models like GPT-5.4, Claude, and Gemini. The tech press treats this as an academic benchmark problem. For corporate controllers, it is an undocumented risk sitting inside the general ledger.

Finance teams spent the past year deploying LLM-backed tools to automate unstructured data extraction for contract review, invoice processing, and audit sampling. The operational premise: "multi-model routing" or "LLM consensus" guarantees accuracy. If three models agree on a vendor's liability cap, the extracted data must be clean.

The New Stack data proves that premise mathematically flawed. When foundational models disagree on factual synthesis at a 67% rate, automated consensus systems break down. They either flood the close with false-positive exception alerts or, worse, silently log hallucinated, averaged figures directly into the ERP.

Follow the incentive. Why do SaaS vendors aggressively market AI extraction while structurally insulating themselves from the outputs?

Sage research shows enterprise SaaS vendors are drafting agreements that strictly delineate compliance responsibilities-specifically calling out SOX and ASC 606-and expressly disclaiming liability for AI-generated inaccuracies. Simultaneously, Vouch reports these vendors are loading up on Tech Errors & Omissions (E&O) and specialized "AI Insurance" to limit out-of-pocket liabilities for algorithmic bugs causing customer misstatements.

The vendor captures the valuation premium for shipping an AI feature. The finance team holds the bag for the material weakness.

Numeric hallucination in complex accounting is not hypothetical. Uniqus documented a 2026 ASC 842 lease accounting case where a fine-tuned GenAI system fabricated a non-existent "incremental borrowing rate" (IBR) to confidently classify a complex ground lease. For ASC 606 revenue recognition, Uniqus notes AI-assisted analysis of complex performance obligations carries a "high hallucination risk." The only structural mitigation: deploying Retrieval-Augmented Generation (RAG) architectures that strictly anchor model outputs to verified contract corpora.

A strictly U.S.-centric read misses the compounding danger in cross-border finance. A U.S. controller might view an extraction error solely through a consolidated SOX lens. But multinational teams rely on these automated tools to parse multi-currency rebate triggers and localized pricing tiers across European and APAC subsidiaries. When an LLM fails to accurately retrieve a localized contract threshold, it doesn't just create a headquarters reconciliation headache. It creates statutory audit failures in jurisdictions where automated misstatements carry distinct, localized director liabilities.

Separate anomaly detection from deterministic factual extraction. Ledge notes purpose-built LLM matching engines successfully replace fragile manual spreadsheets by ingesting daily ERP and bank data to flag missing matches. Similarly, ResearchGate highlights the financial sector's deployment of AI-driven analytics to automatically detect financial misstatements.

Anomaly detection is probabilistic: it flags potential issues for human review. Contract extraction is deterministic: it writes a liability cap, payment term, or performance obligation into a system of record. Do not rely on probabilistic models for deterministic financial facts.

During internal audits, reconciling AI-extracted contract terms against actual billing data will fail if the extraction was flawed from the start. Controllers will face explosive manual reconciliation backlogs as human auditors pull original source documents to clear discrepancies, entirely negating the software's promised efficiency.

Controllers must immediately re-evaluate the buy/build decision for AI-driven data extraction. Shift the operating strategy from using AI for higher accuracy via consensus to restricting AI strictly to workflow routing.

Audit your exposure now:

Audit the Vendor Stack: Identify which AP automation and contract lifecycle management (CLM) vendors use multi-model validation for data extraction.
Review the Contracts: Locate the specific clauses where vendors disclaim liability for AI outputs and verify who bears the cost of a resulting financial restatement.
Change the Control: Mandate a strict human-in-the-loop control for any automated extraction of contract terms tied to material revenue or spend.
Demand Raw Data: Require vendors to output raw extraction confidence scores and exact source-document citations, rather than accepting pre-processed "consensus" answers that mask underlying model disagreements.

Until foundational models can reliably agree on basic facts, they have no business autonomously writing terms to your general ledger.

Read0%

Action Plan

Playbook

1) Audit your current AP automation and contract lifecycle management (CLM) vendors to identify if they use multi-model validation for data extraction. 2) Implement a strict 'human-in-the-loop' control for any automated extraction of contract terms tied to material revenue or spend. 3) Require vendors to output raw extraction confidence scores and source-document citations rather than pre-processed 'consensus' answers.

Risks If Ignored

Blindly trusting vendor claims of 'high accuracy through AI consensus' will lead to material misstatements and failed audits when the underlying models are fundamentally misaligned on basic fact retrieval.

Key Takeaways

"[Memorable statement from the text regarding the May 2026 event.]"

"[Key insight capturing a specific development or reaction.]"

"[A concise summary of the news's impact or future outlook.]"

CompaniesLenzWiserAnthropicCornell UniversityThe New Stack

PeopleKosta JordanovFounderEddie YangResearcherDashun WangResearcher

StandardsMMLU-Pro(Academic Community)GPQA(Academic Community)

Key DatesAnnouncementMay 21HistoricalFebruary 15, 2026HistoricalFebruaryProjectedcoming months

Originally Reported ByNaN/7 Minimally Sourced

Thenewstack

thenewstack.io/frontier-llm-factcheck-disagreement

Supporting Sources

Affected Workflows

AI BenchmarksModel InferenceFrontier Signal Lane

Research Sources6

SaaS and AI vendors are increasingly relying on Tech Errors & Omissions (E&O) and specialized 'AI Insurance' to limit out-of-pocket liabilities for algorithmic errors or software bugs that result in financial misstatements for their customers. Vouch
To limit liability for financial misstatements, enterprise SaaS vendors are structuring agreements that strictly delineate compliance responsibilities (such as SOX and ASC 606) and expressly disclaim liability for inaccuracies stemming from AI-generated outputs. Sage
Rather than a specific mathematical variance causing misstatements, the financial sector is actually deploying AI-driven SaaS analytics (e.g., machine learning and anomaly detection) to automatically detect and prevent financial misstatements, though model explainability remains a core governance challenge. ResearchGate
Instead of causing errors, purpose-built LLM matching engines are being deployed to prevent restatements. These AI platforms ingest daily data from ERPs and banks to flag missing matches and anomalies in real-time, replacing fragile manual spreadsheet processes that historically lead to lost trust and material weaknesses. Ledge
Under ASC 842 lease accounting, fine-tuned GenAI systems lacking strict RAG grounding are highly susceptible to numeric hallucinations. In a documented 2026 case, an AI fabricated a non-existent 'incremental borrowing rate' (IBR) to confidently but incorrectly classify a complex ground lease, highlighting the requirement for RAG to mitigate this risk. Uniqus
For ASC 606 revenue recognition, AI-assisted analysis of complex performance obligations (e.g., in pharmaceutical or software contracts) poses a 'high hallucination risk' due to the intense judgment required. RAG architectures strictly mitigate this by anchoring model outputs to verified contract corpora rather than relying on parametric memory. Uniqus

#AIBenchmarks #Fact-Checking #GPT-5.4 #LLMs #ModelInference

Written By

Priya Desai

Responses

(0)

Responses0

‌

‌
‌

‌

‌
‌

‌

‌
‌

‌