
Language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertainty, according to new research from OpenAI. The paper, which reframes hallucinations as a predictable outcome of how we measure AI performance, suggests the solution lies not in better technology but in fundamentally changing how the field evaluates success.
Key Points:
- Hallucinations emerge from statistical pressures in both pretraining and evaluation phases
- Current benchmarks create a "false dichotomy" between right and wrong, penalizing uncertainty
- The solution requires industry-wide coordination to change evaluation metrics
The research, published by OpenAI scientists including Adam Tauman Kalai and Santosh Vempala, offers a surprisingly straightforward explanation for why even advanced models confidently generate false information. When the researchers asked various chatbots about Kalai's PhD dissertation, they received multiple incorrect titles. When asked for his birthday, it gave three different dates, likewise all wrong.
This pattern—plausible but false statements delivered with confidence—is what OpenAI defines as hallucinations. Despite improvements in GPT-5, where hallucinations are significantly fewer especially when reasoning, the company acknowledges that hallucinations remain a fundamental challenge for all large language models.
The paper traces hallucinations to two root causes. First, during pretraining, models learn patterns from vast amounts of text. Spelling and parentheses follow consistent patterns, so errors there disappear with scale. But arbitrary low-frequency facts, like a pet's birthday, cannot be predicted from patterns alone and hence lead to hallucinations. Some information—what the researchers call "arbitrary facts"—simply has no learnable pattern. You cannot deduce someone's birthday from statistical regularities in text.
The second, more tractable problem lies in how we evaluate these systems. Most evaluations measure model performance in a way that encourages guessing rather than honesty about uncertainty. The researchers draw an analogy to standardized testing: "If you do not know the answer but take a wild guess, you might get lucky and be right. Leaving it blank guarantees a zero."
This creates what the paper calls an "epidemic" of misaligned incentives. Current benchmarks—from MMLU to SWE-bench—use binary scoring where both wrong answers and abstentions receive zero points. A model optimized for these metrics learns to guess rather than acknowledge uncertainty, because occasional lucky guesses outperform consistent abstention.
OpenAI's proposed solution involves redesigning evaluations to explicitly penalize confident errors more than uncertainty—similar to how some standardized tests deduct points for wrong answers. They suggest adding confidence thresholds to evaluation instructions, making the acceptable level of uncertainty explicit rather than implicit.
For questions where there is a single “right answer,” one can consider three categories of responses: accurate responses, errors, and abstentions where the model does not hazard a guess. Abstaining is part of humility, one of OpenAI’s core values. Most scoreboards prioritize and rank models based on accuracy, but errors are worse than abstentions.
But implementation faces significant coordination challenges. The paper isn't merely suggesting adding new benchmarks; it's calling for modifications to the existing evaluations that dominate leaderboards and influence billions in research funding. Every major benchmark would need to adopt new scoring methods that reward calibrated uncertainty.
The research also pushes back against several common assumptions about hallucinations. The authors argue that hallucinations are not inevitable—a system could theoretically always respond "I don't know" when uncertain. They also note that cutting hallucination completely might prove impossible, not because of technological limitations but because some questions are inherently unanswerable or computationally intractable.
What's particularly striking is how this framing shifts responsibility. Rather than treating hallucinations as a deep technical challenge requiring breakthrough innovations, the paper presents them as a predictable consequence of misaligned incentives that the field has collectively created and maintained.
As language models become integrated into critical applications—from healthcare to finance—the cost of hallucinations increases. Unless the field changes how it measures performance, AI systems will continue to "sound right" while sometimes being wrong.
For now, users must navigate a paradox: the very metrics that demonstrate AI's impressive capabilities also incentivize the behaviors that make these systems unreliable. Until that changes, we're left with remarkably capable systems that have been optimized, above all else, to never leave a question unanswered.