The First Token Knows: Single-Decode Confidence for Hallucination Detection
Abstract
First-token confidence (phi_first) derived from initial token distribution matches or exceeds semantic self-consistency in detecting hallucinations while being more computationally efficient.
Self-consistency detects hallucinations by generating multiple sampled answers to a question and measuring agreement, but this requires repeated decoding and can be sensitive to lexical variation. Semantic self-consistency improves this by clustering sampled answers by meaning using natural language inference, but it adds both sampling cost and external inference overhead. We show that first-token confidence, phi_first, computed from the normalized entropy of the top-K logits at the first content-bearing answer token of a single greedy decode, matches or modestly exceeds semantic self-consistency on closed-book short-answer factual question answering. Across three 7-8B instruction-tuned models and two benchmarks, phi_first achieves a mean AUROC of 0.820, compared with 0.793 for semantic agreement and 0.791 for standard surface-form self-consistency. A subsumption test shows that phi_first is moderately to strongly correlated with semantic agreement, and combining the two signals yields only a small AUROC improvement over phi_first alone. These results suggest that much of the uncertainty information captured by multi-sample agreement is already available in the model's initial token distribution. We argue that phi_first should be reported as a default low-cost baseline before invoking sampling-based uncertainty estimation.
Community
Sharing our paper "The First Token Knows: Single-Decode Confidence for Hallucination Detection". A single greedy decode captures almost all the hallucination-detection signal that multi-sample self-consistency does — at ~1/11 the cost. We define ϕ_first, the normalized entropy of the top-K logits at the first content-bearing answer token, and benchmark it against semantic and surface-form self-consistency.
Key findings
- Mean AUROC: 0.820 (ϕ_first) vs. 0.793 (semantic agreement) vs. 0.791 (surface-form self-consistency)
- Across Llama-3.1-8B, Mistral-7B-v0.3, and Qwen2.5-7B on PopQA and TriviaQA (n=1000 each)
- Ensembling ϕ_first with semantic agreement adds only +0.02 AUROC — first-token confidence already carries most of the signal
Feedback welcome.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Detecting Hallucinations in Large Language Models via Internal Attention Divergence Signals (2026)
- Deterministic Hallucination Detection in Medical VQA via Confidence-Evidence Bayesian Gain (2026)
- Sample Transform Cost-Based Training-Free Hallucination Detector for Large Language Models (2026)
- RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration (2026)
- Weakly Supervised Distillation of Hallucination Signals into Transformer Representations (2026)
- HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs (2026)
- Entropy Alone is Insufficient for Safe Selective Prediction in LLMs (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.05166 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper