Project Verity-H v0.4

Teaching AI to say "I don't know."

When humans lack knowledge, they admit it — "I'm not sure", "I don't know", "let me check." LLMs don't. They fill gaps with plausible-sounding assumptions and present them as facts. Verity-H researches whether a lightweight verification pipeline can enforce honest behavior: share what you know, flag what you don't, never silently guess.

The system lets an LLM answer a question, then verifies every claim against the provided evidence before the user sees it. Supported claims pass through. Unsupported claims get flagged. Contradictions get caught. The user sees what's verified vs. what's a guess — like talking to an honest colleague.

For the full architecture, research grounding, and design decisions, see DESIGN.md.

Quick Start

# Clone
git clone https://huggingface.co/Sravanth18/verity-h-prototype
cd verity-h-prototype

# Setup
python -m venv .venv && source .venv/bin/activate
pip install -e ".[test]"

# Run tests (mock mode, no API key needed)
pytest

Run Evaluation

# Set environment
export LLM_MODE=hf_api
export HF_API_KEY=your-key-here
export MODEL_NAME=Qwen/Qwen3-4B-Instruct-2507
export LLM_CALL_DELAY=2

# Baselines
python -m src.baseline_runner --mode normal --output results/baseline_normal.jsonl
python -m src.baseline_runner --mode honesty --output results/baseline_honesty.jsonl

# Pipeline
python -m src.pipeline_runner --output results/verity_pipeline_v0.4.jsonl

# Batched (resumable if interrupted)
python run_pipeline_batched.py --delay 0.5 --output results/verity_pipeline_v0.4.jsonl

# Report
python -m src.report --normal results/baseline_normal.jsonl \
                     --honesty results/baseline_honesty.jsonl \
                     --pipeline results/verity_pipeline_v0.4.jsonl \
                     --output results/report.md

How It Works

Question + Evidence
       │
       ▼
  1. Split evidence into spans          (deterministic)
  2. Draft answer                        (LLM call #1)
  3. Extract + label claims              (LLM call #2)
  4. Post-process:                       (deterministic)
     • Filter junk/meta claims
     • Fix mislabeled claims via span matching
     • Detect inferential claims (4-tier)
     • Detect contradictions (status-pair only; numeric/date logged for audit)
  5. Gate decision                       (deterministic)
       │
       ▼
  Final answer with transparency metadata

2 LLM calls per case. Everything else is deterministic and auditable.

Pipeline Decisions

Situation	Decision	What user sees
All claims verified	`accept`	Clean answer from verified claims
Some claims unverified	`partial`	"What I can verify" + "What I cannot verify"
Status-pair contradiction (open/closed, approved/rejected, etc.)	`contradiction`	Flags conflict, shows both sides
No evidence for the question	`needs_info`	"I don't have enough info" + what's needed
Speculative question (pressure=1)	`hypothesis`	Low-confidence guess with full caveats
Verifier failed to parse	`verifier_error`	Refuses to answer

Inference Detection (v0.3.1+)

The verifier catches claims the LLM wrongly marks as SUPPORTED:

Tier	What it catches	Example
1. Epistemic hedges	"suggests", "consistent with", "most likely"	"Symptoms are consistent with bacterial infection"
2. Logical leaps	"therefore", "based on these findings"	"Therefore the patient has strep throat"
3. Deontic/normative	"should", "recommended", "indicated"	"Antibiotics should be started"
4. Speculative questions	Question asks for judgment/prediction	"Should we invest?" → answer is inherently inferential

Grounded in: CogniBench (arxiv:2505.20767), GME modality taxonomy (arxiv:2106.08037), BioScope corpus.

Results (Qwen3-4B, 30 cases, v0.2.1)

Metric	Baseline Normal	Baseline Honesty	Verity-H
Unsupported claim rate (↓)	10%	0%	0%
Correct abstention (↑)	70%	100%	100%
Grounded accept (↑)	0%	0%	100%
Contradiction detection (↑)	60%	40%	80%
Pressure hypothesis (↑)	0%	0%	100% (v0.2.1)
False contradiction (↓)	0%	0%	0%
Partial coverage (↑)	0%	0%	100%
Latency p50	3,525ms	3,244ms	6,495ms (v0.3, 2-call batch)

See RESULTS_ARCHIVE.md for full version history.

Environment Variables

Variable	Default	Description
`LLM_MODE`	`mock`	`mock` / `api` (OpenAI) / `hf_api` (HuggingFace)
`HF_API_KEY`	—	HuggingFace API key (for `hf_api` mode)
`OPENAI_API_KEY`	—	OpenAI API key (for `api` mode)
`MODEL_NAME`	`Qwen/Qwen3-4B-Instruct-2507`	Model to use
`LLM_TEMPERATURE`	`0.0`	Temperature
`LLM_MAX_TOKENS`	`2048`	Max tokens per response
`LLM_CALL_DELAY`	`2`	Seconds between API calls (rate limiting)
`LLM_MAX_CALLS_PER_MINUTE`	`30`	Per-minute rate limit

Gold Cases

100 cases across 6 categories (development set):

Category	Count	Tests
`grounded`	17	All claims in evidence → accept
`missing_info`	14	Evidence doesn't cover question → abstain
`contradiction`	15	Conflicting facts in evidence → flag
`pressure`	15	Speculative question → hypothesis with caveats
`filler_trap`	15	Tempts model to invent facts → abstain
`partial_answer`	24	Some facts available, some not → partial

100 total cases — development set only. Not a held-out evaluation.

Tests

209 tests covering all modules. Run with pytest -v.

tests/
├── test_calibration.py          # Table-format probe validation
├── test_claim_filter.py         # Slot-aware relevance filtering
├── test_constants.py            # Shared stop words
├── test_contradiction_checks.py # Status-pair contradictions + possible_conflict audit
├── test_evidence_spans.py       # Abbreviation-aware splitting
├── test_gate.py                 # All gate rules + edge cases
├── test_inference_detector.py   # All 4 tiers + exact failure cases
├── test_metrics.py              # Pipeline + baseline metrics
├── test_schemas.py              # Pydantic validation
├── test_span_matcher.py         # Substring/fuzzy/numeric matching
└── test_verifier.py             # Batch table parser + integration

What This Does NOT Do

No internet search or retrieval (RAG)
No vector databases
No fine-tuning
No UI or deployment
No GPU required

This is a research harness, not a product.

Known Limitations (v0.4)

The v0.4 baseline intentionally trades some detection for zero false positives and maintainable code.

#	Limitation	Why	Mitigation
1	Numeric contradictions not caught deterministically	Money/percentage/count/date conflicts have too many false positives (e.g., revenue target vs actual revenue).	Relies on verifier LLM. If LLM misses, contradiction is not flagged.
2	Semantic relevance not enforced	"How fast can the car go?" with only engine specs supported → `accept`. v0.3.2 had a 20-entry synonym-table guard but it was too rule-heavy for a baseline.	Acceptable for v0.4. Future: semantic similarity check (not synonym table).
3	100 cases = dev set only	The deterministic rules were tuned against failures on this set. Results are directional, not publication-grade.	Create held-out 50-case test set for unbiased validation.
4	Inference detector is regex-based	Covers common hedges but cannot catch all inferential reasoning.	Grounded in CogniBench + GME + BioScope; handles most common cases.
5	Single evidence document	No multi-document consensus or evidence weighting.	Designed for single-pass evaluation.

Next Steps

Simplify to v0.4 baseline — status-pair contradictions only, no frame detector
Remove slot-mismatch guard (semantic relevance is known limitation)
209 tests pass, zero false contradictions
Run v0.4 eval on full 100-case development set
Test on multiple models (1B, 4B, 70B+) to prove model independence
Create held-out 50-case test set for unbiased evaluation
Confidence calibration analysis

See DESIGN.md for the full architecture document.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for Sravanth18/verity-h-prototype

CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models

Paper • 2505.20767 • Published May 27, 2025 • 1

The Possible, the Plausible, and the Desirable: Event-Based Modality Detection for Language Processing

Paper • 2106.08037 • Published Jun 15, 2021 • 1