YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Project Verity-H v0.4
Teaching AI to say "I don't know."
When humans lack knowledge, they admit it β "I'm not sure", "I don't know", "let me check." LLMs don't. They fill gaps with plausible-sounding assumptions and present them as facts. Verity-H researches whether a lightweight verification pipeline can enforce honest behavior: share what you know, flag what you don't, never silently guess.
The system lets an LLM answer a question, then verifies every claim against the provided evidence before the user sees it. Supported claims pass through. Unsupported claims get flagged. Contradictions get caught. The user sees what's verified vs. what's a guess β like talking to an honest colleague.
For the full architecture, research grounding, and design decisions, see DESIGN.md.
Quick Start
# Clone
git clone https://huggingface.co/Sravanth18/verity-h-prototype
cd verity-h-prototype
# Setup
python -m venv .venv && source .venv/bin/activate
pip install -e ".[test]"
# Run tests (mock mode, no API key needed)
pytest
Run Evaluation
# Set environment
export LLM_MODE=hf_api
export HF_API_KEY=your-key-here
export MODEL_NAME=Qwen/Qwen3-4B-Instruct-2507
export LLM_CALL_DELAY=2
# Baselines
python -m src.baseline_runner --mode normal --output results/baseline_normal.jsonl
python -m src.baseline_runner --mode honesty --output results/baseline_honesty.jsonl
# Pipeline
python -m src.pipeline_runner --output results/verity_pipeline_v0.4.jsonl
# Batched (resumable if interrupted)
python run_pipeline_batched.py --delay 0.5 --output results/verity_pipeline_v0.4.jsonl
# Report
python -m src.report --normal results/baseline_normal.jsonl \
--honesty results/baseline_honesty.jsonl \
--pipeline results/verity_pipeline_v0.4.jsonl \
--output results/report.md
How It Works
Question + Evidence
β
βΌ
1. Split evidence into spans (deterministic)
2. Draft answer (LLM call #1)
3. Extract + label claims (LLM call #2)
4. Post-process: (deterministic)
β’ Filter junk/meta claims
β’ Fix mislabeled claims via span matching
β’ Detect inferential claims (4-tier)
β’ Detect contradictions (status-pair only; numeric/date logged for audit)
5. Gate decision (deterministic)
β
βΌ
Final answer with transparency metadata
2 LLM calls per case. Everything else is deterministic and auditable.
Pipeline Decisions
| Situation | Decision | What user sees |
|---|---|---|
| All claims verified | accept |
Clean answer from verified claims |
| Some claims unverified | partial |
"What I can verify" + "What I cannot verify" |
| Status-pair contradiction (open/closed, approved/rejected, etc.) | contradiction |
Flags conflict, shows both sides |
| No evidence for the question | needs_info |
"I don't have enough info" + what's needed |
| Speculative question (pressure=1) | hypothesis |
Low-confidence guess with full caveats |
| Verifier failed to parse | verifier_error |
Refuses to answer |
Inference Detection (v0.3.1+)
The verifier catches claims the LLM wrongly marks as SUPPORTED:
| Tier | What it catches | Example |
|---|---|---|
| 1. Epistemic hedges | "suggests", "consistent with", "most likely" | "Symptoms are consistent with bacterial infection" |
| 2. Logical leaps | "therefore", "based on these findings" | "Therefore the patient has strep throat" |
| 3. Deontic/normative | "should", "recommended", "indicated" | "Antibiotics should be started" |
| 4. Speculative questions | Question asks for judgment/prediction | "Should we invest?" β answer is inherently inferential |
Grounded in: CogniBench (arxiv:2505.20767), GME modality taxonomy (arxiv:2106.08037), BioScope corpus.
Results (Qwen3-4B, 30 cases, v0.2.1)
| Metric | Baseline Normal | Baseline Honesty | Verity-H |
|---|---|---|---|
| Unsupported claim rate (β) | 10% | 0% | 0% |
| Correct abstention (β) | 70% | 100% | 100% |
| Grounded accept (β) | 0% | 0% | 100% |
| Contradiction detection (β) | 60% | 40% | 80% |
| Pressure hypothesis (β) | 0% | 0% | 100% (v0.2.1) |
| False contradiction (β) | 0% | 0% | 0% |
| Partial coverage (β) | 0% | 0% | 100% |
| Latency p50 | 3,525ms | 3,244ms | 6,495ms (v0.3, 2-call batch) |
See RESULTS_ARCHIVE.md for full version history.
Environment Variables
| Variable | Default | Description |
|---|---|---|
LLM_MODE |
mock |
mock / api (OpenAI) / hf_api (HuggingFace) |
HF_API_KEY |
β | HuggingFace API key (for hf_api mode) |
OPENAI_API_KEY |
β | OpenAI API key (for api mode) |
MODEL_NAME |
Qwen/Qwen3-4B-Instruct-2507 |
Model to use |
LLM_TEMPERATURE |
0.0 |
Temperature |
LLM_MAX_TOKENS |
2048 |
Max tokens per response |
LLM_CALL_DELAY |
2 |
Seconds between API calls (rate limiting) |
LLM_MAX_CALLS_PER_MINUTE |
30 |
Per-minute rate limit |
Gold Cases
100 cases across 6 categories (development set):
| Category | Count | Tests |
|---|---|---|
grounded |
17 | All claims in evidence β accept |
missing_info |
14 | Evidence doesn't cover question β abstain |
contradiction |
15 | Conflicting facts in evidence β flag |
pressure |
15 | Speculative question β hypothesis with caveats |
filler_trap |
15 | Tempts model to invent facts β abstain |
partial_answer |
24 | Some facts available, some not β partial |
100 total cases β development set only. Not a held-out evaluation.
Tests
209 tests covering all modules. Run with pytest -v.
tests/
βββ test_calibration.py # Table-format probe validation
βββ test_claim_filter.py # Slot-aware relevance filtering
βββ test_constants.py # Shared stop words
βββ test_contradiction_checks.py # Status-pair contradictions + possible_conflict audit
βββ test_evidence_spans.py # Abbreviation-aware splitting
βββ test_gate.py # All gate rules + edge cases
βββ test_inference_detector.py # All 4 tiers + exact failure cases
βββ test_metrics.py # Pipeline + baseline metrics
βββ test_schemas.py # Pydantic validation
βββ test_span_matcher.py # Substring/fuzzy/numeric matching
βββ test_verifier.py # Batch table parser + integration
What This Does NOT Do
- No internet search or retrieval (RAG)
- No vector databases
- No fine-tuning
- No UI or deployment
- No GPU required
This is a research harness, not a product.
Known Limitations (v0.4)
The v0.4 baseline intentionally trades some detection for zero false positives and maintainable code.
| # | Limitation | Why | Mitigation |
|---|---|---|---|
| 1 | Numeric contradictions not caught deterministically | Money/percentage/count/date conflicts have too many false positives (e.g., revenue target vs actual revenue). | Relies on verifier LLM. If LLM misses, contradiction is not flagged. |
| 2 | Semantic relevance not enforced | "How fast can the car go?" with only engine specs supported β accept. v0.3.2 had a 20-entry synonym-table guard but it was too rule-heavy for a baseline. |
Acceptable for v0.4. Future: semantic similarity check (not synonym table). |
| 3 | 100 cases = dev set only | The deterministic rules were tuned against failures on this set. Results are directional, not publication-grade. | Create held-out 50-case test set for unbiased validation. |
| 4 | Inference detector is regex-based | Covers common hedges but cannot catch all inferential reasoning. | Grounded in CogniBench + GME + BioScope; handles most common cases. |
| 5 | Single evidence document | No multi-document consensus or evidence weighting. | Designed for single-pass evaluation. |
Next Steps
- Simplify to v0.4 baseline β status-pair contradictions only, no frame detector
- Remove slot-mismatch guard (semantic relevance is known limitation)
- 209 tests pass, zero false contradictions
- Run v0.4 eval on full 100-case development set
- Test on multiple models (1B, 4B, 70B+) to prove model independence
- Create held-out 50-case test set for unbiased evaluation
- Confidence calibration analysis
See DESIGN.md for the full architecture document.