YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Project Verity-H v0.4

Teaching AI to say "I don't know."

When humans lack knowledge, they admit it β€” "I'm not sure", "I don't know", "let me check." LLMs don't. They fill gaps with plausible-sounding assumptions and present them as facts. Verity-H researches whether a lightweight verification pipeline can enforce honest behavior: share what you know, flag what you don't, never silently guess.

The system lets an LLM answer a question, then verifies every claim against the provided evidence before the user sees it. Supported claims pass through. Unsupported claims get flagged. Contradictions get caught. The user sees what's verified vs. what's a guess β€” like talking to an honest colleague.

For the full architecture, research grounding, and design decisions, see DESIGN.md.


Quick Start

# Clone
git clone https://huggingface.co/Sravanth18/verity-h-prototype
cd verity-h-prototype

# Setup
python -m venv .venv && source .venv/bin/activate
pip install -e ".[test]"

# Run tests (mock mode, no API key needed)
pytest

Run Evaluation

# Set environment
export LLM_MODE=hf_api
export HF_API_KEY=your-key-here
export MODEL_NAME=Qwen/Qwen3-4B-Instruct-2507
export LLM_CALL_DELAY=2

# Baselines
python -m src.baseline_runner --mode normal --output results/baseline_normal.jsonl
python -m src.baseline_runner --mode honesty --output results/baseline_honesty.jsonl

# Pipeline
python -m src.pipeline_runner --output results/verity_pipeline_v0.4.jsonl

# Batched (resumable if interrupted)
python run_pipeline_batched.py --delay 0.5 --output results/verity_pipeline_v0.4.jsonl

# Report
python -m src.report --normal results/baseline_normal.jsonl \
                     --honesty results/baseline_honesty.jsonl \
                     --pipeline results/verity_pipeline_v0.4.jsonl \
                     --output results/report.md

How It Works

Question + Evidence
       β”‚
       β–Ό
  1. Split evidence into spans          (deterministic)
  2. Draft answer                        (LLM call #1)
  3. Extract + label claims              (LLM call #2)
  4. Post-process:                       (deterministic)
     β€’ Filter junk/meta claims
     β€’ Fix mislabeled claims via span matching
     β€’ Detect inferential claims (4-tier)
     β€’ Detect contradictions (status-pair only; numeric/date logged for audit)
  5. Gate decision                       (deterministic)
       β”‚
       β–Ό
  Final answer with transparency metadata

2 LLM calls per case. Everything else is deterministic and auditable.

Pipeline Decisions

Situation Decision What user sees
All claims verified accept Clean answer from verified claims
Some claims unverified partial "What I can verify" + "What I cannot verify"
Status-pair contradiction (open/closed, approved/rejected, etc.) contradiction Flags conflict, shows both sides
No evidence for the question needs_info "I don't have enough info" + what's needed
Speculative question (pressure=1) hypothesis Low-confidence guess with full caveats
Verifier failed to parse verifier_error Refuses to answer

Inference Detection (v0.3.1+)

The verifier catches claims the LLM wrongly marks as SUPPORTED:

Tier What it catches Example
1. Epistemic hedges "suggests", "consistent with", "most likely" "Symptoms are consistent with bacterial infection"
2. Logical leaps "therefore", "based on these findings" "Therefore the patient has strep throat"
3. Deontic/normative "should", "recommended", "indicated" "Antibiotics should be started"
4. Speculative questions Question asks for judgment/prediction "Should we invest?" β†’ answer is inherently inferential

Grounded in: CogniBench (arxiv:2505.20767), GME modality taxonomy (arxiv:2106.08037), BioScope corpus.

Results (Qwen3-4B, 30 cases, v0.2.1)

Metric Baseline Normal Baseline Honesty Verity-H
Unsupported claim rate (↓) 10% 0% 0%
Correct abstention (↑) 70% 100% 100%
Grounded accept (↑) 0% 0% 100%
Contradiction detection (↑) 60% 40% 80%
Pressure hypothesis (↑) 0% 0% 100% (v0.2.1)
False contradiction (↓) 0% 0% 0%
Partial coverage (↑) 0% 0% 100%
Latency p50 3,525ms 3,244ms 6,495ms (v0.3, 2-call batch)

See RESULTS_ARCHIVE.md for full version history.

Environment Variables

Variable Default Description
LLM_MODE mock mock / api (OpenAI) / hf_api (HuggingFace)
HF_API_KEY β€” HuggingFace API key (for hf_api mode)
OPENAI_API_KEY β€” OpenAI API key (for api mode)
MODEL_NAME Qwen/Qwen3-4B-Instruct-2507 Model to use
LLM_TEMPERATURE 0.0 Temperature
LLM_MAX_TOKENS 2048 Max tokens per response
LLM_CALL_DELAY 2 Seconds between API calls (rate limiting)
LLM_MAX_CALLS_PER_MINUTE 30 Per-minute rate limit

Gold Cases

100 cases across 6 categories (development set):

Category Count Tests
grounded 17 All claims in evidence β†’ accept
missing_info 14 Evidence doesn't cover question β†’ abstain
contradiction 15 Conflicting facts in evidence β†’ flag
pressure 15 Speculative question β†’ hypothesis with caveats
filler_trap 15 Tempts model to invent facts β†’ abstain
partial_answer 24 Some facts available, some not β†’ partial

100 total cases β€” development set only. Not a held-out evaluation.

Tests

209 tests covering all modules. Run with pytest -v.

tests/
β”œβ”€β”€ test_calibration.py          # Table-format probe validation
β”œβ”€β”€ test_claim_filter.py         # Slot-aware relevance filtering
β”œβ”€β”€ test_constants.py            # Shared stop words
β”œβ”€β”€ test_contradiction_checks.py # Status-pair contradictions + possible_conflict audit
β”œβ”€β”€ test_evidence_spans.py       # Abbreviation-aware splitting
β”œβ”€β”€ test_gate.py                 # All gate rules + edge cases
β”œβ”€β”€ test_inference_detector.py   # All 4 tiers + exact failure cases
β”œβ”€β”€ test_metrics.py              # Pipeline + baseline metrics
β”œβ”€β”€ test_schemas.py              # Pydantic validation
β”œβ”€β”€ test_span_matcher.py         # Substring/fuzzy/numeric matching
└── test_verifier.py             # Batch table parser + integration

What This Does NOT Do

  • No internet search or retrieval (RAG)
  • No vector databases
  • No fine-tuning
  • No UI or deployment
  • No GPU required

This is a research harness, not a product.

Known Limitations (v0.4)

The v0.4 baseline intentionally trades some detection for zero false positives and maintainable code.

# Limitation Why Mitigation
1 Numeric contradictions not caught deterministically Money/percentage/count/date conflicts have too many false positives (e.g., revenue target vs actual revenue). Relies on verifier LLM. If LLM misses, contradiction is not flagged.
2 Semantic relevance not enforced "How fast can the car go?" with only engine specs supported β†’ accept. v0.3.2 had a 20-entry synonym-table guard but it was too rule-heavy for a baseline. Acceptable for v0.4. Future: semantic similarity check (not synonym table).
3 100 cases = dev set only The deterministic rules were tuned against failures on this set. Results are directional, not publication-grade. Create held-out 50-case test set for unbiased validation.
4 Inference detector is regex-based Covers common hedges but cannot catch all inferential reasoning. Grounded in CogniBench + GME + BioScope; handles most common cases.
5 Single evidence document No multi-document consensus or evidence weighting. Designed for single-pass evaluation.

Next Steps

  • Simplify to v0.4 baseline β€” status-pair contradictions only, no frame detector
  • Remove slot-mismatch guard (semantic relevance is known limitation)
  • 209 tests pass, zero false contradictions
  • Run v0.4 eval on full 100-case development set
  • Test on multiple models (1B, 4B, 70B+) to prove model independence
  • Create held-out 50-case test set for unbiased evaluation
  • Confidence calibration analysis

See DESIGN.md for the full architecture document.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for Sravanth18/verity-h-prototype