Model Card for EVHost

Status: Pre-publication release. Author list, contact email, and paper link are placeholders and will be updated upon paper acceptance. See CITATION.md for the current placeholder citation.

This is the model card for the EVHost fusion classifier that pairs with the Evo2 evolutionary language model for viral host prediction. The checkpoint file evhost_best.pt contains the trained FusionClassifier weights and all hyperparameters required for inference.

Model Details

Model name: EVHost (fusion classifier)
Architecture: FusionClassifier — a multi-layer perceptron that fuses a 1920-dim Evo2 embedding (post-projection) with 211-dim hand-crafted genomic features (CUB, dinucleotide, CPB, AA frequency, host adaptation, zoonotic) through a 512-dim hidden layer.
Parameters: ~2.5M trainable (fusion head only; Evo2 backbone is frozen at inference and not part of this checkpoint)
Checkpoint size: ~500 MB
Framework: PyTorch 2.7.0
License: MIT

The 211-dim genomic-feature vector is the post-CPB-compression representation fed to the fusion MLP. Pre-CPB-compression dimensionality is 1149 (CUB 64 + dinuc 16 + CPB 256 + AA 20 + bridge-dinuc 16 + adaptation 24 + zoonotic 7 + Evo2-projection 512 = 1915 → fused 1149). See src/evhost/models/fusion.py for the implementation.

Intended Use

Primary use

Research: predict the human-host probability of a viral genome sequence given pre-computed Evo2 embeddings and genomic features. Designed for viral surveillance, zoonotic-risk screening, and pandemic-preparedness research.

Out-of-scope use

Not a clinical diagnostic tool. Do not use for patient-level decision-making or pathogen identification in clinical workflows.
Not validated for non-viral sequences (bacteria, plasmids, host genomes).
Not validated for sequences substantially diverged from NCBI/VHDB reference taxa used during training (coronaviruses, influenza, rabies, etc.).

How to Use

Option 1 — Direct download (curl/wget)

mkdir -p models
curl -L "https://huggingface.co/Adorably/EVHost/resolve/main/evhost_best.pt" -o models/evhost_best.pt

Option 2 — Python via `huggingface_hub`

from huggingface_hub import hf_hub_download
import torch

ckpt_path = hf_hub_download(
    repo_id="Adorably/EVHost",
    filename="evhost_best.pt",
    local_dir="models",
)
checkpoint = torch.load(ckpt_path, map_location="cpu", weights_only=False)

Loading and running inference

import torch
from evhost.models import FusionClassifier
checkpoint = torch.load("models/evhost_best.pt", map_location="cpu", weights_only=False)
model = FusionClassifier(
    d_evo=checkpoint["d_evo"],
    k_bio=checkpoint["k_bio"],
    d_bio=checkpoint["d_bio"],
    fusion_hidden=checkpoint["fusion_hidden"],
    evo_reduced_dim=checkpoint["evo_reduced_dim"],
    cpb_dim=checkpoint["cpb_dim"],
    cpb_compressed_dim=checkpoint["cpb_compressed_dim"],
).to("cpu")
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()
# `embedding` is a 1920-dim Evo2 vector, `cpb` is a 256-dim codon-pair-bias vector,
# `non_cpb` is a (211 - 256 → after compression) -dim concatenated feature vector.
with torch.no_grad():
    logit = model(embedding, cpb, non_cpb)
    probability = torch.sigmoid(logit).item()

For a full end-to-end demo (FASTA → embedding + features → prediction), see the GitHub repository: examples/simple_prediction.py and scripts/predict_host.py.

Training Data

Source corpora: NCBI Virus + VHDB + BERT-Infect.
Total sequences: 128,761 viral genome records across 26 families.
Labels: 64,680 labeled-positive (Homo sapiens host records); 64,081 unlabeled (all other hosts).
Observed positive fraction: 50.2%; estimated true positive class prior π̂ = 0.62 (95% CI 0.59–0.65).
Split: date-based — <2015 train, 2015–2018 validation, >2018 prediction. Sarbecovirus sequences were excluded from training and used only for prediction.
Data redistribution: the original training data is not redistributed with this checkpoint. Users must obtain it from the source databases per their respective licenses.

Training Procedure

Backbone: Evo2 (1B-parameter), fine-tuned on viral sequences for ~20 hours on a single NVIDIA H100 (4,688 optimization steps, sequence length 8,192, effective batch size 32, peak LR 1.5×10⁻⁵, min LR 1.5×10⁻⁶, 400 warm-up steps). The fine-tuned Evo2 backbone is not included in this checkpoint — see the Evo2 project for backbone weights.
Fusion head: trained for 40 epochs (final 15 used for checkpoint selection) with batch size 64, learning rates 3×10⁻⁵ (Evo2 projection) / 1×10⁻⁴ (fusion head) / 3×10⁻⁴ (BioMLP), weight decay 5×10⁻³, gradient clip 0.65.
Objective: positive-unlabeled (PU) learning with nnPU correction (β = 0.1). An architecture-matched binary-cross-entropy variant (EVHost(BC)) is also trained for direct comparison with baselines.
Reproducibility: see the GitHub repository — training scripts under scripts/train/, configuration under configs/.

Evaluation

Performance below is on the post-2018 held-out prediction set, using host-isolate labels as the reference. Recall / precision / F1 / ROC-AUC values are taken from Figure 2 of the paper; species-level detection rates are taken from Figure 4a.

Sequence-level metrics

Model	F1	Recall	Precision	ROC-AUC
EVHost (PU)	0.789	0.994	0.665	0.953
EVHost (BC)	0.901	0.934	0.870	0.958
BERT-infect (DNABERT)	0.891	0.956	0.834	—
BERT-infect (VIBE)	0.813	0.955	0.706	—
DeePaC-vir	0.724	0.663	0.798	—
Zoonotic rank	0.928	0.955	0.903	—
BLAST	0.844	0.942	0.764	—
kNN	0.893	0.929	0.860	—

Notes:

EVHost(PU) trades precision for recall by design. Many of its "false positives" under incomplete host-isolate labels map to species with independent human-host evidence in VHDB, so the precision value understates true performance.
EVHost(BC) achieves the highest F1 among all methods on this set, showing that the fusion architecture is strong even without PU learning.

Limitations and Biases

Host label noise. Host-isolate labels from NCBI/VHDB are incomplete and biased toward well-studied zoonoses. Both EVHost variants may under-predict human host probability for viruses with sparse surveillance records (especially non-mammalian reservoirs).
Taxonomic coverage. Trained on 26 viral families; performance on families outside this set (e.g., bacteriophages, plant viruses) is unvalidated.
Length dependence. Evo2 embeddings are computed at 8,192-token context. Sequences substantially shorter or longer are truncated/aggregated by the upstream embedding script, which may degrade performance.
Temporal drift. The post-2018 test set is used for headline metrics, but viral evolution outpaces any fixed model. Periodic re-evaluation is required.
Confirmation bias from PU learning. The PU-trained variant optimistically assumes the unlabeled set is mostly negative; for surveillance use, prefer the (BC) variant if false-positive cost is high.

How to Cite

See CITATION.md in the repository for the current (placeholder) citation. The canonical citation will be added upon paper acceptance.

License

MIT — see the LICENSE file in the repository.

Acknowledgments

Evo2 model by the Arc Institute.
NCBI Virus and VHDB for viral genome and host metadata.
BERT-Infect dataset authors.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Adorably/EVHost

Base model

arcinstitute/evo2_1b_base

Finetuned

(1)

this model