YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
FLES-2 v32 — Sparse Lexical Embeddings via Two-Pass Distillation
NDCG@10: 0.3119 | MRR: 0.5291 | NNZ: 420 (at eval) | Zero loss spikes
A sparse retrieval encoder that transforms text into interpretable, indexable sparse vectors using BERT's MLM predictions. Trained with a novel two-pass distillation methodology: ranking-heavy first (α=0.7), vocabulary-heavy second (α=0.3, low LR).
Model Description
FLES-2 v32 produces sparse vectors over a 30,522-dimensional vocabulary space (BERT WordPiece). Each dimension corresponds to a vocabulary term, and the weight indicates how strongly that term is predicted for the input text. The result is a bag-of-expanded-terms representation that can be indexed with standard inverted indices.
Architecture
Text → BERT (bert-base-uncased) → MLM Head → log(1 + ReLU(logits)) → Max Pool → Sparse Vector
Training Methodology
Two-pass sparse self-distillation from a frozen teacher (mindoval/fles1-v12b):
- Pass 1 (f2-v15): Student=v7, Teacher=v12b, α=0.7 (ranking-dominant), lr=2e-5, 200K×2ep
- Pass 2 (f2-v32): Student=f2-v15, Teacher=v12b, α=0.3 (vocabulary-dominant), lr=5e-6, 200K×2ep
Key innovations:
- Teacher thresholding (t=0.3) prevents density transfer
- L1 FLOPS regularization (constant gradient, no density explosions)
- Epoch-level CLFR (Closed-Loop FLOPS Regulation) for sparsity control
- Zero loss spikes across all training (perfect stability)
Usage
from fles1_encoder import FLES1Encoder
# Load model
encoder = FLES1Encoder.from_pretrained("mindoval/fles2-v32")
# Encode text to sparse vector
sparse_vec = encoder.encode("What is machine learning?")
# Returns: {"machine": 1.82, "learning": 1.65, "artificial": 0.94, "intelligence": 0.87, ...}
# Batch encode
vectors = encoder.encode_batch(["query 1", "query 2"], batch_size=32)
Evaluation Results (nfcorpus)
| Metric | Score |
|---|---|
| NDCG@10 | 0.3119 |
| MRR | 0.5291 |
| Recall@100 | 0.2456 |
| Avg NNZ (non-zero terms) | 420 |
| Loss spikes during training | 0 |
Training Details
| Parameter | Value |
|---|---|
| Base model | bert-base-uncased |
| Teacher | mindoval/fles1-v12b (frozen, thresholded at 0.3) |
| Student init (pass 2) | mindoval/fles2-v15 |
| Distillation loss | Sparse KL divergence |
| Alpha (pass 2) | 0.3 (70% distillation, 30% ranking) |
| Learning rate (pass 2) | 5e-6 |
| FLOPS regularization | L1, λ_d=0.00003 |
| Training data | 200K MS MARCO passages, 2 epochs |
| Total steps (pass 2) | 12,500 |
| Hardware | NVIDIA H100 NVL 80GB |
| Training time (pass 2) | ~3.9 hours |
Comparison
| Model | NDCG@10 | NNZ | Method |
|---|---|---|---|
| FLES-2 v32 | 0.3119 | 420 | Two-pass sparse self-distillation |
| FLES-2 v15 | 0.3102 | 458 | Single-pass (α=0.7) |
| FLES-1 v14 | 0.3049 | 359 | L1 FLOPS + epoch CLFR |
| BM25 (Pyserini) | 0.325 | — | Unsupervised |
| SPLADE-cocondenser | 0.340 | 125 | L2 FLOPS + cross-encoder distillation |
Limitations
- Evaluated only on nfcorpus (medical domain). Performance on other BEIR datasets may vary.
- Gap to BM25 (0.013) and SPLADE (0.028) remains. The teacher (v12b) is the ceiling.
- Sparse vectors are larger than dense (420 non-zero terms vs 768-dim dense).
Citation
@misc{fles2v32,
title={FLES-2: Two-Pass Sparse Self-Distillation for Learned Sparse Retrieval},
author={Tavarez, Golvis},
year={2026},
publisher={Mindoval, Inc.},
url={https://huggingface.co/mindoval/fles2-v32}
}
Acknowledgments
Built by Mindoval, Inc. Training compute provided by Microsoft Corporation (Azure ML, H100 GPUs).
License
Apache 2.0
- Downloads last month
- 38