FLES-2 v32 — Sparse Lexical Embeddings via Two-Pass Distillation

NDCG@10: 0.3119 | MRR: 0.5291 | NNZ: 420 (at eval) | Zero loss spikes

A sparse retrieval encoder that transforms text into interpretable, indexable sparse vectors using BERT's MLM predictions. Trained with a novel two-pass distillation methodology: ranking-heavy first (α=0.7), vocabulary-heavy second (α=0.3, low LR).

Model Description

FLES-2 v32 produces sparse vectors over a 30,522-dimensional vocabulary space (BERT WordPiece). Each dimension corresponds to a vocabulary term, and the weight indicates how strongly that term is predicted for the input text. The result is a bag-of-expanded-terms representation that can be indexed with standard inverted indices.

Architecture

Text → BERT (bert-base-uncased) → MLM Head → log(1 + ReLU(logits)) → Max Pool → Sparse Vector

Training Methodology

Two-pass sparse self-distillation from a frozen teacher (mindoval/fles1-v12b):

Pass 1 (f2-v15): Student=v7, Teacher=v12b, α=0.7 (ranking-dominant), lr=2e-5, 200K×2ep
Pass 2 (f2-v32): Student=f2-v15, Teacher=v12b, α=0.3 (vocabulary-dominant), lr=5e-6, 200K×2ep

Key innovations:

Teacher thresholding (t=0.3) prevents density transfer
L1 FLOPS regularization (constant gradient, no density explosions)
Epoch-level CLFR (Closed-Loop FLOPS Regulation) for sparsity control
Zero loss spikes across all training (perfect stability)

Usage

from fles1_encoder import FLES1Encoder

# Load model
encoder = FLES1Encoder.from_pretrained("mindoval/fles2-v32")

# Encode text to sparse vector
sparse_vec = encoder.encode("What is machine learning?")
# Returns: {"machine": 1.82, "learning": 1.65, "artificial": 0.94, "intelligence": 0.87, ...}

# Batch encode
vectors = encoder.encode_batch(["query 1", "query 2"], batch_size=32)

Evaluation Results (nfcorpus)

Metric	Score
NDCG@10	0.3119
MRR	0.5291
Recall@100	0.2456
Avg NNZ (non-zero terms)	420
Loss spikes during training	0

Training Details

Parameter	Value
Base model	bert-base-uncased
Teacher	mindoval/fles1-v12b (frozen, thresholded at 0.3)
Student init (pass 2)	mindoval/fles2-v15
Distillation loss	Sparse KL divergence
Alpha (pass 2)	0.3 (70% distillation, 30% ranking)
Learning rate (pass 2)	5e-6
FLOPS regularization	L1, λ_d=0.00003
Training data	200K MS MARCO passages, 2 epochs
Total steps (pass 2)	12,500
Hardware	NVIDIA H100 NVL 80GB
Training time (pass 2)	~3.9 hours

Comparison

Model	NDCG@10	NNZ	Method
FLES-2 v32	0.3119	420	Two-pass sparse self-distillation
FLES-2 v15	0.3102	458	Single-pass (α=0.7)
FLES-1 v14	0.3049	359	L1 FLOPS + epoch CLFR
BM25 (Pyserini)	0.325	—	Unsupervised
SPLADE-cocondenser	0.340	125	L2 FLOPS + cross-encoder distillation

Limitations

Evaluated only on nfcorpus (medical domain). Performance on other BEIR datasets may vary.
Gap to BM25 (0.013) and SPLADE (0.028) remains. The teacher (v12b) is the ceiling.
Sparse vectors are larger than dense (420 non-zero terms vs 768-dim dense).

Citation

@misc{fles2v32,
  title={FLES-2: Two-Pass Sparse Self-Distillation for Learned Sparse Retrieval},
  author={Tavarez, Golvis},
  year={2026},
  publisher={Mindoval, Inc.},
  url={https://huggingface.co/mindoval/fles2-v32}
}

Acknowledgments

Built by Mindoval, Inc. Training compute provided by Microsoft Corporation (Azure ML, H100 GPUs).

License

Apache 2.0

Downloads last month: 38

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support