YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

FLES-2 v32 — Sparse Lexical Embeddings via Two-Pass Distillation

NDCG@10: 0.3119 | MRR: 0.5291 | NNZ: 420 (at eval) | Zero loss spikes

A sparse retrieval encoder that transforms text into interpretable, indexable sparse vectors using BERT's MLM predictions. Trained with a novel two-pass distillation methodology: ranking-heavy first (α=0.7), vocabulary-heavy second (α=0.3, low LR).

Model Description

FLES-2 v32 produces sparse vectors over a 30,522-dimensional vocabulary space (BERT WordPiece). Each dimension corresponds to a vocabulary term, and the weight indicates how strongly that term is predicted for the input text. The result is a bag-of-expanded-terms representation that can be indexed with standard inverted indices.

Architecture

Text → BERT (bert-base-uncased) → MLM Head → log(1 + ReLU(logits)) → Max Pool → Sparse Vector

Training Methodology

Two-pass sparse self-distillation from a frozen teacher (mindoval/fles1-v12b):

  1. Pass 1 (f2-v15): Student=v7, Teacher=v12b, α=0.7 (ranking-dominant), lr=2e-5, 200K×2ep
  2. Pass 2 (f2-v32): Student=f2-v15, Teacher=v12b, α=0.3 (vocabulary-dominant), lr=5e-6, 200K×2ep

Key innovations:

  • Teacher thresholding (t=0.3) prevents density transfer
  • L1 FLOPS regularization (constant gradient, no density explosions)
  • Epoch-level CLFR (Closed-Loop FLOPS Regulation) for sparsity control
  • Zero loss spikes across all training (perfect stability)

Usage

from fles1_encoder import FLES1Encoder

# Load model
encoder = FLES1Encoder.from_pretrained("mindoval/fles2-v32")

# Encode text to sparse vector
sparse_vec = encoder.encode("What is machine learning?")
# Returns: {"machine": 1.82, "learning": 1.65, "artificial": 0.94, "intelligence": 0.87, ...}

# Batch encode
vectors = encoder.encode_batch(["query 1", "query 2"], batch_size=32)

Evaluation Results (nfcorpus)

Metric Score
NDCG@10 0.3119
MRR 0.5291
Recall@100 0.2456
Avg NNZ (non-zero terms) 420
Loss spikes during training 0

Training Details

Parameter Value
Base model bert-base-uncased
Teacher mindoval/fles1-v12b (frozen, thresholded at 0.3)
Student init (pass 2) mindoval/fles2-v15
Distillation loss Sparse KL divergence
Alpha (pass 2) 0.3 (70% distillation, 30% ranking)
Learning rate (pass 2) 5e-6
FLOPS regularization L1, λ_d=0.00003
Training data 200K MS MARCO passages, 2 epochs
Total steps (pass 2) 12,500
Hardware NVIDIA H100 NVL 80GB
Training time (pass 2) ~3.9 hours

Comparison

Model NDCG@10 NNZ Method
FLES-2 v32 0.3119 420 Two-pass sparse self-distillation
FLES-2 v15 0.3102 458 Single-pass (α=0.7)
FLES-1 v14 0.3049 359 L1 FLOPS + epoch CLFR
BM25 (Pyserini) 0.325 — Unsupervised
SPLADE-cocondenser 0.340 125 L2 FLOPS + cross-encoder distillation

Limitations

  • Evaluated only on nfcorpus (medical domain). Performance on other BEIR datasets may vary.
  • Gap to BM25 (0.013) and SPLADE (0.028) remains. The teacher (v12b) is the ceiling.
  • Sparse vectors are larger than dense (420 non-zero terms vs 768-dim dense).

Citation

@misc{fles2v32,
  title={FLES-2: Two-Pass Sparse Self-Distillation for Learned Sparse Retrieval},
  author={Tavarez, Golvis},
  year={2026},
  publisher={Mindoval, Inc.},
  url={https://huggingface.co/mindoval/fles2-v32}
}

Acknowledgments

Built by Mindoval, Inc. Training compute provided by Microsoft Corporation (Azure ML, H100 GPUs).

License

Apache 2.0

Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support