Visual Document Retrieval
Transformers
Safetensors
gemma3
image-text-to-text
vision-language
retrieval
colbert
late-interaction
multimodal
multilingual
document-retrieval
22-languages
Eval Results (legacy)
text-generation-inference
Instructions to use Cognitive-Lab/ColNetraEmbed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Cognitive-Lab/ColNetraEmbed with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Cognitive-Lab/ColNetraEmbed") model = AutoModelForImageTextToText.from_pretrained("Cognitive-Lab/ColNetraEmbed") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| - es | |
| - fr | |
| - de | |
| - it | |
| - hi | |
| - mr | |
| - sa | |
| - kn | |
| - te | |
| - ta | |
| - ml | |
| - zh | |
| - ja | |
| - ko | |
| - ar | |
| - bn | |
| - gu | |
| - or | |
| - pa | |
| - ru | |
| - th | |
| license: gemma | |
| library_name: transformers | |
| tags: | |
| - vision-language | |
| - retrieval | |
| - colbert | |
| - late-interaction | |
| - multimodal | |
| - multilingual | |
| - document-retrieval | |
| - 22-languages | |
| pipeline_tag: visual-document-retrieval | |
| base_model: | |
| - google/gemma-3-4b-it | |
| datasets: | |
| - Cognitive-Lab/nayanair-bench | |
| model-index: | |
| - name: ColNetraEmbed | |
| results: | |
| - task: | |
| type: image-text-retrieval | |
| name: Cross-Lingual Document Retrieval | |
| dataset: | |
| type: Cognitive-Lab/nayanair-bench | |
| name: Nayana-IR Cross-Lingual | |
| split: test | |
| metrics: | |
| - type: ndcg_at_5 | |
| value: 0.637 | |
| name: NDCG@5 | |
| - type: recall_at_10 | |
| value: 0.700 | |
| name: Recall@10 | |
| - type: map_at_10 | |
| value: 0.610 | |
| name: MAP@10 | |
| - type: mrr_at_10 | |
| value: 0.610 | |
| name: MRR@10 | |
| - task: | |
| type: image-text-retrieval | |
| name: Monolingual Document Retrieval | |
| dataset: | |
| type: Cognitive-Lab/nayanair-bench | |
| name: Nayana-IR Monolingual | |
| split: test | |
| metrics: | |
| - type: ndcg_at_5 | |
| value: 0.670 | |
| name: NDCG@5 | |
| - type: recall_at_10 | |
| value: 0.764 | |
| name: Recall@10 | |
| - type: map_at_10 | |
| value: 0.645 | |
| name: MAP@10 | |
| - type: mrr_at_10 | |
| value: 0.686 | |
| name: MRR@10 | |
| - task: | |
| type: image-text-retrieval | |
| name: English Document Retrieval | |
| dataset: | |
| type: vidore/vidore-benchmark | |
| name: ViDoRe v2 | |
| split: test | |
| metrics: | |
| - type: ndcg_at_5 | |
| value: 0.551 | |
| name: NDCG@5 | |
| - type: recall_at_10 | |
| value: 0.664 | |
| name: Recall@10 | |
| - type: map_at_10 | |
| value: 0.445 | |
| name: MAP@10 | |
| - type: mrr_at_10 | |
| value: 0.445 | |
| name: MRR@10 | |
| # ColNetraEmbed | |
|  | |
| [](https://arxiv.org/abs/2512.03514) | |
| [](https://github.com/adithya-s-k/colpali) | |
| [](https://huggingface.co/Cognitive-Lab/ColNetraEmbed) | |
| [](https://www.cognitivelab.in/blog/introducing-netraembed) | |
| [](https://huggingface.co/spaces/AdithyaSK/NetraEmbed) | |
| [](https://huggingface.co/Cognitive-Lab/ColNetraEmbed/blob/main/ColNetraEmbed_InferenceDemo.ipynb) | |
| [](https://huggingface.co/Cognitive-Lab/NetraEmbed/blob/main/NetraEmbed_Gradio_Demo_final.ipynb) | |
| **ColNetraEmbed** is a state-of-the-art multilingual multimodal embedding model for visual document retrieval, powered by the Gemma3 backbone and using Colbert-style multi-vector representations. | |
| ## Model Description | |
| ColNetraEmbed is a multilingual multimodal embedding model that encodes documents as multi-vector representations using the ColPali architecture. Each image patch is mapped to a contextualized embedding, enabling fine-grained matching between visual content and text queries through late interaction (MaxSim). | |
| - **Model Type:** Multilingual Multimodal Embedding Model with ColPali-style Multi-vector representations | |
| - **Architecture:** ColPali with Gemma3-4B backbone | |
| - **Embedding Dimension:** 128 per token | |
| - **Capabilities:** Multilingual, Multimodal (Vision + Text), Multi-vector late interaction | |
| - **Use Case:** Visual document retrieval, multilingual document understanding, fine-grained visual search | |
| ## Paper | |
| ๐ **[M3DR: Towards Universal Multilingual Multimodal Document Retrieval](https://arxiv.org/abs/2512.03514)** | |
| ## Installation | |
| ```bash | |
| pip install git+https://github.com/adithya-s-k/colpali.git | |
| ``` | |
| ## Quick Start | |
| ```python | |
| import torch | |
| from PIL import Image | |
| from colpali_engine.models import ColGemma3, ColGemmaProcessor3 | |
| # Load model and processor | |
| model_name = "Cognitive-Lab/ColNetraEmbed" | |
| model = ColGemma3.from_pretrained( | |
| model_name, | |
| torch_dtype=torch.bfloat16, | |
| device_map="cuda", | |
| ) | |
| processor = ColGemmaProcessor3.from_pretrained(model_name) | |
| # Load your images | |
| images = [ | |
| Image.open("document1.jpg"), | |
| Image.open("document2.jpg"), | |
| ] | |
| # Define queries | |
| queries = [ | |
| "What is the total revenue?", | |
| "Show me the organizational chart", | |
| ] | |
| # Process and encode | |
| batch_images = processor.process_images(images).to(model.device) | |
| batch_queries = processor.process_queries(queries).to(model.device) | |
| with torch.no_grad(): | |
| image_embeddings = model(**batch_images) # Shape: (num_images, num_patches, 128) | |
| query_embeddings = model(**batch_queries) # Shape: (num_queries, num_tokens, 128) | |
| # Compute similarity scores using MaxSim | |
| scores = processor.score_multi_vector( | |
| qs=query_embeddings, | |
| ps=image_embeddings, | |
| ) # Shape: (num_queries, num_images) | |
| # Get best matches | |
| for i, query in enumerate(queries): | |
| best_idx = scores[i].argmax().item() | |
| print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.2f})") | |
| ``` | |
| ## Use Cases | |
| - **Document Retrieval:** Search through large collections of visual documents | |
| - **Visual Question Answering:** Answer questions about document content | |
| - **Document Understanding:** Extract and match information from scanned documents | |
| - **Cross-lingual Document Search:** Multilingual visual document retrieval | |
| ## Model Details | |
| - **Base Model:** [Gemma3-4B-IT](https://huggingface.co/google/gemma-3-4b-it) | |
| - **Vision Encoder:** SigLIP | |
| - **Training Data:** Multilingual document datasets | |
| - **Embedding Strategy:** Multi-vector (Late Interaction) | |
| - **Similarity Function:** MaxSim (Maximum Similarity) | |
| ## Performance | |
| ColNetraEmbed achieves strong performance on multilingual document retrieval benchmarks. Evaluated on [Nayana-IR Bench](https://huggingface.co/collections/Cognitive-Lab/nayanair-bench) (22 languages) and ViDoRe v2. | |
| ### Benchmark Results | |
| **Nayana-IR Cross-Lingual** | |
| | Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 | | |
| |-------|:------:|:---------:|:------:|:------:| | |
| | **ColNetraEmbed** | **0.637** | **0.700** | **0.610** | **0.610** | | |
| | Jina-Embeddings-v4 | 0.435 | 0.435 | 0.390 | 0.548 | | |
| | ColNomic-Embed-3B | 0.315 | 0.320 | 0.267 | 0.444 | | |
| | ColPali-v1.3 | 0.284 | 0.347 | 0.249 | 0.403 | | |
| | GME-Qwen2-VL-2B | 0.235 | 0.308 | 0.209 | 0.314 | | |
| | ColQwen2.5-v0.2 | 0.143 | 0.160 | 0.127 | 0.220 | | |
| | ColQwen2-v1.0 | 0.050 | 0.065 | 0.038 | 0.109 | | |
| **Nayana-IR Monolingual** | |
| | Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 | | |
| |-------|:------:|:---------:|:------:|:------:| | |
| | **ColNetraEmbed** | **0.670** | **0.764** | **0.645** | **0.686** | | |
| | ColNomic-Embed-3B | 0.534 | 0.603 | 0.515 | 0.546 | | |
| | ColQwen2.5-v0.2 | 0.453 | 0.513 | 0.437 | 0.464 | | |
| | GME-Qwen2-VL-2B | 0.444 | 0.525 | 0.426 | 0.452 | | |
| | ColQwen2-v1.0 | 0.413 | 0.466 | 0.398 | 0.422 | | |
| | ColPali-v1.3 | 0.410 | 0.484 | 0.393 | 0.422 | | |
| **ViDoRe v2** | |
| | Model | NDCG@5 | Recall@10 | MAP@10 | MRR@10 | | |
| |-------|:------:|:---------:|:------:|:------:| | |
| | ColQwen2.5-v0.2 | 0.592 | 0.664 | 0.484 | 0.711 | | |
| | Jina-Embeddings-v4 | 0.576 | 0.686 | - | - | | |
| | GME-Qwen2-VL-2B | 0.574 | 0.630 | 0.466 | 0.690 | | |
| | ColNomic-Embed-3B | 0.556 | 0.633 | 0.451 | 0.672 | | |
| | **ColNetraEmbed** | **0.551** | **0.664** | **0.445** | **0.445** | | |
| | ColQwen2-v1.0 | 0.545 | 0.640 | 0.438 | 0.653 | | |
| | ColPali-v1.3 | 0.538 | 0.627 | 0.436 | 0.644 | | |
| **Key Results:** | |
| - ๐ **Strong multilingual performance** with ColBERT-style late interaction | |
| - ๐ **124% improvement** over ColPali-v1.3 on cross-lingual tasks | |
| - ๐ Supports **22 languages** across diverse script families | |
| - ๐ **Fine-grained matching** through token-level MaxSim scoring | |
| **Comparison: Multi-vector vs Single-vector** | |
| - ColNetraEmbed (multi-vector): More interpretable with token-level attribution | |
| - NetraEmbed (single-vector): Higher accuracy (0.716 vs 0.637) and 250x more efficient storage | |
| See our [paper](https://arxiv.org/abs/2512.03514) for comprehensive evaluation and architectural comparisons. | |
| ## Citation | |
| ```bibtex | |
| @misc{kolavi2025m3druniversalmultilingualmultimodal, | |
| title={M3DR: Towards Universal Multilingual Multimodal Document Retrieval}, | |
| author={Adithya S Kolavi and Vyoman Jain}, | |
| year={2025}, | |
| eprint={2512.03514}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.IR}, | |
| url={https://arxiv.org/abs/2512.03514} | |
| } | |
| ``` | |
| ## License | |
| This model is released under the same license as the base Gemma3 model. | |
| ## Acknowledgments | |
| This work benefited from compute credits for training, inference, and evaluation provided by [Modal](https://modal.com), acknowledged as a compute sponsor. Dataset curation and synthesis were supported by the [Meta LLaMA Impact Grant](https://about.fb.com/news/2025/04/llama-impact-grant-recipients/?utm_source=AIatMeta&utm_medium=organic_social&utm_content=image&utm_campaign=llamacon) through our [Nayana initiative](https://www.cognitivelab.in/nayana). We appreciate Meta for continued support of our research efforts at [CognitiveLab](https://www.cognitivelab.in). | |
| Built on top of the ColPali framework and Gemma3 architecture. | |