Perception-moondream2

Perception-moondream2 is a specialized Vision-Language Model (VLM) fine-tuned for dense urban traffic scene understanding. Built on top of the highly efficient moondream2 architecture, this model is designed to analyze CCTV and traffic camera feeds to generate highly detailed, comprehensive textual descriptions of traffic conditions.

Model Details

  • Base Model: vikhyatk/moondream2 (Revision: 2024-08-26)
  • Architecture: Vision Encoder + Phi-1.5 Text Decoder
  • Task: Dense Image Captioning & Visual Question Answering (VQA)
  • Language: English

Training Data

The model was fine-tuned on the Subh775/Traffic-Perception-VL dataset. This dataset consists of complex, real-world urban traffic scenes (such as bustling streets in Bengaluru, India).

The training focused on teaching the model to accurately perceive and describe:

  • Vehicle Types & Colors: Identifying auto-rickshaws, scooters, motorcycles, and cars.
  • Traffic Density & Flow: Estimating congestion levels and movement.
  • Pedestrian Activity: Tracking people walking on sidewalks or crossing streets.
  • Infrastructure: Recognizing road layouts, lanes, shops, signage, and greenery.

Intended Use Cases

  • Smart City Analytics: Automated monitoring of CCTV feeds to detect congestion or accidents.
  • Traffic Management: Generating real-time text logs of intersection activity.
  • Autonomous Driving Context: Providing dense contextual descriptions for self-driving datasets.

Usage

Because this model relies on the custom Moondream2 architecture, you will need to use trust_remote_code=True when loading it via the transformers library.

Prerequisites

Make sure you have the required libraries installed:

!pip install transformers==4.44.2 "huggingface_hub<1.0" accelerate pillow einops

Load Tokenizer & Model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import requests

model_id = "Subh775/Perception-moondream2"

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.float16,
    # REMOVED device_map="auto"
)
# move to the GPU
model = model.to("cuda")
model.eval()

Inference

image_path = "/content/100130.jpg"
image = Image.open(image_path).convert("RGB")

enc_image = model.encode_image(image)

# Give it explicit instructions & explicitly ban the geographic bias.
prompt = (
    "Describe this traffic scene in detail. Focus strictly on the vehicles, "
    "pedestrians, infrastructure, and traffic density. Do not mention Bengaluru, "
    "India, or any specific geographic locations."
)

answer = model.answer_question(enc_image, prompt, tokenizer)

banned_phrases = ["in Bengaluru, India", "in Bengaluru", "Bengaluru, India,", "Bengaluru,"]
for banned in banned_phrases:
    answer = answer.replace(banned, "")

print(answer.strip())
Downloads last month
114
Safetensors
Model size
2B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Subh775/Perception-moondream2

Finetuned
(5)
this model

Dataset used to train Subh775/Perception-moondream2