Instructions to use Subh775/Perception-moondream2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Subh775/Perception-moondream2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Subh775/Perception-moondream2", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Subh775/Perception-moondream2", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Subh775/Perception-moondream2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Subh775/Perception-moondream2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Subh775/Perception-moondream2",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Subh775/Perception-moondream2

SGLang

How to use Subh775/Perception-moondream2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Subh775/Perception-moondream2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Subh775/Perception-moondream2",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Subh775/Perception-moondream2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Subh775/Perception-moondream2",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Subh775/Perception-moondream2 with Docker Model Runner:
```
docker model run hf.co/Subh775/Perception-moondream2
```

Perception-moondream2

Perception-moondream2 is a specialized Vision-Language Model (VLM) fine-tuned for dense urban traffic scene understanding. Built on top of the highly efficient moondream2 architecture, this model is designed to analyze CCTV and traffic camera feeds to generate highly detailed, comprehensive textual descriptions of traffic conditions.

Model Details

Base Model: vikhyatk/moondream2 (Revision: 2024-08-26)
Architecture: Vision Encoder + Phi-1.5 Text Decoder
Task: Dense Image Captioning & Visual Question Answering (VQA)
Language: English

Training Data

The model was fine-tuned on the Subh775/Traffic-Perception-VL dataset. This dataset consists of complex, real-world urban traffic scenes (such as bustling streets in Bengaluru, India).

The training focused on teaching the model to accurately perceive and describe:

Vehicle Types & Colors: Identifying auto-rickshaws, scooters, motorcycles, and cars.
Traffic Density & Flow: Estimating congestion levels and movement.
Pedestrian Activity: Tracking people walking on sidewalks or crossing streets.
Infrastructure: Recognizing road layouts, lanes, shops, signage, and greenery.

Intended Use Cases

Smart City Analytics: Automated monitoring of CCTV feeds to detect congestion or accidents.
Traffic Management: Generating real-time text logs of intersection activity.
Autonomous Driving Context: Providing dense contextual descriptions for self-driving datasets.

Usage

Because this model relies on the custom Moondream2 architecture, you will need to use trust_remote_code=True when loading it via the transformers library.

Prerequisites

Make sure you have the required libraries installed:

!pip install transformers==4.44.2 "huggingface_hub<1.0" accelerate pillow einops

Load Tokenizer & Model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import requests

model_id = "Subh775/Perception-moondream2"

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.float16,
    # REMOVED device_map="auto"
)
# move to the GPU
model = model.to("cuda")
model.eval()

Inference

image_path = "/content/100130.jpg"
image = Image.open(image_path).convert("RGB")

enc_image = model.encode_image(image)

# Give it explicit instructions & explicitly ban the geographic bias.
prompt = (
    "Describe this traffic scene in detail. Focus strictly on the vehicles, "
    "pedestrians, infrastructure, and traffic density. Do not mention Bengaluru, "
    "India, or any specific geographic locations."
)

answer = model.answer_question(enc_image, prompt, tokenizer)

banned_phrases = ["in Bengaluru, India", "in Bengaluru", "Bengaluru, India,", "Bengaluru,"]
for banned in banned_phrases:
    answer = answer.replace(banned, "")

print(answer.strip())

Downloads last month: 114

Safetensors

Model size

2B params

Tensor type

F16

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Subh775/Perception-moondream2

Base model

vikhyatk/moondream2

Finetuned

(5)

this model

Subh775
/

Perception-moondream2