V-JEPA 2 + Qwen3.5-27B Video Understanding

Event-based video understanding pipeline using V-JEPA 2 vision encoder aligned with Qwen3.5-27B LLM.

Training data: ~600 YouTube Shorts with Gemini 2.0 Flash auto-generated summaries (custom dataset, not publicly released).

Key Results

~250x token efficiency vs frame-based approaches (8-15 tokens per video)
80% domain accuracy on video summarization (Experiment 5)
47.9% text recognition accuracy with V-JEPA LoRA (vs 1.2% baseline)
~22GB VRAM for inference with GGUF quantization

Checkpoints

File	Description	Use Case
`exp5_projection/proj_epoch5.pt`	Projection Layer (3-layer MLP, ~215M)	Video summarization
`exp6_projection/proj_lora_epoch5.pt`	Projection Layer trained with LoRA	Summarization + text recognition
`exp6_vjepa_lora/`	V-JEPA 2 LoRA adapter (r=16, alpha=32)	Text recognition in videos

Architecture

Video → V-JEPA 2 ViT-L (frozen/LoRA) → frame mean pool → [N_frames, 1024]
  → event segmentation (cosine distance peak detection)
  → event mean pool → [N_events, 1024]
  → Projection Layer (3-layer MLP) → [N_events, 5120]
  → Qwen3.5-27B (frozen) → text generation

Projection Layer Architecture

class ProjectionV2(nn.Module):
    def __init__(self, vjepa_dim=1024, llm_dim=5120):
        super().__init__()
        hidden = llm_dim * 2  # 10240
        self.proj = nn.Sequential(
            nn.Linear(vjepa_dim, hidden), nn.GELU(),
            nn.Linear(hidden, hidden), nn.GELU(),
            nn.Linear(hidden, llm_dim),
        )

Usage

import torch
from transformers import AutoModel

# Load V-JEPA 2
vjepa = AutoModel.from_pretrained("facebook/vjepa2-vitl-fpc64-256")

# Load Projection
proj = ProjectionV2(1024, 5120)
proj.load_state_dict(torch.load("exp5_projection/proj_epoch5.pt"))

# For text recognition, also load LoRA
from peft import PeftModel
vjepa_lora = PeftModel.from_pretrained(vjepa, "exp6_vjepa_lora/")
proj_lora = ProjectionV2(1024, 5120)
proj_lora.load_state_dict(torch.load("exp6_projection/proj_lora_epoch5.pt"))

Training Details

Vision Encoder: V-JEPA 2 ViT-L (326M params, frozen or LoRA r=16)
LLM: Qwen3.5-27B (frozen, bf16)
Projection: 3-layer MLP (~215M params, trainable)
Data: ~600 YouTube Shorts with Gemini 2.0 Flash auto-summaries
Training: 5 epochs, AdamW lr=1e-4, A100 80GB
Loss: next-token prediction (causal LM)

Citation

If you use this work, please cite:

@misc{raen2026vjepa_video_understanding,
  title={Event-Based Video Understanding via V-JEPA--LLM Alignment: From Event Segmentation to Visual-Semantic Mapping},
  author={Raen2264},
  year={2026},
  doi={10.5281/zenodo.19143611},
  url={https://doi.org/10.5281/zenodo.19143611},
  note={Model checkpoints: https://huggingface.co/2264K/vjepa2-qwen3.5-video-understanding}
}

Downloads last month: 58

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 2264K/vjepa2-qwen3.5-video-understanding

Base model

Qwen/Qwen3.5-27B

Adapter

(55)

this model