fun-research
/

Video-LLaVA-Seg

Video-Text-to-Text

text-generation

Model card Files Files and versions

Video-LLaVA-Seg

Project | Arxiv

This is the official baseline implementation for the ViCas dataset, presented in the paper ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation.

For details about setting up the model, refer to the Video-LLaVA-Seg GitHub repo

For details about downloading and evaluating the dataset benchmark, refer to the ViCaS GitHub repo

Downloads last month: 9

Safetensors

Model size

9B params

Tensor type

I64

·

BF16

·

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for fun-research/Video-LLaVA-Seg

ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation

Paper • 2412.09754 • Published Dec 12, 2024