ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation
Paper • 2412.09754 • Published
How to use fun-research/Video-LLaVA-Seg with Transformers:
# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("fun-research/Video-LLaVA-Seg", dtype="auto")This is the official baseline implementation for the ViCas dataset, presented in the paper ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation.
For details about setting up the model, refer to the Video-LLaVA-Seg GitHub repo
For details about downloading and evaluating the dataset benchmark, refer to the ViCaS GitHub repo