Instructions to use PLAN-Lab/PyraTok with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use PLAN-Lab/PyraTok with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("PLAN-Lab/PyraTok", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
metadata
license: mit
language:
- en
base_model:
- Wan-AI/Wan2.2-T2V-A14B
pipeline_tag: image-text-to-video
PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
π’ Official Announcement
PyraTok has been officially accepted to CVPR 2026! π
This repository contains the pretrained weights and model implementation for the Language-aligned Pyramidal Tokenizer.
π Overview
PyraTok is a state-of-the-art video tokenizer that bridges the gap between video understanding and generation. Unlike traditional VAEs that operate at a single visual scale, PyraTok introduces a Language-aligned Pyramidal Quantization (LaPQ) module.
Key Innovations:
- Pyramidal Structure: Learns semantically structured discrete latents across multiple spatiotemporal resolutions.
- Language Alignment: Tightly couples visual tokens with language using a shared, large binary codebook (up to 48K tokens).
- Scalability: Robustly scales from standard resolutions to 4K/8K video processing.
- Unified Backbone: A single model that excels in Video QA, Zero-Shot Segmentation, and high-fidelity Text-to-Video generation.
@inproceedings{susladkar2026pyratok,
title={PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation},
author={Susladkar, Onkar and Prakash, Tushar and Juvekar, Adheesh and Nguyen, Kiet A. and Jang, Dong-Hwan and Dhillon, Inderjit S. and Lourentzou, Ismini},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}