CinemaCLIP-1.0.0

CinemaCLIP is a ViT-B-32-256 fine-tune specialized for understanding the visual language of cinema at a frame level. It is a hybrid CLIP model with 23 classifier heads that represent a comprehensive taxonomy built with domain experts. For more info, see our launch blog post.

This repository ships three serialized forms of the same model:

Torch (model.safetensors) — load via the cinemaclip Python package.
CoreML (ImageEncoder.mlmodel, ImageEncoder.mlpackage and TextEncoder.mlpackage) — for on-device Apple Neural Engine inference.
ONNX (ImageEncoder.onnx, TextEncoder.onnx, plus _fp16 variants) — for cross-platform inference.

Install

pip install cinemaclip            # core
pip install "cinemaclip[coreml]"  # CoreML export/inference
pip install "cinemaclip[onnx]"    # ONNX export/inference

Usage (PyTorch)

from PIL import Image
from cinemaclip import CinemaCLIP

model = CinemaCLIP.from_pretrained("OZU-Technology/CinemaCLIP").eval()

# End-to-end classification on a PIL image
image = Image.open("still.jpg").convert("RGB")
predictions = model.predict_image(image)
predictions["classifier_preds"]  # Classifier predictions
predictions["clip_image_embedding"]

# Just the image embedding
x = model.preprocess(image).unsqueeze(0)
image_embedding = model.encode_image(x, normalize=True)   # [1, 512]

# Just the text embedding
tokens = model.tokenizer(["a medium closeup of "])
text_embedding = model.encode_text(tokens, normalize=True)  # [1, 512]

The CinemaCLIP.predict_image method is demonstrative for how to get post-processed classifier outputs from the model. It is not super efficient or production ready, and must be treated as a reference above all else.

Usage (CoreML)

import coremltools as ct
from PIL import Image

img_encoder = ct.models.MLModel("ImageEncoder.mlpackage")
# Input must be 256x256 RGB, resized with BICUBIC for parity with the released torch outputs.
img = Image.open("still.jpg").convert("RGB").resize((256, 256), Image.Resampling.BICUBIC)
out = img_encoder.predict({"Image": img})
embedding = out["clip_image_embedding"]    # [512]
probabilities = out["probabilities"]       # [101] — concat of 23 per-category outputs

# TODO
text_encoder = ct.models.MLModel("TextEncoder.mlpackage")

Usage (ONNX)

from PIL import Image
from onnxruntime import InferenceSession
from torchvision import transforms as T

img = Image.open("still.jpg").convert("RGB")
preprocess = T.Compose([
    T.Resize((256, 256), interpolation=T.InterpolationMode.BICUBIC),
    T.ToTensor(),   # yields float tensor in [0, 1] — no mean/std normalization
])
x = preprocess(img).unsqueeze(0).numpy()

session = InferenceSession("ImageEncoder.onnx", providers=["CPUExecutionProvider"])
emb, probs = session.run(None, {"Image": x})

Output structure

probabilities is a flat [101] vector — the concatenation of all 23 classifier heads' post-activation outputs. Label names and positions are in the shipped CinemaNetSchema.json:

import json
schema = json.load(open("CinemaNetSchema.json"))
label_names = schema["probabilities_labels"]  # len == 101

The classifier heads are a mix of 3 types of classifiers:

Single label (softmax activation)
Multi label (sigmoid activation)
Binary (sigmoid activation)

Evaluation

CinemaCLIP outperforms not only the largest existing CLIP models (up to 28x larger), but also leading VLMs in cinematic understanding tasks (we benchmarked against the leading 4B VLMs).

Two inference modes are reported for CinemaCLIP:

Classifier — the shipped supervised heads on the CinemaCLIP image embedding.
0-shot — zero-shot text/image similarity using CinemaCLIP's own text encoder.

Category	CinemaCLIP 0-shot	CinemaCLIP Classifier	Qwen3.5-4B	Gemma4-4B	InternVL3.5-4B	Molmo2-4B	DFN ViT-H-14	MetaCLIP PE-bigG	OpenAI ViT-L-14	MobileCLIP-S1	DFN ViT-L-14	SigLIP2 SO400M	SigLIP2 ViT-gopt
Mean	83.2	87.7	57.6	56.7	55.3	55.3	45.9	45.2	44.8	44.2	39.0	38.7	36.5
Color Contrast	89.3	87.4	33.7	35.3	33.7	35.3	34.0	33.1	49.4	38.7	37.1	57.7	25.2
Color Key	86.8	95.7	78.1	78.1	80.3	64.3	58.2	50.2	53.2	59.4	48.3	22.8	52.6
Color Saturation	83.0	84.3	66.5	65.4	72.1	45.9	55.1	61.8	58.1	35.8	46.8	33.3	31.8
Color Theory	75.3	73.3	54.0	51.7	50.7	48.7	54.7	51.7	50.7	47.3	47.7	31.3	31.7
Color Tones	87.3	89.3	50.2	62.6	70.6	62.1	58.5	50.2	52.0	55.7	47.2	24.0	17.7
Lighting Cast	81.2	87.8	38.3	53.3	39.8	35.7	25.4	29.3	28.8	35.7	22.8	37.8	18.2
Lighting Contrast	91.6	93.2	29.8	39.1	38.7	46.1	35.3	35.5	32.6	39.0	39.4	48.4	37.6
Lighting Edge	80.4	93.6	22.8	38.8	31.2	40.4	22.4	31.6	41.6	34.0	21.2	26.0	25.6
Lighting Silhouette	88.2	92.0	80.9	63.0	48.9	48.8	66.6	67.1	67.4	58.4	43.5	46.2	78.9
Shot Angle	79.5	84.4	41.9	49.2	33.2	49.9	28.0	13.7	19.0	19.6	25.9	21.3	17.2
Shot Composition	94.0	97.0	46.0	54.5	55.7	60.5	27.8	24.3	21.3	22.0	25.2	31.4	11.4
Shot Dutch Angle	67.6	73.6	62.2	65.1	46.7	49.3	27.3	44.5	38.4	56.6	25.9	47.6	68.7
Shot Focus	59.1	71.8	19.9	26.6	26.3	25.1	32.9	31.2	24.4	31.3	37.3	48.2	12.6
Shot Framing	83.1	82.3	38.0	29.6	40.1	34.6	33.6	24.9	23.5	23.9	33.0	7.3	9.8
Shot Height	89.2	92.8	38.1	37.4	41.2	53.0	37.6	33.7	28.9	24.0	33.6	29.6	23.9
Shot Lens Size	73.3	76.7	49.6	28.0	43.6	46.6	32.1	28.0	34.5	30.1	25.7	30.1	17.6
Shot Location	86.5	92.9	81.0	82.2	81.5	79.2	73.0	68.4	68.0	75.6	66.1	65.0	46.7
Shot Symmetry	87.8	91.0	90.2	86.7	76.0	80.2	76.6	78.0	54.0	39.3	24.9	46.0	82.4
Shot Time of Day	75.7	87.6	75.1	66.1	70.7	70.7	68.1	69.6	60.3	73.7	71.2	48.5	42.7
Shot Type	80.7	86.7	81.3	61.2	57.0	57.4	52.8	40.4	36.5	35.7	56.7	46.5	29.7
Shot Type - Crowd	96.9	99.1	97.2	88.2	94.3	94.8	55.9	69.1	68.6	77.2	37.3	52.4	69.3
Shot Type - OTS	94.1	96.4	92.5	85.0	83.9	87.6	53.2	57.0	73.9	60.3	42.1	50.5	51.2

The shot.lighting.direction head ships in the classifier heads but has been excluded from the table above being a multi-label classifier.

Citation

@misc{cinemaclip2026,
  title        = {CinemaCLIP: A hybrid CLIP model and taxonomy for the visual language of cinema},
  author       = {Somani, Rahul and Marini, Anton and Stewart, Damian},
  year         = {2026},
  publisher    = {Hugging Face},
  doi          = {10.57967/hf/8539},
  howpublished = {\url{https://huggingface.co/OZU-Technology/CinemaCLIP}},
  note         = {Model weights and taxonomy}
}

Downloads last month: 196

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for OZU-Technology/CinemaCLIP

Base model

laion/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K

Finetuned

(1)

this model