OVIE — One View Is Enough!

Monocular Training for In-the-Wild Novel View Generation

OVIE is a novel view synthesis model that generates a new viewpoint of a scene from a single image and a target camera pose. Unlike most prior work, it is trained entirely on unpaired in-the-wild images — no multi-view supervision required.

Model architecture

OVIE is a convolutional encoder–decoder with a Vision Transformer (ViT) bottleneck conditioned on camera parameters via adaptive layer normalisation (AdaLN):

Encoder: cascaded downsampling ConvBlocks (3 scales)
Bottleneck: 12-layer ViT (hidden size 768, 12 heads) with AdaLN camera conditioning
Decoder: cascaded upsampling ConvBlocks (3 scales)
Camera conditioning: a 7-dimensional pose encoding (rotation + translation) projected into the ViT hidden space
Parameters: ~143M

Usage

import torch
from models.models import OVIEModel
from utils.pose_enc import extri_intri_to_pose_encoding
from torchvision.transforms import ToTensor
from PIL import Image

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model
model = OVIEModel.from_pretrained("kyutai/ovie", revision="v1.0").to(device)
model.eval()
image_size = model.image_size  # 256

# Prepare input image
img_pil = Image.open("image.jpg").convert("RGB").resize((image_size, image_size))
img_tensor = ToTensor()(img_pil).unsqueeze(0).to(device)

# Define target camera pose (3x4 extrinsics)
extrinsics = torch.tensor([[[1.0, 0.0, 0.0, -1.25],
                            [0.0, 1.0, 0.0,  0.5],
                            [0.0, 0.0, 1.0, -2.0]]], device=device)
dummy_intrinsics = torch.zeros(1, 1, 3, 3, device=device)

camera = extri_intri_to_pose_encoding(
    extrinsics=extrinsics.unsqueeze(0),
    intrinsics=dummy_intrinsics,
    image_size_hw=(image_size, image_size),
)
cam_token = camera[..., :7].squeeze(0)

# Generate novel view
with torch.no_grad():
    pred = model(x=img_tensor, cam_params=cam_token)
# pred: (1, 3, 256, 256) tensor in [0, 1]

See the repository for full installation instructions and example notebooks:

inference_huggingface.ipynb — loads directly from this Hub page
inference_local.ipynb — loads from a local checkpoint

Training

OVIE is trained on a diverse mix of in-the-wild internet images (ImageNet, Places365, OSV5M, OpenImages) with no multi-view pairs. Training uses a combination of L2 reconstruction loss, LPIPS perceptual loss, and an adversarial loss with a DINO-based discriminator. Camera poses are sampled synthetically from a distribution of plausible viewpoint changes.

Evaluation

The model is evaluated on DL3DV and Real Estate 10K (RE10K) using PSNR, SSIM, and LPIPS. See the paper for full quantitative results.

Citation

@misc{ovie2026,
      title={One View Is Enough! Monocular Training for In-the-Wild Novel View Generation},
      author={Adrien Ramanana Rahary and Nicolas Dufour and Patrick Perez and David Picard},
      year={2026},
      eprint={2603.23488},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.23488},
}

Downloads last month: 422

Paper for kyutai/ovie

One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

Paper • 2603.23488 • Published 27 days ago • 5