PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset
Abstract
A large-scale UHR image-text dataset and evaluation benchmark are introduced to advance ultra-high-resolution text-to-image generation capabilities.
Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the extreme desire for better visual experience and the rapid development of imaging technology, the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However, UHR image generation poses great challenges due to the scarcity and complexity of high-resolution content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios (each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to native 100MP generation with three training schemes. Finally, leveraging both conventional metrics and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and semantic alignment. Extensive experimental results on our benchmark and the constructive exploration of training strategies collaboratively provide valuable insights for future breakthroughs.
Community
Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With
the extreme desire for better visual experience and the rapid development of imaging technology,
the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However,
UHR image generation poses great challenges due to the scarcity and complexity of high-resolution
content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset
curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios
(each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our
large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to
native 100MP generation with three training schemes. Finally, leveraging both conventional metrics
and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark
establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and
semantic alignment. Extensive experimental results on our benchmark and the constructive exploration
of training strategies collaboratively provide valuable insights for future breakthroughs.
Hi !
Great work on L2P (and PixVerve)! You're doing truly groundbreaking work!
Context on how i2i currently works in FLUX.2:
In standard FLUX.2, image-to-image is done via sequence concatenation — the reference image is VAE-encoded into a latent, patchified into tokens, and simply concatenated into the same token sequence as the noise and text tokens: [text_tokens | reference_tokens | noise_tokens]. The noise tokens then attend to the reference tokens in self-attention. No separate i2i mechanism — the reference is just additional context tokens in the sequence.
My questions:
Distilled model: Can L2P's latent-to-pixel transform be applied to a distilled few-step FLUX.2 Klein 4B, or does it require the base model? Will the few-step (4-step) capability survive the transform, or does it need re-distillation afterward?
i2i support: Since L2P removes the VAE, would the reference image instead go through the same large-patch tokenizer as the noise (directly in pixel space) and be concatenated into the sequence the same way? I.e. [text | reference_patches | noise_patches] — is the attention/concatenation mechanism preserved, just with pixel-patch tokens instead of VAE-latent tokens?
Right now, I'm trying to get rid of the titles when editing high-resolution images! I want to provide better context and make sure the titles don't look so out of place. And your approach to generating high-resolution images looks very promising!
I’ve already tried a few methods. For example, I used DC-AE with 128 512-channel blocks for maximum compression! But I think your approach is the best!
Thanks in advance for your reply!
Thank you for your interest.
Regarding Question 1:
Our approach is compatible with distilled models (1k resolution version of L2P was trained on z-image-turbo). Like most fine-tuning methods, the model will lose its few-step generation capability after fine-tuning. However, if paired with trajectory imitation training, we believe the few-step capability can be maintained.
Regarding Question 2:
We plan to explore L2P-FLUX.2 and its i2i applications. If there are new developments or conclusions, we will update the L2P/PixVerve repo. Please stay tuned.
Get this paper in your agent:
hf papers read 2605.20147 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper