Authors
- Dongjun Kim*
- Chieh-Hsin Lai
- Wei-Hsiang Liao
- Yuhta Takida
- Naoki Murata
- Toshimitsu Uesaka
- Yuki Mitsufuji
- Stefano Ermon*
* External authors
Venue
- NeurIPS 2024
Date
- 2024
PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher
Dongjun Kim*
Stefano Ermon*
* External authors
NeurIPS 2024
2024
Abstract
To accelerate sampling, diffusion models (DMs) are often distilled into generators that directly map noise to data in a single step. In this approach, the resolution of the generator is fundamentally limited by that of the teacher DM. To overcome this limitation, we propose Progressive Growing of Diffusion Autoencoder (PaGoDA), a technique to progressively grow the resolution of the generator beyond that of the original teacher DM. Our key insight is that a pre-trained, low-resolution DM can be used to deterministically encode high-resolution data to a structured latent space by solving the PF-ODE forward in time (data-to-noise), starting from an appropriately down-sampled image. Using this frozen encoder in an auto-encoder framework, we train a decoder by progressively growing its resolution. From the nature of progressively growing decoder, PaGoDA avoids re-training teacher/student models when we upsample the student model, making the whole training pipeline much cheaper. In experiments, we used our progressively growing decoder to upsample from the pre-trained model's 64x64 resolution to generate 512x512 samples, achieving 2x faster inference compared to single-step distilled Stable Diffusion like LCM. PaGoDA also achieved state-of-the-art FIDs on ImageNet across all resolutions from 64x64 to 512x512. Additionally, we demonstrated PaGoDA's effectiveness in solving inverse problems and enabling controllable generation.
Related Publications
In typical multimodal contrastive learning, such as CLIP, encoders produce onepoint in the latent representation space for each input. However, one-point representation has difficulty in capturing the relationship and the similarity structure of a huge amount of instances in…
Controllable generation through Stable Diffusion (SD) fine-tuning aims to improve fidelity, safety, and alignment with human guidance. Existing reinforcement learning from human feedback methods usually rely on predefined heuristic reward functions or pretrained reward model…
Personalized text-to-image diffusion models have grown popular for their ability to efficiently acquire a new concept from user-defined text descriptions and a few images. However, in the real world, a user may wish to personalize a model on multiple concepts but one at a ti…
JOIN US
Shape the Future of AI with Sony AI
We want to hear from those of you who have a strong desire
to shape the future of AI.