Authors
- Dongjun Kim*
- Chieh-Hsin Lai
- Wei-Hsiang Liao
- Yuhta Takida
- Naoki Murata
- Toshimitsu Uesaka
- Yuki Mitsufuji
- Stefano Ermon*
* External authors
Venue
- NeurIPS 2024
Date
- 2024
PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher
Dongjun Kim*
Stefano Ermon*
* External authors
NeurIPS 2024
2024
Abstract
To accelerate sampling, diffusion models (DMs) are often distilled into generators that directly map noise to data in a single step. In this approach, the resolution of the generator is fundamentally limited by that of the teacher DM. To overcome this limitation, we propose Progressive Growing of Diffusion Autoencoder (PaGoDA), a technique to progressively grow the resolution of the generator beyond that of the original teacher DM. Our key insight is that a pre-trained, low-resolution DM can be used to deterministically encode high-resolution data to a structured latent space by solving the PF-ODE forward in time (data-to-noise), starting from an appropriately down-sampled image. Using this frozen encoder in an auto-encoder framework, we train a decoder by progressively growing its resolution. From the nature of progressively growing decoder, PaGoDA avoids re-training teacher/student models when we upsample the student model, making the whole training pipeline much cheaper. In experiments, we used our progressively growing decoder to upsample from the pre-trained model's 64x64 resolution to generate 512x512 samples, achieving 2x faster inference compared to single-step distilled Stable Diffusion like LCM. PaGoDA also achieved state-of-the-art FIDs on ImageNet across all resolutions from 64x64 to 512x512. Additionally, we demonstrated PaGoDA's effectiveness in solving inverse problems and enabling controllable generation.
Related Publications
Consistency Training (CT) has recently emerged as a promising alternative to diffusion models, achieving competitive performance in image generation tasks. However, non-distillation consistency training often suffers from high variance and instability, and analyzing and impr…
Detecting musical versions (different renditions of the same piece) is a challenging task with important applications. Because of the ground truth nature, existing approaches match musical versions at the track level (e.g., whole song). However, most applications require to …
Diffusion models have demonstrated exceptional performances in various fields of generative modeling, but suffer from slow sampling speed due to their iterative nature. While this issue is being addressed in continuous domains, discrete diffusion models face unique challenge…
JOIN US
Shape the Future of AI with Sony AI
We want to hear from those of you who have a strong desire
to shape the future of AI.