Venue

Date

Share

HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

Yuhta Takida

Yukara Ikemiya

Takashi Shibuya

Kazuki Shimada

Woosung Choi

Chieh-Hsin Lai

Naoki Murata

Toshimitsu Uesaka

Kengo Uchida

Yuki Mitsufuji

Wei-Hsiang Liao

TMLR-2024

2024

Abstract

Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It is commonly performed with a variational autoencoding model, VQ-VAE, which can be further extended to hierarchical structures for making high-fidelity reconstructions. However, such hierarchical extensions of VQ-VAE often suffer from the codebook/layer collapse issue, where the codebook is not efficiently used to express the data, and hence degrades reconstruction accuracy. To mitigate this problem, we propose a novel unified framework to stochastically learn hierarchical discrete representation on the basis of the variational Bayes framework, called hierarchically quantized variational autoencoder (HQ-VAE). HQ-VAE naturally generalizes the hierarchical variants of VQ-VAE, such as VQ-VAE-2 and residual-quantized VAE (RQ-VAE), and provides them with a Bayesian training scheme. Our comprehensive experiments on image datasets show that HQ-VAE enhances codebook usage and improves reconstruction performance. We also validated HQ-VAE in terms of its applicability to a different modality with an audio dataset.



Related Publications

Diffusion-based Signal Refiner for Speech Enhancement and Separation

IEEE, 2026
Ryosuke Sawata, Masato Hirano*, Naoki Murata, Shusuke Takahashi*, Yuki Mitsufuji

Although recent speech processing technologies have achieved significant improvements in objective metrics, there still remains a gap in human perceptual quality. This paper proposes Diffiner, a novel solution that utilizes the powerful generative capability of diffusion mod…

PAVAS: Physics-Aware Video-to-Audio Synthesis

CVPR, 2026
Oh Hyun-Bin*, Yuhta Takida, Toshimitsu Uesaka, Tae-Hyun Oh*, Yuki Mitsufuji

Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds…

MeanFlow Transformers with Representation Autoencoders

CVPR, 2026
Zheyuan Hu*, Chieh-Hsin Lai, Ge Wu*, Yuki Mitsufuji, Stefano Ermon*

MeanFlow (MF) is a diffusion-motivated generative model that enables efficient few-step generation by learning long jumps directly from noise to data. In practice, it is often used as a latent MF by leveraging the pre-trained Stable Diffusion variational autoencoder (SD-VAE)…

  • HOME
  • Publications
  • HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.