SAN: Inducing Metrizability of GAN with Discriminative Normalized Linear Layer

ICLR, 2024
Yuhta Takida, Masaaki Imaizumi*, Takashi Shibuya, Chieh-Hsin Lai, Toshimitsu Uesaka, Naoki Murata, Yuki Mitsufuji

Generative adversarial networks (GANs) learn a target probability distribution by optimizing a generator and a discriminator with minimax objectives. This paper addresses the question of whether such optimization actually provides the generator with gradients that make its d…

Manifold Preserving Guided Diffusion

ICLR, 2024
Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao, Yuki Mitsufuji, J. Zico Kolter*, Ruslan Salakhutdinov*, Stefano Ermon*

Despite the recent advancements, conditional image generation still faces challenges of cost, generalizability, and the need for task-specific training. In this paper, we propose Manifold Preserving Guided Diffusion (MPGD), a training-free conditional generation framework th…

Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion

ICLR, 2024
Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, Stefano Ermon*

Consistency Models (CM) (Song et al., 2023) accelerate score-based diffusion model sampling at the cost of sample quality but lack a natural way to trade-off quality for speed. To address this limitation, we propose Consistency Trajectory Model (CTM), a generalization encomp…

Enhancing Semantic Communication with Deep Generative Models -- An ICASSP Special Session Overview

ICASSP, 2023
Eleonora Grassucci*, Yuki Mitsufuji, Ping Zhang*, Danilo Comminiello*

Semantic communication is poised to play a pivotal role in shaping the landscape of future AI-driven communication systems. Its challenge of extracting semantic information from the original complex content and regenerating semantically consistent data at the receiver, possi…

BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

ICASSP, 2023
Takashi Shibuya, Yuhta Takida, Yuki Mitsufuji

Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between re…

Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

ICASSP, 2023
Hao Shi*, Kazuki Shimada, Masato Hirano*, Takashi Shibuya, Yuichiro Koyama*, Zhi Zhong*, Shusuke Takahashi*, Tatsuya Kawahara*, Yuki Mitsufuji

Diffusion-based speech enhancement (SE) has been investigated recently, but its decoding is very time-consuming. One solution is to initialize the decoding process with the enhanced feature estimated by a predictive SE system. However, this two-stage method ignores the compl…

VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance

ICASSP, 2023
Carlos Hernandez-Olivan*, Koichi Saito, Naoki Murata, Chieh-Hsin Lai, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Yuki Mitsufuji

Restoring degraded music signals is essential to enhance audio quality for downstream music manipulation. Recent diffusion-based music restoration methods have demonstrated impressive performance, and among them, diffusion posterior sampling (DPS) stands out given its intrin…

Zero- and Few-shot Sound Event Localization and Detection

ICASSP, 2023
Kazuki Shimada, Kengo Uchida, Yuichiro Koyama*, Takashi Shibuya, Shusuke Takahashi*, Yuki Mitsufuji, Tatsuya Kawahara*

Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and tempor…

Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription

ICASSP, 2023
Frank Cwitkowitz*, Kin Wai Cheuk, Woosung Choi, Marco A. Martínez-Ramírez, Keisuke Toyama*, Wei-Hsiang Liao, Yuki Mitsufuji

In recent years, research on music transcription has focused mainly on architecture design and instrument-specific data acquisition. With the lack of availability of diverse datasets, progress is often limited to solo-instrument tasks such as piano transcription. Several wor…

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

NeurIPS, 2023
Kazuki Shimada, Archontis Politis*, Parthasaarathy Sudarsanam*, Daniel Krause*, Kengo Uchida, Sharath Adavann*, Aapo Hakala*, Yuichiro Koyama*, Naoya Takahashi, Shusuke Takahashi*, Tuomas Virtanen*, Yuki Mitsufuji

While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper pro…

Automatic Piano Transcription with Hierarchical Frequency-Time Transformer

ISMIR, 2023
Keisuke Toyama*, Taketo Akama*, Yukara Ikemiya, Yuhta Takida, Wei-Hsiang Liao, Yuki Mitsufuji

Taking long-term spectral and temporal dependencies into account is essential for automatic piano transcription. This is especially helpful when determining the precise onset and offset for each note in the polyphonic piano content. In this case, we may rely on the capabilit…

Extending Audio Masked Autoencoders Toward Audio Restoration

WASPAA, 2023
Zhi Zhong*, Hao Shi*, Masato Hirano*, Kazuki Shimada, Kazuya Tateishi*, Takashi Shibuya, Shusuke Takahashi*, Yuki Mitsufuji

Audio classification and restoration are among major downstream tasks in audio signal processing. However, restoration derives less of a benefit from pretrained models compared to the overwhelming success of pretrained models in classification tasks. Due to such unbalanced b…

The Whole Is Greater than the Sum of Its Parts: Improving DNN-based Music Source Separation

Ryosuke Sawata*, Naoya Takahashi, Stefan Uhlich*, Shusuke Takahashi*, Yuki Mitsufuji

This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) without increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which…

Diffiner: A Versatile Diffusion-based Generative Refiner for Speech Enhancement

Interspeech, 2023
Ryosuke Sawata*, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Takashi Shibuya, Shusuke Takahashi*, Yuki Mitsufuji

Although deep neural network (DNN)-based speech enhancement (SE) methods outperform the previous non-DNN-based ones, they often degrade the perceptual quality of generated outputs. To tackle this problem, we introduce a DNN-based generative refiner, Diffiner, aiming to impro…

GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Blind Inverse Problems with Denoising Diffusion Restoration

ICML, 2023
Naoki Murata, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji

Pre-trained diffusion models have been successfully used as priors in a variety of linear inverse problems, where the goal is to reconstruct a signal from noisy linear measurements. However, existing approaches require knowledge of the linear operator. In this paper, we prop…

FP-Diffusion: Improving Score-based Diffusion Models by Enforcing the Underlying Score Fokker-Planck Equation

ICML, 2023
Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon*

Score-based generative models learn a family of noise-conditional score functions corresponding to the data density perturbed with increasingly large amounts of noise. These perturbed data densities are tied together by the Fokker-Planck equation (FPE), a partial differentia…

Music Mixing Style Transfer: A Contrastive Learning Approach to Disentangle Audio Effects

ICASSP, 2023
Junghyun Koo*, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich*, Kyogu Lee*, Yuki Mitsufuji

We propose an end-to-end music mixing style transfer system that converts the mixing style of an input multitrack to that of a reference song. This is achieved with an encoder pre-trained with a contrastive objective to extract only audio effects related information from a r…

DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability

ICASSP, 2023
Kin Wai Cheuk, Ryosuke Sawata*, Toshimitsu Uesaka, Naoki Murata, Naoya Takahashi, Shusuke Takahashi*, Dorien Herremans*, Yuki Mitsufuji

In this paper we propose a novel generative approach, DiffRoll, to tackle automatic music transcription (AMT).Instead of treating AMT as a discriminative task in which the model is trained to convert spectrograms into piano rolls, we think of it as a conditional generative t…

Hierarchical Diffusion Models for Singing Voice Neural Vocoder

ICASSP, 2023
Naoya Takahashi, Mayank Kumar Singh*, Yuki Mitsufuji

Recent progress in deep generative models has improved the quality of neural vocoders in speech domain. However, generating a high-quality singing voice remains challenging due to a wider variety of musical expressions in pitch, loudness, and pronunciations. In this work, we…

Unsupervised vocal dereverberation with diffusion-based generative models

ICASSP, 2023
Koichi Saito, Naoki Murata, Toshimitsu Uesaka, Chieh-Hsin Lai, Yuhta Takida, Takao Fukui*, Yuki Mitsufuji

Removing reverb from reverberant music is a necessary technique to clean up audio for downstream music manipulations. Reverberation of music contains two categories, natural reverb, and artificial reverb. Artificial reverb has a wider diversity than natural reverb due to its…

CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

ICLR, 2023
Hao-Wen Dong*, Naoya Takahashi, Yuki Mitsufuji, Julian McAuley*, Taylor Berg-Kirkpatrick*

Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query…

SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization

ICML, 2022
Yuhta Takida, Takashi Shibuya, Wei-Hsiang Liao, Chieh-Hsin Lai, Junki Ohmura*, Toshimitsu Uesaka, Naoki Murata, Shusuke Takahashi*, Toshiyuki Kumakura*, Yuki Mitsufuji

One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned discrete representation uses only a fraction of the full capacity of the codebook, also known as codebook collapse. We hypothesize that the training scheme of VQ-VAE, which involves some…


December 13, 2023 | Events

Sony AI Reveals New Research Contributions at NeurIPS 2023

Sony Group Corporation and Sony AI have been active participants in the annual NeurIPS Conference for years, contributing pivotal research that has helped to propel the fields of artificial intelligence and machine learning forwar…

Sony Group Corporation and Sony AI have been active participants in the annual NeurIPS Conference for years, contributing pivotal …


Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.