Skip to content
Headshot of Masato Ishii

Masato Ishii

Profile

Masato Ishii is a Senior Research Scientist at Sony Research Inc. He received his Ph.D. from the University of Tokyo under the supervision of Professor Masashi Sugiyama. From 2010 to 2019, he worked as a researcher at NEC. Between 2017 and 2019, he served as a visiting researcher at RIKEN AIP. In 2019, he joined Sony Group Corporation and has been seconded to Sony Research Inc. since 2023. His research interests span from the fundamentals of machine learning to its applications in computer vision, with a current focus mainly on audio-visual generation.

Publications

TITAN-Guide: Taming Inference-Time Alignment for Guided Text-to-Video Diffusion Models

ICCV, 2025 | Christian Simon, Masato Ishii, Akio Hayakawa, Zhi Zhong*, Shusuke Takahashi*, Takashi Shibuya, Yuki Mitsufuji

In the recent development of conditional diffusion models still require heavy supervised fine-tuning for performing control on a category of tasks. Training-free conditioning via guidance with off-the-shelf models is a favorable alternative to avoid further fine-tuning on th...

A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Join…

IJCNN, 2025 | Masato Ishii, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji

In this work, we build a simple but strong baseline for sounding video generation. Given base diffusion models for audio and video, we integrate them with additional modules into a single model and train it to make the model jointly generate audio and video. To enhance align...

MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

CVPR, 2025 | Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji

We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained wit...

Mining your own secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models

ICLR, 2025 | Saurav Jha, Shiqi Yang*, Masato Ishii, Mengjie Zhao*, Christian Simon, Muhammad Jehanzeb Mirza, Dong Gong, Lina Yao, Shusuke Takahashi*, Yuki Mitsufuji

Personalized text-to-image diffusion models have grown popular for their ability to efficiently acquire a new concept from user-defined text descriptions and a few images. However, in the real world, a user may wish to personalize a model on multiple concepts but one at a ti...

Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

ICLR, 2025 | Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

In this study, we aim to construct an audio-video generative model with minimal computational cost by leveraging pre-trained single-modal generative models for audio and video. To achieve this, we propose a novel method that guides single-modal models to cooperatively genera...