Sights on AI: Yuki Mitsufuji Shares Inspiration for AI Research into Music and Sound
Sony AI
August 14, 2024
The Sony AI team is a diverse group of individuals working to accomplish one common goal: accelerate the fundamental research and development of AI and enhance human imagination and creativity, particularly in the realm of entertainment. Each individual brings different experiences, along with a unique view of the technology, to this work. This insightful Q&A series, Sights on AI, highlights the career journeys of Sony AI’s leaders and offers their perspectives on a number of AI topics.
Peter Stone ・ Erica Kato Marcus・ Tarek Besold ・ Yuki Mitsufuji
Yuki Mitsufuji is a Lead Research Scientist at Sony AI, overseeing music and sound research projects within the organization’s AI for Creators Flagship. In addition to his role at Sony AI, he is a Distinguished Engineer for Sony Group Corporation and the Head of Creative AI Lab for Sony R&D.
Yuki holds a PhD in Information Science & Technology from the University of Tokyo and has spent the last decade building his career in music and sound technology. His groundbreaking work in this area has made him a pioneer in foundational music and sound work, such as sound separation and other generative models that can be applied to music, sound, and other modalities.
Today, Yuki’s research interests center around leveraging AI in areas such as sound separation, music restoration and also in defensive technologies to help combat intellectual property infringement. He has a keen interest in exploring new ways AI can expand creator’s expression including the science behind the perception of music and sound, opening doors to new styles and new dimensions in how we experience sound.
His team’s explorations aim to uncover novel approaches and tools to empower creators, offering them the ability to refine and shape their output in real time, thereby elevating their creative process and the artistic integrity of their work. In this blog, Yuki shares his inspiration for entering the fields of AI and entertainment, how Sony AI is thinking about research projects from a creator’s point of view, and why he believes AI can help aid in the future of music and sound creation.
What inspired you to enter the fields of AI and entertainment?
When I first entered the field as a computer scientist over twenty years ago, I had a deep interest in music. However, at that time, very few entertainment companies were investing in technology and AI research and development related to entertainment.
A pivotal moment came in 2013 when I attended the International Conference on Acoustics, Speech, and Signal Processing (ICASSP) as an author. That year's conference covered topics ranging from speech processing to machine learning for signal processing and much more – and I was first exposed to AI in its current form. The event included a session with renowned computer scientist Geoffrey Hinton, during which he predicted that typical machine learning approaches of the time would be overtaken by new methods, particularly in image and speech recognition.
I drew great inspiration from this session, which spurred my journey into music and sound research. The first application in which my team used this new paradigm – deep learning – was in sound separation. At that time, no one believed sound separation could be used practically. Early on, we were able to get signals to separate, but many signals still remained. Because it didn’t work well, it wasn’t a tool creators wanted to use. Then, when we deployed deep learning, we observed significant improvement. We frequently brought our tools to creation studios for feedback, prompting us to keep iterating to meet the creators' high standards and expectations. After about five years, we began receiving very positive feedback that the creators wanted to use our tool. In 2020, we released a successful project with sounds that had been manipulated by AI sound separation. The rest is history.
You worked on the music restoration of the famous Canadian pianist Glen Gould and the soundtrack restoration of Lawrence of Arabia. What did you do in these projects, and how have they helped shape the work you are currently doing at Sony AI?
As I mentioned previously, the beginning phases of our work in sound separation took a long time to get right.
One of our first big projects was a music restoration of the famous Canadian pianist, Glen Gould. Kanji Ishimaru, a renowned Japanese actor and musician, dreamed of performing alongside Gould, a Canadian classical pianist he deeply admired. Unfortunately, this was thought to be impossible because Gould had passed away. However, there was a recording of Gould’s work that included both his playing and spoken word, and Ishimaru thought that if he could somehow extract the piano performance, he could “time travel” to collaborate with him. He was looking for a tool to achieve this, which led him to our team.
We started work, and the first trial was not very successful – the quality was unsatisfactory. After receiving feedback from Ishimaru and his audio engineer, we continued to iterate the model – I think it was three or more times – until we finally got to a place where they were really impressed with the sound quality. With the final extracted piano performance, Ishimaru recorded his voice, the audio engineer mixed it, and the final product was released. It was very difficult to achieve this quality of sound, but it was rewarding and exciting. We were able to make someone’s dream come true, and it was one of the first examples of using AI to bring together older recordings with new artists, highlighting the creative potential of AI in sound and music restoration and collaboration.
Around the same time, we also received a request from Sony Pictures to remaster the sound of the classic Academy Award-winning movie, Lawrence of Arabia. This was a nearly impossible task because old movies like this didn’t have separate recordings of the sound elements. We proposed using our sound separation tool and conveyed that we could extract sounds like dialogue and background sound effects. And we actually did it! When the film was re-released, they included a booklet where the audio engineer wrote some commentary on how the sound was created. The engineer shared our name with the comment, “Thanks to this tool, this project was successful,” which is the most rewarding recognition you can receive.
These first projects are a reminder of our aspiration to make creators’ dreams a reality and to demonstrate that AI can be deployed in the music industry in a responsible way. Ethically trained models and ensuring that our work is respectful of artist’s rights is a priority for Sony AI. This is our catalyst for working through the challenges of these intricate projects. As we explore new ways AI can enhance the audio experience for viewers and listeners we want to continue providing tools that offer creators a new realm of possibilities beyond their imaginations.
You shared some of the complexities around your early work in music and sound. What makes AI research in the realm of music and sound, specifically, so complex?
Music is an inherently complex area for AI researchers. In image manipulation, for example, AI is generating single snapshots with no constraints on timing. Music, on the other hand, is difficult because audio is much more complex. Not only does generation have to be coherent across time, there is also complexity due to timbre, scale, and structure, which are deeply rooted in human ingenuity. This makes our use of AI as an amplifier to these layers of human creation a thoughtful and meticulous process.
For example, classical music is very challenging to work on because the instruments are typically played together, as an orchestra, and the length of the pieces are long. Additionally, there are many different pitches that are played across the course of a piece. Then there is the challenge of recording individual instruments alone, which rarely happens because it is a time-consuming and very costly process. This makes it difficult to isolate and manipulate specific sounds without affecting the overall composition.
Jazz presents its own set of complexities, particularly due to the improvisation inherent in the genre – you will rarely see the same performance twice, as artists are always looking to create new versions of the pieces they play. The lack of available data encapsulating all of the different possible improvisational variations in recordings or live performances adds to the challenge.
The pop and rock genres are easier to work on because they are more modern and have more engineered fusions that help separate different elements. The songs are also usually shorter, and many pop artists follow a structured approach to song creation. For instance, many pop artists have a vision for their song and work through the elements separately – writing the song, assigning different instruments to the score, and then voice recording afterward.
Beyond these technical aspects, there are also challenges related to new music creation, such as crediting and copyright. Troves of content have existed for hundreds of years, and artists must ensure that they are truly creating new music and are properly licensing and crediting any elements that previous artists have created. AI can assist in this process, but it must be used thoughtfully to respect and protect the artistry and intellectual property of musicians.
How is Sony AI thinking about their research from the creators' point of view and its impact on their work?
Our mission at Sony AI is to work hand-in-hand with creators to ensure the technology we develop genuinely enhances their work. We aim to create tools that resonate with their creative processes, ultimately pushing the boundaries of what is possible in music and sound.
Sound separation is a key focus of our research, not only as a restoration tool but as a new paradigm for thinking about music. Traditionally, sound separation was seen as a method for isolating and cleaning up audio tracks. However, our work has shown that it can fundamentally change how creators interact with music. By providing the ability to isolate and engineer individual elements within a piece, we empower artists – like Ishimaru – to explore new creative possibilities and achieve effects that were previously unattainable.
As we continue our research, we’re excited about the potential for real-time applications in music and sound. Our ultimate goal is to provide frameworks that creators can test, use, and provide feedback on. This iterative process is critical, as it ensures we are delivering the best possible tools to enhance their artistry.
One recent example of this from the team includes published research that achieved State of the Art levels for deep generative modeling – Consistency Trajectory Models (CTM): Learning Probability Flow ODE Trajectory of Diffusion and SAN: Inducing Metrizability of GAN with Discriminative Normalized Linear Layer. The team often works with text and image generation models which are less complex and more readily available, before applying the findings to music and sound tasks that are more demanding. Considering the real-time efficiency and granular control needed for professional creators, these new model discoveries can be applied to various other content types such as audio, video, 3D, and others.
As a follow up to the CTM work, we also recently introduced SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation, which is a new model that enables flexible transitioning between high-quality one-step sound generation and superior sound quality through multi-step generation. This allows creators to initially control sounds with one-step samples before refining them through multi-step generation.
While some areas of our research have been put into practice, such as sound separation techniques, much of the work is still in the realm of scientific exploration. In sound generation specifically, we’ve reached a point in time where it is an interesting engineering capability, but one that needs to be validated further as purposeful for artists. We see our role as analogous to designing and test-driving a new car design. While we can design and model all we want, the true test comes when the car is actually driven, for us this means when the tools are put into the hands of creators. When it enables professional creators to stretch their imaginations or allows a game designer to think about sound in a new way and ultimately lets artists protect and enhance their artistry - then we will have succeeded.
What is your point of view on how AI can aid in the future of music and sound creation?
AI has many practical applications to aid in the future of music and sound. One that is very prominent is the ability to help with downstream business tasks for large catalog holders. Efforts such as tagging, filtering, and even recommendations within catalogs are arduous, complicated, and time-consuming, especially as music catalogs grow.
I also believe that AI will play a greater role in intellectual property (IP) protection. One example of this is audio watermarking. Watermarks should be embedded within signals, but this can be complex. With the technology available today, actions like voice transformation can destroy watermarks. Additionally, conventional watermarking does not act as it should. We frequently see individuals changing elements like timbre, time stretch, and more, which ultimately destroys watermarks. AI can also be used in music matching. The use of this technology can be an effective method for detecting music similarities in the absence of watermarks, which is important given the sensitivities around AI in the music industry as a whole.
AI can be a powerful tool that can set up creators for great success, but we must remember that there have been generations of artists and musicians before AI whose rights and artistic contribution must be valued and respected. We must truly understand how we can use this technology to empower the next generation of artists to create independently and unearth new musical possibilities without infringing on the history or work of artists past and present.
For more information on Sony AI’s music and sound research, visit the organization’s AI for Creators page.
Latest Blog
November 15, 2024 | Sony AI
Breaking New Ground in AI Image Generation Research: GenWarp and PaGoDA at NeurI…
At NeurIPS 2024, Sony AI is set to showcase two new research explorations into methods for image generation: GenWarp and PaGoDA. These two research papers highlight advancements in…
October 4, 2024 | AI Ethics
Mitigating Bias in AI Models: A New Approach with TAB
Artificial intelligence models, especially deep neural networks (DNNs), have proven to be powerful tools in tasks like image recognition and natural language processing. However, t…
September 14, 2024 | Scientific Discovery
Behind the Research: How Sony AI Researchers are Pioneering AI Models for Scient…
The pace of scientific research is accelerating, with an exponential increase in published research articles each year. For instance, in 1980, approximately 500,000 scientific arti…