Sony AI’s Deep RL Team on Why the Hardest Problems Still Matter

Game AI

Gaming

Life at Sony AI

Robotics

Sony AI

August 26, 2025

From sentiment analysis to interactive robotics, AI tackles a range of challenges. While some tasks involve parsing patterns or generating outputs from static data, others require systems to make decisions, adapt in real time, and learn from experience. This is where reinforcement learning thrives: at the core of learning how to act, not just predict.

“If you care about AI, you can't avoid reinforcement learning because the problem it describes is everywhere in AI,” says James MacGlashan, Senior Staff Research Scientist on Sony AI’s Gaming and Interactive Agents team.

Meaning: RL isn’t just one of many tools—it’s the essential lens through which the hardest and most important AI problems will be solved.

At Sony AI, we have a deep dedication to exploring and expanding the potential of reinforcement learning across diverse domains. With some of the field’s top talent, we’ve helped push RL from theory to reality. From academic curiosity to robust systems like GT Sophy, we’re pushing the boundaries to solve real-world problems.

Read on to learn more from some of our Reinforcement Learning Team about why this type of AI is integral to understanding learning, why LLMs are not the be-all-end-all, and why RL’s advancement is necessary for the future of robotics, and beyond.

The Learning Method That Mirrors Life

Reinforcement learning isn’t just a branch of machine learning. It’s a way of framing intelligence itself: agents acting in an environment, learning through feedback, shaping their behavior over time. It mirrors how people and animals learn: by doing, by trying, by failing.

Unlike supervised learning or generative models that learn from curated datasets, RL starts from scratch. An agent enters an unfamiliar world, doesn’t know what works, and must explore. Despite recent hype around generative models, and LLMS, RL remains as the framework best suited to build agents that do more than predict.

“Everyone's glommed onto LLMs—but they miss the bigger questions: how do agents actually learn in real environments? Not just output text, but act in the world,” notes MacGlashan.

In fact, reinforcement learning has played a pivotal role in making LLMs more usable. “A form of RL has played a pivotal role in the relatively recent boom of large language models,” explains Peter Stone, Chief Scientist at Sony AI. “Specifically, reinforcement learning from human feedback, or RLHF, was one of the main differences between GPT-3 and GPT-3.5, which became known as ‘ChatGPT’ and took the world by storm in late 2022 and early 2023.”

Stone continues: “Until then, language models simply output the most likely string of words from their training data in a way that was not very controllable, and thus not aligned with typical human preferences. RLHF was used as a form of tuning these language models to output strings that are more likely to be preferred by their human users.”

The idea is simple but powerful: people rate the possible responses of a language model, and reinforcement learning tunes the model to favor the highly rated ones. “Some of the earliest work in RLHF, long before the rise of large language models, was actually introduced by some of our Sony AI team,” says Stone.

“I was the Ph.D. advisor of Brad Knox, who introduced the first RLHF system, known as TAMER, back in 2008. And then shortly thereafter, James MacGlashan and colleagues introduced a variant called COACH. Though we couldn't have predicted it at the time, the methodologies introduced back then for using human preferences as an RL reward signal have ended up being perfectly suited to the needs of aligning the outputs of large language models with human preferences.”

This is what attracted many researchers to the field in the first place.

“I fell in love with it because it’s a very simple model, but it reflects a lot of real-world complexity,” says Harm van Seijen, Staff Research Scientist on the Gaming and Interactive Agents team. “It includes fundamental aspects of human decision-making: trading off short-term versus long-term reward, exploring versus exploiting, and operating under uncertainty.”

Adding to this sentiment: “The beauty of reinforcement learning is that you don’t need to spell out every move – just reward the outcome,” says Varun Kompella, Staff Research Scientist on the Gaming and Interactive Agents team. “It’s like training a dog to fetch. The agent figures out how to do it by exploring.”

The Beauty (and Brutality) of Learning by Doing

Reinforcement learning is powerful because it mimics how we learn. It’s also difficult for the same reason.

MacGlashan explained that RL is more complex than other AI methods because agents must learn through trial and error in constantly changing environments, a process that still demands specialized expertise.

Even exploration, the starting point of learning, is a massive hurdle. Van Seijen pointed out that, unlike humans, most RL agents explore inefficiently—often through random trial and error.

The inability to generalize across environments remains one of RL’s most persistent challenges.

Robotics, especially, lays bare these challenges. Kompella recalls his first experience training a robot outside simulation: “I turned it on, it hit the table, and a tendon snapped. That was four months of repairs.”

These aren’t edge cases. They’re reflections of the real-world messiness RL is trying to master. As van Seijen explains, “Our agents need to be able to generalize across situations—to adapt like humans do, not retrain from scratch every time the environment changes.”

This is why we remain focused on RL’s core challenges: reducing the amount of experience to learning, stabilizing learning and solving the long-standing trade-offs in generalization.

GT Sophy: What Happens When It Works

Sony AI’s GT Sophy racing agent is a case study in reinforcement learning done right. It didn’t learn to race by watching human drivers or mimicking behavior. It learned by doing, crashing, restarting.

“We didn’t want to model average drivers. We wanted to be better than the best,” says MacGlashan. “And for that, dataset fitting is off the table.”

Games are an ideal domain for reinforcement learning. Games are high-fidelity, high-stakes, and ruthlessly honest. Importantly, they are safe environments in which to (initially) fail, while gathering lots of experience quickly.

“And gamers are unforgiving,” says Kompella. “They’ll find the holes in your system in minutes. You can’t fake it.”

“With GT Sophy, we had an environment where we could scale experience and test our models under real-time constraints,” adds Stone. “It’s a powerful proving ground.”

As MacGlashan points out, “There are places where we still had weaknesses, but we built agents that could outperform the best human racers in many track and car combinations. That’s real.”

The project shows that RL is not just a lab curiosity. It can deliver robust, high-performance behavior in a real product environment. It proved that, when paired with the right environment, RL can deliver world-class performance.

From Games to the Real World

But Sony AI’s vision extends far beyond gaming. Reinforcement learning is also core to the company’s work in robotics and interactive systems.

Calling robotics the “gold standard,” MacGlashan stressed that success requires safe navigation in real-world conditions, not just benchmark performance.

Yet robotics remains notoriously difficult for RL. As Kompella notes, “Until we have robots that can explore without damaging themselves, we’re limited in how we can train them. That’s why games are a good bridge: they’re safe, scalable, and simulate many of the same challenges.”

Even in games, RL faces the same core obstacles: exploration, generalization, and efficiency. As van Seijen points out, “You can train an agent for one environment, but slight changes can throw it off. That’s not acceptable for systems we want to rely on.”

What’s Holding RL Back (and What We’re Working On)

The field has made progress. Algorithms like PPO and SAC have dramatically reduced the amount of experience that is needed to learn. But problems like instability, brittleness, and slow training remain unsolved. “Unlike supervised learning, RL algorithms can easily break,” says MacGlashan. “They diverge. That’s not okay.”

Van Seijen elaborates: “We want agents that don’t just memorize optimal policies—we want agents that adapt quickly. That can work with people. That can handle novelty.”

One area of interest is leveraging large language models to improve RL. “LLMs contain a lot of useful knowledge,” says van Seijen. “But we’re still figuring out how to integrate them meaningfully into the RL process. There’s potential, but the crossover hasn’t fully happened yet.”

As researchers, yes, we know RL is hard. Most current RL methods are brittle, harder to train, but it’s also more honest. We know its potential remains unmatched. And we are continuing to gain deeper access to this potential through ongoing research.

Another challenge that’s becoming increasingly relevant is alignment. “One of the biggest appeals of RL is that it allows a person to specify a reward function, and to let the RL agent figure out on its own, through trial and error, how to optimize the reward function,” explains Peter Stone. “The premise is that it is easier for a person to tell an agent, such as a robot, what to do (via a reward function) than how to do it.”

But Stone also points out a crucial limitation: “It can be difficult to specify a reward function that leads to the behavior that one really wants. That is, it is difficult to align a reward function with a person's true preferences.”

A promising direction? Learning from preference feedback.

“It is typically relatively easy for a person to express preferences among possible behaviors,” Stone explains. “If one is shown two possible agent behaviors, it is usually fairly easy to indicate which of the behaviors is preferred.”

This preference-based feedback loop—rather than trying to encode complex value systems into rigid objectives—could be key to building RL systems that align more naturally with human values.

Stone and collaborators, including former students, have been exploring how to induce reward functions from such human preferences. “I think that this sort of reward function induction will be a necessary component for aligned RL systems,” he says, pointing to a 2023 TMLR paper as an example of this emerging direction.

Why Sony AI Is All In

Sony AI isn’t just exploring RL because it’s interesting. It’s because we believe RL is necessary for the kinds of intelligent systems the future demands. We’ve been committed to this modality since day one.

“I think we’re among the last generation of RL researchers prior to the deep RL revolution,” says MacGlashan. “We were here before it was trendy. We’ve seen it mature. And we know its potential.”

Peter Stone puts it clearly: “Reinforcement learning is a fundamental building block of intelligence. It’s how people learn, how animals adapt, and it’s how machines will need to learn if we want them to truly understand and navigate the world.”

And as AI systems become increasingly embedded in our lives, alignment matters more than ever.

“It’s not enough for an agent to optimize a reward signal—it needs to do so in a way that reflects human values,” says Stone. “That’s why our research is now also focused on how preferences can guide behavior—how humans can shape learning not just through code, but through feedback.”

Sony AI’s work on reward function induction and human preference modeling offers one path forward: teaching agents to adapt by understanding what people want—not just what’s easiest to optimize.

In other words: you don’t work on reinforcement learning because it’s easy. You work on it because it matters. And Sony AI is building the team, the tools, and the systems that will carry it forward—from games to robotics, from simulation to reality, and from research to impact. We’re solving long-term challenges, ones that will shape how AI works in the world.

Explore More from the Team Shaping the Future of RL

If you’re curious where reinforcement learning is headed next, take a deeper dive into Sony AI’s ongoing research, interviews with our RL team, and recent publications driving the field forward.

For further reading, check out our blog and robotics flagship page for more information and RL research highlights. And to learn more about Peter Stone and Pete Wurman, dive into our Sights on AI Series where Stone and Wurman sit down for an intimate conversation on RL and gaming AI:

-Sights on AI: Peter Stone Talks Reinforcement Learning

-Sights on AI: Pete Wurman Discusses Career in AI and How the Technology is Evolving the Gaming Landscape

-The Challenge to Create a Pandemic Simulator

Latest Blog

September 9, 2025 | Sony AI

Advancing Analog Design with AI: Sony AI’s Contributions at MLCAD 2025

Analog circuit design has long resisted the kind of automation that has transformed digital design. They may be small, yet mighty, but they are notoriously difficult to automate. W…

September 5, 2025 | Sony AI

Advancing AI: Highlights from August

August was a month of sound, circuits, and shared creativity. From new approaches in audio search to breakthroughs in analog design, Sony AI continued advancing research that makes…

August 20, 2025 | Sony AI

Behind the Sound: How AI Is Enhancing Audio Search

For sound designers, finding the right sound can be a challenge. Traditional methods rely on filenames and tags, which may not always accurately reflect how a sound is perceived. A…

SEE ALL

HOME
Blog
Sony AI’s Deep RL Team on Why the Hardest Problems Still Matter

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.

LEARN MORE