CVPR 2026: Sony AI's Latest in Computer Vision Research

Events | June 1, 2026

Share this article

The Computer Vision and Pattern Recognition conference (CVPR) brings together work that defines how machines see, reconstruct, and generate the visual world. This year, Sony AI and its collaborators are presenting six papers at CVPR 2026 in Denver.

The research spans generative modeling, 3D scene understanding, video-to-audio synthesis, domain-adaptive perception, and visual token efficiency—each paper targeting a different constraint on building AI systems that hold up outside the lab, at the scale deployment demands.

CVPR '26 Tutorial on Diffusion Models

June 3, 2026 · Denver, Colorado · 8:05 AM MT

This tutorial covers the theoretical and empirical foundations of diffusion and flow-map models for fast sampling taught by Sony AI’s Chieh-Hsin (Jesse) Lai and Yuki Mitsufuji.

The tutorial's first half traces the field's origins across three lineages: variational (VAE → DDPM), score-based (energy models → Score SDE), and flow-based (normalizing flows → Flow Matching & Rectified Flow). From there it moves into distillation techniques for fast sampling (including DMD) and flow-map models like Consistency Model, Consistency Trajectory Model, and the newly released MeanFlow. Two live demos round out the section: one on distribution-based distillation, one on training a flow-map model from scratch.

The second half, led by Mitsufuji, zooms out to AI content creation and protection: diffusion memorization, attribution, and audio-visual generation via MMAudio.

Grounded in the instructors' own textbook, The Principles of Diffusion Models, slides and recordings will also be available after the conference.

To learn more about MMAudio, visit:
Unlocking the Future of Video-to-Audio Synthesis: Inside the MMAudio Model - Sony AI

To read our interview with Chieh-Hsin (Jesse) Lai about The Principles of Diffusion Models, visit:
On Writing The Principles of Diffusion Models, A Q&A With Sony AI Researcher, Jesse Lai

Advancing Embodied AI with Foundation Models: Peter Stone to Keynote CVPR Workshop

Peter Stone, Chief Scientist at Sony AI, will be a keynote speaker at the upcoming 1st Workshop on Deployment of Foundation Models for Embodied AI (WDFM-EAI). This workshop aims to push the boundaries of how large foundation models can be efficiently and robustly deployed in autonomous systems, from self-driving cars to legged robots.

Building on the success of previous workshops focused on distilling foundation models for autonomous driving, WDFM-EAI expands the scope to uncover synergies across embodied AI domains. The workshop will dive into cutting-edge research on multimodal perception, real-time decision making with foundation models, vision-language-action models, world models for planning and control, model compression techniques, and ensuring safety in multimodal autonomous systems.

We're honored to contribute to this important dialogue on the future of foundation models in embodied AI. Peter's keynote will explore the vast potential and key challenges in deploying these powerful models in real-world autonomous systems to enable more intelligent, adaptive, and generalizable embodied agents.

Join us at WDFM-EAI to collaborate with leading researchers from academia and industry on this exciting frontier:

June 3, 2026 at CVPR in Denver, CO
Peter Stone will join other luminaries from Waabi, Tesla, Waymo, Wayve, Uber, NVIDIA, Physical Intelligence, XPENG & GM in this keynote presentation.

For more information, visit: https://wdfm-eai.github.io/CVPR26/

MeanFlow Transformers with Representation Autoencoders

Authors: Zheyuan Hu, Chieh-Hsin Lai, Ge Wu, Yuki Mitsufuji, Stefano Ermon

Research paper: https://arxiv.org/abs/2511.13019

Project site: https://github.com/sony/mf-rae

Generating a high-quality image typically requires a model to take many small, iterative steps, akin to slowly developing a photograph in a darkroom. MeanFlow (MF) takes a different approach. Rather than learning each incremental transition, it learns to jump directly from noise to a finished image in a single step. The challenge has been making that leap both fast and stable.

This paper addresses a core bottleneck in MeanFlow training: the computational cost of the decoder that translates generated representations back into pixels. In standard latent MF, that decoder (the SD-VAE) consumes approximately 73% of the total generation cost. The team replaces it with a Representation Autoencoder (RAE), which uses a frozen pre-trained vision encoder to supply semantically rich latent features paired with a lightweight decoder. The result reduces decoder cost by roughly 3×.

Training MF in this new latent space, however, introduces its own instability: gradients explode almost immediately regardless of initialization strategy. To address this, the authors introduce Consistency Mid-Training (CMT), which gives the model a trajectory-aware starting point by having it learn the long jumps that MF will ultimately need to make, guided by a pre-trained teacher's known trajectory rather than random initialization. As the paper describes, this approach learns a "trajectory-aware initialization by following the numerical PF-ODE trajectory of a pre-trained flow matching model."

The practical gains are substantial. On ImageNet 256, MF-RAE achieves a 1-step FID of 2.03, compared to 3.43 for vanilla MF, while reducing total training cost by 83% and sampling compute by 38%. For applications where fast, high-quality image generation matters, from real-time creative tools to synthetic data pipelines, this represents a meaningful step toward generation that is both cheaper to train and faster to run.

Read the paper →

Face Time Traveller: Travel Through Ages Without Losing Identity

Authors: Purbayan Kar, Ayush Ghadiya, Vishal Chudasama, Pankaj Wasnik (Sony Research India), C.V. Jawahar (IIIT Hyderabad)

Research paper: https://arxiv.org/abs/2602.22819Project site: https://research.sri-media-analysis.com/face-time-traveller-cvpr26/

Face aging is the simulation of age progression or regression in a face image, and it is "vital in entertainment, forensics, and digital archiving," in fact, the creative demand is well established. For instance, in film and advertising, productions like The Irishman and public campaigns like the David Beckham malaria awareness project have demonstrated the appetite for realistic age transformation at scale. In gaming, face aging allows characters to evolve over time. In heritage and archival work, it supports visualizing historical figures across different life stages. Traditional approaches rely on costly, labor-intensive VFX pipelines to deliver these results. The researchers note that modern face aging models can achieve similar visual realism "at significantly lower time and cost without prosthetics or manual VFX — while preserving the actor's identity across different lifespans." The research question FaceTT addresses is whether learned models can deliver that quality reliably — and whether they can do so without sacrificing identity in the process.

The challenge is harder than it appears. Face aging is "a complex and ill-posed problem" shaped by both intrinsic factors, like biological and identity-related attributes that evolve naturally with age, such as skin texture and bone structure; and extrinsic ones, like UV exposure, lifestyle, and environmental effects. Existing models that rely on simple numerical age representations overlook this interplay. A prompt like "Photo of a 60-year-old person" does not really carry enough contextual grounding to drive realistic synthesis. High-level conditions like hair loss or weight gain are not directly grounded in low-level visual features without sufficient contextual detail. The result is artifacts, background inconsistency, and identity drift—meaning, the subject looks older but no longer quite like themselves.

FaceTT addresses this with three components. A Face-Attribute-Aware Prompt Refinement strategy constructs semantically rich prompts that explicitly encode both intrinsic and extrinsic aging cues, extracted from the input image using a vision-language model. This moves conditioning from a vague age label to a visually grounded description of how this particular person ages. A tuning-free Angular Inversion method then maps input faces into the diffusion model's latent space without iterative optimization, reducing computational overhead while maintaining reconstruction fidelity. Finally, an Adaptive Attention Control mechanism dynamically modulates cross-attention for semantic aging cues and self-attention for structural preservation replacing the static, uniform attention strategies that cause background hallucination and fail to isolate age-relevant facial regions in prior methods.

FaceTT also introduces Cyclic Identity Similarity, a new evaluation protocol that measures identity preservation through cyclic transformations (aging a face forward and then back) to provide a reference-independent consistency measure. This addresses a known limitation of standard benchmarks, where paired ground-truth data across age gaps is scarce. Extensive experiments on benchmark datasets and in-the-wild images demonstrate FaceTT achieves superior identity retention, background preservation, and aging realism over state-of-the-art methods.

Read the paper →

ORAL PRESENTATION
PAVAS: Physics-Aware Video-to-Audio Synthesis

Authors: Oh Hyun-Bin, Yuhta Takida, Toshimitsu Uesaka, Tae-Hyun Oh, Yuki Mitsufuji

Research paper: https://arxiv.org/abs/2512.08282
Project site: https://physics-aware-video-to-audio-synthesis.github.io/

When a hammer strikes metal or a ball bounces on a floor, a human listener instinctively expects the sound to match what they see: heavier objects land harder, faster motion produces sharper impacts. Current video-to-audio models often miss this connection. They can recognize that a hammer should produce a "metallic clang," but they struggle to modulate loudness or spectral sharpness based on the actual strength and dynamics of the impact. As the researchers explain, “We refer to this discrepancy as a lack of physical grounding—implicit modeling between visual dynamics and acoustic behavior.”

PAVAS addresses this by injecting explicit physical reasoning into a latent diffusion-based video-to-audio pipeline. The system uses a Physics Parameter Estimator (PPE) to extract object-level mass and velocity from video: a Vision-Language Model infers mass from visual and semantic context, while a segmentation-based dynamic 3D reconstruction module recovers motion trajectories for velocity estimation. A Physics-Driven Audio Adapter (Phy-Adapter) then feeds those estimates into the diffusion model as conditioning signals, so that generated audio reflects the underlying object dynamics rather than just appearance.

To evaluate this, the team curates VGG-Impact, a benchmark of 272 object-object interaction moments drawn from VGGSound, and introduces the Audio-Physics Correlation Coefficient (APCC) to measure how closely changes in kinetic energy match changes in spectral onset strength. Across both VGGSound and VGG-Impact, PAVAS achieves the strongest physical consistency among evaluated models while maintaining competitive perceptual quality, and a user study ranks it highest on all four subjective criteria including physical plausibility. As the authors describe, the work moves "beyond perceptual alignment toward physics-aware sound generation."

Read the paper →

EW-DETR: Evolving World Object Detection via Incremental Low-Rank Detection Transformer

Authors: Munish Monga, Vishal Chudasama, Pankaj Wasnik (Sony Research India), C.V. Jawahar (IIIT Hyderabad)

Research paper: https://arxiv.org/abs/2602.20985

Most object detectors are built for a fixed world, but the real world doesn't stay fixed. The researchers frame the problem directly: "real-world deployment scenarios demand fundamentally different capabilities."

For example: An autonomous vehicle must continuously identify new object types ( construction equipment, novel vehicle models) and adapt from daytime to night to fog, and critically, and more importantly, "recognize unseen objects as 'unknown' to avoid catastrophic failures." A warehouse robot must handle an evolving product inventory under varying lighting and seasonal conditions. In healthcare, privacy regulations restrict retaining patient data between tasks, which eliminates the replay-based strategies most existing detectors depend on. All three scenarios demand the same capabilities: incremental learning of new classes, adaptation to shifting visual domains, and calibrated detection of genuinely unknown objects, all without storing or revisiting any prior training data.

The failure mode of existing approaches is concrete. When deployed in evolving-world settings, detectors that lack these capabilities either misclassify unknown objects into known categories producing overconfident, incorrect predictions or absorb them into the background class, causing "missed detections of potentially critical novel objects." Neither failure is acceptable in safety-critical deployment.

EW-DETR formalizes this as Evolving World Object Detection (EWOD) and introduces a framework addressing all three dimensions simultaneously. Three modules augment a standard DETR-based detector.

Incremental LoRA Adapters use a dual-adapter design: one accumulates compressed knowledge from all prior tasks, a second captures the current task's updates. Their merging is governed by a data-aware coefficient; the core stability-plasticity trade-off the researchers identify here is that "an overly aggressive merge can quickly overwrite useful representations from past tasks," while an overly conservative one prevents adaptation to genuinely new domains. The coefficient is weighted by per-task sample ratios, giving under-represented domains stronger influence on the aggregate representation.

A Query-Norm Objectness Adapter then decouples semantic content from query magnitude in the transformer decoder, yielding class-agnostic representations that enable robust unknown detection under domain shifts without requiring auxiliary supervision or additional loss terms.

An Entropy-Aware Unknown Mixing module combines classification uncertainty with objectness evidence to identify high-objectness, high-uncertainty detections as unknowns rather than absorbing them into known categories or background.

To evaluate what no existing single metric captures—such as retention, openness, and generalization together—the researchers introduce FOGS (Forgetting, Openness, Generalisation Score). On Pascal Series and Diverse Weather benchmarks, EW-DETR improves FOGS by 57.24% over prior methods and enables the state-of-the-art RF-DETR detector to operate effectively in evolving-world settings for the first time.

Read the paper →

C3G: Learning Compact 3D Representations with 2K Gaussians

Authors: Honggyu An, Jaewoo Jung, Mungyeom Kim, Chaehyun Kim, Minkyeong Jeon, Jisang Han, Kazumi Fukuda, Takuya Narihira, Hyunah Ko, Junsu Kim, Sunghwan Hong, Yuki Mitsufuji, Seungryong Kim

Research paper: https://arxiv.org/abs/2512.04021

Project site: https://cvlab-kaist.github.io/C3G

When humans look at a room, we do not mentally reconstruct every surface pixel by pixel. We form a compact abstraction of the key objects, their rough spatial relationships, and the overall structure of the scene. Current feed-forward 3D Gaussian Splatting methods take the opposite approach: they predict one or more Gaussians per pixel, producing hundreds of thousands of primitives per scene. This creates excessive redundancy, causes multi-view misalignments, and makes it expensive to lift 2D semantic features into 3D for scene understanding.

C3G introduces a fundamentally different architecture. Rather than predicting Gaussians per pixel, it uses a small set of learnable query tokens (around 2,000) that attend across multi-view image features through a transformer. Each token learns to discover and represent a distinct region of the scene, producing a compact set of 3D Gaussians at only the essential spatial locations.

Caption: We propose a feed-forward framework for learning compact 3D representations from unposed images. Our approach estimates only 2K Gaussians that are allocated in meaningful regions to enable generalizable scene reconstruction and understanding.

Notably, this decomposition emerges purely from a photometric reconstruction objective: with no supervision on where Gaussians should be placed, the researchers find that "each token naturally learns to represent different regions," attending to spatially coherent areas across views and effectively discovering multi-view correspondences without explicit guidance.

The authors then exploit this emergent attention pattern for a second task. A companion feature decoder, C3G-F, reuses the learned attention weights from the Gaussian decoder to lift arbitrary 2D features (from encoders like DINOv2, DINOv3, or VGGT) into view-invariant 3D representations. Only the value projections need retraining.

On benchmarks for novel view synthesis, 3D open-vocabulary segmentation, and multi-view correspondence, C3G uses roughly 65× fewer Gaussians and 15× less memory than comparable methods while matching or exceeding their performance. The result suggests that a compact, geometrically meaningful representation is sufficient for both high-quality scene synthesis and understanding, without the redundancy current methods treat as necessary.

Read the paper →

VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generation

Authors: Maitreya Patel (Arizona State University / Sony AI), Jingtao Li, Weiming Zhuang, Lingjuan Lv (Sony AI), Yezhou Yang (Arizona State University)

Research paper: https://arxiv.org/abs/2604.24885
Project site: https://github.com/SonyResearch/VibeToken

Diffusion models dominate production image generation for a straightforward reason: they handle arbitrary resolutions and aspect ratios without modification. Autoregressive models, despite achieving competitive quality, have not. The reason traces back to the tokenizer. When a conventional tokenizer maps an image into discrete tokens, the token count grows with resolution, and autoregressive inference cost scales quadratically with sequence length. At 1024x1024, that means approximately 11 trillion FLOPs per forward pass for a standard model. The compute burden makes resolution-flexible AR generation practically infeasible.

VibeToken addresses this at the source. It is a 1D Transformer-based image tokenizer designed to encode images of any resolution and aspect ratio into a short, user-controllable sequence of 32 to 256 tokens, independent of the input dimensions. Four architectural choices make this possible: dynamic grid positional embeddings that adapt to any input lattice without quality loss, an adaptive patch embedding that varies kernel size across resolutions, a decoder that targets any output resolution without a separate upsampler, and a training strategy that samples both resolution and token length jointly. Together, these allow the tokenizer to generalize to resolutions it never encountered during training, including 1024x1024 images it was only trained to handle at 512x512 and below.

Building on VibeToken, VibeToken-Gen is a class-conditioned autoregressive generator that inherits its tokenizer's resolution flexibility directly.

Because token length is fixed by the user rather than determined by image size, the inference compute stays constant at 179G FLOPs for any resolution. Against the diffusion-based state-of-the-art at 1024x1024, VibeToken-Gen achieves a better gFID of 3.94 versus 5.87, while generating images in 0.46 seconds compared to 1.08 seconds, and using 64 tokens versus 1,024. Relative to a fixed-resolution AR baseline at the same resolution, it is 63.4 times more efficient. As the researchers put it, the goal is for VibeToken to "help unlock the wide adoption of AR visual generative models in production use cases;” a gap that the efficiency and flexibility results here meaningfully close.

Read the paper →

Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

Authors: Christian Simon, Masato Ishii, Wei-Yao Wang, Koichi Saito, Akio Hayakawa, Dongseok Shim, Zhi Zhong, Shuyang Cui, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

Research paper: https://arxiv.org/abs/2602.20981
Project site: https://echoesovertime.github.io

Video-to-audio generation holds "substantial promise for enhancing sound design workflows, particularly in domains such as film and gaming" — but existing models are built for short clips, typically 8 to 10 seconds, and fail when asked to generate audio for anything longer. The failure is architectural. Transformer-based V2A models rely on positional embeddings to establish temporal order in audio sequences, but those embeddings are fixed at training time. Apply them to a sequence longer than what the model trained on and performance degrades, sometimes sharply.

The researchers confirm this directly: on the UnAV100 benchmark, stretching a pretrained Transformer-based model across durations from 10 to 60 seconds caused 3 to 4 point drops in both distribution matching and multimodal alignment scores. Generating separate short clips for each segment is one workaround, but the researchers are direct about the result: it "often results in fragmented audio experiences, marked by disjointed transitions, unaligned sound events, and degraded audio quality stemming from its limited grasp of long-form video context."

MMHNet takes a different approach. At its core is a Non-Causal Mamba-2 architecture, which processes sequences without positional embeddings. This removes the primary bottleneck for length generalization. Causal models, including standard Mamba, restrict information flow to one direction and are prone to modulation decay over long sequences, where conditioning signals weaken over time. Non-causal Mamba enables omnidirectional information flow and does not accumulate this decay, making it more stable for multimodal fusion across extended video inputs. Around this core, MMHNet adds a hierarchical routing framework that compresses redundant tokens before they reach the main network, using temporal routing to identify where sound events actually occur and multimodal routing to select tokens with the highest cross-modal alignment. The result is a system that processes long video efficiently without being retrained on long data.

On the UnAV100 and LongVale benchmarks, which test generation across durations from 10 seconds to over 7 minutes, MMHNet consistently outperforms prior methods including LoVA, V-AURA, and HunyuanVideo-Foley across distribution matching, audio quality, multimodal alignment, and temporal synchronization. The IB-score, which measures audio-visual alignment via cross-modal embeddings, exceeds HunyuanVideo-Foley by 3.9 points on UnAV100. On LongVale, where longer durations expose the hardest generalization failures, MMHNet's advantage on the DeSync score is 0.23 points over the second-best method. The underlying principle the work demonstrates is straightforward: training on short clips and testing on long ones is achievable in video-to-audio generation, without modification at inference, as long as the architecture is built to support it.

Read the paper →

UniCompress: Token Compression for Unified Vision-Language Understanding and Generation

Authors: Ziyao Wang (Sony AI / University of Maryland), Chen Chen, Jingtao Li, Weiming Zhuang, Jiabo Huang, Ang Li (University of Maryland), Lingjuan Lyu (Sony AI)

Research paper: https://arxiv.org/abs/2603.11320

Unified vision-language models handle both understanding and generation within a single autoregressive framework, encoding images into discrete tokens processed alongside text. The appeal is architectural simplicity and shared parameterization. The practical problem is token count. A standard tokenizer mapping a 512x512 image produces 1,024 visual tokens, and both understanding and generation must operate over that full length. The memory, training cost, and inference latency that follow directly limit deployment in resource-constrained settings—and with it, the prospect of practical deployment under limited compute and memory that makes unified models genuinely usable at scale.

The obvious fix is to compress the visual tokens. The obstacle is that compression treats understanding and generation differently. Understanding is relatively tolerant of token reduction; models can drop spatial redundancy and still recognize objects, answer questions, and produce captions. Generation is not. It depends on fine-grained, spatially consistent token sequences to reconstruct detail accurately. The researchers confirm this asymmetry explicitly: naive downsampling or uniform token pruning degrades generation performance by more than 15%, even when understanding holds up.

UniCompress resolves this with a plug-in compression framework that wraps an existing tokenizer without requiring the language model to be retrained. Three modules handle the work.

A global token extractor uses cross-attention to summarize scene-level semantics into a small set of learnable meta tokens, capturing layout and object relations that survive compression.
A pooling-based compressor reduces the spatial token grid by averaging within non-overlapping patches.
An autoregressive decompressor then reconstructs the full token sequence conditioned on both the compressed local tokens and the global meta tokens, which serve as semantic anchors for recovering fine-grained spatial detail.

The global tokens are what distinguish this approach from naive compression: without them, long-range spatial consistency breaks down; without the autoregressive decompressor, textures over-smooth and artifacts appear.

Applied across six representative unified model backbones including UniTok, VILA-U, VARGPT, and BAGEL, UniCompress reduces visual tokens by 4x while keeping understanding accuracy within 3 percentage points and generation FID within 5 points of uncompressed baselines. In wall-clock terms, generation inference time for UniTok drops from 32.25 minutes to 18.96 minutes, a speedup of over 40%. Training time falls by around 15% across configurations.

The framework integrates without modifying backbone architectures, which means it can be applied to existing systems without the expense of full model retraining, a practically meaningful property as unified models become more widely deployed.

Read the paper →

UNIFORM: Unifying Knowledge from Large-scale and Diverse Pre-trained Models

Authors: Yimu Wang (University of Waterloo), Weiming Zhuang, Chen Chen, Jiabo Huang, Jingtao Li, Lingjuan Lyu (Sony AI)

Research paper: https://arxiv.org/abs/2508.19498

The number of pretrained vision models publicly available on platforms like HuggingFace exceeded one million in 2024—a 32x increase in just two years. Each model encodes a distinct interpretation of the visual world, trained on different data, built with different architectures, optimized for different tasks. As the researchers put it, "their collective consensus is likely universal and generalizable to unseen data." The practical challenge is that no existing method can draw on all of it. Model merging requires identical architectures. Mixture-of-experts demands every model remain loaded in memory. Standard knowledge distillation assumes teachers and students share the same label space, a constraint that disqualifies most of what's available.

UNIFORM introduces a knowledge transfer framework with none of those constraints. Teachers are divided into two types: predictive teachers, which share the target label space and contribute class predictions, and descriptive teachers, which provide general visual representations without predicting the target classes. The framework distinguishes between two types of teachers, a division that reflects a real difference in what publicly available models can offer. Predictive teachers share the target label space: they can look at an unlabeled image and predict a class the student model is trying to learn. Descriptive teachers were trained on entirely different categories and cannot name the target classes, but they learned rich visual feature representations in the process. Those representations carry useful knowledge about structure and appearance even without a label attached. Both types contribute to training a single student model through a pair of dedicated voting mechanisms that filter conflicting signals before they reach the student.

The core problem those mechanisms address is sign conflict. When features from different models occupy different latent spaces, a simple average can cancel them out. As the researchers explain, "the agreed features can be all 0 in an extreme case when all the teachers' features offset with each other." This isn't a corner case: a model trained on medical scans and one trained on bird species will represent the same image in fundamentally different ways. Simply averaging their outputs collapses useful information rather than combining it.

UNIFORM resolves this by mapping all teacher features into a shared space, then voting element-wise: features that disagree with the consensus direction are filtered before aggregation. Predictive teachers face a parallel problem at the prediction level: a CNN and a Vision Transformer trained on the same data can make different class predictions because they process images differently; CNNs attend to local texture, transformers to global structure. Rather than averaging those conflicting predictions, a parallel logit-level mechanism highlights the pseudo-class with the strongest teacher consensus rather than treating all predictions equally, preventing inconsistent teacher predictions from confusing the student.

Validated across 11 object recognition benchmarks with 104 public teacher models, UNIFORM consistently outperforms knowledge distillation baselines. Critically, it "exhibits remarkable scalability by benefiting from over one hundred teachers, while existing methods saturate at a much smaller scale" —a finding that points toward a broader principle: the more heterogeneous the teacher pool, the more the voting mechanism matters.

Read the paper →

In Closing…

The questions these papers take on are not new:

How do we build generative models that are efficient enough to deploy, faithful enough to trust, and grounded enough in the physical world—and in human identity—to be useful?

How do we represent complex scenes compactly without losing what makes them meaningful?

How do we train perception systems that hold up when conditions change, classes evolve, and the unknown appears without warning?

How do we harness the collective knowledge already encoded in publicly available pretrained models, rather than starting from scratch?

What is changing is the sophistication of the answers.

Each of these papers advances a specific technical frontier. The combined picture is one of a field working through the constraints that matter most for real-world deployment. These include compute, data efficiency, physical plausibility, and robustness. Progress on these fronts is incremental by nature. The results here suggest the increments are meaningful.

Sony AI's researchers and collaborators will be presenting this work at CVPR 2026 at the Colorado Convention Center in Denver, from June 3–7, 2026. Each paper is linked above. We welcome conversations at the conference.

To dive into our CVPR work from prior years, visit: