Authors

* External authors

Venue

Date

Share

Sparo: Selective Attention for Robust and Compositional Transformer Encodings for Vision

Ankit Vani*

Bac Nguyen

Samuel Lavoie*

Ranjay Krishna*

Aaron Courville*

* External authors

ECCV-24

2024

Abstract

Selective attention helps us focus on task-relevant aspects in the constant flood of our sensory input. This constraint in our perception allows us to robustly generalize under distractions and to new compositions of perceivable concepts. Transformers employ a similar notion of attention in their architecture, but representation learning models with transformer backbones like CLIP and DINO often fail to demonstrate robustness and compositionality. We highlight a missing architectural prior: unlike human perception, transformer encodings do not separately attend over individual concepts. In response, we propose Sparo, a read-out mechanism that partitions encodings into separately attended slots, each produced by a single attention head. Using Sparo with CLIP imparts an inductive bias that the vision and text modalities are different views of a shared compositional world with the same corresponding concepts. Using Sparo, we demonstrate improvements on downstream recognition, robustness, retrieval, and compositionality benchmarks with CLIP (up to +14% for ImageNet, +4% for SugarCrepe), and on nearest neighbors and linear probe for ImageNet with DINO (+3% each). We also showcase a powerful ability to intervene and select individual Sparo concepts to further improve downstream task performance (up from +4% to +9% for SugarCrepe) and use this ability to study the robustness of Sparo’s representation structure. Finally, we provide insights through ablation experiments and visualization of learned concepts.

Related Publications

SONA: Learning Conditional, Unconditional, and Matching-Aware Discriminator

ICLR, 2026
Yuhta Takida, Satoshi Hayakawa, Takashi Shibuya, Masaaki Imaizumi*, Naoki Murata, Bac Nguyen, Toshimitsu Uesaka, Chieh-Hsin Lai, Yuki Mitsufuji

Deep generative models have made significant advances in generating complex content, yet conditional generation remains a fundamental challenge. Existing conditional generative adversarial networks often struggle to balance the dual objectives of assessing authenticity and c…

Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment

ICLR, 2026
Bac Nguyen, Yuhta Takida, Naoki Murata, Chieh-Hsin Lai, Toshimitsu Uesaka, Stefano Ermon*, Yuki Mitsufuji

Slot Attention (SA) with pretrained diffusion models has recently shown promise for object-centric learning (OCL), but suffers from slot entanglement and weak alignment between object slots and image content. We propose Contrastive Object-centric Diffusion Alignment (CODA), …

G2D2: Gradient-Guided Discrete Diffusion for Image Inverse Problem Solving

TMLR, 2025
Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Bac Nguyen, Stefano Ermon*, Yuki Mitsufuji

Recent literature has effectively leveraged diffusion models trained on continuous variables as priors for solving inverse problems. Notably, discrete diffusion models with discrete latent codes have shown strong performance, particularly in modalities suited for discrete co…

  • HOME
  • Publications
  • Sparo: Selective Attention for Robust and Compositional Transformer Encodings for Vision

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.