Introducing FHIBE: A Consent-Driven Benchmark for AI Fairness Evaluation

How Sony AI’s Fair Human-Centric Image Benchmark sets a new standard for ethical, inclusive AI evaluation

AI Ethics

November 5, 2025

Why Fairness Needs a Better Approach

Building AI that works fairly across people, places, and contexts globally requires data that represents real human diversity, not scraped snapshots from the internet— but images collected with consent, fair pay, and dignity.

Bias in AI doesn’t just emerge randomly from the algorithms themselves.

It starts with the data used to build them.

From its inception, computer vision has relied on massive image sets collected indiscriminately or “scraped” from the internet, most often without consent, compensation, or consideration for the individuals represented.

If the goal is to ensure AI can “see” everyone across geography, ancestry, age, and gender, then fairness must begin with the data used to train and evaluate these systems.

Today’s AI systems are often trained and evaluated using non-consensual, demographically skewed datasets that risk privacy violations and reinforce bias. This issue is central: developers have relied on non-consensual datasets in the absence of ethical alternatives. These problems are especially acute in human-centric computer vision (HCCV), where models must make sense of faces, bodies, and expressions across diverse contexts.

“Even the basic first step of checking for bias is difficult because of the lack of publicly available, ethically sourced datasets for most computer vision tasks,” says Alice Xiang, Sony Group’s Global Head of AI Governance and lead researcher on FHIBE. “We wanted to enable developers to check for bias in AI without having to resort to problematic datasets.”

In practice, this absence of good options has sometimes pushed developers toward a kind of ethics nihilism—meaning, deciding not to check at all because the tools to do so responsibly simply don’t exist.

FHIBE—the Fair Human-Centric Image Benchmark—was created to answer that challenge.

The research development of FHIBE has just been published in Nature, and with more than 10,000 images of nearly 2,000 participants from over 80 countries/areas, it provides developers and researchers with a powerful way to uncover bias before deployment—across face detection, pose estimation, and vision-language tasks.

Beyond “What’s Fair” — Grappling With the How

AI ethics is rarely black-and-white. “People often think, ‘Just make it ethical, do what’s right,’” Xiang notes. “But ethics in AI is about balancing competing priorities: privacy, utility, diversity, and feasibility.”

Prior to joining Sony, Xiang had observed a growing problem across the field: even teams eager to evaluate fairness were constrained by the lack of datasets created with consent and diversity in mind. When she arrived at Sony AI, she saw an opportunity to do something about it. Across industry, many groups were forced to rely on existing datasets that were poorer in terms of representation and consent. Sony AI recognized the risks of this status quo and made the decision to invest in a new, more responsible path.

“We quickly learned that creating a dataset that was diverse enough to be meaningful, consent-driven, and global in scope was far harder than expected,” Xiang says. “Our early goal was modest—just a small dataset slightly better than the ones that were available. But as we worked with business units and peers across the industry, it became clear that no one had solved this problem. FHIBE had to grow into something much bigger.”

Building FHIBE: Harder Than Anyone Expected

Behind FHIBE lies years of relentless problem-solving.

The team developed clear requirements for what a responsibly created dataset should include—consent, fair pay, diversity, and rigorous annotation—and when they found nothing close, they had to build it themselves.

This meant translating theory into operations: deciding what data to collect, balancing more annotations (better utility) with privacy considerations with careful approaches to anonymization, and navigating standards on consent and compensation.

Quality control was another challenge. The team built custom infrastructure to verify images, went back-and-forth with vendors, and eventually hired Quality Assurance (QA) specialists to inspect every batch, even after vendor QA. “We didn’t think we were asking for the moon,” Xiang recalls, “but what we wanted turned out to be far above the industry standard.”

They also introduced revocable consent from the start; designing systems so participants could withdraw their images at any time. “We knew FHIBE needed to be a living dataset,” Xiang says. “Diversity and stability matter, but so does every participant’s right to maintain control over their data.”

Even privacy protections via inpainting (such as, removing bystanders or license plates) were carefully reviewed. “We considered blurring, but chose inpainting to maximize privacy protection and preserve image context,” Xiang explains.

What Makes FHIBE Different

FHIBE isn’t just ethically sourced: it’s methodologically rigorous.

- Demographic + Phenotypic Detail: Participants self-reported attributes such as pronouns, ancestry, age group, hairstyle, makeup, and headwear.
- Environmental Context: Images include metadata on lighting, weather, and scene type, which are vital for testing real-world model performance.
- Precision Annotations: Bounding boxes, keypoints, and segmentation masks enable detailed benchmarking across multiple computer vision tasks.
- Evaluation-Only Design: FHIBE is a bias auditing dataset. This ensures it is used to measure fairness, not reinforce bias.

This rigor revealed problems that would otherwise go unseen: models that misgendered based on hairstyle norms, linked African ancestry with rural environments, and even generated toxic language for certain gendered prompts.

“Without a benchmark like FHIBE, you simply wouldn’t know these problems exist,” Xiang notes.

A Tool for Developers—Not Just Policymakers

FHIBE is not merely a checkbox: it’s a developer and policymaker’s tool.

“Too often, fairness benchmarks are framed as regulatory exercises,” Xiang says. “We built FHIBE so model developers can use it upstream—before deployment—to take responsibility for fairness. It’s also great if policymakers use FHIBE to audit AI models, too.”

The dataset supports:

- Bias diagnosis for task-specific models (pose estimation, face detection)
- Intersectional audits of foundation models (e.g., CLIP, BLIP-2)
- Comparative analysis across demographic and environmental variables

The goal: to make bias checking part of the development pipeline, not an afterthought.

Setting a New Standard for AI’s Supply Chain

FHIBE is proof that global, consent-driven data collection is possible, even if difficult.

“Ethical data collection is really hard, but really important,” Xiang emphasizes. “Our hope is that FHIBE raises the bar for everyone—showing that high standards are possible, and giving teams fewer excuses not to do fairness evaluation.”

Looking Ahead

FHIBE is just the beginning. Xiang sees FHIBE as a pivotal moment (like ImageNet was for deep learning) that can spark an ethical AI revolution. By proving it can be done, FHIBE may reduce the friction for others who take on similar efforts in the future. .

For Sony AI, FHIBE is a call to action:

- For developers, it’s a blueprint for building responsible AI.
- For researchers, it’s a precision tool for bias diagnosis.
- For the field at large, it’s a reminder that the data we use shapes the outcomes we get.

Explore the research, download FHIBE and watch A Fair Reflection, the short film documenting its creation and the ethical challenges it seeks to address, at FairnessBenchmark.ai.sony

Latest Blog

November 7, 2025 | AI Ethics, Sony AI

The FHIBE Team: Data, Dignity, and the People Who Made It Possible

AI runs on data, but too often, that data has been scraped without consent, assembled without care, and used without accountability. The consequences ripple through our daily lives…

November 4, 2025 | AI Ethics

What If Fairness Started at the Dataset Level?

At Sony AI, we believe ethical AI starts with the inputs. And that means reexamining how datasets are collected and shared. Our research has consistently shown that fairness and re…

October 31, 2025 | Sony AI

Advancing AI: Highlights from October

October marked another milestone month for Sony AI, showcasing impactful work across computer vision, generative modeling, and human–AI perception. At ICCV 2025, our researchers in…

SEE ALL

HOME
Blog
Introducing FHIBE: A Consent-Driven Benchmark for AI Fairness Evaluation

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.

LEARN MORE