What If Fairness Started at the Dataset Level?

AI Ethics

November 4, 2025

At Sony AI, we believe ethical AI starts with the inputs. And that means reexamining how datasets are collected and shared. Our research has consistently shown that fairness and representation cannot be afterthoughts–they are foundational to trustworthy AI.

As AI systems are increasingly deployed in all areas of public and private life, the stakes of ignoring dataset-level issues are only rising.

So, we ask: what if fairness started at the dataset?

Laying the Groundwork for Something New

At Sony AI, our AI Ethics team has spent years orienting our research to challenge the status quo of dataset construction.

Our work has shown how existing human-centric computer vision datasets often rely on non-consensual web scraping, lack critical demographic metadata, and perpetuate representational gaps—leading to models that can be both unfair and unreliable (Andrews et al., Ethical Considerations for Responsible Data Curation, NeurIPS 2023).

We’ve also highlighted a deeper paradox: protecting privacy by limiting data collection can sometimes leave marginalized groups “unseen,” yet this very absence increases the risk of being “mis-seen” by AI systems—misclassified, misrecognized, or misrepresented (Xiang, Being Seen Versus Mis-Seen, Harvard JOLT 2022). The harms of invisibility and mis-visibility are inseparable, and solving them requires a shift in how datasets are built.

And we have argued that fairness cannot be treated as an afterthought. From our ethical audits to our analysis of how demographic data shapes bias detection, our research has made clear that datasets must be designed with purpose, consent, and diversity from the outset (Xiang et al., Mirror, Mirror: Reflections on Dataset Bias and Fairness, 2021).

Together, these works revealed not just the importance of ethical data collection, but also the difficulty of reconciling best practices, conflicting priorities, and technical specifications of fairness. The literature has offered many guidelines, but these guidelines often clash with one another, and translating them into technical requirements is far from straightforward. This tension set the stage for our next set of explorations.

When Models Learn the Wrong Thing

To address these challenges, we examined practical strategies that impact bias mitigation. One such approach is Targeted Augmentations for Bias Mitigation (TAB), a method developed as part of our ongoing research into shortcut learning and fairness.

As Sony AI researcher Jerone Andrews explains, “A model might detect kitchen utensils in an image and wrongly infer the presence of a woman, even if no woman is actually in the image” (Mitigating Bias in AI Models, 2024). These kinds of spurious correlations often stem from datasets with imbalanced or stereotyped representations, leading to biased model behavior.

TAB looks at how a model has learned so far and pinpoints two kinds of examples: ones that reinforce its existing biases, and ones that challenge them. By giving the model more of the challenging cases (much like harder questions on a test) it learns in a more balanced way.

For instance, if a model always associates kitchen utensils with women, as aforementioned, TAB will deliberately feed it more examples of men cooking. This forces the model to learn from tougher, more informative cases instead of just repeating old assumptions.

While TAB is not a silver bullet, it represents one practical tool within a broader portfolio of research Sony AI is pursuing to make AI systems more equitable and reliable.

Redefining Diversity

In parallel, we investigated how diversity in datasets can be meaningfully defined, measured, and evaluated.

In the ICML 2024 paper Measure Dataset Diversity, Don’t Just Claim It, Sony AI researchers argue that defining and validating diversity must go beyond intuition and surface-level representation.

“Achieving this requires robust, standardized methods of conceptualizing and operationalizing diversity,” write Zhao et al. (2024), calling for stronger scientific foundations behind claims of inclusiveness.

The paper outlines a three-part process rooted in measurement theory:

- Conceptualization Translate values like diversity into specific, contextual definitions.
- Operationalization Develop measurable indicators aligned with those definitions.
- Evaluation Test whether a dataset consistently reflects the intended diversity goals.

This structure allows creators and reviewers alike to assess whether datasets are genuinely representative—or just claiming to be.

As Sony AI researchers note, the process of data creation is deeply social—not just technical. Following critiques such as Paullada et al. (2021), we aim to “focus on the social and epistemic processes involved in dataset creation, rather than treating datasets as neutral raw material.”

Fairness as a Lifecycle

Finally, we turned to the human processes behind dataset creation, analyzing the labor, documentation, and transparency practices that make fairness a lived reality.

In our NeurIPS 2024 paper, A Taxonomy of Challenges to Curating Fair Datasets, we studied the real-world experiences of curators working with vision, language, and multimodal datasets. What emerged was a practical framework for embedding fairness across the dataset lifecycle.

As the authors explain, “Fairness is not only a property of the final artifact—the dataset—but also a constant consideration curators must account for throughout the curation process.”

This includes:

- Composition: Are all relevant communities represented?
- Process: Are annotators compensated fairly, and is their labor respected?
- Release: Is the dataset shared with sufficient context and documentation to prevent misuse?

Each phase brings its own set of trade-offs. But acknowledging these limitations, and documenting them transparently, is critical to building trust.

Coming Soon

Together, these threads—groundwork on ethical dataset design, explorations of bias mitigation, measurement, and lifecycle practices—have brought us to the next step.

Coming soon, we’ll be introducing a new kind of dataset, one shaped by these very principles. Stay tuned.

Latest Blog

November 7, 2025 | AI Ethics, Sony AI

The FHIBE Team: Data, Dignity, and the People Who Made It Possible

AI runs on data, but too often, that data has been scraped without consent, assembled without care, and used without accountability. The consequences ripple through our daily lives…

November 5, 2025 | AI Ethics

Introducing FHIBE: A Consent-Driven Benchmark for AI Fairness Evaluation

Why Fairness Needs a Better ApproachBuilding AI that works fairly across people, places, and contexts globally requires data that represents real human diversity, not scraped snaps…

October 31, 2025 | Sony AI

Advancing AI: Highlights from October

October marked another milestone month for Sony AI, showcasing impactful work across computer vision, generative modeling, and human–AI perception. At ICCV 2025, our researchers in…

SEE ALL

HOME
Blog
What If Fairness Started at the Dataset Level?

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.

LEARN MORE