Measure dataset diversity, don’t just claim it

VIEW PUBLICATION

Dora Zhao*

Jerone T. A. Andrews

Orestis Papakyriakopoulos*

Alice Xiang

* External authors

ICML-24

2024

Abstract

Machine learning (ML) datasets, often perceived as neutral, inherently encapsulate abstract and disputed social constructs. Dataset curators frequently employ value-laden terms such as diversity, bias, and quality to characterize datasets. Despite their prevalence, these terms lack clear definitions and validation. Our research explores the implications of this issue by analyzing "diversity" across 135 image and text datasets. Drawing from social sciences, we apply principles from measurement theory to identify considerations and offer recommendations for conceptualizing, operationalizing, and evaluating diversity in datasets. Our findings have broader implications for ML research, advocating for a more nuanced and precise approach to handling value-laden properties in dataset construction.

Related Publications

A Taxonomy of Challenges to Curating Fair Datasets

NeurIPS, 2024
Dora Zhao*, Morgan Klaus Scheuerman, Pooja Chitre*, Jerone Andrews, Georgia Panagiotidou*, Shawn Walker*, Kathleen H. Pine*, Alice Xiang

Despite extensive efforts to create fairer machine learning (ML) datasets, there remains a limited understanding of the practical aspects of dataset curation. Drawing from interviews with 30 ML dataset curators, we present a comprehensive taxonomy of the challenges and trade…

Resampled Datasets Are Not Enough: Mitigating Societal Bias Beyond Single Attributes

EMNLP, 2024
Yusuke Hirota, Jerone Andrews, Dora Zhao*, Orestis Papakyriakopoulos*, Apostolos Modas, Yuta Nakashima*, Alice Xiang

We tackle societal bias in image-text datasets by removing spurious correlations between protected groups and image attributes. Traditional methods only target labeled attributes, ignoring biases from unlabeled ones. Using text-guided inpainting models, our approach ensures …

Efficient Bias Mitigation Without Privileged Information

ECCV, 2024
Mateo Espinosa Zarlenga*, Swami Sankaranarayanan, Jerone Andrews, Zohreh Shams, Mateja Jamnik*, Alice Xiang

Deep neural networks trained via empirical risk minimisation often exhibit significant performance disparities across groups, particularly when group and task labels are spuriously correlated (e.g., “grassy background” and “cows”). Existing bias mitigation methods that aim t…

SEE ALL

HOME
Publications
Measure dataset diversity, don’t just claim it

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.

LEARN MORE