Considerations for Ethical Speech Recognition Datasets

Orestis Papakyriakopoulos*

Alice Xiang

* External authors

WSDM 2023

2023

Abstract

Speech AI Technologies are largely trained on publicly available datasets or by the massive web-crawling of speech. In both cases, data acquisition focuses on minimizing collection effort, without necessarily taking the data subjects’ protection or user needs into consideration. This results to models that are not robust when used on users who deviate from the dominant demographics in the train- ing set, discriminating individuals having different dialects, accents, speaking styles, and disfluencies. In this talk, we use automatic speech recognition as a case study and examine the properties that ethical speech datasets should possess towards responsible AI ap- plications. We showcase diversity issues, inclusion practices, and necessary considerations that can improve trained models, while facilitating model explainability and protecting users and data sub- jects. We argue for the legal & privacy protection of data subjects, targeted data sampling corresponding to user demographics & needs, appropriate meta data that ensure explainability & account- ability in cases of model failure, and the sociotechnical & situated model design. We hope this talk can inspire researchers & practi- tioners to design and use more human-centric datasets in speech technologies and other domains, in ways that empower and respect users, while improving machine learning models’ robustness and utility.

Related Publications

A Taxonomy of Challenges to Curating Fair Datasets

NeurIPS, 2024
Dora Zhao*, Morgan Klaus Scheuerman, Pooja Chitre*, Jerone Andrews, Georgia Panagiotidou*, Shawn Walker*, Kathleen H. Pine*, Alice Xiang

Despite extensive efforts to create fairer machine learning (ML) datasets, there remains a limited understanding of the practical aspects of dataset curation. Drawing from interviews with 30 ML dataset curators, we present a comprehensive taxonomy of the challenges and trade…

Resampled Datasets Are Not Enough: Mitigating Societal Bias Beyond Single Attributes

EMNLP, 2024
Yusuke Hirota, Jerone Andrews, Dora Zhao*, Orestis Papakyriakopoulos*, Apostolos Modas, Yuta Nakashima*, Alice Xiang

We tackle societal bias in image-text datasets by removing spurious correlations between protected groups and image attributes. Traditional methods only target labeled attributes, ignoring biases from unlabeled ones. Using text-guided inpainting models, our approach ensures …

Efficient Bias Mitigation Without Privileged Information

ECCV, 2024
Mateo Espinosa Zarlenga*, Swami Sankaranarayanan, Jerone Andrews, Zohreh Shams, Mateja Jamnik*, Alice Xiang

Deep neural networks trained via empirical risk minimisation often exhibit significant performance disparities across groups, particularly when group and task labels are spuriously correlated (e.g., “grassy background” and “cows”). Existing bias mitigation methods that aim t…

SEE ALL

HOME
Publications
Considerations for Ethical Speech Recognition Datasets

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.

LEARN MORE