Considerations for Ethical Speech Recognition Datasets

Orestis Papakyriakopoulos

Alice Xiang

WSDM 2023



Speech AI Technologies are largely trained on publicly available datasets or by the massive web-crawling of speech. In both cases, data acquisition focuses on minimizing collection effort, without necessarily taking the data subjects’ protection or user needs into consideration. This results to models that are not robust when used on users who deviate from the dominant demographics in the train- ing set, discriminating individuals having different dialects, accents, speaking styles, and disfluencies. In this talk, we use automatic speech recognition as a case study and examine the properties that ethical speech datasets should possess towards responsible AI ap- plications. We showcase diversity issues, inclusion practices, and necessary considerations that can improve trained models, while facilitating model explainability and protecting users and data sub- jects. We argue for the legal & privacy protection of data subjects, targeted data sampling corresponding to user demographics & needs, appropriate meta data that ensure explainability & account- ability in cases of model failure, and the sociotechnical & situated model design. We hope this talk can inspire researchers & practi- tioners to design and use more human-centric datasets in speech technologies and other domains, in ways that empower and respect users, while improving machine learning models’ robustness and utility.

Related Publications

Upvotes? Downvotes? No Votes? Understanding the relationship between reaction mechanisms and political discourse on Reddit

CHI, 2023
Orestis Papakyriakopoulos, Severin Engelmann*, Amy Winecoff*

A significant share of political discourse occurs online on social media platforms. Policymakers and researchers try to understand the role of social media design in shaping the quality of political discourse around the globe. In the past decades, scholarship on political di…

Causality for Temporal Unfairness Evaluation and Mitigation

NeurIPS, 2022
Aida Rahmattalabi, Alice Xiang

Recent interests in causality for fair decision-making systems has been accompanied with great skepticism due to practical and epistemological challenges with applying existing causal fairness approaches. Existing works mainly seek to remove the causal effect of social categ…

Men Also Do Laundry: Multi-Attribute Bias Amplification

NeurIPS, 2022
Dora Zhao, Jerone T. A. Andrews, Alice Xiang

As computer vision systems become more widely deployed, there is increasing concern from both the research community and the public that these systems are not only reproducing but amplifying harmful social biases. The phenomenon of bias amplification, which is the focus of t…


Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.