Authors

* External authors

Venue

Date

Share

Considerations for Ethical Speech Recognition Datasets

Orestis Papakyriakopoulos*

Alice Xiang

* External authors

WSDM 2023

2023

Abstract

Speech AI Technologies are largely trained on publicly available datasets or by the massive web-crawling of speech. In both cases, data acquisition focuses on minimizing collection effort, without necessarily taking the data subjects’ protection or user needs into consideration. This results to models that are not robust when used on users who deviate from the dominant demographics in the train- ing set, discriminating individuals having different dialects, accents, speaking styles, and disfluencies. In this talk, we use automatic speech recognition as a case study and examine the properties that ethical speech datasets should possess towards responsible AI ap- plications. We showcase diversity issues, inclusion practices, and necessary considerations that can improve trained models, while facilitating model explainability and protecting users and data sub- jects. We argue for the legal & privacy protection of data subjects, targeted data sampling corresponding to user demographics & needs, appropriate meta data that ensure explainability & account- ability in cases of model failure, and the sociotechnical & situated model design. We hope this talk can inspire researchers & practi- tioners to design and use more human-centric datasets in speech technologies and other domains, in ways that empower and respect users, while improving machine learning models’ robustness and utility.

Related Publications

Measure dataset diversity, don’t just claim it

ICML, 2024
Dora Zhao*, Jerone T. A. Andrews, Orestis Papakyriakopoulos*, Alice Xiang

Machine learning (ML) datasets, often perceived as neutral, inherently encapsulate abstract and disputed social constructs. Dataset curators frequently employ value-laden terms such as diversity, bias, and quality to characterize datasets. Despite their prevalence, these ter…

Not My Voice! A Taxonomy of Ethical and Safety Harms of Speech Generators

FaccT, 2024
Wiebke Hutiri*, Orestis Papakyriakopoulos*, Alice Xiang

The rapid and wide-scale adoption of AI to generate human speech poses a range of significant ethical and safety risks to society that need to be addressed. For example, a growing number of speech generation incidents are associated with swatting attacks in the United States…

Ethical Considerations for Responsible Data Curation

NeurIPS, 2023
Jerone Andrews, Dora Zhao*, William Thong, Apostolos Modas, Orestis Papakyriakopoulos*, Alice Xiang

Human-centric computer vision (HCCV) data curation practices often neglect privacy and bias concerns, leading to dataset retractions and unfair models. HCCV datasets constructed through nonconsensual web scraping lack crucial metadata for comprehensive fairness and robustnes…

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.