Navigating Responsible Data Curation Takes the Spotlight at NeurIPS 2023


January 18, 2024

The field of Human-Centric Computer Vision (HCCV) is rapidly progressing, and some researchers are raising a red flag on the current ethics of data curation. A primary concern is that today’s practices in HCCV data curation – which prioritize dataset size and utility – have sidelined critical issues related to privacy and bias. Such emphasis has not only resulted in the retraction of well-known datasets but also unfair models. Particularly, datasets obtained through nonconsensual web scraping lack the vital metadata that is essential for conducting comprehensive fairness and robustness assessments.

The paper, “Ethical Considerations for Responsible Data Curation,” led by Jerone T.A. Andrews, Sony AI Research Scientist, and Alice Xiang, Sony AI Lead Research Scientist and Global Head of AI Ethics at Sony Group Corporation – with contributions from the AI Ethics team at Sony AI – aims to address these issues, offering proactive, domain-specific recommendations for curating HCCV evaluation datasets. This research was accepted as an oral presentation and shared by Jerone at the 2023 Conference on Neural Information Processing Systems (NeurIPS).

Jerone, Alice, and the other authors of this paper are experts in machine learning (ML), computer vision (CV), algorithmic fairness, philosophy, and social science, and utilized contemporary interdisciplinary practices to develop recommendations for responsible data curation. With a diverse range of backgrounds, they bring extensive experience in CV dataset design, model training, and ethical guideline development. The team identified ethical issues in HCCV data curation and conducted a thorough literature review on topics such as human-centered artificial intelligence (HCAI), HCCV datasets, bias detection, and mitigation, resulting in refined considerations and detailed recommendations for responsible data curation.

We spoke with Jerone and Alice about the impact of unethical data collection and Sony AI’s recommendations for responsible data collection.

What is the impact of unethical data collection on society if practitioners do not reverse course and adopt more ethical practices?

Unethical data collection practices, such as nonconsensual web scraping in HCCV datasets, can result in underrepresentation, biases, and privacy concerns. This approach treats image subjects as free raw material and lacks essential ground-truth metadata for fair evaluations. This significantly inhibits the ability to fully understand model blind spots and potential harms across various dimensions, including data subjects, instruments, and environments. Inferring attributes such as race and gender, which have been done in several datasets, introduces additional biases and carries the risk of causing psychological harm when incorrect. This perpetuates existing inequalities and biases in AI systems and requires careful consideration for responsible and ethical data curation practices.

For example, in the healthcare industry, the absence of diverse data – particularly regarding age and minority populations – can lead AI systems to prioritize younger individuals due to life expectancy assumptions. This then perpetuates ageism and neglects the healthcare needs of older populations, which can lead to biased resource allocation.

Many harms associated with AI development occur at the point of data collection and cannot be easily addressed after the fact. These include the extent to which people have control over how their data is used, how those involved in the data collection pipeline are compensated and credited for their contributions, and what kind of world view the AI model learns and entrenches. Nonetheless, AI researchers and developers have historically undervalued data collection, leading to widespread reliance on problematically sourced datasets. Through this paper and our broader research agenda, we hope to encourage and enable practitioners to adopt more ethical practices going forward.

What are the actionable recommendations practitioners can take from this research on ethical considerations for data curation?

The first step is to prioritize fairness in dataset collection through explicit design for fairness and robustness assessments, which should preclude using "dirty data." Here, dirty data includes inferred data. Dirty data is characterized by missing or incorrect information and is distorted by individual and societal biases, which can inadvertently compromise downstream research, policy, and decision-making validity. Through mechanisms like informed consent, dataset curators can engage data subjects in the data collection process. This enables the collection of self-identified information directly from data subjects, who inherently possess contextual knowledge of their environment and are aware of their own attributes. Gathering these labels improves datasets' fairness, respect, and accuracy, acknowledging individuals' autonomy and fostering their well-being. This approach promotes the ethical and inclusive creation of datasets.

Prioritization of fairness must, however, start at the inception of a dataset, which requires practitioners to refrain from repurposing existing web-scraped, fairness-unaware datasets – for example, datasets collected without data subject participation or fairness in mind. Second, practitioners should delimit the scope of their data collection effort before any data is collected via purpose statements, which will help ensure alignment with data subjects' consent, intentions, and best interests, preventing purpose creep and hindsight bias.

It has been highlighted in previous scholarship and in our research that institutional protocols are unsuitable for data-centric research due to the limited definition of what counts as human-subjects research. Current protocols classify publicly available data as minimal risk without considering the broader societal consequences beyond the immediate research study context. Therefore, in designing and collecting HCCV data, researchers must embrace heightened ethical responsibility to safeguard the well-being of human subjects in research. This involves a meticulous acknowledgment that most data either represents or directly influences individuals.

What are the current challenges to adopting these recommendations, and how can they be avoided or solved?

Adopting more ethical data curation practices faces several obstacles rooted in entrenched norms, organizational inertia, diffusion of responsibility, and concerns about legal liability. These barriers collectively contribute to a reluctance to embrace change and hinder the integration of practices that prioritize ethical considerations.

Another obstacle is seeking consent from all depicted individuals, which introduces resource-intensive logistical challenges. Obtaining explicit permission on a large scale requires significant human resources, time, and financial support. This task, in particular, can be daunting for smaller organizations that may lack the infrastructure to implement and maintain consent management systems.

Moreover, extending ethical data curation recommendations to large training datasets, often utilized in developing complex machine learning models, known as foundation models, can incur substantial costs. The need for meticulous curation, verification, and documentation of such extensive datasets demands a delicate balance between advancing model development and upholding ethical standards. Organizations must navigate these financial considerations to strike an equilibrium that ensures model improvements and adherence to ethical guidelines.

To address these challenges, leading machine learning conferences could consider adopting a registered reports model, where dataset proposals are pre-accepted before data collection. This would help to alleviate financial uncertainties associated with more ethical practices. Organizations can redirect research funds from data-intensive methods and channel financial resources toward approaches developed with responsibly curated data. This allows for a balanced alignment between technological advancement and ethical imperatives.

As with any ethical recommendations, there is always the need to balance multiple ethical desiderata (which can sometimes conflict) with practical constraints. As a result, there will never be perfectly ethically collected data for AI development. That said, current practices common across AI development, such as reliance on uncurated, nonconsensual, web-scraped datasets, provide a very low baseline that can certainly be improved upon. Thus, the goal should be incremental improvement, adoption of deliberate ethical practices, and greater allocation of resources toward appropriate data sourcing.

Is there an opportunity to repair the damage done by current models based on unethical datasets?

Repairing the damage of existing models trained on unethical datasets is challenging but possible. New models can be designed with ethical considerations to ensure fairness, and existing models can be improved through retraining on responsibly collected datasets or adding fairness constraints. While it may not fully undo the impact of unethical data, these efforts contribute to a more ethical AI landscape.

You mentioned that financial considerations are a gating factor in instituting the recommended changes. Does the economic feasibility of operationalizing fairness create another kind of bias in that only large companies or institutions with the economic means can create fair datasets?

The cost associated with data collection, consent management systems, and ensuring fairness can be substantial, especially when dealing with large-scale datasets. Smaller organizations and academic research groups may face challenges meeting these financial requirements. This barrier could result in an uneven distribution of resources, where larger entities are better equipped to adhere to ethical data practices, potentially perpetuating biases in favor of well-funded organizations.

To mitigate this specific bias, there is a need for broader initiatives and collaborations – such as data consortia – to pool resources and knowledge for ethical data collection. Additionally, regulatory frameworks and incentives for organizations to prioritize fairness in data collection can help level the playing field and ensure that financial constraints do not lead to a bias in ethical data practices.

What is the inflection point that might shift the tide toward more ethical data collection, and how far off are we from this reality?

The inflection point to drive more ethical data collection could be shaped by implementing purpose statements, as suggested in our paper, before data collection, as well as stricter regulations, peer pressure, public awareness, industry leadership, and research community reflection. Purpose statements can maintain transparency and prevent hindsight bias and purpose creep.

While a growing corpus of research discusses these issues, top-tier computer vision-centric conferences have yet to adopt ethics review practices. Therefore, many ethically dubious datasets are still being used and collected.

Collaboration is a cornerstone in making this endeavor a reality. The speed of progress is dependent on the cooperation between all stakeholders, evolving regulations, and the commitment of organizations and researchers to adopt more ethical data collection practices. This can be done even if established institutional research protocols do not mandate these practices. By leveraging diverse perspectives, knowledge, and resources, collaborative initiatives can catalyze innovative solutions and establish best practices that contribute to the ethical advancement of data-related activities.

Recent interest in generative AI and growing calls for AI regulation will likely provide an inflection point. Given the volume of data involved in developing generative AI technologies, there is growing attention to the need to source data for AI development more ethically. This is even more concerning when juxtaposed with the additional ethical concerns around the possibility of such models to leak confidential or private information, or create content that mimics their training data. In addition, the EU AI Act and other regulatory movements are putting more pressure on organizations to provide transparency on the data used to develop their models, such that these practices will likely face more public and regulatory scrutiny going forward.

Latest Blog

June 18, 2024 | Sony AI

Sights on AI: Tarek Besold Offers Perspective on AI for Scientific Discovery, Ba…

The Sony AI team is a diverse group of individuals working to accomplish one common goal: accelerate the fundamental research and development of AI and enhance human imagination an…

June 4, 2024 | Events , Sony AI

Not My Voice! A Framework for Identifying the Ethical and Safety Harms of Speech…

In recent years, the rise of AI-driven speech generation has led to both remarkable advancements and significant ethical concerns. Speech generation can be a driver for accessibili…

May 22, 2024 | Sony AI

In Their Own Words: Sony AI’s Researchers Explain What Grand Challenges They’re …

This year marks four years since the inception of Sony AI. In light of this milestone, we have found ourselves reflecting on our journey and sharpening our founding commitment to c…

  • HOME
  • Blog
  • Navigating Responsible Data Curation Takes the Spotlight at NeurIPS 2023


Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.