Authors

Venue

Date

Share

GenDataAgent: On-the-fly Dataset Augmentation with Synthetic Data

Zhiteng Li

Lele Chen

Jerone Andrews

Yunhao Ba

Yulun Zhang

Alice Xiang

ICLR-25

2026

Abstract

We propose a generative agent that augments training datasets with synthetic data
for model fine-tuning. Unlike prior work, which uniformly samples synthetic data,
our agent iteratively generates relevant samples on-the-fly, aligning with the target
distribution. It prioritizes synthetic data that complements difficult training samples,
focusing on those with high variance in gradient updates. Experiments across
several image classification tasks demonstrate the effectiveness of our approach.

Related Publications

Responsibly Training Foundation Models: Actualizing Ethical Principles for Curating Large-Scale Training Datasets in the Era …

ACM SIGCHI, 2025
Morgan Klaus Scheuerman, Dora Zhao*, Jerone T. A. Andrews, Abeba Birhane, Q. Vera Liao*, Georgia Panagiotidou*, Pooja Chitre*, Kathleen Pine, Shawn Walker*, Jieyu Zhao*, Alice Xiang

AI technologies have become ubiquitous, influencing domains from healthcare to finance and permeating our daily lives. Concerns about the values underlying the creation and use of datasets to develop AI technologies are growing. Current dataset practices often disregard crit…

A Taxonomy of Challenges to Curating Fair Datasets

NeurIPS, 2024
Dora Zhao*, Morgan Klaus Scheuerman, Pooja Chitre*, Jerone Andrews, Georgia Panagiotidou*, Shawn Walker*, Kathleen H. Pine*, Alice Xiang

Despite extensive efforts to create fairer machine learning (ML) datasets, there remains a limited understanding of the practical aspects of dataset curation. Drawing from interviews with 30 ML dataset curators, we present a comprehensive taxonomy of the challenges and trade…

Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspectiv…

EMNLP, 2024
Zhaotian Weng*, Zijun Gao*, Jerone Andrews, Jieyu Zhao*

Vision-language models (VLMs) pre-trained on extensive datasets can inadvertently learn biases by correlating gender information with specific objects or scenarios. Current methods, which focus on modifying inputs and monitoring changes in the model's output probability scor…

  • HOME
  • Publications
  • GenDataAgent: On-the-fly Dataset Augmentation with Synthetic Data

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.