Authors
- Zhiteng Li
- Lele Chen
- Jerone Andrews
- Yunhao Ba
- Yulun Zhang
- Alice Xiang
Venue
- ICLR-25
Date
- 2026
GenDataAgent: On-the-fly Dataset Augmentation with Synthetic Data
Zhiteng Li
Lele Chen
Jerone Andrews
Yunhao Ba
Yulun Zhang
ICLR-25
2026
Abstract
We propose a generative agent that augments training datasets with synthetic data
for model fine-tuning. Unlike prior work, which uniformly samples synthetic data,
our agent iteratively generates relevant samples on-the-fly, aligning with the target
distribution. It prioritizes synthetic data that complements difficult training samples,
focusing on those with high variance in gradient updates. Experiments across
several image classification tasks demonstrate the effectiveness of our approach.
Related Publications
AI technologies have become ubiquitous, influencing domains from healthcare to finance and permeating our daily lives. Concerns about the values underlying the creation and use of datasets to develop AI technologies are growing. Current dataset practices often disregard crit…
Despite extensive efforts to create fairer machine learning (ML) datasets, there remains a limited understanding of the practical aspects of dataset curation. Drawing from interviews with 30 ML dataset curators, we present a comprehensive taxonomy of the challenges and trade…
Vision-language models (VLMs) pre-trained on extensive datasets can inadvertently learn biases by correlating gender information with specific objects or scenarios. Current methods, which focus on modifying inputs and monitoring changes in the model's output probability scor…
JOIN US
Shape the Future of AI with Sony AI
We want to hear from those of you who have a strong desire
to shape the future of AI.



