CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI
Siyuan Cheng
Zhenting Wang
Xiangyu Zhang
CVPR-25
2025
Abstract
With the rapid advancement of generative AI, it is now pos-
sible to synthesize high-quality images in a few seconds.
Despite the power of these technologies, they raise signif-
icant concerns regarding misuse. Current efforts to dis-
tinguish between real and AI-generated images may lack
generalization, being effective for only certain types of gen-
erative models and susceptible to post-processing techniques
like JPEG compression. To overcome these limitations, we
propose a novel framework, CO-SPY, that first enhances
existing semantic features (e.g., the number of fingers in a
hand) and artifact features (e.g., pixel value differences),
and then adaptively integrates them to achieve more general
and robust synthetic image detection. Additionally, we create
CO-SPYBENCH, a comprehensive dataset comprising 5 real
image datasets and 22 state-of-the-art generative models,
including the latest models like FLUX. We also collect 50k
synthetic images in the wild from the Internet to enable eval-
uation in a more practical setting. Our extensive evaluations
demonstrate that our detector outperforms existing methods
under identical training conditions, achieving an average
accuracy improvement of approximately 11% to 34%. The
code is available at https://github.com/Megum1/Co-Spy.
Related Publications
Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts. However, the advancement of T2I diffusion models presents significant risks, as the models could be exploited for malicious purposes, suc…
As scaling laws in generative AI push performance, they simultaneously concentrate the development of these models among actors with large computational resources. With a focus on text-to-image (T2I) generative models, we aim to unlock this bottleneck by demonstrating very l…
While existing vision and multi-modal foundation models can handle multiple computer vision tasks, they often suffer from significant limitations, including huge demand for data and computational resources during training and inconsistent performance across vision tasks at d…
JOIN US
Shape the Future of AI with Sony AI
We want to hear from those of you who have a strong desire
to shape the future of AI.