Skip to content

Latest commit

 

History

History
62 lines (62 loc) · 2.71 KB

2025-01-14-yamaguchi25a.md

File metadata and controls

62 lines (62 loc) · 2.71 KB
title booktitle year volume series month publisher pdf url openreview abstract layout issn id tex_title firstpage lastpage page order cycles bibtex_editor editor bibtex_author author date address container-title genre issued extras
Analyzing Diffusion Models on Synthesizing Training Datasets
Proceedings of the 16th Asian Conference on Machine Learning
2025
260
Proceedings of Machine Learning Research
0
PMLR
x2PY6GP9GJ
Synthetic samples from diffusion models are promising for training discriminative models as replications or augmentations of real training datasets. However, we found that the synthetic datasets degrade classification performance over real datasets when using the same dataset size. This means that the synthetic samples from modern diffusion models are less informative for training discriminative tasks. This paper investigates the gap between synthetic and real samples by analyzing the synthetic samples reconstructed from real samples through the noising (diffusion) and denoising (reverse) process of diffusion models. By varying the time steps starting the reverse process in the reconstruction, we can control the trade-off between the information in the original real data and the information produced by diffusion models. Through assessing the reconstructed samples and the trained models, we found that the synthetic samples are concentrated in modes of the training data distribution as the reverse step increases, and thus, they have difficulty covering the outer edges of the distribution. On the contrary, we found that these synthetic samples yield significant improvements in the data augmentation setting where both real and synthetic samples are used, indicating that the samples around modes are useful as interpolation for learning classification boundaries. These findings suggest that modern diffusion models are currently insufficient to replicate the real training dataset in the same dataset size but are suitable for interpolating the real training samples as the augment datasets.
inproceedings
2640-3498
yamaguchi25a
Analyzing Diffusion Models on Synthesizing Training Datasets
335
350
335-350
335
false
Nguyen, Vu and Lin, Hsuan-Tien
given family
Vu
Nguyen
given family
Hsuan-Tien
Lin
Yamaguchi, Shin'ya
given family
Shin’ya
Yamaguchi
2025-01-14
Proceedings of the 16th Asian Conference on Machine Learning
inproceedings
date-parts
2025
1
14