The codebase for evaluation of deep generative models as presented in Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models, accepted to NeurIPS 2023
We studied 41 generative models across a diverse range of image datasets and found:
- The state-of-the-art perceptual realism of diffusion models as judged by humans is not reflected in commonly reported metrics when using the default Inception-V3 network.
- Supervised networks do not provide a perceptual space that generalizes well for image evaluation, and neither do self-supervised methods from particular families.
- DINOv2 provides such a generalized representation space and allows for much richer evaluation of generative models. Researchers should replace Inception-V3 in all evaluation metrics. We provide an extensive DINOv2 leaderboard below and have added the results to paperswithcode.com.
- Generative models directly memorize training examples on simple, smaller datasets like CIFAR10, but not necessarily on more complex datasets like ImageNet. However, our experiments show that currently proposed diagnostic metrics do not properly detect memorization.
Here we provide code to compute the following 15 generative evaluation metrics using 8 different encoder networks:
Metrics:
- Fréchet Distance: FD
- FD∞
- Spatial FID: sFID
- Kernel Distance
- Inception Score
- FLS
- Precision & Recall
- Density & Coverage
- Vendi score
- AuthPct
- CT score
- FLS-POG
- Realism
- Approximate Sliced Wasserstein: ASW
Encoders:
First clone this repository, then navigate to the directory and pip install to install all required packages.
git clone [email protected]:layer6ai-labs/dgm-eval
cd dgm-eval
pip install -e .
We recommend you do this in a conda environment:
conda create --name dgm-eval pip python==3.10
conda activate dgm-eval
git clone [email protected]:layer6ai-labs/dgm-eval
cd dgm-eval
pip install -e .
Computing metrics only requires the paths to either locally hosted image datasets or torchvision.datasets. Encoders are automatically downloaded. For example, the following will compute the Fréchet distance (fd), kernel distance (kd), precision/recall/density/coverage (prdc), and the CT score (ct) using DINOv2 (default) as the encoder.
python -m dgm_eval path/to/training_dataset path/to/generated_dataset \
--test_path path/to/test_dataset \
--model dinov2 \
--metrics fd kd prdc ct
See scripts/run_experiments.sh
or run python dgm_eval -h
for further details on commandline parameters. As we suggest in the paper, metrics should be reported using the default model size (DINOv2-ViT-L/14) for final leaderboard values, but tracking progress during training is a factor of 4 more efficient with DINOv2-ViT-B/14. To use this architecture instead simply add --arch vitb14
as a commandline parameter.
Local datasets should either be un-conditional:
local/path/
000000.png
000001.png
...
or conditional:
local/path/
0/
000000.png
000001.png
...
1/
000000.png
000001.png
...
...
The directory should only include image files. To download and use a dataset from torchvision.datasets, just specify the dataset and train/test string:
python dgm_eval CIFAR10:train CIFAR10:test
A full example is as follows:
python -m dgm_eval CIFAR10:train CIFAR10:test \
--model dinov2 \
--metrics fd kd prdc \
--device cuda \
--batch_size 256 \
--nsample 512
>>> ....
>>> Num real: 512 Num fake: 512
>>> fd: 862.53745
>>> kd_value: 0.01095
>>> kd_variance: 0.00000
>>> precision: 0.90430
>>> recall: 0.91797
>>> density: 0.97969
>>> coverage: 0.94141
All generated data shown in this work can be accessed at the following link:
drive.google.com/drive/folders/1X0MFaUta90d3zF9xG4KchjR-8SE0cT_7?usp=sharing
Including:
- Datasets of 100,000 image samples from 41 generative models across
CIFAR10/
,imagenet256/
,LSUN Bedroom/
, andFFHQ256/
. - Training & test data at 256 x 256 resolution
- Generated datasets for controlled experiments presented in the Appendix can be found in
toy-datasets/
Data for human evaluation of image realism can be found at data/human-evaluation-realism/
We have included leaderboard values on paperswithcode (links), and list all metrics in a table below:
Heatmaps can be visualized for each model on any given image datasets by the following, with examples following:
python -m dgm_eval CIFAR10:train CIFAR10:test \
--model inception \
--metrics fd \
--device cuda \
--batch_size 256 \
--nsample 50000 \
--heatmaps
Images | Inception | DINOv2 |
---|---|---|
If you use any part of this repository in your research, please cite the associated paper with the following bibtex entry:
Authors: George Stein, Jesse C. Cresswell, Rasa Hosseinzadeh, Yi Sui, Brendan Leigh Ross, Valentin Villecroze, Zhaoyan Liu, Anthony L. Caterini, J. Eric T. Taylor, Gabriel Loaiza-Ganem
@inproceedings{stein2023exposing,
title={Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models},
author={Stein, George and Cresswell, Jesse and Hosseinzadeh, Rasa and Sui, Yi and Ross, Brendan and Villecroze, Valentin and Liu, Zhaoyan and Caterini, Anthony L and Taylor, Eric and Loaiza-Ganem, Gabriel},
booktitle={Advances in Neural Information Processing Systems},
volume={36},
year={2023}
}
This data and code is licensed under the MIT License, copyright by Layer 6 AI.