v1.2.1
v1.2.1 adds two new benchmark datasets: the GlobalWheat wheat head detection dataset, and the RxRx1 cellular microscopy dataset. Please see our paper for more details on these datasets.
It also simplifies saving and evaluation predictions made across different replicates and datasets.
New datasets
New benchmark dataset: GlobalWheat-WILDS v1.0
- The Global Wheat Head detection dataset comprises images of wheat fields collected from 12 countries around the world. The task is to draw bounding boxes around instances of wheat heads in each image, and the distribution shift is over images taken in different locations.
- Model performance is measured by the proportion of the predicted bounding boxes that sufficiently overlap with the ground truth bounding boxes (IoU > 0.5). The example script implements a FasterRCNN baseline.
- This dataset is adapted from the Global Wheat Head Dataset 2021, which was recently used in a public competition held in conjunction with the Computer Vision in Plant Phenotyping and Agriculture Workshop at ICCV 2021.
New benchmark dataset: RxRx1-WILDS v1.0
- The RxRx1 dataset comprises images of genetically-perturbed cells taken with fluorescent microscopy and collected across 51 experimental batches. The task is to classify the identity of the genetic perturbation applied to each cell, and the distribution shift is over different experimental batches.
- Model performance is measured by average classification accuracy. The example script implements a ResNet-50 baseline.
- This dataset is adapted from the RxRx1 dataset released by Recursion.
Additional dataset: ENCODE
- The ENCODE dataset is based on the ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge. The task is to classify if a given genomic location will be bound by a particular transcription factor, and the distribution shift is over different cell types.
- We did not include this dataset in the official benchmark as we were unable to learn a model that could generalize across all the cell types simultaneously, even in an in-distribution setting, which suggested that the model family and/or feature set might not be rich enough.
Other changes
Saving and evaluating predictions
To ease evaluation and leaderboard submission, we have made the following changes:
- Predictions are now automatically saved in the format described in our submission guidelines.
- We have added an evaluation script that evaluates these saved predictions across multiple replicates and datasets. See the updated README and
examples/evaluate.py
for more details.
Code changes to support detection tasks
To support detection tasks, we have modified the example scripts as well as made slight changes to the WILDS data loaders. All interfaces should be backwards-compatible.
- The labels
y
and the model outputs no longer need to be aTensor
. For example, for detection tasks, a model might return a dictionary containing bounding box coordinates as well as class predictions for each bounding box. Accordingly, several helper functions have been rewritten to be more flexible. - Models can now optionally take in
y
in the forward call. For example, during training, a model might use ground truth bounding boxes to train a bounding box classifier. - Data transforms can now transform both
x
andy
. We have also mergedtrain_transform
andeval_transform
functions into a single function that takes ais_training
parameter.
Miscellaneous changes
- We have changed the names of the in-distribution
split_scheme
's to match the terminology in Section 5 of the updated paper. - The FMoW-WILDS and PovertyMap-WILDS constructors now no longer use the
oracle_training_set
parameter to use an in-distribution split. This is now controlled throughsplit_scheme
to be consistent with the other datasets. - We fixed a minor bug in the PovertyMap-WILDS in-distribution baseline. The Val (ID) and Test (ID) splits are slightly changed.
- The FMoW-WILDS constructor now sets
use_ood_val=True
by default. This change has no effect for users using the example scripts, asuse_ood_val
is already set inconfig/datasets.py
. - Users who are only using the data loaders and not the evaluation metrics or example scripts will no longer need to install
torch_scatter
(thanks Ke Alexander Wang). - The Waterbirds dataset now computes the adjusted average accuracy on the validation and test sets, as described in Appendix C.1 of the corresponding paper.
- The behavior of
algorithm.eval()
is now consistent withalgorithm.model.eval()
in that both preserve thegrad_fn
attribute (thanks Divya Shanmugam). See #45. - The dataset name for OGB-MolPCBA has been changed from
ogbg-molpcba
to toogb-molpcba
for consistency. - We have updated the OGB-MolPCBA data loader to be compatible with v1.7 of the
pytorch_geometric
dependency (thanks arnaudvl). See #52.