Release v1.2.1 · p-lambda/wilds

v1.2.1 adds two new benchmark datasets: the GlobalWheat wheat head detection dataset, and the RxRx1 cellular microscopy dataset. Please see our paper for more details on these datasets.

It also simplifies saving and evaluation predictions made across different replicates and datasets.

New datasets

New benchmark dataset: GlobalWheat-WILDS v1.0

The Global Wheat Head detection dataset comprises images of wheat fields collected from 12 countries around the world. The task is to draw bounding boxes around instances of wheat heads in each image, and the distribution shift is over images taken in different locations.
Model performance is measured by the proportion of the predicted bounding boxes that sufficiently overlap with the ground truth bounding boxes (IoU > 0.5). The example script implements a FasterRCNN baseline.
This dataset is adapted from the Global Wheat Head Dataset 2021, which was recently used in a public competition held in conjunction with the Computer Vision in Plant Phenotyping and Agriculture Workshop at ICCV 2021.

New benchmark dataset: RxRx1-WILDS v1.0

The RxRx1 dataset comprises images of genetically-perturbed cells taken with fluorescent microscopy and collected across 51 experimental batches. The task is to classify the identity of the genetic perturbation applied to each cell, and the distribution shift is over different experimental batches.
Model performance is measured by average classification accuracy. The example script implements a ResNet-50 baseline.
This dataset is adapted from the RxRx1 dataset released by Recursion.

Additional dataset: ENCODE

The ENCODE dataset is based on the ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge. The task is to classify if a given genomic location will be bound by a particular transcription factor, and the distribution shift is over different cell types.
We did not include this dataset in the official benchmark as we were unable to learn a model that could generalize across all the cell types simultaneously, even in an in-distribution setting, which suggested that the model family and/or feature set might not be rich enough.

Other changes

Saving and evaluating predictions

To ease evaluation and leaderboard submission, we have made the following changes:

Predictions are now automatically saved in the format described in our submission guidelines.
We have added an evaluation script that evaluates these saved predictions across multiple replicates and datasets. See the updated README and examples/evaluate.py for more details.

Code changes to support detection tasks

To support detection tasks, we have modified the example scripts as well as made slight changes to the WILDS data loaders. All interfaces should be backwards-compatible.

The labels y and the model outputs no longer need to be a Tensor. For example, for detection tasks, a model might return a dictionary containing bounding box coordinates as well as class predictions for each bounding box. Accordingly, several helper functions have been rewritten to be more flexible.
Models can now optionally take in y in the forward call. For example, during training, a model might use ground truth bounding boxes to train a bounding box classifier.
Data transforms can now transform both x and y. We have also merged train_transform and eval_transform functions into a single function that takes a is_training parameter.

Miscellaneous changes

We have changed the names of the in-distribution split_scheme's to match the terminology in Section 5 of the updated paper.
The FMoW-WILDS and PovertyMap-WILDS constructors now no longer use the oracle_training_set parameter to use an in-distribution split. This is now controlled through split_scheme to be consistent with the other datasets.
We fixed a minor bug in the PovertyMap-WILDS in-distribution baseline. The Val (ID) and Test (ID) splits are slightly changed.
The FMoW-WILDS constructor now sets use_ood_val=True by default. This change has no effect for users using the example scripts, as use_ood_val is already set in config/datasets.py.
Users who are only using the data loaders and not the evaluation metrics or example scripts will no longer need to install torch_scatter (thanks Ke Alexander Wang).
The Waterbirds dataset now computes the adjusted average accuracy on the validation and test sets, as described in Appendix C.1 of the corresponding paper.
The behavior of algorithm.eval() is now consistent with algorithm.model.eval() in that both preserve the grad_fn attribute (thanks Divya Shanmugam). See #45.
The dataset name for OGB-MolPCBA has been changed from ogbg-molpcba to to ogb-molpcba for consistency.
We have updated the OGB-MolPCBA data loader to be compatible with v1.7 of the pytorch_geometric dependency (thanks arnaudvl). See #52.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.2.1