Releases: p-lambda/wilds
v2.0.0
The v2.0.0 release adds unlabeled data to 8 datasets and several new algorithms for taking advantage of the unlabeled data. It also updates the standard data augmentations used for the image datasets.
All labeled (training, validation, and test) datasets are exactly the same. Evaluation metrics are also exactly the same. All results on the datasets in v1.x are therefore still current and directly comparable to results obtained on v2.
For more information, please read our paper on the unlabeled data.
New datasets with unlabeled data
We have added unlabeled data to the following datasets:
- iwildcam
- camelyon17
- ogb-molpcba
- globalwheat
- civilcomments
- fmow
- poverty
- amazon
The following datasets have no unlabeled data and have not been changed:
- rxrx1
- py150
The labeled training, validation, and test data in all datasets have been kept exactly the same.
The unlabeled data comes from the same underlying sources as the original labeled data, and they can be from the source, validation, extra, or target domains. We describe each dataset in detail in our paper.
Each unlabeled dataset has its own corresponding data loader defined in wilds/datasets/unlabeled
. Please see the README for more details on how to use them.
New algorithms for using unlabeled data
In our scripts in the examples
folder, we have updated and/or added new algorithms that make use of the unlabeled data:
- CORAL (Sun and Saenko, 2016)
- DANN (Ganin et al., 2016)
- AFN (Xu et al., 2019)
- Pseudo-Label (Lee, 2013)
- FixMatch (Sohn et al., 2020)
- Noisy Student (Xie et al., 2020)
- SwAV pre-training (Caron et al., 2020)
- Masked language model pre-training (Devlin et al., 2019)
Other changes
GlobalWheat v1.0 -> v1.1
We have corrected some errors in the metadata for the previous version of the GlobalWheat (labeled) dataset.
Users who did not explicitly make use of the location or stage metadata (which should be most users) will not be affected.
All baseline results are unchanged.
DomainNet support
We have included data loaders for the DomainNet dataset (Peng at al., 2019) as a means of benchmarking the algorithms we implemented on existing datasets.
Data augmentation
We have added support for RandAugment (Cubuk et al., 2019) for RGB images, and we have also implemented a set of data augmentations for the multi-spectral Poverty dataset. These augmentations are used in all of the algorithms for unlabeled data listed above.
Hyperparameters
In our experiments to benchmark the algorithms for using unlabeled data, we tuned hyperparameters by random search instead of grid search. The default hyperparameters in examples/configs/datasets.py
still work well but do not reflect the exact hyperparameters we used for our experiments. To see those, please view our CodaLab worksheet.
Miscellaneous
- In our example scripts, we have added support for gradient accumulation by specifying the
gradient_accumulation_steps
parameter. - We have also added support for logging using Weights and Biases.
v1.2.2
v1.2.2 contains several minor changes:
- Added a check to make sure that a group data loader is used whenever
n_groups_per_batch
ordistinct_groups
are passed in as arguments toexamples/run_expt.py
. (#79) - Data augmentations now only transform
x
by default. Setdo_transform_y
when initializing theWILDSSubset
to modify bothx
andy
. (#77) - For FasterRCNN, we now use the PyTorch implementation of
smooth_l1_loss
instead of the custom torchvision implementation, which was removed in torchvision v0.10. - Updated the requirements to include torchvision, scipy, and scikit-learn. Previously, torchvision was only needed for the example scripts. However, it is now also used for computing metrics in the GlobalWheat-WILDS dataset, so we have moved it into the core set of requirements.
v1.2.1
v1.2.1 adds two new benchmark datasets: the GlobalWheat wheat head detection dataset, and the RxRx1 cellular microscopy dataset. Please see our paper for more details on these datasets.
It also simplifies saving and evaluation predictions made across different replicates and datasets.
New datasets
New benchmark dataset: GlobalWheat-WILDS v1.0
- The Global Wheat Head detection dataset comprises images of wheat fields collected from 12 countries around the world. The task is to draw bounding boxes around instances of wheat heads in each image, and the distribution shift is over images taken in different locations.
- Model performance is measured by the proportion of the predicted bounding boxes that sufficiently overlap with the ground truth bounding boxes (IoU > 0.5). The example script implements a FasterRCNN baseline.
- This dataset is adapted from the Global Wheat Head Dataset 2021, which was recently used in a public competition held in conjunction with the Computer Vision in Plant Phenotyping and Agriculture Workshop at ICCV 2021.
New benchmark dataset: RxRx1-WILDS v1.0
- The RxRx1 dataset comprises images of genetically-perturbed cells taken with fluorescent microscopy and collected across 51 experimental batches. The task is to classify the identity of the genetic perturbation applied to each cell, and the distribution shift is over different experimental batches.
- Model performance is measured by average classification accuracy. The example script implements a ResNet-50 baseline.
- This dataset is adapted from the RxRx1 dataset released by Recursion.
Additional dataset: ENCODE
- The ENCODE dataset is based on the ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge. The task is to classify if a given genomic location will be bound by a particular transcription factor, and the distribution shift is over different cell types.
- We did not include this dataset in the official benchmark as we were unable to learn a model that could generalize across all the cell types simultaneously, even in an in-distribution setting, which suggested that the model family and/or feature set might not be rich enough.
Other changes
Saving and evaluating predictions
To ease evaluation and leaderboard submission, we have made the following changes:
- Predictions are now automatically saved in the format described in our submission guidelines.
- We have added an evaluation script that evaluates these saved predictions across multiple replicates and datasets. See the updated README and
examples/evaluate.py
for more details.
Code changes to support detection tasks
To support detection tasks, we have modified the example scripts as well as made slight changes to the WILDS data loaders. All interfaces should be backwards-compatible.
- The labels
y
and the model outputs no longer need to be aTensor
. For example, for detection tasks, a model might return a dictionary containing bounding box coordinates as well as class predictions for each bounding box. Accordingly, several helper functions have been rewritten to be more flexible. - Models can now optionally take in
y
in the forward call. For example, during training, a model might use ground truth bounding boxes to train a bounding box classifier. - Data transforms can now transform both
x
andy
. We have also mergedtrain_transform
andeval_transform
functions into a single function that takes ais_training
parameter.
Miscellaneous changes
- We have changed the names of the in-distribution
split_scheme
's to match the terminology in Section 5 of the updated paper. - The FMoW-WILDS and PovertyMap-WILDS constructors now no longer use the
oracle_training_set
parameter to use an in-distribution split. This is now controlled throughsplit_scheme
to be consistent with the other datasets. - We fixed a minor bug in the PovertyMap-WILDS in-distribution baseline. The Val (ID) and Test (ID) splits are slightly changed.
- The FMoW-WILDS constructor now sets
use_ood_val=True
by default. This change has no effect for users using the example scripts, asuse_ood_val
is already set inconfig/datasets.py
. - Users who are only using the data loaders and not the evaluation metrics or example scripts will no longer need to install
torch_scatter
(thanks Ke Alexander Wang). - The Waterbirds dataset now computes the adjusted average accuracy on the validation and test sets, as described in Appendix C.1 of the corresponding paper.
- The behavior of
algorithm.eval()
is now consistent withalgorithm.model.eval()
in that both preserve thegrad_fn
attribute (thanks Divya Shanmugam). See #45. - The dataset name for OGB-MolPCBA has been changed from
ogbg-molpcba
to toogb-molpcba
for consistency. - We have updated the OGB-MolPCBA data loader to be compatible with v1.7 of the
pytorch_geometric
dependency (thanks arnaudvl). See #52.
v1.1.0
The v1.1.0 release contains a new Py150 benchmark dataset for code completion, as well as updates to several existing datasets and default models to make them significantly faster and easier to use.
Some of these changes are breaking changes that will impact users who are currently running experiments with WILDS. We sincerely apologize for the inconvenience. We ask all users to update their package to v1.1.0, which will automatically update your datasets. In addition, please update your default models, for example by using the latest example scripts in this repo. These changes were primarily made to accelerate model training, which was a bottleneck for many users; at this time, we do not expect to have to make further changes to the existing datasets or default models.
New datasets
New benchmark dataset: Py150
- The Py150-WILDS dataset is a code completion dataset, where the distribution shift is over code from different Github repositories.
- We focus on accuracy on the subpopulation of class and method tokens, as prior work has shown that those are the most frequent queries in real-world code completion settings.
- It is a variant of the Py150 dataset from Raychev et al., 2016.
- See our paper for more details.
Additional dataset: SQF
- The SQF dataset is based on the stop-question-and-frisk dataset released by the New York Police Department. We adapt the version processed by Goel et al., 2016. The task is to predict criminal possession of a weapon.
- We use this dataset to study distribution shifts in an algorithmic fairness context. Specifically, we consider subpopulation shifts across locations and race groups. However, while there are large performance gaps, we did not find that they were caused by the distribution shift. We therefore did not include this dataset as part of the official benchmark.
Major updates to existing datasets
Note that datasets are versioned separately from the main WILDS version. We have two major updates (i.e., breaking, non-backwards-compatible changes) to datasets.
Amazon v1.0 -> v2.0
- To speed up model training, we have subsampled the number of reviewers in this dataset to 25% of its original size, while keeping the same number of reviews per reviewer.
iWildCam v1.0 -> v2.0
- Previously, the ID split was done uniformly at random, meaning that images from the same sequence (i.e., taken within a few seconds of each other by the same camera) could be found across all of the training / validation (ID) / test (ID) sets.
- In v2.0, we have redone the ID split so that all images taken on the same day by the same camera are in only one of the training, validation (ID), or test (ID) sets. In other words, these sets still comprise images from the same cameras, but taken on different days.
- In line with the new iWildCam 2021 challenge on Kaggle, we have also removed the following images:
- images that include humans or pictures taken indoors.
- images with non-animal categories such as
start
andunidentifiable
. - images in categories such as
unknown
,unknown raptor
andunknown rat
.
- We added back in location 537 that was previously removed as we mistakenly believed those images were corrupted.
- We have re-split the data into training, validation (ID), test (ID), validation (OOD), and test (OOD) sets. This is a different random split from v1.0.
- Since we remove any classes that do not end up in the train split, removing those images and redoing the split gave us a different set of species. There are now 182 classes instead of 186. Specifically, the following classes have been removed:
['unknown', 'macaca fascicularis', 'proechimys sp', 'unidentifiable', 'turtur calcospilos', 'streptopilia senegalensis', 'equus africanus', 'macaca nemestrina', 'start', 'paleosuchus sp', 'unknown raptor', 'unknown rat', 'misfire', 'mustela lutreolina', 'canis latrans', 'myoprocta pratti', 'xerus rutilus', 'end', 'psophia crepitans', 'ictonyx striatus']
. The following classes have been added:[‘praomys tullbergi', 'polyplectron chalcurum', 'ardeotis kori', 'phaetornis sp', 'mus minutoides', 'raphicerus campestris', 'tigrisoma mexicanum', 'leptailurus serval', 'malacomys longipes', 'oenomys hypoxanthus', 'turdus olivaceus', 'macaca sp', 'leiothrix argentauris', 'lophura sp', 'mazama temama', 'hippopotamus amphibius']
. For convenience, we have also added acategories.csv
that maps from label IDs to species names. - To speed up downloading and model training (by reducing the I/O bottleneck), we have also resized all images to have a height of 448px while keeping the original aspect ratio. All images are wide (so they now have a min dimension of 448px). Note that as JPEG compression is lossy, this procedure gives different images from resizing the full-sized image in the code after loading it.
Minor updates to existing datasets
We made two backwards-compatible changes to existing datasets. We encourage all users to update these datasets; these updates should leave results unchanged (modulo training randomness). In future versions of the WILDS package, we will deprecate the older versions of these datasets.
FMoW v1.0 -> v1.1
- Previously, the images were stored as chunks in .npy files and read in using NumPy memmapping.
- Now, we have converted them (losslessly) into individual PNG images. This should help with disk I/O and memory usage, and make them more convenient to visualize and use in other pipelines.
PovertyMap v1.0 -> v1.1
- Previously, the images were stored in a single .npy file and read in using NumPy memmapping.
- Now, we have converted them (loselessly) into individual compressed .npz files. This should help with disk I/O and memory usage.
- We have correspondingly updated the default number of workers for the data loader from 1 to 4.
Default model updates
We have updated the default models for several datasets. Please take note of these changes if you are currently running experiments with these datasets.
Amazon and CivilComments
- To speed up model training, we have switched from BERT-base-uncased to DistilBERT-base-uncased. This obtains roughly similar accuracy but at twice the speed.
- For CivilComments, we have also increased the number of replicates from 3 to 5, to reduce variability in the reported performance.
Camelyon17
- Previously, we were upsizing each image to 224x224 before passing it into the model.
- We now leave the images at their original resolution of 96x96, which significantly speeds up model training.
iWildCam
- Previously, we were resizing each image to 224x224 before passing it into the model. However, this limited model accuracy, as the animals in the images can sometimes be quite small.
- We now resize each image to 448x448 before passing it into the model, which improves accuracy and macro F1 across the board.
FMoW
- For consistency with the other datasets, we have changed the early stopping validation criterion (
val_metric
) fromacc_avg
toacc_worst_region
.
PovertyMap
- For consistency with the other datasets, we have changed the early stopping validation criterion (
val_metric
) fromr_all
tor_wg
.
Other changes
- We have uploaded an executable version of our paper to CodaLab. This contains the exact commands, code, and data used for each experiment reported in our paper. The trained model weights for every experiment can also be found there.
- To ease downloading, we have added
wilds/download_datasets.py
, which allows users to download all (or a subset of) datasets at once. Please see the README for instructions. - We have added a convenience function for getting the appropriate constructor for each dataset in
wilds/get_dataset.py
. This function allows you to specify aversion
argument. If this is not specified, it defaults to the latest available version for that dataset. If that version is not downloaded and thedownload
argument is also set, then it will automatically download that version. - The example script
examples/run_expt.py
now also takes in aversion
argument. - We have added download sizes and expected training times to the README.
- We have updated the default inputs for
WILDSDatasets.eval
methods for various datasets. For example,eval
for most classification datasets now take in predicted labels by default, while the predictions were previously passed in as logits. The default inputs vary across datasets, and we document this in the docstring of eacheval
method. - We made a few updates to the code in
examples/
to interface better with language modeling tasks (for Py150). None of these changes affect the results or the interface with algorithms. - We updated the code in
examples/
to save model predictions in an appropriate format for submissions to the leaderboard. - Finally, we have also updated our paper to streamline the writing and include these new numbers and datasets.