Code and data for the manuscript: Detecting noisy labels with repeated cross-validations
This repository contains the code and data of the paper "Detecting noisy labels with repeated cross-validations", which is accepted at MICCAI 2024 for publication.
In this work we propose a novel algorithm for identifying cases with noisy labels within a given dataset. We found that noisy cases consistently lead to worse performance for the validation fold in n-fold cross validations. By pinpointing the examples that more frequently contribute to inferior cross-validation results, our methods, ReCoV and fastReCoV, effectively identifies the noisy samples within the dataset.
- State-of-the-Art Performance: fastReCoV outperforms existing methods for noisy label detection in popular comptuer vision and medical imaging datasets
- Plug-and-play for most supervised learning tasks and network structures, with potential applications beyond computer vision.
- Does not require prior knowledge of the percentage of noisy examples.
- Efficient when using embeddings (extracted with pre-trained models) as inputs
The methodlogy consists of two algorithms:-
- ReCoV, the original algorithm that is grounded in mathematical foundations. Recommended for tabular datasets and embeddings.
- FastReCoV, a computationally efficient variant that offers a slightly reduced performance but are significantly faster. Recommended for deep learning tasks with large datasets. An overview of the methodology and its results are shown below
opencv
pytorch-gpu
wandb
openslide
scikit-learn
scipy
scikit-image
warmup-scheduler
nystrom-attention
The project is applied to 4 datasets:-
- Mushroom (
./mushroom
) - https://www.kaggle.com/datasets/uciml/mushroom-classification - Hecktor (
./HECKTOR
) - https://hecktor.grand-challenge.org/ - CIFAR-10N (
./cifar10n
) - http://noisylabels.com/ - PANDA (
./PANDA
) - https://www.kaggle.com/competitions/prostate-cancer-grade-assessment
The dataset can be downloaded from the given link. The two algorithms can be run directly using python mushroom_[recov/fastrecov].py
Before running the model on the dataset, for the individual images, features are to be extracted. This can be done via python cifar_featextract.py
. After this step, fastRecov can be run using python cifar_fastrecov.py
Before running the model, the radiomics features are extracted using python HECKTOR_extraction.py
After this step, both recov and fastrecov can be run using python HECKTOR_[recov/fastrecov].py
. The identified labels can be evaluated using python hecktor_evaluate.py
Before running the model, the feautres are to be extracted using python featureextraction.py
. After this step, fastRecov can be run using python panda_fastrecov.py
. Since the test set is hosted in kaggle, users can save the fastrecov noise cleaned model using python test_fastrecov.py
, and evalute on kaggle using panda-recov-submission.ipynb
notebook.
To apply ReCoV to your dataset, you can follow the steps below:-
- Prepare your dataset by converting them into a tabular dataset (-omics) or extracting features from your cases with a pre-trained model e.g. .
- Run the fastReCoV algorithm by following the pseudo code or example scripts.
You can reach the authors by raising an issue in this repo or email them at [email protected]/[email protected]/[email protected]
@article{chen2023cross,
title={Cross-Validation Is All You Need: A Statistical Approach To Label Noise Estimation},
author={Chen, Jianan and Martel, Anne},
journal={arXiv preprint arXiv:2306.13990},
year={2023}
}
To be updated with new citations after publication.