This repository contains source code for the experiments conducted in the AISTATS 2024 paper From Data Imputation to Data Cleaning - Automated Cleaning of Tabular Data Improves Downstream Predictive Performance
.
First of all, use load_corrupt_and_test_datasets.ipynb
to download and corrupt the datasets and setup the expected structure of the data
directory.
run_experiment.py
implements a simple CLI script (run-experiment
), which allows to easily run experiments.
Conformal Data Cleaning:
run-experiment \
--task_id \
"42493" \
--error_fractions \
"0.01" \
"0.05" \
"0.1" \
"0.3" \
"0.5" \
--num_repetitions \
"3" \
--results_path \
"/conformal-data-cleaning/results/final-experiments" \
--models_path \
"/conformal-data-cleaning/models/final-experiments" \
--how_many_hpo_trials \
"50" \
experiment \
--confidence_level \
"0.999"
ML Baseline:
run-experiment \
--task_id \
"42493" \
--error_fractions \
"0.01" \
"0.05" \
"0.1" \
"0.3" \
"0.5" \
--num_repetitions \
"3" \
--results_path \
"/conformal-data-cleaning/results/final-experiments" \
--models_path \
"/conformal-data-cleaning/models/final-experiments" \
--how_many_hpo_trials \
"50" \
baseline \
--method \
"AutoGluon" \
--method_hyperparameter \
"0.999"
PyOD Baseline (not included in the paper):
run-experiment \
--task_id \
"42493" \
--error_fractions \
"0.01" \
"0.05" \
"0.1" \
"0.3" \
"0.5" \
--num_repetitions \
"3" \
--results_path \
"/conformal-data-cleaning/results/final-experiments" \
--models_path \
"/conformal-data-cleaning/models/final-experiments" \
--how_many_hpo_trials \
"50" \
baseline \
--method \
"PyodECOD" \
--method_hyperparameter \
"0.3"
For Garf, please use main.py.
python \
main.py \
--task_id \
"42493" \
--error_fractions \
"0.01" \
"0.05" \
"0.1" \
"0.3" \
"0.5" \
--num_repetitions \
"3" \
--results_path \
"/conformal-data-cleaning/results/final-experiments" \
--models_path \
"/conformal-data-cleaning/models/final-experiments"
We ran our experiments on Kubernetes using Helm. Please checkout the helm charts and change the image
and imagePullSecrets
settings in the values.yaml
files accordingly to your setup.
Therefore, some read-write-many volumes are necessary to store the experiment results. Please checkout the infrastructure/k8s
directory (and don't forget to setup the data directory as describe above).
Using make docker
builds and pushes the necessary docker images and make helm-install
uses deploy_experiments.py
to start our experimental setup.
notebooks/evaluation
contains notebooks we use for evaluating the results and 5_plotting.ipynb
outputs the plots shown in the paper.