Add a simple method to fit models on datasets #69

Innixma · 2024-08-13T18:14:15Z

Related: #55

We should add an interface for users to run a specific model on a specific dataset locally. This will help drive adoption of TabRepo for method papers that are introducing a new model and want to compare against other baselines, similar to how TabZilla is currently being used. The hope is that this feature will do a great deal to resolve the reproducibility / baseline consistency crisis for tabular method papers.

A major benefit of having this logic is that we can incorporate any strong and trusted result of a method into TabRepo's main EvaluationRepository. If someone runs a stronger configuration of a known method, we can either add it alongside the weaker results of a known method or replace the weaker results with the stronger results, depending on what makes more sense. This way we can work to ensure each method in TabRepo is represented by its strongest configuration/search space/preprocessing/etc., greatly reducing the chance methods are misrepresented in terms of their peak capabilities.

Proposal

The fit logic should feature two modes: Basic mode and Simulator mode.

Basic mode doesn't require the user to generate out-of-fold predictions. Therefore the model will not be compatible with TabRepo simulation, but will still be able to be compared to TabRepo results via the test scores. It is important to have a basic mode so that users can avoid doing k-fold bagging if they don't want to. Basic mode should be very similar to what is done in AutoMLBenchmark.

Simulator mode will require the user to additionally produce out-of-fold predictions & probabilities for every row of the training data. We can provide templates to make this easy to do, such as relying on AutoGluon's k-fold bagging implementation or generic sklearn k-fold split. Simulator mode results will be fully compatible with TabRepo, and will allow for simulating ensembles of the user's method with prior TabRepo artifacts.

Requirements:

Model Code

Users will need to define their model running code similar to how it is done in AutoMLBenchmark in the exec.py files for frameworks. They should ensure that their model is lazy imported to avoid increasing the dependency requirements in TabRepo.
An alternative to supplying their own model code from scratch, they can instead supply an AutoGluon compatible custom model implementation that runs via AutoGluon, similar to how we ran the original TabRepo baseline methods.
We should ensure onboarding to this logic is as simple as possible, with helpful unit tests to check for compatibility similar to sklearn.utils.estimator_checks.check_estimator. We should also check how TabZilla does this and if we want to re-use any design patterns.
We should provide a TabRepo extension library template for method contributions so that the user can essentially do pip install TabRepo followed by pip install MyTabRepoExtension and use their model extension directly in TabRepo. This will help minimize TabRepo's maintenance burden by avoiding all method contributions being part of TabRepo's source code. We can move proven high performing / important methods into main TabRepo when we deem it worthwhile. The code required for the extension library would be the model source code that would be run on a given task (essentially the AutoMLBenchmark exec.py and setup.sh files)

Inputs

OpenML task + fold (ex: Airlines fold 2)
(Stretch) Add support for custom datasets (not OpenML) -> Refer to AutoMLBenchmark implementation
train data
test data (maybe w/o labels?)
OpenML feature types
User specified arguments (model hyperparameters, etc., same as AutoMLBenchmark)
Benchmark specified arguments (constraints such as time limit, infer limit, etc.)
Positive Class in Binary Classification

Run Artifacts

The resulting artifact should be either an instance of EvaluationRepository or very similar to EvaluationRepository.

General

Simulator Mode

OOF prediction probabilities & predictions
OOF eval_metric scores

Result Aggregation

We should add logic to cache the results so that we don't re-run successful jobs
Logic to automatically retry failed jobs.
We should add logic to aggregate results across tasks / methods.

Parallelization / Distribution (Stretch)

To simplify running many jobs at once, we can try to leverage Ray for single machine or distributed clusters.
Alternatively, we could do a similar approach to AutoMLBenchmark AWS mode / docker mode.
We could potentially add a compatibility layer to AutoMLBenchmark, where we convert out objects into AutoMLBenchmark objects so that the logic runs via AutoMLBenchmark.
We could potentially leverage AutoGluon-Bench.
MLFlow?

Ensuring reproducibility (Stretch)

We could include pip freeze output as part of the run artifacts, along with various other information such as num_cpus, num_gpus, OS, date, python version, memory size, disk size, etc. This would help improve reproducibility.
We could dockerize the environment similar to what is available in AutoMLBenchmark's docker mode. The downside of this is that it becomes quite complicated, is time consuming, and most users wouldn't know how to do this properly without a lot of engineering effort on our part to make it seamless.

Evaluation

Users should be able to take their output artifact and pass it into a function/method in an EvaluationRepository object to generate a bunch of tables/plots/statistics on how their method performs vs various baselines/simulated results/etc. For example, repo.compare(my_model_benchmark_results_object).
Alternatively, leverage the EvaluationRepository join logic (Add EvaluationRepository Join Logic #65) to merge the user's results with the target comparison repository to run the evaluations.

Open Questions

Should this logic live in TabRepo or a new GitHub repository? The answer likely depends on how many dependencies would need to be added to support this and the demand users would have to run it standalone without using the rest of TabRepo's functionality.

The text was updated successfully, but these errors were encountered:

geoalgo · 2024-08-28T15:52:58Z

Great that you are pushing for this!

A major benefit of having this logic is that we can incorporate any strong and trusted result

True, another big use-case (at least for me) is to be able to quickly see how a method perform on a wide-range of datasets even if the predictions are not included.

Basic mode/Simulator mode

I agree it makes sense to have the option to have only metrics for ease of use.
The names may be a bit disconnected with what the modes are, why not just calling the first mode "metric-only" and making clear that ensemble simulations are only supported with model predictions?

Users will need to define their model running code similar to how it is done in AutoMLBenchmark in the exec.py files for frameworks

This could be quite complicated for users. In Tabzilla and in FTTransformer, they provide an example on how to run a simple scikit learn like class, would it be possible to support something like this? I think it would make it much easier for users.

For instance, something like this (just to give the high-level idea):

repo = ...
X_train, y_train, X_test = repo.get_Xy(dataset="Airlines", fold=0)
y_pred = CatBoost().fit(X_train, y_train).predict(X_test)
# output metrics that are comparable with repo.metrics(datasets=["Airlines"], configs=["CatBoost_r22_BAG_L1"], fold=0)
print(repo.evaluate(y_pred))

GDGauravDutta · 2024-09-03T20:09:07Z

When we can have simpler API like Autogluon , so as have better understanding about this new library.

Innixma · 2024-09-05T23:35:31Z

@GDGauravDutta We are actively working on this, and a simpler API should be available within the next month.

Innixma added this to the TabRepo 2.0 milestone Aug 13, 2024

Innixma mentioned this issue Aug 13, 2024

TabRepo 2.0 Feature Tracker #63

Open

23 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a simple method to fit models on datasets #69

Add a simple method to fit models on datasets #69

Innixma commented Aug 13, 2024 •

edited

Loading

geoalgo commented Aug 28, 2024

GDGauravDutta commented Sep 3, 2024

Innixma commented Sep 5, 2024

Add a simple method to fit models on datasets #69

Add a simple method to fit models on datasets #69

Comments

Innixma commented Aug 13, 2024 • edited Loading

Proposal

Model Code

Inputs

Run Artifacts

General

Simulator Mode

Result Aggregation

Parallelization / Distribution (Stretch)

Ensuring reproducibility (Stretch)

Evaluation

Open Questions

geoalgo commented Aug 28, 2024

GDGauravDutta commented Sep 3, 2024

Innixma commented Sep 5, 2024

Innixma commented Aug 13, 2024 •

edited

Loading