You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We should add an interface for users to run a specific model on a specific dataset locally. This will help drive adoption of TabRepo for method papers that are introducing a new model and want to compare against other baselines, similar to how TabZilla is currently being used. The hope is that this feature will do a great deal to resolve the reproducibility / baseline consistency crisis for tabular method papers.
A major benefit of having this logic is that we can incorporate any strong and trusted result of a method into TabRepo's main EvaluationRepository. If someone runs a stronger configuration of a known method, we can either add it alongside the weaker results of a known method or replace the weaker results with the stronger results, depending on what makes more sense. This way we can work to ensure each method in TabRepo is represented by its strongest configuration/search space/preprocessing/etc., greatly reducing the chance methods are misrepresented in terms of their peak capabilities.
Proposal
The fit logic should feature two modes: Basic mode and Simulator mode.
Basic mode doesn't require the user to generate out-of-fold predictions. Therefore the model will not be compatible with TabRepo simulation, but will still be able to be compared to TabRepo results via the test scores. It is important to have a basic mode so that users can avoid doing k-fold bagging if they don't want to. Basic mode should be very similar to what is done in AutoMLBenchmark.
Simulator mode will require the user to additionally produce out-of-fold predictions & probabilities for every row of the training data. We can provide templates to make this easy to do, such as relying on AutoGluon's k-fold bagging implementation or generic sklearn k-fold split. Simulator mode results will be fully compatible with TabRepo, and will allow for simulating ensembles of the user's method with prior TabRepo artifacts.
Requirements:
Model Code
Users will need to define their model running code similar to how it is done in AutoMLBenchmark in the exec.py files for frameworks. They should ensure that their model is lazy imported to avoid increasing the dependency requirements in TabRepo.
An alternative to supplying their own model code from scratch, they can instead supply an AutoGluon compatible custom model implementation that runs via AutoGluon, similar to how we ran the original TabRepo baseline methods.
We should ensure onboarding to this logic is as simple as possible, with helpful unit tests to check for compatibility similar to sklearn.utils.estimator_checks.check_estimator. We should also check how TabZilla does this and if we want to re-use any design patterns.
We should provide a TabRepo extension library template for method contributions so that the user can essentially do pip install TabRepo followed by pip install MyTabRepoExtension and use their model extension directly in TabRepo. This will help minimize TabRepo's maintenance burden by avoiding all method contributions being part of TabRepo's source code. We can move proven high performing / important methods into main TabRepo when we deem it worthwhile. The code required for the extension library would be the model source code that would be run on a given task (essentially the AutoMLBenchmark exec.py and setup.sh files)
Inputs
OpenML task + fold (ex: Airlines fold 2)
(Stretch) Add support for custom datasets (not OpenML) -> Refer to AutoMLBenchmark implementation
train data
test data (maybe w/o labels?)
OpenML feature types
User specified arguments (model hyperparameters, etc., same as AutoMLBenchmark)
Benchmark specified arguments (constraints such as time limit, infer limit, etc.)
Positive Class in Binary Classification
Run Artifacts
The resulting artifact should be either an instance of EvaluationRepository or very similar to EvaluationRepository.
General
test eval_metric scores
test predictions & prediction probabilities
test inference time
test inference time by batch size
val predictions & prediction probabilities (if val exists)
train time
artifact size on disk
log dump of stdout/stderr
custom artifact support
Potentially collaborate with OpenML to add upload support for these run artifacts to OpenML.
Special exception handling artifacts, in case an error occurs, to help with debugging failures.
Simulator Mode
OOF prediction probabilities & predictions
OOF eval_metric scores
Result Aggregation
We should add logic to cache the results so that we don't re-run successful jobs
Logic to automatically retry failed jobs.
We should add logic to aggregate results across tasks / methods.
Parallelization / Distribution (Stretch)
To simplify running many jobs at once, we can try to leverage Ray for single machine or distributed clusters.
Alternatively, we could do a similar approach to AutoMLBenchmark AWS mode / docker mode.
We could potentially add a compatibility layer to AutoMLBenchmark, where we convert out objects into AutoMLBenchmark objects so that the logic runs via AutoMLBenchmark.
We could include pip freeze output as part of the run artifacts, along with various other information such as num_cpus, num_gpus, OS, date, python version, memory size, disk size, etc. This would help improve reproducibility.
We could dockerize the environment similar to what is available in AutoMLBenchmark's docker mode. The downside of this is that it becomes quite complicated, is time consuming, and most users wouldn't know how to do this properly without a lot of engineering effort on our part to make it seamless.
Evaluation
Users should be able to take their output artifact and pass it into a function/method in an EvaluationRepository object to generate a bunch of tables/plots/statistics on how their method performs vs various baselines/simulated results/etc. For example, repo.compare(my_model_benchmark_results_object).
Alternatively, leverage the EvaluationRepository join logic (Add EvaluationRepository Join Logic #65) to merge the user's results with the target comparison repository to run the evaluations.
Open Questions
Should this logic live in TabRepo or a new GitHub repository? The answer likely depends on how many dependencies would need to be added to support this and the demand users would have to run it standalone without using the rest of TabRepo's functionality.
The text was updated successfully, but these errors were encountered:
A major benefit of having this logic is that we can incorporate any strong and trusted result
True, another big use-case (at least for me) is to be able to quickly see how a method perform on a wide-range of datasets even if the predictions are not included.
Basic mode/Simulator mode
I agree it makes sense to have the option to have only metrics for ease of use.
The names may be a bit disconnected with what the modes are, why not just calling the first mode "metric-only" and making clear that ensemble simulations are only supported with model predictions?
Users will need to define their model running code similar to how it is done in AutoMLBenchmark in the exec.py files for frameworks
This could be quite complicated for users. In Tabzilla and in FTTransformer, they provide an example on how to run a simple scikit learn like class, would it be possible to support something like this? I think it would make it much easier for users.
For instance, something like this (just to give the high-level idea):
repo= ...
X_train, y_train, X_test=repo.get_Xy(dataset="Airlines", fold=0)
y_pred=CatBoost().fit(X_train, y_train).predict(X_test)
# output metrics that are comparable with repo.metrics(datasets=["Airlines"], configs=["CatBoost_r22_BAG_L1"], fold=0)print(repo.evaluate(y_pred))
Related: #55
We should add an interface for users to run a specific model on a specific dataset locally. This will help drive adoption of TabRepo for method papers that are introducing a new model and want to compare against other baselines, similar to how TabZilla is currently being used. The hope is that this feature will do a great deal to resolve the reproducibility / baseline consistency crisis for tabular method papers.
A major benefit of having this logic is that we can incorporate any strong and trusted result of a method into TabRepo's main EvaluationRepository. If someone runs a stronger configuration of a known method, we can either add it alongside the weaker results of a known method or replace the weaker results with the stronger results, depending on what makes more sense. This way we can work to ensure each method in TabRepo is represented by its strongest configuration/search space/preprocessing/etc., greatly reducing the chance methods are misrepresented in terms of their peak capabilities.
Proposal
The fit logic should feature two modes: Basic mode and Simulator mode.
Basic mode doesn't require the user to generate out-of-fold predictions. Therefore the model will not be compatible with TabRepo simulation, but will still be able to be compared to TabRepo results via the test scores. It is important to have a basic mode so that users can avoid doing k-fold bagging if they don't want to. Basic mode should be very similar to what is done in AutoMLBenchmark.
Simulator mode will require the user to additionally produce out-of-fold predictions & probabilities for every row of the training data. We can provide templates to make this easy to do, such as relying on AutoGluon's k-fold bagging implementation or generic sklearn k-fold split. Simulator mode results will be fully compatible with TabRepo, and will allow for simulating ensembles of the user's method with prior TabRepo artifacts.
Requirements:
Model Code
exec.py
files for frameworks. They should ensure that their model is lazy imported to avoid increasing the dependency requirements in TabRepo.sklearn.utils.estimator_checks.check_estimator
. We should also check how TabZilla does this and if we want to re-use any design patterns.pip install TabRepo
followed bypip install MyTabRepoExtension
and use their model extension directly in TabRepo. This will help minimize TabRepo's maintenance burden by avoiding all method contributions being part of TabRepo's source code. We can move proven high performing / important methods into main TabRepo when we deem it worthwhile. The code required for the extension library would be the model source code that would be run on a given task (essentially the AutoMLBenchmark exec.py and setup.sh files)Inputs
Run Artifacts
The resulting artifact should be either an instance of
EvaluationRepository
or very similar toEvaluationRepository
.General
Simulator Mode
Result Aggregation
Parallelization / Distribution (Stretch)
Ensuring reproducibility (Stretch)
pip freeze
output as part of the run artifacts, along with various other information such as num_cpus, num_gpus, OS, date, python version, memory size, disk size, etc. This would help improve reproducibility.Evaluation
EvaluationRepository
object to generate a bunch of tables/plots/statistics on how their method performs vs various baselines/simulated results/etc. For example,repo.compare(my_model_benchmark_results_object)
.Open Questions
The text was updated successfully, but these errors were encountered: