-
Notifications
You must be signed in to change notification settings - Fork 411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to choose which fold to use as a final predictor #614
base: master
Are you sure you want to change the base?
Conversation
Hi @drskd! Thank you for contribution. You are the first person that asked for this feature. If there will be more users that need this, then I will merge it. |
this sounds nice. but isn´t it the same as, just making a shorter test split ? Can i ask for what u use it? Do you use it only for prediction and not for training? |
@brainmosaik In my case I did a time series split with 4 folds, lets say one for each season over the last 12 months. |
Thats sounds, good. So the workflow for this, should be?: So it should be better on new upcoming days? Or should it be used like this? Because mljar already saving models , for each fold? So we can skip the Re-training? Big thanks for this idea and implementation. Just trying to get my head around it. edit: |
The Here, as it's already the case, AutoML will train each chosen model and hyperparameters on the 4 splits, then look at the average chosen validation metric over 4 validation sets to rank models on the leaderboard, and finally predicts using the model at the top of the leaderboard. The chosen_fold parameter would impact only the prediction step. Each model in the leaderboard would still be trained on 4 different datasets, but the prediction would come from only the model trained on the last split Classic usage:
Custom usage:
I could indeed use only the last split if I'm interested in the most recent part of the data, but as the model search is really powerful, having only one split to validate is risky in terms of overfitting. Having additional regularization on hyperparameter selection with more splits help to limit the risk of overfitting. |
i must say thanks again, for this commit.I think this should be in the main branch, could be really usefull. I wonder if this got , implemented now in the main branch? I have some addational ideas, to make it take an weighted, prediction, something like
Or this,but i think i have a thinking error in this?
|
Hi @pplonski ,
Thanks for this great package!
I use it pretty often so I wanted to add my contribution to it.
I needed to test the difference between taking the average of models fitted on each fold, and looking at the prediction of only the last fold.
This was especially interesting in my case as it was a time series split, and I wanted my final model to be the one trained on the most recent data.
I added a parameter in the AutoML class called
chosen_fold
, which I ultimately set at-1
in my case to get the model of the last fold.It's a bit linked to #475.
Feel free to tell me if I should continue working on this evolution!
P.S. : I think the changes in the requirements_dev.txt are needed because the last click versions are not compatible anymore with the pinned version of black. (see https://stackoverflow.com/questions/71673404/importerror-cannot-import-name-unicodefun-from-click)
Probably upgrading black could also be a good move.