-
Notifications
You must be signed in to change notification settings - Fork 737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Progress of EBM fitting #575
Comments
A related set of questions has also come up recently in the AutoGluon project with regards to timeouts and callbacks autogluon/autogluon#4480 (comment), so this would be good for us to solve. My default thinking until recently has been that other improvements like speed, memory, and model quality were more immediately important, but considering how useful it would be to AutoGluon I now think this is one of the more important things that could be added to EBMs. Is this something you'd be interested in looking into @DerWeh? Whatever gets added will need to go somewhere in this loop:
|
Allowing to pass a callback, which can also stop the iterations early, is in principle simple enough and quite general (as long as we define the stopping criteria beforehand). We could either define the important variables which are passed to the callback, or simple pass the locals, which might be more powerful but provides no stable API. What complicates things is the multiprocessing EBMs default to. Thus, the callback needs to be pickled and runs in a different process... Getting a meaningful timeout for fewer jobs than outer bags is non-trivial in the current form (as you mentioned in the issue). Canceling the loop after a certain time is straight forward, but if the bags run one after another some bags might have finished, one aborted early, and some not run at all... # the previous while loop
for step_idx in call_back(max_steps):
...
# definition of a callback
def timeout_callback_generator(maxtime):
def timeout_callback(max_steps):
start_time = datetime.datetime.now()
for step_idx in range(max_steps):
yield step_idx
if start_time - datetime.datetime.now() > maxtime:
print("Maximal time exceeded, stoping before converged")
break
return timeout_callback Just a general idea, not working code. Of course, this would be a "soft timeout". As we check the time after every iteration instead of killing it after a certain time. This would also be possible, but I doubt the benefit is worth the effort. |
We use ctypes, which I thought didn't have an option to release the GIL, so I'm not sure why there isn't an impact when using default="threads". I agree the API might be unstable and the process state handling would change once we move threading into C. I'm thinking the cost of potentially making a breaking change in the future is worth the benefit today. Hopefully, callback usage would be niche, and therefore not break too many users. We might also be able to version the function by looking at the number of parameters. In terms of the callback API, I was thinking that instead of looping and yielding inside the callback, the boost function would contain the loop, and we'd call the callback function on each loop iteration. That would allow us to pass things like the current validation metric, and the loop iteration count, etc. The callback could return a bool to terminate boosting. The callback would have to somehow hold per-outer-bag state, which could be done via a global dictionary if we pass in the bag index. We do still have the messy issues with determining which work to include in the final model. That seems like a pretty fundamental problem. Maybe a simple heuristic would work, like only including completed models if any model reaches completion, otherwise include all partly completed models. Not ideal, but sometimes messiness is required to get something practical. I think JobLibProvider isn't required and could be simplified away. I didn't write that section, so maybe there's something I'm missing, but I think you're correct on that. |
I am by no means an expert of
FYI: NumPy provides ctypeslib with some convenience functions simplifiying the usage of So multithreading is fine, as long as the library is thread safe.
This was also my first thought. However, to use a timeout, I think we need state (the start time). A generator seems the most natural, send would allow for providing values to the generator. But I really have to implement a prototype to see if this works out nicely or not.
My approach would have been to provide every bag with an equal time limit. Of course, this places additional burden on the user as he has to consider the number of bags and cores. Assumption What are your thoughts? Another issues the iterative nature: we first fit the main effects and then add interactions. How do we handle this? First fit mains, and if time is left, spend it on interactions? Reserve a fixed budget for mains and interactions. Last, but very important: how precise do we have to follow the timeout? Is it enough to check the time every iteration? Can we neglect everything else and focus only on the boosting? This could realized by (asynchronously) writing out a file tree like:
and providing a helper creating an EBM from the latest configuration. This would require again some heuristic, which results to include. One idea would be to write out the metric, that is boosted, with every result and include everything which is not worse than the median metric + X%. Small addendum:
I fully agree, the root issue is that fitting EBMs takes too long for big dataset (whatever too long means) forcing use to work around it. But unless you have ways to fitting up training by one or two orders of magnitudes, we're stuck. |
Ah, very interesting. I didn't know that about CDLL. Yes, the library is thread-safe, so we can switch to using threads which simplifies things at least a little bit. I think, probably, if my goal was to build the best model possible in a given amount of time, and I had to choose between shallow boosting all the bags, or deep boosting just a few, I think boosting just a few as deeply as possible would result in the better model more often. Since we can't currently use multiple cores to advance a single bag, I think if we had N available cores, the best strategy currently would be to boost N bags to completion and then move to the next N bags if time allows. The nice thing about this is that it aligns with our current processing order. For pairs, we currently choose them after all the mains are done. I don't think we want to change that since the pairs are chosen universally across all bags, so all the bags need to be done before we can do that in a consistent way. In theory perhaps having a fixed time budget for the mains to allow some pair boosting time would result in a better model within a given amount time, but that feels like it's getting rather complicated in terms of how these things would be specified. I agree, the state holding methodology is non-deal. It is possible, but it isn't clear to me how obvious this will be to our users. Something like this works:
Perhaps it would be more obvious to our users if we added a callback_args parameter along with a callback parameter to the EBM constructors. Then the user could do something like this:
I somewhat lean toward the first option since it keeps the main interface simpler by only adding a single new parameter to the constructor. We could always include an example in our documentation to show users how to handle state. I'm not really familiar with send, so I'll read up on it. Would be interesting to see how it changes the feel of the API if used. I'll also think about the question about writing to disk and add some thoughts later to this thread. It doesn't feel like we've hit the best API yet, so let's keep discussing. |
What is the recommended indication for the progress of fitting EBMs? In version 0.5.0, the logger provided the current boosting round and the value of the metric every 10 rounds. This was removed in version 0.6.0. Currently, I find no more progress information of boosting.
For large datasets, fitting an EBM can take several days. So any form of progress indication would be highly welcome. Ideally, we would be also able to save and resume intermediate results (in case of power outages, or we might realize that the results are already good enough for our purpose, or so bad that there is no point in further boosting).
To me, a progress indicator is quite important. It is very hard to estimate the runtime of boosting, as ideally we rely on the early stopping. Thus, the final runtime depends on the “difficulty” of the dataset, making in hard to extrapolate runtimes.
The text was updated successfully, but these errors were encountered: