Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: For long ML tasks, make intermediate saves #1024

Closed
sylvaincom opened this issue Dec 27, 2024 · 9 comments · Fixed by #1287
Closed

feat: For long ML tasks, make intermediate saves #1024

sylvaincom opened this issue Dec 27, 2024 · 9 comments · Fixed by #1287
Assignees
Labels
enhancement New feature or request user-reported

Comments

@sylvaincom
Copy link
Contributor

sylvaincom commented Dec 27, 2024

Is your feature request related to a problem? Please describe.

As a data scientist, I might launch some long ML tasks on a server that is bad and I might loose all my results if the server crashes.
Got this issue from 2 user interviews.

Describe the solution you'd like

Save some intermediate results. For example, if you do a cross-validation with 5 splits, we could store at least the 1st split before everything has finished running, so that you have at least the 1st split if it crashes in the middle of the 2nd split.

Related to #989

Edit: neptune.ai does continued tracking (but for foundation models)

@sylvaincom sylvaincom added enhancement New feature or request needs-triage This has been recently submitted and needs attention user-reported labels Dec 27, 2024
@glemaitre
Copy link
Member

joblib.Memory allows to cache results. In scikit-learn, the Pipeline exposes a memory parameter to allow for such behaviour. It would be cool to go at the estimator level to have some aggressive caching. But it is not a straightforward task because sometimes hashing the inputs is more costly than just calling the function itself.

So if skore could make a sensible caching mechanism into place.

Example regarding the caching: https://joblib.readthedocs.io/en/stable/auto_examples/memory_basic_usage.html#sphx-glr-auto-examples-memory-basic-usage-py

In #997, there is an in-memory caching mechanism. Persisting it on the disk would be useful to avoid the recomputing some of the results.

@tuscland tuscland removed the needs-triage This has been recently submitted and needs attention label Jan 3, 2025
@augustebaum augustebaum changed the title feat: For long ML tasks, make same intermediate saves feat: For long ML tasks, make intermediate saves Jan 8, 2025
@auguste-probabl
Copy link
Contributor

auguste-probabl commented Jan 30, 2025

For this to work, we need to move the fitting logic out of CrossValidationReport.__init__. Indeed, right now

report = CrossValidationReport(estimator, ...)

immediately fits the estimators, which can take a long time; if the fitting is interrupted, then the whole __init__ fails, so report is not defined and there is no way to access anything.

Instead the fitting could be done as soon as some metric/plot method is called.
Another solution (maybe quicker to implement) could be to expose the _fit_estimator_reports method, and have the user call it explicitly.

@glemaitre
Copy link
Member

immediately fits the estimators, which can take a long time; if the fitting is interrupted, then the whole init fails, so report is not defined and there is no way to access anything.

I don't think so. The fitting is using a joblib generator. It means that theoretically, we can catch the exception and terminate the process properly with the current results. It is not related with the __init__ itself.

The more challenging and subsequent feature is being able to "resume" the execution from what have computed to be able to not recompute some previous cached information. Maybe in this case, we should have a resume(...) method because you will already have the instance at hand:

report = CrossValidationReport(...)
# report is interrupted or crash for some reason
report.resume()
# eventually to restart from scractch
report.restart()

(disclosure I'm not sure about the naming because they might not be explicit what are we restarting or resuming).

@auguste-probabl
Copy link
Contributor

I can confirm that with the "call _fit_estimator_reports explicitly" solution works. I can push my local experiment if you're interested.

Doing this brought up the same question as @glemaitre: What happens if _fit_estimator_reports is called again: should it resume or restart? IMO, for a first iteration, restarting from scratch every time is fine. If you want to keep your previous "run" then you can put it in storage.

Even if it's theoretically possible to do everything in the __init__, I still think the fit pattern from sklearn is preferable. __init__ should not be a long computation.

With that said, I'll now look at catching exceptions directly in _fit_estimator_reports.

@glemaitre
Copy link
Member

I can confirm that with the "call _fit_estimator_reports explicitly" solution works. I can push my local experiment if you're interested.

I'm interested to see what about just to be sure in which "user setting" you are.

With that said, I'll now look at catching exceptions directly in _fit_estimator_reports.

My thought would be to catch it up in the loop consuming the generator (but I need to see you other code to understand exactly the use case).

@auguste-probabl
Copy link
Contributor

auguste-probabl commented Jan 30, 2025

See 3acfb1f for the "call _fit_estimator_reports explicitly" attempt

See 928aa5e for the "catch exceptions in _fit_estimator_reports" attempt

@auguste-probabl
Copy link
Contributor

I took some time to try to write an automatic test, and I'm not getting anywhere. The clone-ing of the estimator and the parallelism when we run the cross-validation makes it difficult. Any ideas @glemaitre @rouk1 @thomass-dev ? Otherwise I'll just write a manual testing guide in the PR.

@thomass-dev
Copy link
Collaborator

Try by mocking clone?

@auguste-probabl
Copy link
Contributor

Try by mocking clone?

Thanks, worked perfectly :)

@auguste-probabl auguste-probabl self-assigned this Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request user-reported
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants