feat: For long ML tasks, make intermediate saves #1024

sylvaincom · 2024-12-27T10:48:29Z

Is your feature request related to a problem? Please describe.

As a data scientist, I might launch some long ML tasks on a server that is bad and I might loose all my results if the server crashes.
Got this issue from 2 user interviews.

Describe the solution you'd like

Save some intermediate results. For example, if you do a cross-validation with 5 splits, we could store at least the 1st split before everything has finished running, so that you have at least the 1st split if it crashes in the middle of the 2nd split.

Related to #989

Edit: neptune.ai does continued tracking (but for foundation models)

glemaitre · 2024-12-30T09:25:39Z

joblib.Memory allows to cache results. In scikit-learn, the Pipeline exposes a memory parameter to allow for such behaviour. It would be cool to go at the estimator level to have some aggressive caching. But it is not a straightforward task because sometimes hashing the inputs is more costly than just calling the function itself.

So if skore could make a sensible caching mechanism into place.

Example regarding the caching: https://joblib.readthedocs.io/en/stable/auto_examples/memory_basic_usage.html#sphx-glr-auto-examples-memory-basic-usage-py

In #997, there is an in-memory caching mechanism. Persisting it on the disk would be useful to avoid the recomputing some of the results.

auguste-probabl · 2025-01-30T10:18:46Z

For this to work, we need to move the fitting logic out of CrossValidationReport.__init__. Indeed, right now

report = CrossValidationReport(estimator, ...)

immediately fits the estimators, which can take a long time; if the fitting is interrupted, then the whole __init__ fails, so report is not defined and there is no way to access anything.

Instead the fitting could be done as soon as some metric/plot method is called.
Another solution (maybe quicker to implement) could be to expose the _fit_estimator_reports method, and have the user call it explicitly.

glemaitre · 2025-01-30T10:42:56Z

immediately fits the estimators, which can take a long time; if the fitting is interrupted, then the whole init fails, so report is not defined and there is no way to access anything.

I don't think so. The fitting is using a joblib generator. It means that theoretically, we can catch the exception and terminate the process properly with the current results. It is not related with the __init__ itself.

The more challenging and subsequent feature is being able to "resume" the execution from what have computed to be able to not recompute some previous cached information. Maybe in this case, we should have a resume(...) method because you will already have the instance at hand:

report = CrossValidationReport(...)
# report is interrupted or crash for some reason
report.resume()
# eventually to restart from scractch
report.restart()

(disclosure I'm not sure about the naming because they might not be explicit what are we restarting or resuming).

auguste-probabl · 2025-01-30T10:52:53Z

I can confirm that with the "call _fit_estimator_reports explicitly" solution works. I can push my local experiment if you're interested.

Doing this brought up the same question as @glemaitre: What happens if _fit_estimator_reports is called again: should it resume or restart? IMO, for a first iteration, restarting from scratch every time is fine. If you want to keep your previous "run" then you can put it in storage.

Even if it's theoretically possible to do everything in the __init__, I still think the fit pattern from sklearn is preferable. __init__ should not be a long computation.

With that said, I'll now look at catching exceptions directly in _fit_estimator_reports.

glemaitre · 2025-01-30T10:56:25Z

I can confirm that with the "call _fit_estimator_reports explicitly" solution works. I can push my local experiment if you're interested.

I'm interested to see what about just to be sure in which "user setting" you are.

With that said, I'll now look at catching exceptions directly in _fit_estimator_reports.

My thought would be to catch it up in the loop consuming the generator (but I need to see you other code to understand exactly the use case).

auguste-probabl · 2025-01-30T11:17:24Z

See 3acfb1f for the "call _fit_estimator_reports explicitly" attempt

See 928aa5e for the "catch exceptions in _fit_estimator_reports" attempt

auguste-probabl · 2025-02-03T16:26:49Z

I took some time to try to write an automatic test, and I'm not getting anywhere. The clone-ing of the estimator and the parallelism when we run the cross-validation makes it difficult. Any ideas @glemaitre @rouk1 @thomass-dev ? Otherwise I'll just write a manual testing guide in the PR.

thomass-dev · 2025-02-04T09:24:05Z

Try by mocking clone?

auguste-probabl · 2025-02-04T16:55:00Z

Try by mocking clone?

Thanks, worked perfectly :)

sylvaincom added enhancement New feature or request needs-triage This has been recently submitted and needs attention user-reported labels Dec 27, 2024

tuscland removed the needs-triage This has been recently submitted and needs attention label Jan 3, 2025

augustebaum changed the title ~~feat: For long ML tasks, make same intermediate saves~~ feat: For long ML tasks, make intermediate saves Jan 8, 2025

auguste-probabl mentioned this issue Feb 4, 2025

feat(CrossValidationReporter): Catch exceptions during cross-validation #1287

Merged

auguste-probabl self-assigned this Feb 4, 2025

glemaitre closed this as completed in #1287 Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: For long ML tasks, make intermediate saves #1024

feat: For long ML tasks, make intermediate saves #1024

sylvaincom commented Dec 27, 2024 •

edited

Loading

glemaitre commented Dec 30, 2024

auguste-probabl commented Jan 30, 2025 •

edited

Loading

glemaitre commented Jan 30, 2025

auguste-probabl commented Jan 30, 2025

glemaitre commented Jan 30, 2025

auguste-probabl commented Jan 30, 2025 •

edited

Loading

auguste-probabl commented Feb 3, 2025

thomass-dev commented Feb 4, 2025

auguste-probabl commented Feb 4, 2025

feat: For long ML tasks, make intermediate saves #1024

feat: For long ML tasks, make intermediate saves #1024

Comments

sylvaincom commented Dec 27, 2024 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

glemaitre commented Dec 30, 2024

auguste-probabl commented Jan 30, 2025 • edited Loading

glemaitre commented Jan 30, 2025

auguste-probabl commented Jan 30, 2025

glemaitre commented Jan 30, 2025

auguste-probabl commented Jan 30, 2025 • edited Loading

auguste-probabl commented Feb 3, 2025

thomass-dev commented Feb 4, 2025

auguste-probabl commented Feb 4, 2025

sylvaincom commented Dec 27, 2024 •

edited

Loading

auguste-probabl commented Jan 30, 2025 •

edited

Loading

auguste-probabl commented Jan 30, 2025 •

edited

Loading