Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add slurm workers for calibration end-to-end test #3461

Merged
merged 1 commit into from
Jan 21, 2025

Conversation

nefrathenrici
Copy link
Member

@nefrathenrici nefrathenrici commented Dec 2, 2024

This PR updates the calibration end-to-end test to use Distributed.jl and the updated ClimaCalibrate with task-based parallelism. The main two files changed are calibration/model_interface.jl and calibration/test/e2e_test.jl

ClimaCalibrate v0.0.6 has three changes relevant to this PR

  • just requires forward_model instead of set_up_forward_model and run_forward_model
  • addprocs(SlurmManager(n)) can be used to acquired Slurm workers
  • Adds WorkerBackend to distribute forward_model runs across Julia workers

Other changes:

  • Removed the Calibration test GHA, it does not have a clear purpose
  • Removed the prior.toml and put it in the test script
  • Added *.out to the gitignore because the worker output goes to .out log files.

@nefrathenrici nefrathenrici force-pushed the ne/slurm_workers branch 3 times, most recently from 8957ff2 to 8cbe910 Compare December 2, 2024 18:41
@szy21
Copy link
Member

szy21 commented Dec 3, 2024

I'm not sure if I'm the best person to review this. And I think Charlie is out? Maybe some other people who are more familiar with calibration can take a look?

Copy link
Member

@Sbozzolo Sbozzolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some docs (inline comments would be fine) to give some context to those who read this file and are not familiar with ClusterManagers?

calibration/test/Project.toml Outdated Show resolved Hide resolved
@nefrathenrici nefrathenrici force-pushed the ne/slurm_workers branch 5 times, most recently from 33a96d3 to a20a84c Compare December 18, 2024 00:54
@nefrathenrici nefrathenrici linked an issue Dec 18, 2024 that may be closed by this pull request
Copy link
Member

@Sbozzolo Sbozzolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there documentation about the WorkerBacked?

@nefrathenrici
Copy link
Member Author

nefrathenrici commented Dec 19, 2024

Is there documentation about the WorkerBacked?

There is no documentation on the WorkerBackend at the moment, I am planning on updating the ClimaCalibrate as part of this PR but I can add some more information in this current PR as well.

@nefrathenrici nefrathenrici added this pull request to the merge queue Jan 16, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Jan 17, 2025
@nefrathenrici nefrathenrici force-pushed the ne/slurm_workers branch 2 times, most recently from 7482e30 to 2cac52d Compare January 17, 2025 21:21
@nefrathenrici nefrathenrici added this pull request to the merge queue Jan 21, 2025
Merged via the queue into main with commit 7ea7c30 Jan 21, 2025
19 of 20 checks passed
@nefrathenrici nefrathenrici deleted the ne/slurm_workers branch January 21, 2025 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add example calibration demonstrating persistent Slurm workers
3 participants