Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/improve dataloader memory #76

Open
wants to merge 19 commits into
base: develop
Choose a base branch
from

Conversation

japols
Copy link
Member

@japols japols commented Oct 9, 2024

Describe your changes

This PR adds a configurable read_group_size (config.dataloader.read_group_size) that defines reader groups, subgroups of the model communication groups, who share the dataloading workload. Increasing the read_group_size heavily reduces CPU memory usage and increases throughput of the dataloader as each task only reads a shard of the train/test/val batch.

The following experiment on o1280 shows CPU memory usage decrease as we increase the read_group_size (1, 4, 16) with model_sharding across 16 GPUs:

Screenshot 2024-10-30 at 15 15 36

MLFlow

we can also see a 2x speedup in terms of runtime due to an increased throughput in the dataloader.

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist before requesting a review

  • I have performed a self-review of my code
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation and docstrings to reflect the changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have ensured that the code is still pip-installable after the changes and runs
  • I have not introduced new dependencies in the inference partion of the model
  • I have ran this on single GPU
  • I have ran this on multi-GPU or multi-node
  • I have ran this to work on LUMI (or made sure the changes work independently.)
  • I have ran the Benchmark Profiler against the old version of the code

Tag possible reviewers

@ssmmnn11 @mishooax @theissenhelen @JesperDramsch @sahahner @mchantry


📚 Documentation preview 📚: https://anemoi-training--76.org.readthedocs.build/en/76/

@FussyDuck
Copy link

FussyDuck commented Oct 9, 2024

CLA assistant check
All committers have signed the CLA.

Copy link
Member

@ssmmnn11 ssmmnn11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Would be nice to clean up the group creation a bit more and consolidate everything to the strategy :-).

src/anemoi/training/train/forecaster.py Show resolved Hide resolved
src/anemoi/training/train/forecaster.py Outdated Show resolved Hide resolved
src/anemoi/training/data/datamodule.py Outdated Show resolved Hide resolved
src/anemoi/training/train/train.py Outdated Show resolved Hide resolved
@japols japols force-pushed the feature/improve-dataloader-memory branch from 4cc8a78 to 3c6b5c9 Compare October 9, 2024 15:50
@japols japols added the enhancement New feature or request label Oct 9, 2024
@gabrieloks
Copy link
Contributor

gabrieloks commented Oct 10, 2024

Very nice feature, Jan. I tested this on a rollout run on n320. These runs are painful because we need to reduce the number of workers to avoid out of memory issues and training speed is reduced drastically.

But I did a test with your branch and the develop branch and the results are quite good!

Here is a comparison in terms of memory usage for num_workers = 6:

image

The very good thing is that the job on the develop branch actually crashes at the end of rollout=2 while with your new feature, the rollout fine-tuning keeps going. This will considerably speed up rollout fine-tuning.

@mchantry

Copy link
Member

@mishooax mishooax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work, Jan! 👍

src/anemoi/training/data/dataset.py Show resolved Hide resolved
src/anemoi/training/data/dataset.py Show resolved Hide resolved
src/anemoi/training/train/train.py Show resolved Hide resolved
src/anemoi/training/data/dataset.py Outdated Show resolved Hide resolved
@ssmmnn11
Copy link
Member

does it make sense to link Jira issues here? or should we manage them independently?

https://jira.ecmwf.int/browse/HPCAS-53

@cathalobrien
Copy link
Contributor

does it make sense to link Jira issues here? or should we manage them independently?

https://jira.ecmwf.int/browse/HPCAS-53

Probably better to link to here on the Jira ticket, likely more discussion will take place here

CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Member

@HCookie HCookie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good.
Are there any blockers to merging this soon?

@ssmmnn11
Copy link
Member

I don't think there is anything outstanding. I agree to merge it rather sooner than later.

ssmmnn11
ssmmnn11 previously approved these changes Nov 18, 2024
CHANGELOG.md Outdated
@@ -46,6 +46,7 @@ Keep it human-readable, your future self will thank you!
- Feat: Anemoi Profiler compatible with mlflow and using Pytorch (Kineto) Profiler for memory report [38](https://github.com/ecmwf/anemoi-training/pull/38/)
- New limited area config file added, limited_area.yaml. [#134](https://github.com/ecmwf/anemoi-training/pull/134/)
- New stretched grid config added, stretched_grid.yaml [#133](https://github.com/ecmwf/anemoi-training/pull/133)
- Add reader groups to reduce CPU memory usage and increase dataloader throughput [#76](https://github.com/ecmwf/anemoi-training/pull/76)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to changelog bot fun, this is in the wrong space, could you put it in the unrelease bit please. (Sorry for the chore)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done: 05665c7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants