-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/improve dataloader memory #76
base: develop
Are you sure you want to change the base?
Conversation
…via dataloader.read_frequency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Would be nice to clean up the group creation a bit more and consolidate everything to the strategy :-).
…nstead of SLURM_PROCID
4cc8a78
to
3c6b5c9
Compare
for more information, see https://pre-commit.ci
Very nice feature, Jan. I tested this on a rollout run on n320. These runs are painful because we need to reduce the number of workers to avoid out of memory issues and training speed is reduced drastically. But I did a test with your branch and the develop branch and the results are quite good! Here is a comparison in terms of memory usage for num_workers = 6: The very good thing is that the job on the develop branch actually crashes at the end of rollout=2 while with your new feature, the rollout fine-tuning keeps going. This will considerably speed up rollout fine-tuning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work, Jan! 👍
does it make sense to link Jira issues here? or should we manage them independently? |
Probably better to link to here on the Jira ticket, likely more discussion will take place here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good.
Are there any blockers to merging this soon?
I don't think there is anything outstanding. I agree to merge it rather sooner than later. |
CHANGELOG.md
Outdated
@@ -46,6 +46,7 @@ Keep it human-readable, your future self will thank you! | |||
- Feat: Anemoi Profiler compatible with mlflow and using Pytorch (Kineto) Profiler for memory report [38](https://github.com/ecmwf/anemoi-training/pull/38/) | |||
- New limited area config file added, limited_area.yaml. [#134](https://github.com/ecmwf/anemoi-training/pull/134/) | |||
- New stretched grid config added, stretched_grid.yaml [#133](https://github.com/ecmwf/anemoi-training/pull/133) | |||
- Add reader groups to reduce CPU memory usage and increase dataloader throughput [#76](https://github.com/ecmwf/anemoi-training/pull/76) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to changelog bot fun, this is in the wrong space, could you put it in the unrelease bit please. (Sorry for the chore)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done: 05665c7
Describe your changes
This PR adds a configurable read_group_size (config.dataloader.read_group_size) that defines reader groups, subgroups of the model communication groups, who share the dataloading workload. Increasing the read_group_size heavily reduces CPU memory usage and increases throughput of the dataloader as each task only reads a shard of the train/test/val batch.
The following experiment on o1280 shows CPU memory usage decrease as we increase the read_group_size (1, 4, 16) with model_sharding across 16 GPUs:
MLFlow
we can also see a 2x speedup in terms of runtime due to an increased throughput in the dataloader.
Type of change
Checklist before requesting a review
Tag possible reviewers
@ssmmnn11 @mishooax @theissenhelen @JesperDramsch @sahahner @mchantry
📚 Documentation preview 📚: https://anemoi-training--76.org.readthedocs.build/en/76/