Feature/improve dataloader memory #76

japols · 2024-10-09T09:25:17Z

Describe your changes

This PR adds a configurable read_group_size (config.dataloader.read_group_size) that defines reader groups, subgroups of the model communication groups, who share the dataloading workload. Increasing the read_group_size heavily reduces CPU memory usage and increases throughput of the dataloader as each task only reads a shard of the train/test/val batch.

The following experiment on o1280 shows CPU memory usage decrease as we increase the read_group_size (1, 4, 16) with model_sharding across 16 GPUs:

MLFlow

we can also see a 2x speedup in terms of runtime due to an increased throughput in the dataloader.

Type of change

New feature (non-breaking change which adds functionality)

Checklist before requesting a review

Tag possible reviewers

@ssmmnn11 @mishooax @theissenhelen @JesperDramsch @sahahner @mchantry

📚 Documentation preview 📚: https://anemoi-training--76.org.readthedocs.build/en/76/

…via dataloader.read_frequency

…-memory

FussyDuck · 2024-10-09T09:25:23Z

All committers have signed the CLA.

ssmmnn11

Great work! Would be nice to clean up the group creation a bit more and consolidate everything to the strategy :-).

src/anemoi/training/train/forecaster.py

src/anemoi/training/data/datamodule.py

src/anemoi/training/train/train.py

…nstead of SLURM_PROCID

for more information, see https://pre-commit.ci

gabrieloks · 2024-10-10T16:01:02Z

Very nice feature, Jan. I tested this on a rollout run on n320. These runs are painful because we need to reduce the number of workers to avoid out of memory issues and training speed is reduced drastically.

But I did a test with your branch and the develop branch and the results are quite good!

Here is a comparison in terms of memory usage for num_workers = 6:

The very good thing is that the job on the develop branch actually crashes at the end of rollout=2 while with your new feature, the rollout fine-tuning keeps going. This will considerably speed up rollout fine-tuning.

@mchantry

…aloader

mishooax

Very nice work, Jan! 👍

src/anemoi/training/data/dataset.py

src/anemoi/training/train/train.py

src/anemoi/training/data/dataset.py

ssmmnn11 · 2024-10-30T10:18:40Z

does it make sense to link Jira issues here? or should we manage them independently?

https://jira.ecmwf.int/browse/HPCAS-53

cathalobrien · 2024-10-30T14:13:31Z

does it make sense to link Jira issues here? or should we manage them independently?

https://jira.ecmwf.int/browse/HPCAS-53

Probably better to link to here on the Jira ticket, likely more discussion will take place here

…taloader-memory

CHANGELOG.md

HCookie

This looks really good.
Are there any blockers to merging this soon?

ssmmnn11 · 2024-11-15T13:48:39Z

I don't think there is anything outstanding. I agree to merge it rather sooner than later.

mchantry · 2024-11-18T18:38:57Z

CHANGELOG.md

@@ -46,6 +46,7 @@ Keep it human-readable, your future self will thank you!
 - Feat: Anemoi Profiler compatible with mlflow and using Pytorch (Kineto) Profiler for memory report [38](https://github.com/ecmwf/anemoi-training/pull/38/)
 - New limited area config file added, limited_area.yaml. [#134](https://github.com/ecmwf/anemoi-training/pull/134/)
 - New stretched grid config added, stretched_grid.yaml [#133](https://github.com/ecmwf/anemoi-training/pull/133)
+- Add reader groups to reduce CPU memory usage and increase dataloader throughput [#76](https://github.com/ecmwf/anemoi-training/pull/76)


Due to changelog bot fun, this is in the wrong space, could you put it in the unrelease bit please. (Sorry for the chore)

done: 05665c7

japols added 5 commits September 24, 2024 12:51

feat: initial implementation of dataloader memory optimization

cdaf082

fix: non-reader tasks actually return before reading

fcc7c93

feat: add reader_group to define per-model_comm_group read behaviour …

5d171c7

…via dataloader.read_frequency

Merge remote-tracking branch 'origin' into feature/improve-dataloader…

8c16e54

…-memory

docs: cleanup, add comments

ee94593

ssmmnn11 requested changes Oct 9, 2024

View reviewed changes

src/anemoi/training/train/forecaster.py Show resolved Hide resolved

src/anemoi/training/train/forecaster.py Outdated Show resolved Hide resolved

src/anemoi/training/data/datamodule.py Outdated Show resolved Hide resolved

src/anemoi/training/train/train.py Outdated Show resolved Hide resolved

refactor: Pass model/reader group information from DDPGroupStrategy i…

3c6b5c9

…nstead of SLURM_PROCID

japols force-pushed the feature/improve-dataloader-memory branch from 4cc8a78 to 3c6b5c9 Compare October 9, 2024 15:50

[pre-commit.ci] auto fixes from pre-commit.com hooks

57a13c5

for more information, see https://pre-commit.ci

japols added the enhancement New feature or request label Oct 9, 2024

gabrieloks and others added 3 commits October 24, 2024 16:53

Merge branch 'develop' into feature/improve-dataloader-memory

bcd0fe6

feat: Add support for sharded reading in dataloader

9a22225

refactor: merge read groups with sharded reading functionality in dat…

6615c97

…aloader

mishooax reviewed Oct 25, 2024

View reviewed changes

src/anemoi/training/data/dataset.py Show resolved Hide resolved

src/anemoi/training/data/dataset.py Show resolved Hide resolved

src/anemoi/training/train/train.py Show resolved Hide resolved

src/anemoi/training/data/dataset.py Outdated Show resolved Hide resolved

refactor: adress PR review comments

93a7e06

japols and others added 4 commits October 31, 2024 10:57

Merge remote-tracking branch 'origin/develop' into feature/improve-da…

ef9fad9

…taloader-memory

Merge remote-tracking branch 'origin/develop' into feature/improve-da…

1d4f779

…taloader-memory

fix: allgather batch in callbacks

3b69b33

Merge branch 'develop' into feature/improve-dataloader-memory

937943b

HCookie reviewed Nov 15, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

HCookie reviewed Nov 15, 2024

View reviewed changes

cleanup, async plot warning

3c01e14

ssmmnn11 previously approved these changes Nov 18, 2024

View reviewed changes

Merge branch 'develop' into feature/improve-dataloader-memory

24cab45

mchantry reviewed Nov 18, 2024

View reviewed changes

fix: changelog, read_group default

05665c7

japols dismissed ssmmnn11’s stale review via 05665c7 November 19, 2024 08:33

docs: docstring, fix: warning only on rank 0

aea0ad7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/improve dataloader memory #76

Feature/improve dataloader memory #76

japols commented Oct 9, 2024 •

edited by github-actions bot

Loading

FussyDuck commented Oct 9, 2024 •

edited

Loading

ssmmnn11 left a comment

gabrieloks commented Oct 10, 2024 •

edited

Loading

mishooax left a comment

ssmmnn11 commented Oct 30, 2024

cathalobrien commented Oct 30, 2024

HCookie left a comment

ssmmnn11 commented Nov 15, 2024

mchantry Nov 18, 2024

japols Nov 19, 2024

Feature/improve dataloader memory #76

Are you sure you want to change the base?

Feature/improve dataloader memory #76

Conversation

japols commented Oct 9, 2024 • edited by github-actions bot Loading

Describe your changes

Type of change

Checklist before requesting a review

Tag possible reviewers

FussyDuck commented Oct 9, 2024 • edited Loading

ssmmnn11 left a comment

Choose a reason for hiding this comment

gabrieloks commented Oct 10, 2024 • edited Loading

mishooax left a comment

Choose a reason for hiding this comment

ssmmnn11 commented Oct 30, 2024

cathalobrien commented Oct 30, 2024

HCookie left a comment

Choose a reason for hiding this comment

ssmmnn11 commented Nov 15, 2024

mchantry Nov 18, 2024

Choose a reason for hiding this comment

japols Nov 19, 2024

Choose a reason for hiding this comment

japols commented Oct 9, 2024 •

edited by github-actions bot

Loading

FussyDuck commented Oct 9, 2024 •

edited

Loading

gabrieloks commented Oct 10, 2024 •

edited

Loading