Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train] set auto_transfer cuda device #26819

Merged
merged 4 commits into from
Jul 21, 2022

Conversation

matthewdeng
Copy link
Contributor

@matthewdeng matthewdeng commented Jul 21, 2022

Signed-off-by: Matthew Deng [email protected]

Why are these changes needed?

This sets the CUDA Stream on the correct device (and not the default one) when calling train.torch.prepare_data_loader(auto_transfer=True).

Repro

import subprocess

from torch.utils.data import DataLoader
from torchvision import datasets

from ray import train
from ray.air.config import ScalingConfig
from ray.train.torch import TorchTrainer


def train_func(config):
    training_data = datasets.FashionMNIST(
        root="/tmp/data_fashion_mnist",
        train=True,
        download=True,
    )
    train_dataloader = DataLoader(training_data)
    train_dataloader = train.torch.prepare_data_loader(train_dataloader, auto_transfer=True)
    subprocess.run(["nvidia-smi"])


if __name__ == "__main__":
    trainer = TorchTrainer(
        train_loop_per_worker=train_func,
        scaling_config=ScalingConfig(
            num_workers=4,
            use_gpu=True,
        ),
    )
    result = trainer.fit()

Before:

(BaseWorkerMixin pid=4197) +-----------------------------------------------------------------------------+
(BaseWorkerMixin pid=4197) | Processes:                                                                  |
(BaseWorkerMixin pid=4197) |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
(BaseWorkerMixin pid=4197) |        ID   ID                                                   Usage      |
(BaseWorkerMixin pid=4197) |=============================================================================|
(BaseWorkerMixin pid=4197) |    0   N/A  N/A     47105      C                                    1281MiB |
(BaseWorkerMixin pid=4197) |    0   N/A  N/A     47106      C                                    1281MiB |
(BaseWorkerMixin pid=4197) |    0   N/A  N/A     47107      C                                    1281MiB |
(BaseWorkerMixin pid=4197) |    0   N/A  N/A     47109      C                                    1281MiB |
(BaseWorkerMixin pid=4197) +-----------------------------------------------------------------------------+

After:

(BaseWorkerMixin pid=5604) +-----------------------------------------------------------------------------+
(BaseWorkerMixin pid=5604) | Processes:                                                                  |
(BaseWorkerMixin pid=5604) |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
(BaseWorkerMixin pid=5604) |        ID   ID                                                   Usage      |
(BaseWorkerMixin pid=5604) |=============================================================================|
(BaseWorkerMixin pid=5604) |    0   N/A  N/A     14123      C                                    1281MiB |
(BaseWorkerMixin pid=5604) |    1   N/A  N/A     14124      C                                    1281MiB |
(BaseWorkerMixin pid=5604) |    2   N/A  N/A     14125      C                                    1281MiB |
(BaseWorkerMixin pid=5604) |    3   N/A  N/A     14126      C                                    1281MiB |
(BaseWorkerMixin pid=5604) +-----------------------------------------------------------------------------+

Related issue number

Closes #26707

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@matthewdeng matthewdeng marked this pull request as ready for review July 21, 2022 04:24
@JiahaoYao
Copy link
Contributor

Hi @matthewdeng , could u print out the cuda memory?

https://discuss.pytorch.org/t/how-to-check-the-gpu-memory-being-used/131220

say

cuda_mem = [torch.cuda.memory_allocated(i)/1024/1024/1024) for i in range(4)] 
assert min(cuda_mem) == max(cuda_mem) and cuda_memo[0] > 0

@JiahaoYao
Copy link
Contributor

this feature is coool 🚀

@JiahaoYao
Copy link
Contributor

@JiahaoYao
Copy link
Contributor

JiahaoYao commented Jul 21, 2022

i mean in the first case

cuda_memo[0] = 4x1281, cuda_memo[1] = 0, ....

Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is a great find!

Do you think we should treat this optimization as experimental and disable by default until we can more rigorously test it?

Signed-off-by: Matthew Deng <[email protected]>
@matthewdeng
Copy link
Contributor Author

@amogkam hah I was actually thinking the same thing, updated!

Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - this should solve the benchmark script GPU util issue, right?

@matthewdeng
Copy link
Contributor Author

@krfricke yep if you're referring to #26707!

@amogkam amogkam merged commit 728e2b3 into ray-project:master Jul 21, 2022
Rohan138 pushed a commit to Rohan138/ray that referenced this pull request Jul 28, 2022
This sets the CUDA Stream on the correct device (and not the default one) when calling train.torch.prepare_data_loader(auto_transfer=True).

Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Rohan138 <[email protected]>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
This sets the CUDA Stream on the correct device (and not the default one) when calling train.torch.prepare_data_loader(auto_transfer=True).

Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Stefan van der Kleij <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[air/train] Multi GPU training occupies more memory memory on first GPU
5 participants