Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] : Ray Thread Actor may cause cuda memory leakage. #49360

Open
PanAndy opened this issue Dec 19, 2024 · 0 comments
Open

[Core] : Ray Thread Actor may cause cuda memory leakage. #49360

PanAndy opened this issue Dec 19, 2024 · 0 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks

Comments

@PanAndy
Copy link

PanAndy commented Dec 19, 2024

What happened + What you expected to happen

When we were using ray.Actor to perform calculations related to PyTorch tensors, we noticed that the CUDA memory usage continuously increased with each computation step.
After investigation, we found that this was caused by setting max_concurrency > 1.
In the test provided below, the memory leak reached 3 GB after executing 100 times.

Versions / Dependencies

Version: 2.40.0
python

Reproduction script

import json

import ray
import torch

def log_gpu_memory_usage(head: str):
    memory_allocated = torch.cuda.memory_allocated() / 1024 ** 3
    memory_reserved = torch.cuda.memory_reserved() / 1024 ** 3
    message = f'{head}, memory allocated (GB): {memory_allocated}, memory reserved (GB): {memory_reserved}'
    print(message)
    return memory_allocated, memory_reserved


@ray.remote(num_gpus=1)
class ComputeTensorActor:
    def __init__(self, name):
        self.name = name

    def compute_tensor(self, num=100):
        torch.manual_seed(0)

        tensor_size = (1024, 1024)
        tensor = torch.randn(tensor_size, device='cuda')
        tensor_list = [torch.randn(tensor_size, device='cuda') for _ in range(num * 100)]

        for _ in range(num):
            other_tensor = torch.randn(tensor_size, device='cuda')
            tensor = torch.mm(tensor, other_tensor)
            tensor = torch.relu(tensor)
            tensor += 0.1 * torch.randn(tensor_size, device='cuda')

        metrics = {}
        memory_allocated, memory_reserved = log_gpu_memory_usage(head=f"{self.name} before empty cache")
        metrics["onload/memory_allocated"] = memory_allocated
        metrics["onload/memory_reserved"] = memory_reserved

        del tensor_list, tensor
        torch.cuda.empty_cache()

        memory_allocated, memory_reserved = log_gpu_memory_usage(head=f"{self.name} after empty cache")
        metrics["offload/memory_allocated"] = memory_allocated
        metrics["offload/memory_reserved"] = memory_reserved

        return metrics


def test_thread_actor():
    ray.init(num_gpus=1, ignore_reinit_error=True)
    cp_actor = ComputeTensorActor.options(num_gpus=1, max_concurrency=1000).remote("thread actor")
    num = 100
    metric_list = [ray.get(cp_actor.compute_tensor.remote(num=num)) for _ in range(num)]

    print(metric_list)

    with open("thread_actor_metrics.json", "w") as f:
        json.dump(metric_list, f)


def test_common_actor():
    ray.init(num_gpus=1, ignore_reinit_error=True)
    cp_actor = ComputeTensorActor.options(num_gpus=1).remote("thread actor")
    num = 100
    metric_list = [ray.get(cp_actor.compute_tensor.remote(num=num)) for _ in range(num)]

    print(metric_list)
    with open("common_actor_metrics.json", "w") as f:
        json.dump(metric_list, f)


if __name__ == '__main__':
    test_thread_actor()
    test_common_actor()

last result:

thread_actor_metrics[-1] =     {
        "onload/memory_allocated": 42.1953125,
        "onload/memory_reserved": 42.20703125,
        "offload/memory_allocated": 3.12890625,
        "offload/memory_reserved": 3.14453125
    }

common_actor_metrics[-1] =     {
        "onload/memory_allocated": 39.1015625,
        "onload/memory_reserved": 39.11328125,
        "offload/memory_allocated": 0.03515625,
        "offload/memory_reserved": 0.05078125
    }

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@PanAndy PanAndy added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 19, 2024
@jcotant1 jcotant1 added the core Issues that should be addressed in Ray Core label Dec 19, 2024
@ruisearch42 ruisearch42 added P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 23, 2024
@ruisearch42 ruisearch42 self-assigned this Dec 23, 2024
@ruisearch42 ruisearch42 added P1 Issue that should be fixed within a few weeks and removed P0 Issues that should be fixed in short order labels Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

3 participants