Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Problems with a high number of environments when using Cameras #604

Closed
ArneKlages4444 opened this issue Jun 30, 2024 · 6 comments
Assignees

Comments

@ArneKlages4444
Copy link

ArneKlages4444 commented Jun 30, 2024

We encounter issues with a high number of environments when using Cameras. Although the memory problem with cameras is now fixed in Isaac Lab, we cannot scale up the amount of environments as we like.

When using two graphic cards, we could achieve a parallelization of 475 environments in the test setup. When going up to 476, the setup sometimes works, or we get one of two errors (see below, errors 1 and 2). With more than 476 environments (tested up to 512), the tests always fail with error 1. When using only one graphics card, we could not even achieve 475 environments and got error 3. The memory usage was tracked while we were performing the experiments, and there was always plenty of VRAM left (see memory usage with 475). We did these tests on a Linux machine with TITAN RTX GPUs and reproduced them on our HPC with L40S GPUs. On both systems, the breaking points were the same, despite the more powerful L40S GPUs.

For our test, we simply put a camera into the FrankaCubeLift environment and added the flattened image to the observations (code below).

Is there anything we do wrong in our setups, or is this a known issue with cameras? When using the TiledCamera camera, we could use 4096 parallel environments, but since it does not provide the desired image output (#493) it is not usable for us.

System Info:

System 1:

  • For testing, we use the Docker container with X11 forwarding disabled.
  • IsaacLab version (last commit): Fixes root view to body view mapping in articulation #497
  • Isaac Sim Version: 4.0.0
  • OS: Ubuntu 22.04.3 LTS
  • GPU: 2 * NVIDIA TITAN RTX (24 GiB)
  • CUDA Toolkit 11.5, Driver 12.0
  • GPU Driver: 525.147.05

System 2 (HPC):

Memory usage with 475:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN RTX    Off  | 00000000:17:00.0 Off |                  N/A |
| 41%   35C    P2    78W / 280W |   9934MiB / 24576MiB |     42%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA TITAN RTX    Off  | 00000000:B3:00.0  On |                  N/A |
| 40%   37C    P2    86W / 280W |   8375MiB / 24576MiB |     28%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Error 1 (two GPUs and num_envs > 476):

2024-06-24 10:54:07 [399,118ms] [Warning] [carb] Client carb.scenerenderer-rtx.plugin has acquired [carb::settings::ISettings v1.0] 100 times. Consider accessing this interface with carb::getCachedInterface() (Performance warning)
2024-06-24 10:57:26 [598,443ms] [Error] [carb.graphics-vulkan.plugin] VkResult: ERROR_INITIALIZATION_FAILED
2024-06-24 10:57:26 [598,443ms] [Error] [carb.graphics-vulkan.plugin] vkCreateSemaphore failed.
2024-06-24 10:57:26 [598,443ms] [Error] [gpu.foundation.plugin] createSemaphore failed for interop semaphore
2024-06-24 10:57:26 [598,443ms] [Error] [gpu.foundation.plugin] Failed to create CUDA interop semaphore in command list "Render graph command list (Render queue 0, device 1, frame submission index 0)".
2024-06-24 10:57:56 [628,444ms] [Error] [gpu.foundation.plugin] Submission barrier "CUDA event dependency in MGPU resource" timeout in command list "Render graph command list (Render queue 0, device 0, frame submission index 0)"!
2024-06-24 10:57:56 [628,444ms] [Error] [gpu.foundation.plugin] Wait count: 1, Received signals: 0: 
2024-06-24 10:58:26 [658,466ms] [Error] [gpu.foundation.plugin] Submission barrier "CUDA event dependency in MGPU resource" timeout in command list "Render graph command list (Render queue 0, device 0, frame submission index 0)"!
2024-06-24 10:58:26 [658,466ms] [Error] [gpu.foundation.plugin] Wait count: 1, Received signals: 0:

Error 2 (two GPUs and sometimes at num_envs = 476):

Traceback (most recent call last):
  File "/workspace/isaaclab/test_fr.py", line 60, in <module>
    env = ManagerBasedRLEnv(cfg=env_cfg, render_mode="rgb_array")
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/manager_based_rl_env.py", line 75, in __init__
    super().__init__(cfg=cfg)
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/manager_based_env.py", line 117, in __init__
    self.load_managers()
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/manager_based_rl_env.py", line 118, in load_managers
    super().load_managers()
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/manager_based_env.py", line 205, in load_managers
    self.observation_manager = ObservationManager(self.cfg.observations, self)
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/managers/observation_manager.py", line 41, in __init__
    super().__init__(cfg, env)
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/managers/manager_base.py", line 130, in __init__
    self._prepare_terms()
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/managers/observation_manager.py", line 254, in _prepare_terms
    obs_dims = tuple(term_cfg.func(self._env, **term_cfg.params).shape[1:])
  File "/workspace/isaaclab/test_fr.py", line 28, in camera_img
    camera_output = env.scene[cam_cfg.name].data.output["rgb"]
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/sensors/camera/camera.py", line 170, in data
    self._update_outdated_buffers()
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/sensors/sensor_base.py", line 279, in _update_outdated_buffers
    self._update_buffers_impl(outdated_env_ids)
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/sensors/camera/camera.py", line 473, in _update_buffers_impl
    self._create_annotator_data()
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/sensors/camera/camera.py", line 592, in _create_annotator_data
    data, info = self._process_annotator_output(name, output)
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/sensors/camera/camera.py", line 613, in _process_annotator_output
    data = convert_to_torch(data, device=self.device)
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/utils/array.py", line 90, in convert_to_torch
    tensor = tensor.to(device)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Error 3 (one L40S and 475 environments):

2024-06-30 08:41:49 [281,227ms] [Error] [carb]
../../../source/plugins/rtx.resourcemanager/ResourceManagerContext.cpp(6805): rtx::RtxResult rtx::resourcemanager::Context::allocateDescriptorSets(carb::graphics::ResourceBindingSignature*, uint32_t, uint32_t, carb::graphics::DescriptorSet**, carb::graphics::DescriptorPool**, uint32_t*, size_t, rtx::resourcemanager::DescSetAllocationFlags)(): Assertion (success) failed: Unable to allocate descriptor sets.

Test Code:

import argparse

from omni.isaac.lab.app import AppLauncher

parser = argparse.ArgumentParser(description="TEST")
parser.add_argument("--num_envs", type=int, default=2, help="")
AppLauncher.add_app_launcher_args(parser)
args_cli = parser.parse_args()
app_launcher = AppLauncher(args_cli)
simulation_app = app_launcher.app

import torch

import subprocess

import omni.isaac.lab.sim as sim_utils
from omni.isaac.lab.sensors import CameraCfg
from omni.isaac.lab.managers import ObservationTermCfg as ObsTerm
from omni.isaac.lab.managers import SceneEntityCfg
from omni.isaac.lab_tasks.manager_based.manipulation.lift.config.franka.joint_pos_env_cfg import FrankaCubeLiftEnvCfg
from omni.isaac.lab.envs import ManagerBasedRLEnv


def camera_img(
        env: ManagerBasedRLEnv,
        cam_cfg: SceneEntityCfg = SceneEntityCfg("camera"),
) -> torch.Tensor:
    camera_output = env.scene[cam_cfg.name].data.output["rgb"]
    camera_output = camera_output[..., :-1]  # cut depth chanel
    return camera_output.flatten(start_dim=1)


class FrankaObjectLiftEnvCfg(FrankaCubeLiftEnvCfg):
    def __post_init__(self):
        super().__post_init__()
        self.decimation = 6
        self.episode_length_s = 2.5
        self.sim.dt = 0.01

        self.scene.camera = CameraCfg(
            prim_path="{ENV_REGEX_NS}/Table/camera_sensor",
            update_period=self.decimation * self.sim.dt,
            height=84,
            width=84,
            data_types=["rgb"],
            spawn=sim_utils.PinholeCameraCfg(
                focal_length=24.0, focus_distance=400.0, horizontal_aperture=20.955, clipping_range=(0.1, 1.0e5)
            ),
            offset=CameraCfg.OffsetCfg(pos=(0.0, -0.4, 0.4),
                                       rot=(0.6532814862382034, -0.2705980584036595,
                                            0.2705980656543066, 0.6532814687335938),
                                       convention="world"),
        )

        self.observations.policy.image = ObsTerm(func=camera_img)


env_cfg = FrankaObjectLiftEnvCfg()
env_cfg.scene.num_envs = args_cli.num_envs
env = ManagerBasedRLEnv(cfg=env_cfg, render_mode="rgb_array")


def run_nvidia_smi():
    try:
        output = subprocess.check_output(['nvidia-smi'], text=True)
        print(output)
    except subprocess.CalledProcessError as e:
        print(f"Failed to run nvidia-smi: {e}")
    except FileNotFoundError:
        print("nvidia-smi command not found. Ensure you have the NVIDIA drivers installed.")


run_nvidia_smi()
for i in range(100):
    a = torch.randn_like(env.action_manager.action)
    _ = env.step(a)
    if i % 10 == 0:
        print("iteration:", i)
        run_nvidia_smi()

Console command: ./isaaclab.sh -p test_fr.py --headless --enable_cameras --num_envs=476

@btx0424
Copy link

btx0424 commented Jul 4, 2024

Just another user here.

Maybe give TiledCameraCfg a try. I am able to get >512 cameras running on my laptop RTX 4070.

@ArneKlages4444
Copy link
Author

ArneKlages4444 commented Jul 4, 2024

Yes, the TiledCamera achieves the necessary number of environments, but as I already mentioned, the image it outputs is right now not comparable to the normal Camera. Using the TiledCamera would be our preferred option because of its much better performance, but in our projects, we need realistic images.

"When using the TiledCamera camera, we could use 4096 parallel environments, but since it does not provide the desired image output (#493) it is not usable for us."

@kisisjrlly
Copy link

In a large number of camera simulation scenarios, I also encountered a similar error, is your problem solved?

@ArneKlages4444
Copy link
Author

ArneKlages4444 commented Jul 17, 2024

No, sadly, it's not solved yet. The only way to increase the number of environments that I found is to add a second GPU. But the increase in parallel environments is not linear with the amount of separate GPUs. So it's up to you if it's worth it, especially because only a fraction of the GPU's VRAM capacity is used. Using more than two GPUs did not help in my tests (I tested it with up to 4 L40S GPUs).

@johnnylu305
Copy link

I got same warning and my simulation freezes with 256 environments on my A6000 machine.

@glvov-bdai
Copy link
Collaborator

Tiled rendering RGB has been improved in the 1.2 release, which fixes #493

I am able to simulate 2048 low resolution tiled cameras on a laptop 3080.

We are keeping an eye on #1031

You can try many cameras with your task environment https://isaac-sim.github.io/IsaacLab/source/how-to/estimate_how_many_cameras_can_run.html . If you still experience this issue with tiled rendering let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants