Skip to content

[Question] Problems with a high number of environments when using Cameras #604

Closed
@ArneKlages4444

Description

@ArneKlages4444

We encounter issues with a high number of environments when using Cameras. Although the memory problem with cameras is now fixed in Isaac Lab, we cannot scale up the amount of environments as we like.

When using two graphic cards, we could achieve a parallelization of 475 environments in the test setup. When going up to 476, the setup sometimes works, or we get one of two errors (see below, errors 1 and 2). With more than 476 environments (tested up to 512), the tests always fail with error 1. When using only one graphics card, we could not even achieve 475 environments and got error 3. The memory usage was tracked while we were performing the experiments, and there was always plenty of VRAM left (see memory usage with 475). We did these tests on a Linux machine with TITAN RTX GPUs and reproduced them on our HPC with L40S GPUs. On both systems, the breaking points were the same, despite the more powerful L40S GPUs.

For our test, we simply put a camera into the FrankaCubeLift environment and added the flattened image to the observations (code below).

Is there anything we do wrong in our setups, or is this a known issue with cameras? When using the TiledCamera camera, we could use 4096 parallel environments, but since it does not provide the desired image output (#493) it is not usable for us.

System Info:

System 1:

  • For testing, we use the Docker container with X11 forwarding disabled.
  • IsaacLab version (last commit): Fixes root view to body view mapping in articulation #497
  • Isaac Sim Version: 4.0.0
  • OS: Ubuntu 22.04.3 LTS
  • GPU: 2 * NVIDIA TITAN RTX (24 GiB)
  • CUDA Toolkit 11.5, Driver 12.0
  • GPU Driver: 525.147.05

System 2 (HPC):

Memory usage with 475:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN RTX    Off  | 00000000:17:00.0 Off |                  N/A |
| 41%   35C    P2    78W / 280W |   9934MiB / 24576MiB |     42%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA TITAN RTX    Off  | 00000000:B3:00.0  On |                  N/A |
| 40%   37C    P2    86W / 280W |   8375MiB / 24576MiB |     28%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Error 1 (two GPUs and num_envs > 476):

2024-06-24 10:54:07 [399,118ms] [Warning] [carb] Client carb.scenerenderer-rtx.plugin has acquired [carb::settings::ISettings v1.0] 100 times. Consider accessing this interface with carb::getCachedInterface() (Performance warning)
2024-06-24 10:57:26 [598,443ms] [Error] [carb.graphics-vulkan.plugin] VkResult: ERROR_INITIALIZATION_FAILED
2024-06-24 10:57:26 [598,443ms] [Error] [carb.graphics-vulkan.plugin] vkCreateSemaphore failed.
2024-06-24 10:57:26 [598,443ms] [Error] [gpu.foundation.plugin] createSemaphore failed for interop semaphore
2024-06-24 10:57:26 [598,443ms] [Error] [gpu.foundation.plugin] Failed to create CUDA interop semaphore in command list "Render graph command list (Render queue 0, device 1, frame submission index 0)".
2024-06-24 10:57:56 [628,444ms] [Error] [gpu.foundation.plugin] Submission barrier "CUDA event dependency in MGPU resource" timeout in command list "Render graph command list (Render queue 0, device 0, frame submission index 0)"!
2024-06-24 10:57:56 [628,444ms] [Error] [gpu.foundation.plugin] Wait count: 1, Received signals: 0: 
2024-06-24 10:58:26 [658,466ms] [Error] [gpu.foundation.plugin] Submission barrier "CUDA event dependency in MGPU resource" timeout in command list "Render graph command list (Render queue 0, device 0, frame submission index 0)"!
2024-06-24 10:58:26 [658,466ms] [Error] [gpu.foundation.plugin] Wait count: 1, Received signals: 0:

Error 2 (two GPUs and sometimes at num_envs = 476):

Traceback (most recent call last):
  File "/workspace/isaaclab/test_fr.py", line 60, in <module>
    env = ManagerBasedRLEnv(cfg=env_cfg, render_mode="rgb_array")
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/manager_based_rl_env.py", line 75, in __init__
    super().__init__(cfg=cfg)
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/manager_based_env.py", line 117, in __init__
    self.load_managers()
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/manager_based_rl_env.py", line 118, in load_managers
    super().load_managers()
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/manager_based_env.py", line 205, in load_managers
    self.observation_manager = ObservationManager(self.cfg.observations, self)
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/managers/observation_manager.py", line 41, in __init__
    super().__init__(cfg, env)
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/managers/manager_base.py", line 130, in __init__
    self._prepare_terms()
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/managers/observation_manager.py", line 254, in _prepare_terms
    obs_dims = tuple(term_cfg.func(self._env, **term_cfg.params).shape[1:])
  File "/workspace/isaaclab/test_fr.py", line 28, in camera_img
    camera_output = env.scene[cam_cfg.name].data.output["rgb"]
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/sensors/camera/camera.py", line 170, in data
    self._update_outdated_buffers()
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/sensors/sensor_base.py", line 279, in _update_outdated_buffers
    self._update_buffers_impl(outdated_env_ids)
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/sensors/camera/camera.py", line 473, in _update_buffers_impl
    self._create_annotator_data()
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/sensors/camera/camera.py", line 592, in _create_annotator_data
    data, info = self._process_annotator_output(name, output)
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/sensors/camera/camera.py", line 613, in _process_annotator_output
    data = convert_to_torch(data, device=self.device)
  File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/utils/array.py", line 90, in convert_to_torch
    tensor = tensor.to(device)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Error 3 (one L40S and 475 environments):

2024-06-30 08:41:49 [281,227ms] [Error] [carb]
../../../source/plugins/rtx.resourcemanager/ResourceManagerContext.cpp(6805): rtx::RtxResult rtx::resourcemanager::Context::allocateDescriptorSets(carb::graphics::ResourceBindingSignature*, uint32_t, uint32_t, carb::graphics::DescriptorSet**, carb::graphics::DescriptorPool**, uint32_t*, size_t, rtx::resourcemanager::DescSetAllocationFlags)(): Assertion (success) failed: Unable to allocate descriptor sets.

Test Code:

import argparse

from omni.isaac.lab.app import AppLauncher

parser = argparse.ArgumentParser(description="TEST")
parser.add_argument("--num_envs", type=int, default=2, help="")
AppLauncher.add_app_launcher_args(parser)
args_cli = parser.parse_args()
app_launcher = AppLauncher(args_cli)
simulation_app = app_launcher.app

import torch

import subprocess

import omni.isaac.lab.sim as sim_utils
from omni.isaac.lab.sensors import CameraCfg
from omni.isaac.lab.managers import ObservationTermCfg as ObsTerm
from omni.isaac.lab.managers import SceneEntityCfg
from omni.isaac.lab_tasks.manager_based.manipulation.lift.config.franka.joint_pos_env_cfg import FrankaCubeLiftEnvCfg
from omni.isaac.lab.envs import ManagerBasedRLEnv


def camera_img(
        env: ManagerBasedRLEnv,
        cam_cfg: SceneEntityCfg = SceneEntityCfg("camera"),
) -> torch.Tensor:
    camera_output = env.scene[cam_cfg.name].data.output["rgb"]
    camera_output = camera_output[..., :-1]  # cut depth chanel
    return camera_output.flatten(start_dim=1)


class FrankaObjectLiftEnvCfg(FrankaCubeLiftEnvCfg):
    def __post_init__(self):
        super().__post_init__()
        self.decimation = 6
        self.episode_length_s = 2.5
        self.sim.dt = 0.01

        self.scene.camera = CameraCfg(
            prim_path="{ENV_REGEX_NS}/Table/camera_sensor",
            update_period=self.decimation * self.sim.dt,
            height=84,
            width=84,
            data_types=["rgb"],
            spawn=sim_utils.PinholeCameraCfg(
                focal_length=24.0, focus_distance=400.0, horizontal_aperture=20.955, clipping_range=(0.1, 1.0e5)
            ),
            offset=CameraCfg.OffsetCfg(pos=(0.0, -0.4, 0.4),
                                       rot=(0.6532814862382034, -0.2705980584036595,
                                            0.2705980656543066, 0.6532814687335938),
                                       convention="world"),
        )

        self.observations.policy.image = ObsTerm(func=camera_img)


env_cfg = FrankaObjectLiftEnvCfg()
env_cfg.scene.num_envs = args_cli.num_envs
env = ManagerBasedRLEnv(cfg=env_cfg, render_mode="rgb_array")


def run_nvidia_smi():
    try:
        output = subprocess.check_output(['nvidia-smi'], text=True)
        print(output)
    except subprocess.CalledProcessError as e:
        print(f"Failed to run nvidia-smi: {e}")
    except FileNotFoundError:
        print("nvidia-smi command not found. Ensure you have the NVIDIA drivers installed.")


run_nvidia_smi()
for i in range(100):
    a = torch.randn_like(env.action_manager.action)
    _ = env.step(a)
    if i % 10 == 0:
        print("iteration:", i)
        run_nvidia_smi()

Console command: ./isaaclab.sh -p test_fr.py --headless --enable_cameras --num_envs=476

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions