Description
We encounter issues with a high number of environments when using Cameras. Although the memory problem with cameras is now fixed in Isaac Lab, we cannot scale up the amount of environments as we like.
When using two graphic cards, we could achieve a parallelization of 475 environments in the test setup. When going up to 476, the setup sometimes works, or we get one of two errors (see below, errors 1 and 2). With more than 476 environments (tested up to 512), the tests always fail with error 1. When using only one graphics card, we could not even achieve 475 environments and got error 3. The memory usage was tracked while we were performing the experiments, and there was always plenty of VRAM left (see memory usage with 475). We did these tests on a Linux machine with TITAN RTX GPUs and reproduced them on our HPC with L40S GPUs. On both systems, the breaking points were the same, despite the more powerful L40S GPUs.
For our test, we simply put a camera into the FrankaCubeLift
environment and added the flattened image to the observations (code below).
Is there anything we do wrong in our setups, or is this a known issue with cameras? When using the TiledCamera
camera, we could use 4096 parallel environments, but since it does not provide the desired image output (#493) it is not usable for us.
System Info:
System 1:
- For testing, we use the Docker container with X11 forwarding disabled.
- IsaacLab version (last commit): Fixes root view to body view mapping in articulation #497
- Isaac Sim Version: 4.0.0
- OS: Ubuntu 22.04.3 LTS
- GPU: 2 * NVIDIA TITAN RTX (24 GiB)
- CUDA Toolkit 11.5, Driver 12.0
- GPU Driver: 525.147.05
System 2 (HPC):
- IsaacLab version (last commit): Fixes open cabinet state machine run instructions in script #600
- Isaac Sim Version: 4.0.0
- GPU: 2 * L40S (48 GiB)
- CUDA Version: 12.2
- GPU Driver Version: 535.161.08
Memory usage with 475:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA TITAN RTX Off | 00000000:17:00.0 Off | N/A |
| 41% 35C P2 78W / 280W | 9934MiB / 24576MiB | 42% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA TITAN RTX Off | 00000000:B3:00.0 On | N/A |
| 40% 37C P2 86W / 280W | 8375MiB / 24576MiB | 28% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Error 1 (two GPUs and num_envs > 476):
2024-06-24 10:54:07 [399,118ms] [Warning] [carb] Client carb.scenerenderer-rtx.plugin has acquired [carb::settings::ISettings v1.0] 100 times. Consider accessing this interface with carb::getCachedInterface() (Performance warning)
2024-06-24 10:57:26 [598,443ms] [Error] [carb.graphics-vulkan.plugin] VkResult: ERROR_INITIALIZATION_FAILED
2024-06-24 10:57:26 [598,443ms] [Error] [carb.graphics-vulkan.plugin] vkCreateSemaphore failed.
2024-06-24 10:57:26 [598,443ms] [Error] [gpu.foundation.plugin] createSemaphore failed for interop semaphore
2024-06-24 10:57:26 [598,443ms] [Error] [gpu.foundation.plugin] Failed to create CUDA interop semaphore in command list "Render graph command list (Render queue 0, device 1, frame submission index 0)".
2024-06-24 10:57:56 [628,444ms] [Error] [gpu.foundation.plugin] Submission barrier "CUDA event dependency in MGPU resource" timeout in command list "Render graph command list (Render queue 0, device 0, frame submission index 0)"!
2024-06-24 10:57:56 [628,444ms] [Error] [gpu.foundation.plugin] Wait count: 1, Received signals: 0:
2024-06-24 10:58:26 [658,466ms] [Error] [gpu.foundation.plugin] Submission barrier "CUDA event dependency in MGPU resource" timeout in command list "Render graph command list (Render queue 0, device 0, frame submission index 0)"!
2024-06-24 10:58:26 [658,466ms] [Error] [gpu.foundation.plugin] Wait count: 1, Received signals: 0:
Error 2 (two GPUs and sometimes at num_envs = 476):
Traceback (most recent call last):
File "/workspace/isaaclab/test_fr.py", line 60, in <module>
env = ManagerBasedRLEnv(cfg=env_cfg, render_mode="rgb_array")
File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/manager_based_rl_env.py", line 75, in __init__
super().__init__(cfg=cfg)
File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/manager_based_env.py", line 117, in __init__
self.load_managers()
File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/manager_based_rl_env.py", line 118, in load_managers
super().load_managers()
File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/manager_based_env.py", line 205, in load_managers
self.observation_manager = ObservationManager(self.cfg.observations, self)
File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/managers/observation_manager.py", line 41, in __init__
super().__init__(cfg, env)
File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/managers/manager_base.py", line 130, in __init__
self._prepare_terms()
File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/managers/observation_manager.py", line 254, in _prepare_terms
obs_dims = tuple(term_cfg.func(self._env, **term_cfg.params).shape[1:])
File "/workspace/isaaclab/test_fr.py", line 28, in camera_img
camera_output = env.scene[cam_cfg.name].data.output["rgb"]
File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/sensors/camera/camera.py", line 170, in data
self._update_outdated_buffers()
File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/sensors/sensor_base.py", line 279, in _update_outdated_buffers
self._update_buffers_impl(outdated_env_ids)
File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/sensors/camera/camera.py", line 473, in _update_buffers_impl
self._create_annotator_data()
File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/sensors/camera/camera.py", line 592, in _create_annotator_data
data, info = self._process_annotator_output(name, output)
File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/sensors/camera/camera.py", line 613, in _process_annotator_output
data = convert_to_torch(data, device=self.device)
File "/workspace/isaaclab/source/extensions/omni.isaac.lab/omni/isaac/lab/utils/array.py", line 90, in convert_to_torch
tensor = tensor.to(device)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Error 3 (one L40S and 475 environments):
2024-06-30 08:41:49 [281,227ms] [Error] [carb]
../../../source/plugins/rtx.resourcemanager/ResourceManagerContext.cpp(6805): rtx::RtxResult rtx::resourcemanager::Context::allocateDescriptorSets(carb::graphics::ResourceBindingSignature*, uint32_t, uint32_t, carb::graphics::DescriptorSet**, carb::graphics::DescriptorPool**, uint32_t*, size_t, rtx::resourcemanager::DescSetAllocationFlags)(): Assertion (success) failed: Unable to allocate descriptor sets.
Test Code:
import argparse
from omni.isaac.lab.app import AppLauncher
parser = argparse.ArgumentParser(description="TEST")
parser.add_argument("--num_envs", type=int, default=2, help="")
AppLauncher.add_app_launcher_args(parser)
args_cli = parser.parse_args()
app_launcher = AppLauncher(args_cli)
simulation_app = app_launcher.app
import torch
import subprocess
import omni.isaac.lab.sim as sim_utils
from omni.isaac.lab.sensors import CameraCfg
from omni.isaac.lab.managers import ObservationTermCfg as ObsTerm
from omni.isaac.lab.managers import SceneEntityCfg
from omni.isaac.lab_tasks.manager_based.manipulation.lift.config.franka.joint_pos_env_cfg import FrankaCubeLiftEnvCfg
from omni.isaac.lab.envs import ManagerBasedRLEnv
def camera_img(
env: ManagerBasedRLEnv,
cam_cfg: SceneEntityCfg = SceneEntityCfg("camera"),
) -> torch.Tensor:
camera_output = env.scene[cam_cfg.name].data.output["rgb"]
camera_output = camera_output[..., :-1] # cut depth chanel
return camera_output.flatten(start_dim=1)
class FrankaObjectLiftEnvCfg(FrankaCubeLiftEnvCfg):
def __post_init__(self):
super().__post_init__()
self.decimation = 6
self.episode_length_s = 2.5
self.sim.dt = 0.01
self.scene.camera = CameraCfg(
prim_path="{ENV_REGEX_NS}/Table/camera_sensor",
update_period=self.decimation * self.sim.dt,
height=84,
width=84,
data_types=["rgb"],
spawn=sim_utils.PinholeCameraCfg(
focal_length=24.0, focus_distance=400.0, horizontal_aperture=20.955, clipping_range=(0.1, 1.0e5)
),
offset=CameraCfg.OffsetCfg(pos=(0.0, -0.4, 0.4),
rot=(0.6532814862382034, -0.2705980584036595,
0.2705980656543066, 0.6532814687335938),
convention="world"),
)
self.observations.policy.image = ObsTerm(func=camera_img)
env_cfg = FrankaObjectLiftEnvCfg()
env_cfg.scene.num_envs = args_cli.num_envs
env = ManagerBasedRLEnv(cfg=env_cfg, render_mode="rgb_array")
def run_nvidia_smi():
try:
output = subprocess.check_output(['nvidia-smi'], text=True)
print(output)
except subprocess.CalledProcessError as e:
print(f"Failed to run nvidia-smi: {e}")
except FileNotFoundError:
print("nvidia-smi command not found. Ensure you have the NVIDIA drivers installed.")
run_nvidia_smi()
for i in range(100):
a = torch.randn_like(env.action_manager.action)
_ = env.step(a)
if i % 10 == 0:
print("iteration:", i)
run_nvidia_smi()
Console command: ./isaaclab.sh -p test_fr.py --headless --enable_cameras --num_envs=476