Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak in GPU During Continuous Scene Loading and Coverage Map Calculation #662

Open
hieutrungle opened this issue Nov 15, 2024 · 1 comment

Comments

@hieutrungle
Copy link

I'm encountering a GPU memory leak while performing repeated scene loading and coverage map calculations for my research work. The issue occurs specifically when executing new scene and the coverage_map function within a loop. Interestingly, the compute_paths function doesn't exhibit this problem under similar conditions.

image

The image shows the linear increase in memory when I load new scene and perform coverage_map function.

I suspect that after loading a scene and computing the coverage map, some global parameters or cached data might be persisting in the GPU memory, leading to the observed leak.

Could you please investigate this issue? Specifically, I'd appreciate if you could:

  • Examine the coverage_map function for potential memory management issues.
  • Check if there are any global parameters or caching mechanisms that aren't being properly cleared.
  • implement a function to force the scene to terminate.
    Any insights or solutions to resolve this memory leak would be greatly appreciated. Thank you for your time and assistance.

My system:

  • OS: Ubuntu 22.04
  • CUDA: 550 and 535
  • GPU: NVIDIA 2060 Ti, NVIDIA 3090, NVIDIA A100, NVIDIA A30, NVIDIA H100
  • Sionna version: 0.17, 0.18, 0.19
  • Tensorflow: 2.14, 2.15
@wang-chenlong
Copy link

wang-chenlong commented Dec 11, 2024

I encountered this issue as well. In a fixed scene, I have about 10 transmitters, and each transmitter corresponds to approximately 1500 receiver positions. During the simulation, I simulate only one transmitter at a time, and for each simulation, I process a batch of 120 receivers at a time. After the calculation, I remove the receivers and the computed paths from the scene. Usually, there is no issue when calculating the first transmitter, but after the calculation of the second transmitter, I get an out-of-resources error. I suspect that something in the scene wasn't properly cleared, which is causing this.

batch_size = 120
for idx, (x_tx, y_tx, z_tx) in tqdm(enumerate(tx_position), total=len(tx_position), desc="Simulating Transmitter Positions", unit="transmitter"):
    tx = Transmitter(name=f"tx-{idx}",
                               position=tx_position[idx])

    scene.add(tx)
    rx_pos = rx_position[idx]
    rx_tmp_len = len(rx_pos)

    # 动态按批次处理接收机位置
    receiver_batches = len(rx_pos) // batch_size + (1 if len(rx_pos) % batch_size != 0 else 0)
    for batch_start in tqdm(range(0, len(rx_pos), batch_size), 
                            total=receiver_batches, 
                            desc=f"Processing Receivers for tx-{idx}", 
                            unit="batch", 
                            leave=False,
                            dynamic_ncols=True):
        # 获取当前批次的接收机位置
        rx_batch = rx_pos[batch_start:batch_start + batch_size]

        for i, rx in enumerate(rx_batch):
            receiver_name = f"rx-{batch_start + i}"
            rx = Receiver(name=receiver_name,
                          position=rx)
            scene.add(rx)

        # 计算路径
        paths = scene.compute_paths(max_depth=5,
                                    num_samples=1e6,
                                    diffraction=True,
                                    scattering=True)

        # 移除当前批次的接收机
        for i in range(len(rx_batch)):
            scene.remove(f"rx-{batch_start + i}")

        del paths # Free memory
    scene.remove(f"tx-{idx}")
    

Among all the resources, my GPU memory is fully utilized. I have already set the following:

if gpus:
    try:
        tf.config.experimental.set_memory_growth(gpus[0], True)
    except RuntimeError as e:
        print(e)

1111

---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
Cell In[5], line 31
     28     scene.add(rx)
     30 # 计算路径
---> 31 paths = scene.compute_paths(max_depth=5,
     32                             num_samples=1e6,
     33                             diffraction=True,
     34                             scattering=True)
     36 # 移除当前批次的接收机
     37 for i in range(len(rx_batch)):

File [/usr/local/lib/python3.11/dist-packages/sionna/rt/scene.py:958](http://127.0.0.1:8088/usr/local/lib/python3.11/dist-packages/sionna/rt/scene.py#line=957), in Scene.compute_paths(self, max_depth, method, num_samples, los, reflection, diffraction, scattering, ris, scat_keep_prob, edge_diffraction, check_scene, scat_random_phases, testing)
    784 r"""
    785 Computes propagation paths.
    786 
   (...)
    954     Simulated paths
    955 """
    957 # Trace the paths
--> 958 traced_paths = self.trace_paths(max_depth, method, num_samples, los,
    959     reflection, diffraction, scattering, ris, scat_keep_prob,
    960     edge_diffraction, check_scene)
    962 # Compute the fields and merge the paths
    963 # Check scene is not done twice
    964 paths = self.compute_fields(*traced_paths, False, scat_random_phases,
    965                             testing)

File [/usr/local/lib/python3.11/dist-packages/sionna/rt/scene.py:654](http://127.0.0.1:8088/usr/local/lib/python3.11/dist-packages/sionna/rt/scene.py#line=653), in Scene.trace_paths(self, max_depth, method, num_samples, los, reflection, diffraction, scattering, ris, scat_keep_prob, edge_diffraction, check_scene)
    651     self._check_scene(False)
    653 # Trace the paths
--> 654 paths = self._solver_paths.trace_paths(max_depth,
    655                                        method=method,
    656                                        num_samples=num_samples,
    657                                        los=los, reflection=reflection,
    658                                        diffraction=diffraction,
    659                                        scattering=scattering,
    660                                        ris=ris,
    661                                        scat_keep_prob=scat_keep_prob,
    662                                        edge_diffraction=edge_diffraction)
    664 return paths

File [/usr/local/lib/python3.11/dist-packages/sionna/rt/solver_paths.py:642](http://127.0.0.1:8088/usr/local/lib/python3.11/dist-packages/sionna/rt/solver_paths.py#line=641), in SolverPaths.trace_paths(self, max_depth, method, num_samples, los, reflection, diffraction, scattering, ris, scat_keep_prob, edge_diffraction)
    639 scat_paths_tmp = PathsTmpData(sources, targets, self._dtype)
    640 if scattering and tf.shape(candidates_scat)[0] > 0:
--> 642     scat_paths, scat_paths_tmp = self._scat_test_rx_blockage(targets,sources,
    643                                                         candidates_scat,
    644                                                         hit_points,
    645                                                         ris_objects)
    646     scat_paths, scat_paths_tmp =\
    647         self._compute_directions_distances_delays_angles(scat_paths,
    648                                                          scat_paths_tmp,
    649                                                          True)
    651     scat_paths, scat_paths_tmp =\
    652         self._scat_discard_crossing_paths(scat_paths, scat_paths_tmp,
    653                                           scat_keep_prob)

File [/usr/local/lib/python3.11/dist-packages/sionna/rt/solver_paths.py:3511](http://127.0.0.1:8088/usr/local/lib/python3.11/dist-packages/sionna/rt/solver_paths.py#line=3510), in SolverPaths._scat_test_rx_blockage(self, targets, sources, candidates, hit_points, ris_objects)
   3508 hit_points_ = tf.gather_nd(hit_points_, gather_indices)
   3509 # Store the valid intersection points
   3510 # [max_depth, num_targets, num_sources, max_num_paths, 3]
-> 3511 opt_hit_points = tf.tensor_scatter_nd_update(opt_hit_points,
   3512                                 scatter_indices_, hit_points_)
   3514 # Intersected primitives
   3515 # [num_targets, num_sources, num_samples]
   3516 candidates_ = tf.gather(candidates, depth, axis=0)

File [/usr/local/lib/python3.11/dist-packages/tensorflow/python/util/traceback_utils.py:153](http://127.0.0.1:8088/usr/local/lib/python3.11/dist-packages/tensorflow/python/util/traceback_utils.py#line=152), in filter_traceback.<locals>.error_handler(*args, **kwargs)
    151 except Exception as e:
    152   filtered_tb = _process_traceback_frames(e.__traceback__)
--> 153   raise e.with_traceback(filtered_tb) from None
    154 finally:
    155   del filtered_tb

File [/usr/local/lib/python3.11/dist-packages/tensorflow/python/framework/ops.py:5883](http://127.0.0.1:8088/usr/local/lib/python3.11/dist-packages/tensorflow/python/framework/ops.py#line=5882), in raise_from_not_ok_status(e, name)
   5881 def raise_from_not_ok_status(e, name) -> NoReturn:
   5882   e.message += (" name: " + str(name if name is not None else ""))
-> 5883   raise core._status_to_exception(e) from None

ResourceExhaustedError: {{function_node __wrapped__TensorScatterUpdate_device_[/job](http://127.0.0.1:8088/job):localhost[/replica:0](http://127.0.0.1:8088/replica#line=-1)[/task:0](http://127.0.0.1:8088/task#line=-1)[/device](http://127.0.0.1:8088/device):GPU:0}} OOM when allocating tensor with shape[5,120,1,213905,3] and type float on [/job](http://127.0.0.1:8088/job):localhost[/replica:0](http://127.0.0.1:8088/replica#line=-1)[/task:0](http://127.0.0.1:8088/task#line=-1)[/device](http://127.0.0.1:8088/device):GPU:0 by allocator GPU_0_bfc [Op:TensorScatterUpdate] name:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants