Memory leak issue when loading a new world #81

aaronh65 · 2021-01-21T17:00:29Z

I'm trying to create a custom RL environment for CARLA using leaderboard_evaluator.py as a template, but I'm running into some issues when trying to reset the custom environment (after an episode is done). The functions that load world/scenario, and cleanup after a scenario's end closely matches what's done in leaderboard_evaluator.py (e.g. the load_and_wait_for_world and cleanup functions), but there's a memory leak somewhere that happens every time I reset the environment.

Using a memory profiler shows that each time the environment resets, the carla.Client takes up more and more memory. This eventually leads to an out-of-memory error which kills the process. Is there a clean up method I'm missing/some common pitfall when resetting environments that I should resolve to stop this from happening?

I can provide code if needed, but I wanted first check if this was a known issue.

The text was updated successfully, but these errors were encountered:

aaronh65 · 2021-01-22T00:20:34Z

Quick update: I used some memory profiler tools and something I've seen is that every time I load a new world (in my environment reset method) using self.client.load_world(config.town) where config is a ScenarioConfiguration, memory usage jumps by 50-100 Mb and doesn't go back down by the same amount when I call my cleanup method. I tried looking for any hanging references to the world, but I can't seem to find anything.

I decided to run the memory profiler tool through leaderboard_evaluator.py, and it looks like the memory used increases with each successive scenario that's run, specifically within the _load_and_wait_for_world method. As shown in the graph below, each call to this method adds at least 50 Mb in memory used that doesn't seem to be reclaimed after the scenario's conclusion.

totalmayhem96 · 2021-01-26T04:56:18Z

Great detective work! I have been trying to figure out why I keep running out of GPU memory, and this seems to be the problem. I get through about 20 route scenarios before i get the "out of memory" error... has anyone discovered a way to fix this (even if its a bit hacky)? I need to run 100+ route scenarios and with the current issue I cant get anywhere near that.

aaronh65 · 2021-01-26T06:49:37Z

If you don't have to run the 100 route scenarios all at once, perhaps you could do something hacky with bash scripting. I think I'd try to figure out the approximate number of times you can load a new world before you get an OOM error, and setup your code to run for that number of route scenarios (around 20 in your case). Then, you could use some bash scripting to run the code multiple times to get through all the route scenarios you need e.g. for 100 route scenarios, loop 5 times running 20 route scenarios per loop.

The memory that gets used up by loading new worlds seems relatively consistent and is freed up when you finish running code, so I think looping it with a bash script should work

glopezdiest · 2021-02-11T09:40:35Z

Hey @aaronh65. Thanks for the information. We've also detected the issue and are trying to solve it. I'll report back when we have some answers to this issue.

ap229997 · 2021-04-27T07:58:39Z

Great detective work! I have been trying to figure out why I keep running out of GPU memory, and this seems to be the problem. I get through about 20 route scenarios before i get the "out of memory" error... has anyone discovered a way to fix this (even if its a bit hacky)? I need to run 100+ route scenarios and with the current issue I cant get anywhere near that.

There is another source of memory leakage - checkpoint memory is not freed after the completion of each route and is reloaded in the next route. Just check if you override the destroy() method in your agent file.

SwapnilPande · 2021-10-14T20:32:49Z

Are there any updates on this? I have been encountering the issue with memory growing on the self.client.load_world(config.town) call.

Kin-Zhang · 2022-02-02T11:12:30Z

Great detective work! I have been trying to figure out why I keep running out of GPU memory, and this seems to be the problem. I get through about 20 route scenarios before i get the "out of memory" error... has anyone discovered a way to fix this (even if its a bit hacky)? I need to run 100+ route scenarios and with the current issue I cant get anywhere near that.

There is another source of memory leakage - checkpoint memory is not freed after the completion of each route and is reloaded in the next route. Just check if you override the destroy() method in your agent file.

I didn't see someone override the destroy() method,
e.g:

transfuser/blob/main/leaderboard/team_code/auto_pilot.py
leaderboard/autoagents/autonomous_agent
but the autonomous agent file on destroy function also just pass it.

so, the point on destroy method did you mean that it should destroy some things?

Kin-Zhang · 2022-02-02T11:14:13Z

I also met this problem but it's not so often,
In my script terminal

terminate called after throwing an instance of 'clmdep_msgpack::v1::type_error'
  what():  std::bad_cast

Or Carla terminal:

terminating with uncaught exception of type clmdep_msgpack::v1::type_error: std::bad_castterminating with uncaught exception of type clmdep_msgpack::v1::type_error: std::bad_cast

Signal 6 caught.
Signal 6 caught.
Malloc Size=65538 LargeMemoryPoolOffset=65554 
CommonUnixCrashHandler: Signal=6
Malloc Size=65535 LargeMemoryPoolOffset=131119 
Malloc Size=120416 LargeMemoryPoolOffset=251552 
Engine crash handling finished; re-raising signal 6 for the default handler. Good bye.
Aborted

ap229997 · 2022-02-02T14:42:42Z

Great detective work! I have been trying to figure out why I keep running out of GPU memory, and this seems to be the problem. I get through about 20 route scenarios before i get the "out of memory" error... has anyone discovered a way to fix this (even if its a bit hacky)? I need to run 100+ route scenarios and with the current issue I cant get anywhere near that.

There is another source of memory leakage - checkpoint memory is not freed after the completion of each route and is reloaded in the next route. Just check if you override the destroy() method in your agent file.

I didn't see someone override the destroy() method, e.g:

transfuser/blob/main/leaderboard/team_code/auto_pilot.py

leaderboard/autoagents/autonomous_agent
but the autonomous agent file on destroy function also just pass it.

so, the point on destroy method did you mean that it should destroy some things?

auto_pilot.py does not load any model checkpoint on the GPU so it does not need a destroy() method. If you are running a model on the GPU then you need to clear the checkpoint memory after every route explicitly in the code, e.g. using destroy() method.

You'll encounter this problem only when running on a large number of routes (<10 shouldn't be an issue). If you want to verify this, you can try running an evaluation with say 50 routes and monitor GPU memory usage after every route. Also, we noticed this issue in CARLA 0.9.10 and earlier version of the leaderboard framework, so I don't know if this has been resolved in the newer versions.

Kin-Zhang · 2022-02-03T15:09:13Z

Great detective work! I have been trying to figure out why I keep running out of GPU memory, and this seems to be the problem. I get through about 20 route scenarios before i get the "out of memory" error... has anyone discovered a way to fix this (even if its a bit hacky)? I need to run 100+ route scenarios and with the current issue I cant get anywhere near that.

There is another source of memory leakage - checkpoint memory is not freed after the completion of each route and is reloaded in the next route. Just check if you override the destroy() method in your agent file.

I didn't see someone override the destroy() method, e.g:

transfuser/blob/main/leaderboard/team_code/auto_pilot.py

leaderboard/autoagents/autonomous_agent
but the autonomous agent file on destroy function also just pass it.

so, the point on destroy method did you mean that it should destroy some things?

auto_pilot.py does not load any model checkpoint on the GPU so it does not need a destroy() method. If you are running a model on the GPU then you need to clear the checkpoint memory after every route explicitly in the code, e.g. using destroy() method.

You'll encounter this problem only when running on a large number of routes (<10 shouldn't be an issue). If you want to verify this, you can try running an evaluation with say 50 routes and monitor GPU memory usage after every route. Also, we noticed this issue in CARLA 0.9.10 and earlier version of the leaderboard framework, so I don't know if this has been resolved in the newer versions.

Thanks for pointing out. I think not only the GPU memory grown but also the CPU Memory, and according to commit in this repo, I think this problem did't solve... I tried to use python library to collect useless memory but it seems didn't help the situation.
I tried the 100-120 routes file, and the things just like this issue author said, the CPU Memory growning every time after loading the world...

varunjammula · 2022-03-09T20:26:06Z

@glopezdiest Any updates on this ?

glopezdiest · 2022-03-10T07:35:30Z

Not really, we are working in a new Leaderboard 2.0, and while this issue is part of our plan to improve the leaderboard, we haven't had a chance to check it out yet.

VinalAsodia1999 · 2022-07-15T22:42:43Z

Hi, I'm trying to run a Reinforcement Learning experiment using the DynamicObjectCrossing example scenario on Carla 0.9.13. I've used the psutil python module as a memory profiler and I can see the RAM memory usage go up by a constant amount every-time self.client.load_world(town) is called. I've tried to del self.world before the new world is loaded in, but the problem still persists. Has anyone found a solution to this?

yhh-IV · 2024-03-07T13:39:40Z

Anyone has solved this issue?

memolo99 · 2024-04-10T10:21:06Z

I'm having this issue training a DQN agent on different scenarios in CARLA 0.9.12? Any suggestions?

aaronh65 changed the title ~~Memory leak issue when resetting a custom RL environment based on Leaderboard~~ Memory leak issue when loading a new world Jan 26, 2021

glopezdiest mentioned this issue Apr 26, 2021

terminating with uncaught exception of type clmdep_msgpack::v1::type_error: std::bad_cast carla-simulator/carla#3540

Open

raozhongyu mentioned this issue Apr 27, 2021

terminate called after throwing an instance of 'clmdep_msgpack::v1::type_error' what(): std::bad_cast autonomousvision/transfuser#5

Closed

MoritzNekolla mentioned this issue Jan 17, 2023

Server error when creating multiple worlds. “Failed to find symbol file, expected location”. Potential memory leak? carla-simulator/carla#6122

Open

Kin-Zhang mentioned this issue Oct 23, 2023

About collecting data Kin-Zhang/mmfn#12

Closed

Mariusmarten mentioned this issue Jan 31, 2024

GPU Memory release carla-simulator/carla#7103

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak issue when loading a new world #81

Memory leak issue when loading a new world #81

aaronh65 commented Jan 21, 2021

aaronh65 commented Jan 22, 2021 •

edited

Loading

totalmayhem96 commented Jan 26, 2021

aaronh65 commented Jan 26, 2021

glopezdiest commented Feb 11, 2021 •

edited

Loading

ap229997 commented Apr 27, 2021

SwapnilPande commented Oct 14, 2021

Kin-Zhang commented Feb 2, 2022

Kin-Zhang commented Feb 2, 2022

ap229997 commented Feb 2, 2022

Kin-Zhang commented Feb 3, 2022

varunjammula commented Mar 9, 2022

glopezdiest commented Mar 10, 2022

VinalAsodia1999 commented Jul 15, 2022

yhh-IV commented Mar 7, 2024

memolo99 commented Apr 10, 2024

Memory leak issue when loading a new world #81

Memory leak issue when loading a new world #81

Comments

aaronh65 commented Jan 21, 2021

aaronh65 commented Jan 22, 2021 • edited Loading

totalmayhem96 commented Jan 26, 2021

aaronh65 commented Jan 26, 2021

glopezdiest commented Feb 11, 2021 • edited Loading

ap229997 commented Apr 27, 2021

SwapnilPande commented Oct 14, 2021

Kin-Zhang commented Feb 2, 2022

Kin-Zhang commented Feb 2, 2022

ap229997 commented Feb 2, 2022

Kin-Zhang commented Feb 3, 2022

varunjammula commented Mar 9, 2022

glopezdiest commented Mar 10, 2022

VinalAsodia1999 commented Jul 15, 2022

yhh-IV commented Mar 7, 2024

memolo99 commented Apr 10, 2024

aaronh65 commented Jan 22, 2021 •

edited

Loading

glopezdiest commented Feb 11, 2021 •

edited

Loading