Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak issue when loading a new world #81

Open
aaronh65 opened this issue Jan 21, 2021 · 15 comments
Open

Memory leak issue when loading a new world #81

aaronh65 opened this issue Jan 21, 2021 · 15 comments

Comments

@aaronh65
Copy link

I'm trying to create a custom RL environment for CARLA using leaderboard_evaluator.py as a template, but I'm running into some issues when trying to reset the custom environment (after an episode is done). The functions that load world/scenario, and cleanup after a scenario's end closely matches what's done in leaderboard_evaluator.py (e.g. the load_and_wait_for_world and cleanup functions), but there's a memory leak somewhere that happens every time I reset the environment.

Using a memory profiler shows that each time the environment resets, the carla.Client takes up more and more memory. This eventually leads to an out-of-memory error which kills the process. Is there a clean up method I'm missing/some common pitfall when resetting environments that I should resolve to stop this from happening?

I can provide code if needed, but I wanted first check if this was a known issue.

@aaronh65
Copy link
Author

aaronh65 commented Jan 22, 2021

Quick update: I used some memory profiler tools and something I've seen is that every time I load a new world (in my environment reset method) using self.client.load_world(config.town) where config is a ScenarioConfiguration, memory usage jumps by 50-100 Mb and doesn't go back down by the same amount when I call my cleanup method. I tried looking for any hanging references to the world, but I can't seem to find anything.

I decided to run the memory profiler tool through leaderboard_evaluator.py, and it looks like the memory used increases with each successive scenario that's run, specifically within the _load_and_wait_for_world method. As shown in the graph below, each call to this method adds at least 50 Mb in memory used that doesn't seem to be reclaimed after the scenario's conclusion.

leaderboard_memory_leak

@totalmayhem96
Copy link

Great detective work! I have been trying to figure out why I keep running out of GPU memory, and this seems to be the problem. I get through about 20 route scenarios before i get the "out of memory" error... has anyone discovered a way to fix this (even if its a bit hacky)? I need to run 100+ route scenarios and with the current issue I cant get anywhere near that.

@aaronh65
Copy link
Author

If you don't have to run the 100 route scenarios all at once, perhaps you could do something hacky with bash scripting. I think I'd try to figure out the approximate number of times you can load a new world before you get an OOM error, and setup your code to run for that number of route scenarios (around 20 in your case). Then, you could use some bash scripting to run the code multiple times to get through all the route scenarios you need e.g. for 100 route scenarios, loop 5 times running 20 route scenarios per loop.

The memory that gets used up by loading new worlds seems relatively consistent and is freed up when you finish running code, so I think looping it with a bash script should work

@aaronh65 aaronh65 changed the title Memory leak issue when resetting a custom RL environment based on Leaderboard Memory leak issue when loading a new world Jan 26, 2021
@glopezdiest
Copy link
Contributor

glopezdiest commented Feb 11, 2021

Hey @aaronh65. Thanks for the information. We've also detected the issue and are trying to solve it. I'll report back when we have some answers to this issue.

@ap229997
Copy link

Great detective work! I have been trying to figure out why I keep running out of GPU memory, and this seems to be the problem. I get through about 20 route scenarios before i get the "out of memory" error... has anyone discovered a way to fix this (even if its a bit hacky)? I need to run 100+ route scenarios and with the current issue I cant get anywhere near that.

There is another source of memory leakage - checkpoint memory is not freed after the completion of each route and is reloaded in the next route. Just check if you override the destroy() method in your agent file.

@SwapnilPande
Copy link

Are there any updates on this? I have been encountering the issue with memory growing on the self.client.load_world(config.town) call.

@Kin-Zhang
Copy link

Great detective work! I have been trying to figure out why I keep running out of GPU memory, and this seems to be the problem. I get through about 20 route scenarios before i get the "out of memory" error... has anyone discovered a way to fix this (even if its a bit hacky)? I need to run 100+ route scenarios and with the current issue I cant get anywhere near that.

There is another source of memory leakage - checkpoint memory is not freed after the completion of each route and is reloaded in the next route. Just check if you override the destroy() method in your agent file.

I didn't see someone override the destroy() method,
e.g:

  1. transfuser/blob/main/leaderboard/team_code/auto_pilot.py
  2. leaderboard/autoagents/autonomous_agent
    but the autonomous agent file on destroy function also just pass it.

so, the point on destroy method did you mean that it should destroy some things?

@Kin-Zhang
Copy link

I also met this problem but it's not so often,
In my script terminal

terminate called after throwing an instance of 'clmdep_msgpack::v1::type_error'
  what():  std::bad_cast

Or Carla terminal:

terminating with uncaught exception of type clmdep_msgpack::v1::type_error: std::bad_castterminating with uncaught exception of type clmdep_msgpack::v1::type_error: std::bad_cast

Signal 6 caught.
Signal 6 caught.
Malloc Size=65538 LargeMemoryPoolOffset=65554 
CommonUnixCrashHandler: Signal=6
Malloc Size=65535 LargeMemoryPoolOffset=131119 
Malloc Size=120416 LargeMemoryPoolOffset=251552 
Engine crash handling finished; re-raising signal 6 for the default handler. Good bye.
Aborted

@ap229997
Copy link

ap229997 commented Feb 2, 2022

Great detective work! I have been trying to figure out why I keep running out of GPU memory, and this seems to be the problem. I get through about 20 route scenarios before i get the "out of memory" error... has anyone discovered a way to fix this (even if its a bit hacky)? I need to run 100+ route scenarios and with the current issue I cant get anywhere near that.

There is another source of memory leakage - checkpoint memory is not freed after the completion of each route and is reloaded in the next route. Just check if you override the destroy() method in your agent file.

I didn't see someone override the destroy() method, e.g:

  1. transfuser/blob/main/leaderboard/team_code/auto_pilot.py
  2. leaderboard/autoagents/autonomous_agent
    but the autonomous agent file on destroy function also just pass it.

so, the point on destroy method did you mean that it should destroy some things?

auto_pilot.py does not load any model checkpoint on the GPU so it does not need a destroy() method. If you are running a model on the GPU then you need to clear the checkpoint memory after every route explicitly in the code, e.g. using destroy() method.

You'll encounter this problem only when running on a large number of routes (<10 shouldn't be an issue). If you want to verify this, you can try running an evaluation with say 50 routes and monitor GPU memory usage after every route. Also, we noticed this issue in CARLA 0.9.10 and earlier version of the leaderboard framework, so I don't know if this has been resolved in the newer versions.

@Kin-Zhang
Copy link

Great detective work! I have been trying to figure out why I keep running out of GPU memory, and this seems to be the problem. I get through about 20 route scenarios before i get the "out of memory" error... has anyone discovered a way to fix this (even if its a bit hacky)? I need to run 100+ route scenarios and with the current issue I cant get anywhere near that.

There is another source of memory leakage - checkpoint memory is not freed after the completion of each route and is reloaded in the next route. Just check if you override the destroy() method in your agent file.

I didn't see someone override the destroy() method, e.g:

  1. transfuser/blob/main/leaderboard/team_code/auto_pilot.py
  2. leaderboard/autoagents/autonomous_agent
    but the autonomous agent file on destroy function also just pass it.

so, the point on destroy method did you mean that it should destroy some things?

auto_pilot.py does not load any model checkpoint on the GPU so it does not need a destroy() method. If you are running a model on the GPU then you need to clear the checkpoint memory after every route explicitly in the code, e.g. using destroy() method.

You'll encounter this problem only when running on a large number of routes (<10 shouldn't be an issue). If you want to verify this, you can try running an evaluation with say 50 routes and monitor GPU memory usage after every route. Also, we noticed this issue in CARLA 0.9.10 and earlier version of the leaderboard framework, so I don't know if this has been resolved in the newer versions.

Thanks for pointing out. I think not only the GPU memory grown but also the CPU Memory, and according to commit in this repo, I think this problem did't solve... I tried to use python library to collect useless memory but it seems didn't help the situation.
I tried the 100-120 routes file, and the things just like this issue author said, the CPU Memory growning every time after loading the world...

@varunjammula
Copy link

@glopezdiest Any updates on this ?

@glopezdiest
Copy link
Contributor

Not really, we are working in a new Leaderboard 2.0, and while this issue is part of our plan to improve the leaderboard, we haven't had a chance to check it out yet.

@VinalAsodia1999
Copy link

Hi, I'm trying to run a Reinforcement Learning experiment using the DynamicObjectCrossing example scenario on Carla 0.9.13. I've used the psutil python module as a memory profiler and I can see the RAM memory usage go up by a constant amount every-time self.client.load_world(town) is called. I've tried to del self.world before the new world is loaded in, but the problem still persists. Has anyone found a solution to this?

@yhh-IV
Copy link

yhh-IV commented Mar 7, 2024

Anyone has solved this issue?

@memolo99
Copy link

I'm having this issue training a DQN agent on different scenarios in CARLA 0.9.12? Any suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants