Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: What is simulation speed bottleneck? #680

Open
peci1 opened this issue Oct 22, 2020 · 25 comments
Open

Question: What is simulation speed bottleneck? #680

peci1 opened this issue Oct 22, 2020 · 25 comments
Labels
help wanted Extra attention is needed

Comments

@peci1
Copy link
Collaborator

peci1 commented Oct 22, 2020

We tried to run the simulator on a beefy machine (40 cores, 4 GPUs) with a full team of robots (3 UGVs, 3 UAVs, approx. 30 cameras in total). Neither the CPUs nor any GPU were throttled to max, yet the real-time-factor was between 1-2 percent. Is there any clear performance bottleneck that could be worked on? E.g. aren't the sensors rendered serially (i.e. first one camera, then a second one, and so on)? Or is there something else? I'm pretty sure the physics computations shouldn't be that costly (and the performance doesn't drop lineraly with the number of robots).

@AravindaDP
Copy link

AravindaDP commented Oct 22, 2020

I'm also interested in same question. Is it possible to achieve similar level of performance of cloudsim using docker compose and if so what kind of machine should we use (I'm primarily targeting a AWS EC2 instance)

My experience has been as follows. (Numbers could be slightly off as these were from my general memory as I remember)
For a single X2 UGV with 3D lidar+4rgbd/UAV with 2d lidar+rgbd+2point lidar
Local PC with i7 7th gen 4C/8T @2.8GHz + GTX 1060 Realtime factor is around 30%
Cloudsim 40~50%

For 2x UGV (Same as above)+2x UAV(same as above)+Teambase
Local PC: 2~3%
Cloudsim: 10%

This local performance is even without solution containers/just using ign launch on catkin without any of solution nodes. Haven't tried using in headless mode also yet.

I haven't still tested using docker compose on amazon EC2. I understand that even then, it won't be apple to apple scenario since in cloudsim, simulation container and solution containers are potentially running on different EC2 instances.

In general I'm looking for following information, if it's possible to know.

  1. What is the instance type of host EC2 (p3.8xlarge etc) used to run simulation docker container. How much vCPU/GPU/RAM does that container get?

  2. What is the instance type of host EC2 used to run bridge containers and solution containers. How much vCPU/GPU/RAM does each container get? (Or how many docker instances are run on each EC2 instance)

Probably it's not a straightforward relationship between EC2 instance and container count as it's dynamically managed kubernetes cluster spanning across multiple nodes. Just looking for some rough figure so that I could try to recreate using docker compose on AWS. (Probably with even befier EC2 instance that has power equal to combined EC2 instances required for cloudsim simulation. assuming there exist such beefier single EC2 instance type)

@peci1
Copy link
Collaborator Author

peci1 commented Oct 22, 2020

@AravindaDP Most of your questions have answers in https://github.com/osrf/subt/wiki/Cloudsim%20Architecture . This bug is however about finding the bottlenecks on systems that have enough resources. Your local PC tests very probably suffer from resource exhaustion...

@AravindaDP
Copy link

@peci1 Thanks for pointing out the resources.

@pauljurczak
Copy link

Here are the CPU specs for EC2:

Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz: 
#1-16  2660 MHz
99.72Gi free RAM
119.99Gi total RAM

@AravindaDP
Copy link

AravindaDP commented Oct 23, 2020

Not sure if it has any effects but seems some 3D lidars have way too much horizontal resolution than real sensor they are based on.

e.g. For X1 Config 8 and EXPLORER_X1 Config 2 it has 10000 horizontal points per ring.
https://github.com/osrf/subt/blob/master/submitted_models/explorer_x1_sensor_config_2/model.sdf#L580

Where as VLP-16 (which I believe based on which these were modeled) would only have 1200 horizontal points per ring at 15Hz
http://www.mapix.com/wp-content/uploads/2018/07/63-9229_Rev-H_Puck-_Datasheet_Web-1.pdf

Probably this is not the actual bottleneck but I guess still needs correction.

May be the culprit is camera sensor? I guess I see a difference between COSTAR_HUSKY and EXPLORER_X1 (Single RGBD vs 4x RGBD) is it CPU based rendering? Could it be made to use GPU?

@zbynekwinkler
Copy link

According to my tests, the simulation does not use more than 4 CPU cores. As for GPU, most of the usage is in the gui - I do all local runs headless - if I don't, the GUI takes all available GPU memory (8GB in my case) and nothing else works on the computer running the simulation.

Not knowing anything about the actual implementation, I am also surprised by the drop in performance when simulating multiple robots. From my (possibly naive) point of view (considering current games) the resolution of the cameras are small and the requirements on the quality are not that big either. We are using mostly 640x480 cameras which is 0.3MP. It seems 1920x1080 (2.1MP) is the minimum current games use and the FPS starts at 60Hz - so pixel-wise the ratio is 6.75x and fps-wise 3x (at minimum). Given this comparison it should be possible to get about 20 cameras at 480p resolution at 20Hz real time while in reality we get only about maybe 3% of that.

So yes, I'd also like to know where the bottleneck is. It is really difficult to get anything done at 3% of real time.

@peci1
Copy link
Collaborator Author

peci1 commented Oct 23, 2020

GUI takes all available GPU memory (8GB in my case) and nothing else works on the computer running the simulation.

How many robots are you talking about? I get like 800 MB GPU memory for the GUI with a single robot, and it seems to scale more or less linearly with more robots.

We've actually found out that the Absolem robot is quite a greedy-guts regarding cameras - the main 6-lens omnicamera sums up to something like 4K... So I wouldn't wonder it takes some time to simulate that, but I wonder why the GPU isn't fully used. Or maybe it's just because of the way nvidia-smi computes GPU usage? I know there are many different computation/rendering pipelines in the GPU...

@zbynekwinkler
Copy link

I have re-run the test. Currently when running headless simulation with as single X2 robot there is one ruby process taking about 2GB of memory and the GPU utilization stays around 5%. When running the same setup with the gui, there are two ruby processes each taking 2GB but the GPU utilization jumps to 100% and even the mouse movement is slowed down (even when the window is not visible). So in my book the gui is still broken for me and I'll continue running headless.

I have ubuntu 18.04 system with nvidia driver 450 and GeForce GTX 1050 with 8GB.

@peci1
Copy link
Collaborator Author

peci1 commented Oct 23, 2020

Was this test performed via Docker or in a direct catkin install?

@zbynekwinkler
Copy link

Docker

@peci1
Copy link
Collaborator Author

peci1 commented Oct 23, 2020

Could you re-do the test with a direct install? Because I'd like to clearly separate performance loss brought in by Docker from the performance of the simulator itself. When I run the simulator directly and headless, there is no noticeable slowdown on my 8th gen core i7 ultrabook with external GPU (as long as there is a single robot with not that many cameras).

@zbynekwinkler
Copy link

Could you re-do the test with a direct install?

Actually, sorry, no. I am not going to risk messing up the whole computer with installing all the ros and ign stuff directly into the system I depend on. However rviz is working just fine from inside the docker, taking a hardly noticeable hit on the GPU utilization when displaying an image from the front camera and depth cloud at the same time - taking only 18MB of GPU memory - that aligns more with my expectations.

@zbynekwinkler
Copy link

We might get some improvement in speed by building the plugins in this repository with optimizations enabled. See #688

@dan-riley
Copy link

The simulation speed is clearly related to the RGBD cameras. I have modified our models to run without the cameras and only LIDAR, and two robots can be run at about 60-80% realtime. If the same models are used, but with the cameras enabled, the same two robots run at about 20% realtime. Our models use a 64 beam LIDAR versus the 16 beam present on most systems, so a higher resolution LIDAR does not seem to impact performance much.

@peci1
Copy link
Collaborator Author

peci1 commented Oct 30, 2020

I agree the speed goes down very much with cameras. However, I wonder why the computer doesn't utilize more resources in order to keep it running as fast as possible.

GPU lidar is basically just a depth camera in Gazebo - resolution would be something like 2048x64. So the preformance impact would be hard to notice (even more with the 16-ray ones).

@AravindaDP
Copy link

AravindaDP commented Oct 30, 2020

My knowledge on ignition gazebo is very limited so take my following observations/hunches with a pinch of salt.

I believe ignition gazebo renders camera serially so that might explain why we don't see increase in resource utilization with respect to number of robots/cameras.

I'm also curious about usage of manual scene update in RendoringSensor (base class of all cameras as I understand but also of Gpu lidar) https://github.com/ignitionrobotics/ign-sensors/blob/main/src/RenderingSensor.cc#L89
Probably something that can be optimized? But that should affect gpu lidar also if it has any effect.

I think best way to analyze where is bottleneck is to use profiler to analyze what's taking time.
https://ignitionrobotics.org/api/common/3.6/profiler.html

@AravindaDP
Copy link

I can confirm that using a release build of subt_ws as suggested in #688 makes a noticeable improvement in performance. (In my case approx. 2x speed up for single X2C6 in GUI mode)

@pauljurczak
Copy link

@AravindaDP How did you pass compiler flags and build parameters to catkin?

@AravindaDP
Copy link

@pauljurczak I just used catkin_make -DCMAKE_BUILD_TYPE=Release install as last command in step 4 here https://github.com/osrf/subt/wiki/Catkin%20System%20Setup

@zbynekwinkler
Copy link

@pauljurczak See #688.

@pauljurczak
Copy link

Thank you. I rediscovered that cmake-gui works with this project, if launched from command line. It makes editing configuration options much easier.

@peci1
Copy link
Collaborator Author

peci1 commented Feb 8, 2021

This might help a lot: gazebosim/gz-sensors#95 . I created a PR that would add a service set_rate to each sensor in the simulation. By calling this service, you can selectively decrease the update rates of the sensors. E.g. the realsenses on EXPLORER_X1 run on 30 Hz, but we process them at 6 Hz. That's 80% images we just throw away. So let's not even render them!

Thanks for the idea @tpet !

@peci1
Copy link
Collaborator Author

peci1 commented Mar 1, 2021

The ign-sensors PR has been merged and a new version was released in binary distribution. Now, #791 contains the required SubT part through which teams will be able to control rendering rate of sensors via ROS services. Even before #791 is merged, you can already set the rate in locally running simulations by directly calling the Ignition services (ign service -l | grep set_rate to get the list of services, and then you call each of them with the desired rate).

@peci1
Copy link
Collaborator Author

peci1 commented Mar 1, 2021

With #791, I achieve 50-70% RTF with EXPLORER_X1 (3D lidar + 4x RealSense).

@Space-Swarm
Copy link

Hi all, just bumping this issue as I'm using the simulator and encountering similar problems with a speed bottleneck. Is there a summary somewhere of the options to speed up the simulation?

I have access to a super computer, but it requires a lot of specialist software to setup, and I've learnt from @peci1 that a supercomputer may not resolve the bottlenecks. The old version of gazebo with Ignition Dome does not work well for parallel processing, so I'm interested in finding out if Gazebo Fortress or any other fixes were used by the competing SubT teams. I'm looking for a way that I can use multiple cores to speed up the simulation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

7 participants