Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zombie processes after recent update #1236

Open
boldtrn opened this issue Apr 23, 2024 · 11 comments
Open

Zombie processes after recent update #1236

boldtrn opened this issue Apr 23, 2024 · 11 comments

Comments

@boldtrn
Copy link
Contributor

boldtrn commented Apr 23, 2024

We recently updated from 4.5.1 to 4.10.3. After the update we have seen quite some performance issues with our tile server. One thing that stands out to me is that we are getting zombie processes. We are using the Docker image.

The zombie processes are node commands apparently, so maybe there was an issue introduced along the way?

ps aux | grep 'Z'
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
systemd+    1551  0.0  0.0      0     0 ?        Z    09:46   0:00 [node] <defunct>
systemd+   66028  0.0  0.0      0     0 ?        Z    16:30   0:00 [node] <defunct>
systemd+   93659  0.0  0.0      0     0 ?        Z    19:26   0:00 [node] <defunct>
systemd+   94458  0.0  0.0      0     0 ?        Z    19:33   0:00 [node] <defunct>
systemd+   96641  0.0  0.0      0     0 ?        Z    19:46   0:00 [node] <defunct>
systemd+  101128  0.0  0.0      0     0 ?        Z    20:15   0:00 [node] <defunct>
systemd+  101942  0.0  0.0      0     0 ?        Z    20:21   0:00 [node] <defunct>
@acalcutt
Copy link
Collaborator

I am not aware of any issue that would cause that issue, but can you see if the just release 4.11.0 helps.

@boldtrn
Copy link
Contributor Author

boldtrn commented Apr 23, 2024

We had to revert to 4.5.1 for now. I will give this another try soon :)

@acalcutt
Copy link
Collaborator

acalcutt commented Apr 23, 2024

Where are you running the command above, inside the docker image or on the host the docker is running on? Does it take time to build up like that?

@boldtrn
Copy link
Contributor Author

boldtrn commented Apr 24, 2024

I ran this on the host. I believe this might have been Docker processes that were killed or anything like this, as the user is systemd+ and Docker is managed by systemd, but this only an idea at this point.

@boldtrn
Copy link
Contributor Author

boldtrn commented May 3, 2024

I can verify that the error still persists with the latest release. I can't see anything obvious in the logs, but I have to admit it's a production system, so there are a lot of logs. If you have a possible hint what to search for in the logs I can give this a try. Obvious stuff like ERROR or FATAL did not show anything interesting.

@acalcutt
Copy link
Collaborator

acalcutt commented May 5, 2024

Unfortunately I don't have any good answers on what to look for. If i had to guess it would be a rendering issue, since that starts it's own threads. I find when maplibre-native as an issues, it doesn't always give back an error.

When I am troubleshooting stuff like that I try to find a url that isn't loading as expected. I then test that url in a more contollable instance. usually in testing I uncomment https://github.com/maptiler/tileserver-gl/blob/master/src/serve_rendered.js#L874 to get an idea what is being loaded when maplibre-native fails.

Have you seen anything that is failing to load with the new version? you were using static images right?

@boldtrn
Copy link
Contributor Author

boldtrn commented May 5, 2024

Have you seen anything that is failing to load with the new version? you were using static images right?

We are using raster and vector tiles as well as static images. I haven't seen anything failing, we are serving several million tile requests per day, so it's hard to track down isolated issues. We had some performance issues but I doubt these are related to the version. We are currently running different version of tileserver-gl and CPU etc. usage look somewhat similar (actually the latest version seems to be about 5% less resource consuming)

@acalcutt
Copy link
Collaborator

acalcutt commented May 17, 2024

Just an FYI, i did find an issue in the docker build caused by the change to use "is-ci" when dev utils were not included. I put that back to the old method in #1250 . That should be fixed in 4.11.1

I'm not sure it has anything to do with your issue, but i thought it could be a possibility

@boldtrn
Copy link
Contributor Author

boldtrn commented May 17, 2024

I will give this a try, thanks 👍

@boldtrn
Copy link
Contributor Author

boldtrn commented Jun 5, 2024

Ok, I think the latest update 4.11.1 did indeed fix the zombie processes, I haven't seen them since. Thanks for looking into this @acalcutt 👍

@boldtrn boldtrn closed this as completed Jun 5, 2024
@boldtrn
Copy link
Contributor Author

boldtrn commented Jun 15, 2024

Unfortunately, I have to reopen this issue. Zombie processes just reappeared yesterday on one of our servers. The container even went down and we had to restart it. Again the logs did not show anything new.

Screenshot 2024-06-15 at 09 04 03

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants