Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drastically improve zombie process cleanup efficiency #362

Merged

Conversation

andy108369
Copy link
Contributor

This commit overhauls the zombie process cleanup script, addressing a critical performance issue:

  • Previously, the script would scan all processes and zombies, gathering parent PIDs for EACH zombie.
    This approach was extremely inefficient, taking up to 20 minutes when the server had 465,276 zombie processes.

  • Now, the script stops after finding the first zombie process and focuses on terminating its parent.
    This approach is based on the understanding that multiple zombies are typically linked to a single parent process.
    Terminating the parent process will make all of its zombie children disappear, drastically improving efficiency.

Other improvements include:

  • Add safety checks to avoid terminating containerd-shim processes
  • Remove the zombie threshold check, allowing immediate action on all zombies. This prevents the misconception
    that a small number of zombie processes is acceptable and encourages users to properly address the root cause
    by integrating process reapers like tini.
  • Increase crontab execution frequency from every 15 minutes to every 5 minutes

These changes result in a much more responsive and efficient handling of zombie processes,
reducing cleanup time from 20 minutes to just seconds, even with hundreds of thousands of zombies.

Address an issue where using `xargs` caused argument lists to become too long on systems with many processes. This resulted in errors such as:

    /usr/local/bin/kill_zombie_parents.sh: line 14: /usr/bin/ps: Argument list too long

Changes include:
- Replacing `xargs` with shell loops for reading process IDs in `detect_zombies` and `find_zombie_parents`.
- Using `while read` loops to safely build the list of child processes without exceeding argument length limits.
- Ensuring more robust zombie process detection and signaling by avoiding argument overflows.

This update ensures the script works reliably even in environments with a large number of processes.

Affected functions:
- Modified zombie process handling logic in the `detect_zombies`, `pidtree`, and `find_zombie_parents` functions.
This commit overhauls the zombie process cleanup script, addressing a critical performance issue:

- Previously, the script would scan all processes and zombies, gathering parent PIDs for EACH zombie.
  This approach was extremely inefficient, taking up to 20 minutes when the server had 465,276 zombie processes.

- Now, the script stops after finding the first zombie process and focuses on terminating its parent.
  This approach is based on the understanding that multiple zombies are typically linked to a single parent process.
  Terminating the parent process will make all of its zombie children disappear, drastically improving efficiency.

Other improvements include:
- Add safety checks to avoid terminating containerd-shim processes
- Remove the zombie threshold check, allowing immediate action on all zombies. This prevents the misconception
  that a small number of zombie processes is acceptable and encourages users to properly address the root cause
  by integrating process reapers like tini.
- Increase crontab execution frequency from every 15 minutes to every 5 minutes

These changes result in a much more responsive and efficient handling of zombie processes,
reducing cleanup time from 20 minutes to just seconds, even with hundreds of thousands of zombies.
@andy108369
Copy link
Contributor Author

Test results

  • when zombie reproducer running on the server directly
root@x1:~# /usr/local/bin/kill_zombie_parents.sh
Found zombie process 325103 with immediate parent 325101
Parent chain: 2695:(systemd) 201404:(gnome-terminal-) 225806:(bash) 325101:(zombie) 
Top-level parent is not containerd-shim. No action taken.
  • when zombie reproducer running in the containerd
root@x1:~# /usr/local/bin/kill_zombie_parents.sh
Found zombie process 328874 with immediate parent 325591
Parent chain: 325569:(containerd-shim) 325591:(tail) 
Top-level parent is containerd-shim
Attempting to send SIGCHLD to parent process 325591
Zombie process 328874 still exists after SIGCHLD
Attempting to send SIGTERM to parent process 325591
Zombie process 328874 still exists after SIGTERM
Attempting to send SIGKILL to parent process 325591
Zombie process 328874 no longer exists after SIGKILL
Zombie process cleaned up after SIGKILL

Have additionally tested on a production server with 465288 processes running. It took less than 5 seconds to detect zombie.

@andy108369
Copy link
Contributor Author

@HoomanDgtl

@HoomanDgtl HoomanDgtl merged commit 60df26f into akash-network:main Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants