Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve Storage Leak Problem in automation.pdap.io #228

Open
maxachis opened this issue Sep 4, 2024 · 2 comments
Open

Resolve Storage Leak Problem in automation.pdap.io #228

maxachis opened this issue Sep 4, 2024 · 2 comments

Comments

@maxachis
Copy link

maxachis commented Sep 4, 2024

Upon checking automation.pdap.io, I noticed that none of the jobs had run for over 24 hours. After snurfling around more closely, I realized that the "Built-In" Node which runs Jenkins job stopped running because it didn't have storage space left.

More specifically, when I went to /var/lib/docker and ran du -sh overlay2 (which checks the size of the overlay2 directory in the docker directory, it showed 42 Gigabytes in use, with our node only having 50 GB of storage space in total.

After doing some additional online snurfling, I found I could quickly solve this by

  1. running docker system prune --all --volumes --force, which prunes a lot of old images and other data.
  2. Restarting Jenkins by entering https://automation.pdap.io/safeRestart in the browser

Once I did that, I reclaimed 37.93 GB of space. Neat! So in the future, whenever this happens, we can do that again as a stopgap measure.

Of course, ideally we don't have to do this at all. Unfortunately, at the moment it's not clear what the problem is, and we also don't have a way of being alerted if the node goes offline. So we should solve both of those:

Requirements

  • Set up means of sending out a notification when the Jenkins Built-In Node goes offline (or when storage threshold on droplets is exceeded).
  • Figure out and resolve Docker storage leak problem

Additional Resources

This docker thread discusses the issue and postulates some possible causes.

  • One of those possible causes is the images themselves writing log files or other data to disk, which is not being properly cleaned up.
@maxachis
Copy link
Author

maxachis commented Nov 14, 2024

This occurred again on 2024-11-14.

For the moment, I've set up a stopgap solution by creating a cron job with the following command:

0 0 1 * * docker system prune --all --volumes --force

Which for those lacking a cron-to-English translator means it will run on the first minute of the first hour of the first day of every month.

@maxachis
Copy link
Author

maxachis commented Dec 16, 2024

Occurred again on 2024-12-16

This post in the link above seems to suggest the problem is due to too many log files produced, and proposed several possible solutions to it.

Additionally, there is this article on optimizing docker storage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant