Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Massive drain on CPU and disk I/O on container launch #116

Closed
1 task done
schklom opened this issue Nov 26, 2024 · 14 comments
Closed
1 task done

[BUG] Massive drain on CPU and disk I/O on container launch #116

schklom opened this issue Nov 26, 2024 · 14 comments

Comments

@schklom
Copy link

schklom commented Nov 26, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

On boot, the container moves around a ton of python libraries

if [[ -d "${PY_LOCAL_PATH}.bak" ]]; then
echo "**** New container detected, fixing python package permissions. This may take a while. ****"
mv "${PY_LOCAL_PATH}.bak" "${PY_LOCAL_PATH}"
chown -R abc:abc "${PY_LOCAL_PATH}"
fi
and this takes forever at launch.

Instead, this can be done during the build https://github.com/schklom/Mirror-workflows/blob/850796baa622217b8c65a1d729cb83c98303111e/Dockerfile#L7-L18

although it makes the image larger in size
your latest on amd64 is 508.78 MB https://hub.docker.com/layers/linuxserver/homeassistant/latest/images/sha256-153fd08a9645b2c96334b86764d7b89d557f49b8845814e982f9f8e15a4cb9ed?context=explore
whereas mine is 927.39 MB https://hub.docker.com/layers/schklom/home-assistant/latest/images/sha256-efd225b5432868268f4f7cac0514f94c1798150b0c7161f68c1412be56b3b82a?context=repo

That mv step is skipped on container launch on my image, and everything runs smoothly.

I can submit a PR if you like?

Expected Behavior

NA

Steps To Reproduce

NA

Environment

NA

CPU architecture

x86-64

Docker creation

NA

Container logs

NA
Copy link

Thanks for opening your first issue here! Be sure to follow the relevant issue templates, or risk having this issue marked as invalid.

@aptalca
Copy link
Member

aptalca commented Nov 26, 2024

It's done to prevent a different issue where the following chown operation can take longer than 20 minutes due to an overlayfs bug.

@schklom
Copy link
Author

schklom commented Nov 26, 2024

On my machine, this often causes mv to take over 10 minutes.
Now, the container takes about 1 minute to start.
Just thought you might be interested.

@aptalca
Copy link
Member

aptalca commented Nov 26, 2024

Doing a chown -R abc:abc in Dockerfile is not a solution. The chown needs to be done runtime as abc user's uid/gid gets modified to match the PUID/PGID from env vars during early init.

@schklom
Copy link
Author

schklom commented Nov 27, 2024

Does mv not take a while at runtime for you, when creating the container?

The chown needs to be done runtime as abc user's uid/gid gets modified to match the PUID/PGID from env vars during early init.

Sure, my concern is about mv, so can't chown be done at runtime

-if [[ -d "${PY_LOCAL_PATH}.bak" ]]; then
+if [[ -d "${PY_LOCAL_PATH}" ]]; then
     echo "**** New container detected, fixing python package permissions. This may take a while. ****"
-    mv "${PY_LOCAL_PATH}.bak" "${PY_LOCAL_PATH}"
     chown -R abc:abc "${PY_LOCAL_PATH}"
 fi

?
Doing mv at buildtime should ensure that the correct folder is there.

@aptalca
Copy link
Member

aptalca commented Nov 27, 2024

Does mv not take a while at runtime for you, when creating the container?

It doesn't. Not for our ci test either. https://ci-tests.linuxserver.io/linuxserver/homeassistant/latest/index.html (it takes 4 seconds for the move, and a split second for the chown).

The chown without the move triggers the overlayfs cow bug, which results in the chown taking 20 minutes or longer. The move is an ugly hack to bypass the overlayfs bug. It's the best solution we have to ensure HA can run as a non-root user and still install python packages runtime.

@thespad
Copy link
Member

thespad commented Nov 27, 2024

For another data point it takes just under 2 minutes on my Pi to complete the move + chown:

homeassistant  | 2024-11-25T17:46:51.796496691Z **** New container detected, fixing python package permissions. This may take a while. ****
homeassistant  | 2024-11-25T17:48:44.422127570Z Setting permissions

The problem is it's impossible to detect the overlayfs bug ahead of time, so either we give everyone a slight delay on first run or we give a subset of users a 20+ minute startup delay.

Really this is all because HA decided to randomly switch from using pip to using uv for installing packages and the latter can't do user installs properly.

@schklom
Copy link
Author

schklom commented Nov 27, 2024

Ok, no worries, and thanks for the replies!

I will keep this patch for myself then, as I don't mind the PUID/PGID not being done properly. On my Pi, there are no bugs for now, and no delay on start.
Feel free to close if you want.

@thespad thespad closed this as completed Nov 27, 2024
@LinuxServer-CI LinuxServer-CI moved this from Issues to Done in Issue & PR Tracker Nov 27, 2024
@schklom
Copy link
Author

schklom commented Nov 27, 2024

The chown without the move triggers the overlayfs cow bug, which results in the chown taking 20 minutes or longer. The move is an ugly hack to bypass the overlayfs bug. It's the best solution we have to ensure HA can run as a non-root user and still install python packages runtime.

@aptalca Out of curiosity, by "without the move", do you mean that you tried to do it at buildtime instead of runtime, and it triggered the bug? I'm wondering if me not having a bug (despite not doing the move at runtime) is worrisome or if it makes sense :P

@aptalca
Copy link
Member

aptalca commented Nov 27, 2024

You're focusing on the wrong thing.

We intentionally move/rename the folder to .bak during build time so we can move/rename it back runtime to avoid the overlayfs bug triggered by chown.

Not everyone is affected by the bug, but many are (including some of our ci builders). We haven't been able to identify a common denominator for the devices affected by the bug. We just know that it's very common and we know that a move operation prior to chown prevents it.

@schklom
Copy link
Author

schklom commented Nov 27, 2024

@aptalca I guess I could have been clearer, my bad.
What I'm doing at buildtime now is pretty much mv python_folder python_folder.bak then mv python_folder.bak python_folder.
And you're doing the second step at runtime instead.

I understand that you started with not doing any move operations and it was bad, but I am wondering if you tried to do the 2 move operations at buildtime, like what I'm basically doing now.
My thought is that the layer at the end is not exactly the same as the one before the 2 operations (my image size is almost double yours), and it might fix that bug without needing a move command at runtime.

As you mentioned, there would still be a need to chown at runtime though.

@thespad
Copy link
Member

thespad commented Nov 27, 2024

The overlayfs bug is a copy-on-write issue. Making changes to the container filesystem permissions at runtime causes a COW operation to write the changed file metadata to the host overlayfs storage and that's what causes the bug. For whatever reason a move operation does not trigger the same issue.

It doesn't matter what you do at build time, if you change permissions on the container filesystem at runtime you can run into the bug.

@aptalca
Copy link
Member

aptalca commented Nov 27, 2024

As you mentioned, there would still be a need to chown at runtime though.

You're not following what I'm trying to tell you.

  1. We HAVE TO do a chown runtime to run HA as non-root
  2. Doing a chown in the container file system causes the bug for a lot of people
  3. Doing a move on the files prior to chown runtime prevents the issue.

You're trying to find a way to avoid a runtime move. What I'm trying to tell you is that we WANT a move operation runtime to prevent the chown bug.

@schklom
Copy link
Author

schklom commented Nov 28, 2024

Looks like I'm misunderstanding something, so I won't push further. My patch works for me, i'm happy :P
Thanks for the replies again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

3 participants