The removal of container root directory should be after poststop run successfully. #4363

abel-von · 2024-08-01T09:21:05Z

Description

Hi, We recently added a runc hook to manage our special devices, and ran into a problem.
If our post stop hook run failed(maybe because the resource is still not ready to be recycled), we expect the containerd or kubelet could retry runc delete and when the hook can return success, runc delete succeed, and all the resources will be recycled, with nothing residual.

but actually we found that when the second time containerd call runc delete, the container root directory is removed and a NotFound error returned, so that containerd will consider the container removed normally. but actually the post stop hook never be called successfully.

I am wondering maybe we can change the removal of root directory to after the hook called.

--- a/libcontainer/state_linux.go
+++ b/libcontainer/state_linux.go
@@ -54,13 +54,16 @@ func destroy(c *Container) error {
                        return fmt.Errorf("unable to remove container's IntelRDT group: %w", err)
                }
        }
+       c.initProcess = nil
+       if err := runPoststopHooks(c); err != nil {
+               return fmt.Errorf("unable to run post stop hooks: %w", err)
+       }
+       c.state = &stoppedState{c: c}
+
        if err := os.RemoveAll(c.stateDir); err != nil {
                return fmt.Errorf("unable to remove container state dir: %w", err)
        }
-       c.initProcess = nil
-       err := runPoststopHooks(c)
-       c.state = &stoppedState{c: c}
-       return err
+       return nil
 }

Steps to reproduce the issue

write a runc hook to return error, and config it into runtime of containerd
start a container
delete a container

Describe the results you received and expected

expected:
deletion of the container should fail until the hook can be executed succeessfuly.
received:
container is removed, but the hook is never called with a success, so with resources residual.

What version of runc are you using?

The newest version

Host OS information

I think all os is effected.

Host kernel information

all kernel versions

The text was updated successfully, but these errors were encountered:

lifubang · 2024-08-02T04:14:04Z

In runtime-spec, it says:
The poststop hooks MUST be called after the container is deleted but before the delete operation returns.

Please see:
https://github.com/opencontainers/runtime-spec/blob/v1.2.0/config.md?plain=1#L622

lifubang · 2024-08-02T04:22:40Z

So I think it's not a bug. If you want change this behavior, you should open a PR in runtime-spec first.

BTW, there is another bug for postStop hooks, do you have some interests to fix it:
The poststart hooks MUST be invoked by the runtime. If any poststart hook fails, the runtime MUST log a warning, but the remaining hooks and lifecycle continue as if the hook had succeeded.

Please see:
https://github.com/opencontainers/runtime-spec/blob/v1.2.0/runtime.md?plain=1#L73

But maybe the behavior would be redefined in the future, because there is a similar bug for poststart, you can see:
opencontainers/runtime-spec#1262 (review)

abel-von · 2024-08-02T07:12:31Z

The poststop hooks MUST be invoked by the runtime. If any poststop hook fails, the runtime MUST log a warning, but the remaining hooks and lifecycle continue as if the hook had succeeded.

This implies that all poststop hook can not fail, because even it is failed the platform can not do retry.

But maybe the behavior would be redefined in the future, because there is a similar bug for poststart, you can see:
opencontainers/runtime-spec#1262 (review)

Maybe we can make a correspond change to the poststop too.

rata · 2024-08-02T10:21:40Z

Steps to reproduce the issue

1. write a runc hook to return error, and config it into runtime of containerd

How do you do that?

Please make the steps easy to follow, provide as much info as possible so someone can just follow some steps, instead of search how to do X or Y. The whole point of the steps to repro is to avoid that.

I didn't know there was a way to ask containerd to set OCI hooks without an external tool (like a wrapper for the oci-runtime that adds the hooks).

abel-von linked a pull request Aug 2, 2024 that will close this issue

fix post stop hook not always called successfully #4364

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The removal of container root directory should be after poststop run successfully. #4363

The removal of container root directory should be after poststop run successfully. #4363

abel-von commented Aug 1, 2024 •

edited

Loading

lifubang commented Aug 2, 2024

lifubang commented Aug 2, 2024

abel-von commented Aug 2, 2024 •

edited

Loading

rata commented Aug 2, 2024 •

edited

Loading

Steps to reproduce the issue

The removal of container root directory should be after poststop run successfully. #4363

The removal of container root directory should be after poststop run successfully. #4363

Comments

abel-von commented Aug 1, 2024 • edited Loading

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of runc are you using?

Host OS information

Host kernel information

lifubang commented Aug 2, 2024

lifubang commented Aug 2, 2024

abel-von commented Aug 2, 2024 • edited Loading

rata commented Aug 2, 2024 • edited Loading

Steps to reproduce the issue

abel-von commented Aug 1, 2024 •

edited

Loading

abel-von commented Aug 2, 2024 •

edited

Loading

rata commented Aug 2, 2024 •

edited

Loading