Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error checkpointing Java daemon in previously restored container #2577

Closed
hanwen-flow opened this issue Jan 29, 2025 · 6 comments
Closed

Error checkpointing Java daemon in previously restored container #2577

hanwen-flow opened this issue Jan 29, 2025 · 6 comments

Comments

@hanwen-flow
Copy link

hi there,

i'm using podman to checkpoint and restore containers. In my use-case, the main point is checkpointing/restoring a Java daemonized process

This works the first time around, but checkpointing the restored container yields

2025-01-29T13:20:24.965110Z: CRIU checkpointing failed -52.  Please check CRIU logfile /var/lib/containers/storage/overlay-containers/f2fbb33c68f37c475f7b720d9c116071d6c516c190d91f7222d346c0b09170b4/userdata/dump.log
Error: `/usr/local/bin/crun checkpoint --image-path /var/lib/containers/storage/overlay-containers/f2fbb33c68f37c475f7b720d9c116071d6c516c190d91f7222d346c0b09170b4/userdata/checkpoint --work-path /var/lib/containers/storage/overlay-containers/f2fbb33c68f37c475f7b720d9c116071d6c516c190d91f7222d346c0b09170b4/userdata f2fbb33c68f37c475f7b720d9c116071d6c516c190d91f7222d346c0b09170b4` failed: exit status 1
2025/01/29 14:20:24 executing /usr/bin/sudo /home/hanwen/vc/containers/podman/bin/podman container inspect restored-1738156820
2025/01/29 14:20:25 error snapshotting: exit status 125, log tail: eezing processes: 100000 attempts with 100 ms steps
(00.001351) cgroup.freeze=0
(00.001358) cgroup.freeze=1
(00.101772) cgroup.freeze=1
(00.101914) freezing processes: 1 attempts done
(00.101999) SEIZE 136088 (comm python3): success
(00.102965) Seized task 136088, state 1
(00.102976) seccomp: Collected tid_real 136088 mode 0x2
(00.103009) Collected (0 attempts, 0 in_progress)
(00.103051) Seized task 136109, state 0
(00.103056) plugin: `cuda_plugin' hook 10 -> 0x717362f0c4d0
(00.103138) Error (compel/src/lib/infect.c:263): Unseizable non-zombie 136109 found, state S, err -1/10
(00.103186) Seized task 136109, state 0
(00.103190) plugin: `cuda_plugin' hook 10 -> 0x717362f0c4d0
(00.103227) Error (compel/src/lib/infect.c:263): Unseizable non-zombie 136109 found, state S, err -1/10
(00.103231) Collected (-1 attempts, 1 in_progress)
(00.103251) net: Unlock network
(00.103317) Unfreezing tasks into 1
(00.103318) 	Unseizing 136088 into 1
(00.103328) Error (criu/cr-dump.c:2111): Dumping FAILED.
exit status 1

I have repro scenario on request. I looked at /proc/$PID/task/$PID/status to see if something was off, but nothing relevant stood out.

@adrianreber
Copy link
Member

Please share your reproducer. Curious if I can also see this.

@hanwen-flow
Copy link
Author

repro sent over e-mail

@adrianreber
Copy link
Member

Try runc instead of crun. crun has a bug that only the first process of the container is put in the correct cgroup after restore.

@hanwen-flow
Copy link
Author

Thanks! That worked. Is there a crun issue for this problem?

@adrianreber
Copy link
Member

No. If you want you can open one. I can provide the details.

If I remember it correctly the difference is that runc move the CRIU binary in the correct cgroup and runs it there, crun creates the container and moves only one PID into the container after restore. So either crun must walk the process tree and move all PIDs into the right cgroup or create a helper process that is moved into the cgroup that calls CRIU. Something like that.

@hanwen-flow
Copy link
Author

Filed containers/crun#1651

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants