Error checkpointing Java daemon in previously restored container #2577

hanwen-flow · 2025-01-29T13:56:16Z

hi there,

i'm using podman to checkpoint and restore containers. In my use-case, the main point is checkpointing/restoring a Java daemonized process

This works the first time around, but checkpointing the restored container yields

2025-01-29T13:20:24.965110Z: CRIU checkpointing failed -52.  Please check CRIU logfile /var/lib/containers/storage/overlay-containers/f2fbb33c68f37c475f7b720d9c116071d6c516c190d91f7222d346c0b09170b4/userdata/dump.log
Error: `/usr/local/bin/crun checkpoint --image-path /var/lib/containers/storage/overlay-containers/f2fbb33c68f37c475f7b720d9c116071d6c516c190d91f7222d346c0b09170b4/userdata/checkpoint --work-path /var/lib/containers/storage/overlay-containers/f2fbb33c68f37c475f7b720d9c116071d6c516c190d91f7222d346c0b09170b4/userdata f2fbb33c68f37c475f7b720d9c116071d6c516c190d91f7222d346c0b09170b4` failed: exit status 1
2025/01/29 14:20:24 executing /usr/bin/sudo /home/hanwen/vc/containers/podman/bin/podman container inspect restored-1738156820
2025/01/29 14:20:25 error snapshotting: exit status 125, log tail: eezing processes: 100000 attempts with 100 ms steps
(00.001351) cgroup.freeze=0
(00.001358) cgroup.freeze=1
(00.101772) cgroup.freeze=1
(00.101914) freezing processes: 1 attempts done
(00.101999) SEIZE 136088 (comm python3): success
(00.102965) Seized task 136088, state 1
(00.102976) seccomp: Collected tid_real 136088 mode 0x2
(00.103009) Collected (0 attempts, 0 in_progress)
(00.103051) Seized task 136109, state 0
(00.103056) plugin: `cuda_plugin' hook 10 -> 0x717362f0c4d0
(00.103138) Error (compel/src/lib/infect.c:263): Unseizable non-zombie 136109 found, state S, err -1/10
(00.103186) Seized task 136109, state 0
(00.103190) plugin: `cuda_plugin' hook 10 -> 0x717362f0c4d0
(00.103227) Error (compel/src/lib/infect.c:263): Unseizable non-zombie 136109 found, state S, err -1/10
(00.103231) Collected (-1 attempts, 1 in_progress)
(00.103251) net: Unlock network
(00.103317) Unfreezing tasks into 1
(00.103318) 	Unseizing 136088 into 1
(00.103328) Error (criu/cr-dump.c:2111): Dumping FAILED.
exit status 1

I have repro scenario on request. I looked at /proc/$PID/task/$PID/status to see if something was off, but nothing relevant stood out.

The text was updated successfully, but these errors were encountered:

adrianreber · 2025-01-29T14:00:19Z

Please share your reproducer. Curious if I can also see this.

hanwen-flow · 2025-01-29T14:18:08Z

repro sent over e-mail

adrianreber · 2025-01-29T14:41:00Z

Try runc instead of crun. crun has a bug that only the first process of the container is put in the correct cgroup after restore.

hanwen-flow · 2025-01-29T15:23:27Z

Thanks! That worked. Is there a crun issue for this problem?

adrianreber · 2025-01-29T15:27:45Z

No. If you want you can open one. I can provide the details.

If I remember it correctly the difference is that runc move the CRIU binary in the correct cgroup and runs it there, crun creates the container and moves only one PID into the container after restore. So either crun must walk the process tree and move all PIDs into the right cgroup or create a helper process that is moved into the cgroup that calls CRIU. Something like that.

hanwen-flow · 2025-01-29T15:42:08Z

Filed containers/crun#1651

hanwen-flow mentioned this issue Jan 29, 2025

restore puts processes in wrong cgroup containers/crun#1651

Open

hanwen-flow closed this as completed Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error checkpointing Java daemon in previously restored container #2577

Error checkpointing Java daemon in previously restored container #2577

hanwen-flow commented Jan 29, 2025

adrianreber commented Jan 29, 2025

hanwen-flow commented Jan 29, 2025

adrianreber commented Jan 29, 2025

hanwen-flow commented Jan 29, 2025

adrianreber commented Jan 29, 2025

hanwen-flow commented Jan 29, 2025

Error checkpointing Java daemon in previously restored container #2577

Error checkpointing Java daemon in previously restored container #2577

Comments

hanwen-flow commented Jan 29, 2025

adrianreber commented Jan 29, 2025

hanwen-flow commented Jan 29, 2025

adrianreber commented Jan 29, 2025

hanwen-flow commented Jan 29, 2025

adrianreber commented Jan 29, 2025

hanwen-flow commented Jan 29, 2025