Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

criu: Add support for pidfds #2449

Merged
merged 8 commits into from
Oct 3, 2024

Conversation

bsach64
Copy link
Member

@bsach64 bsach64 commented Jul 21, 2024

A PR for support of process file descriptors.

Here's a article that talks about features of pidfds, which we might have to support (and test):
http://www.corsix.org/content/what-is-a-pidfd

To Do:

  • Check if the pid of pidfd is part of the process tree we are dumping. So, that we can ensure it exists on restore.
  • Add support for pidfds that point to dead processes.
    If the process for which a pidfd is open dies, the Pid and NSPid entries change to -1. So, we don't have the Pid necessary to recreate the process. We cannot just create a random process, open a pidfd to it, and kill it since pidfds opened for a specific process have the same inode number used to compare them.
  • Change the magic in criu/include/magic.h
  • test for pidfd_getfd
  • Make CI green
  • Check functionality before and after 6.9
  • Better commit messages (add related issues that are fixed as a result of this PR)

Support for PIDFD_THREAD (allows creation of a pidfd that points to a specific thread) not a part of this PR.

Known Limitations:
CRIU doesn't support nested PID namespaces. As such we cannot C/R pidfd's that point to PIDs in nested namespaces. If we introduce support for nested PID namespaces we will have to rework some of this code.

This is work being done as part of Google Summer of Code 2024 (https://summerofcode.withgoogle.com/programs/2024/projects/vgpqxIxx)
A huge thank you to Alex @mihalicyn for guiding me and making this a wonderful experience :)

@mihalicyn mihalicyn self-requested a review July 22, 2024 11:31
@mihalicyn mihalicyn marked this pull request as ready for review July 23, 2024 20:01
@mihalicyn mihalicyn marked this pull request as draft July 23, 2024 20:03
images/pidfd.proto Outdated Show resolved Hide resolved
criu/include/magic.h Outdated Show resolved Hide resolved
criu/pidfd.c Outdated Show resolved Hide resolved
criu/pidfd.c Outdated Show resolved Hide resolved
criu/pidfd.c Outdated Show resolved Hide resolved
@bsach64 bsach64 force-pushed the pidfd-support branch 6 times, most recently from acac25e to 729891a Compare July 29, 2024 20:21
@bsach64 bsach64 force-pushed the pidfd-support branch 3 times, most recently from 8f702f6 to c455eaa Compare August 2, 2024 18:32
@rst0git rst0git mentioned this pull request Aug 3, 2024
@bsach64 bsach64 force-pushed the pidfd-support branch 4 times, most recently from f0e574a to a5c9e56 Compare August 8, 2024 16:12
@rst0git
Copy link
Member

rst0git commented Aug 9, 2024

Better commit messages (add related issues that are fixed as a result of this PR)

@bsach64 The following blog post provides information on how to write good commit messages: https://cbea.ms/git-commit/

Similar to the Linux kernel, we use the commit history to make it easier for reviewers (and maintainers) to understand the problem that motivated particular patch and how this problem is being addressed.

@bsach64
Copy link
Member Author

bsach64 commented Aug 11, 2024

Better commit messages (add related issues that are fixed as a result of this PR)

@bsach64 The following blog post provides information on how to write good commit messages: https://cbea.ms/git-commit/

Similar to the Linux kernel, we use the commit history to make it easier for reviewers (and maintainers) to understand the problem that motivated particular patch and how this problem is being addressed.

Hello @rst0git!
I have updated the commit messages and have tried to follow the rules stated here: https://cbea.ms/git-commit/

criu/proc_parse.c Outdated Show resolved Hide resolved
@bsach64 bsach64 force-pushed the pidfd-support branch 2 times, most recently from cc230c5 to e2746eb Compare August 12, 2024 09:45
@bsach64
Copy link
Member Author

bsach64 commented Aug 12, 2024

I am having a hard time understanding why the Vagrant Fedora Rawhide and Vagrant Fedora Rawhide (No VDSO) fail on the pidfd_child test.
Can I get some help regarding this?

criu/pidfd.c Outdated Show resolved Hide resolved
Copy link
Member

@mihalicyn mihalicyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mihalicyn
Copy link
Member

Hey @bsach64, if you have a time and interest, we can discuss support for threaded pidfds.
Next week I'll be on LPC 2024, but just after that I'll be happy chat with you.

@mihalicyn
Copy link
Member

@mihalicyn could you help with reviewing this pr?

yeah, sorry, I just forgot to press an "approve" button earlier. As I was following a whole development process of this thing there is no doubt that it's approved/reviewed from my side :-)

@bsach64
Copy link
Member Author

bsach64 commented Sep 15, 2024

Hey @bsach64, if you have a time and interest, we can discuss support for threaded pidfds.

Next week I'll be on LPC 2024, but just after that I'll be happy chat with you.

Sure Alex :)
Please send me an email, whenever you are free!

@bsach64 bsach64 force-pushed the pidfd-support branch 2 times, most recently from 5578e26 to e572f5b Compare September 15, 2024 09:54
criu/pidfd.c Show resolved Hide resolved
@bsach64 bsach64 force-pushed the pidfd-support branch 3 times, most recently from 56c8f7d to 3b8efda Compare September 21, 2024 13:04
We only use the last pid from the list in NSpid entry (from
/proc/<pid>/fdinfo/<pidfd>) while restoring pidfds.
The last pid refers to the pid of the process in the most deeply nested
pid namespace. Since CRIU does not currently support nested pid
namespaces, this entry is the one we want.

After Linux 6.9, inode numbers can be used to compare pidfds. pidfds
referring to the same process will have the same inode numbers. We use
inode numbers to restore pidfds that point to dead processes.

Signed-off-by: Bhavik Sachdev <[email protected]>
criu/pidfd.c Show resolved Hide resolved
criu/pidfd.c Outdated Show resolved Hide resolved
@avagin
Copy link
Member

avagin commented Oct 3, 2024

LGTM. Good job.

Process file descriptors (pidfds) were introduced to provide a stable
handle on a process. They solve the problem of pid recycling.

For a detailed explanation, see https://lwn.net/Articles/801319/ and
http://www.corsix.org/content/what-is-a-pidfd

Before Linux 6.9, anonymous inodes were used for the implementation of
pidfds. So, we detect them in a fashion similiar to other fd types that
use anonymous inodes by calling `readlink()`.
After 6.9, pidfs (a file system for pidfds) was introduced.
In 6.9 `S_ISREG()` returned true for pidfds, but this again changed with
6.10.
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/pidfs.c?h=v6.11-rc2#n285)
After this change, pidfs inodes have no file type in st_mode in
userspace.
We use `PID_FS_MAGIC` to detect pidfds for kernel >= 6.9
Hence, check for pidfds occurs before the check for regular files.

For pidfds that refer to dead processes, we lose the pid of the process
as the Pid and NSpid fields in /proc/<pid>/fdinfo/<pidfd> change to -1.
So, we create a temporary process for each unique inode and open pidfds
that refer to this process. After all pidfds have been opened we kill
this temporary process.

This commit does not include support for pidfds that point to a specific
thread, i.e pidfds opened with `PIDFD_THREAD` flag.

Fixes: checkpoint-restore#2258

Signed-off-by: Bhavik Sachdev <[email protected]>
Ensures that entries in /proc/<pid>/fdinfo/<pidfd> are same.

Signed-off-by: Bhavik Sachdev <[email protected]>
Ensure `pidfd_send_signal()` syscall works as expected after C/R.

Signed-off-by: Bhavik Sachdev <[email protected]>
Validate that pidfds can been used to send signals to different
processes after C/R using the `pidfd_send_signal()` syscall.

Signed-off-by: Bhavik Sachdev <[email protected]>
After, C/R of pidfds that point to dead processes their inodes might
change. But if two pidfds point to same dead process they should
continue to do so after C/R.

This test ensures that this happens by calling `statx()` on pidfds after
C/R and then comparing their inode numbers.

Support for comparing pidfds by using `statx()` and inode numbers was
introduced alongside pidfs. So if `f_type` of pidfd is not equal to
`PID_FS_MAGIC` then we skip this test.

signed-off-by: Bhavik Sachdev <[email protected]>
We get the read end of a pipe using `pidfd_getfd` and check if we can
read from it after C/R.

signed-off-by: Bhavik Sachdev <[email protected]>
We open a pidfd to a thread using `PIDFD_THREAD` flag and after C/R
ensure that we can send signals using it with `PIDFD_SIGNAL_THREAD`.

signed-off-by: Bhavik Sachdev <[email protected]>
@avagin avagin merged commit 56bc739 into checkpoint-restore:criu-dev Oct 3, 2024
33 of 42 checks passed
@@ -54,6 +54,7 @@ TST_NOFILE := \
shm-mp \
ptrace_sig \
pidfd_self \
pidfd_dead \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this test fails in Fedora Rawhide:

======================= Run zdtm/static/pidfd_dead in h ========================
Start test
./pidfd_dead --pidfile=pidfd_dead.pid --outfile=pidfd_dead.out
Run criu dump
Run criu restore
=[log]=> dump/zdtm/static/pidfd_dead/168/1/restore.log
------------------------ grep Error ------------------------
b'(00.004951)    168: \t\t\tGoing to dup 1 into 2'
b'(00.004953)    168: \tReceive fd for 2'
b'(00.004961)    168: \t\tCreate fd for 3'
b'(00.005264)    168: \t\tCreate fd for 4'
b'(00.005486)    168: Error (criu/cr-restore.c:1261): 175 killed by signal 9: Killed'
b'(00.005504) Error (criu/cr-restore.c:2314): Restoring FAILED.'
------------------------ ERROR OVER ------------------------
############### Test zdtm/static/pidfd_dead FAIL at CRIU restore ###############
Test output: ================================

https://github.com/checkpoint-restore/criu/runs/31040419148

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to see why this happens!
Is there any way that I can replicate this test locally?
I did try running the tests on a VM with Fedora Rawhide 6.12.0-0.rc1.20241005git27cc6fdf7201.22.fc42.x86_64
where these tests end up passing!

Copy link
Member

@rst0git rst0git Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bsach64 There is a race condition between kill() and waitpid() in free_dead_pidfd().

You can replicate the error with the following change:

diff --git a/criu/pidfd.c b/criu/pidfd.c
index fdf5dec60..23062ae02 100644
--- a/criu/pidfd.c
+++ b/criu/pidfd.c
@@ -151,7 +151,7 @@ static int free_dead_pidfd(struct dead_pidfd *dead)
                dead->pid);
                goto err;
        }
-
+       sleep(1);
        if (waitpid(dead->pid, &status, 0) != dead->pid) {
                pr_perror("Could not wait on temporary process with pid: %d",
                dead->pid);
$ sudo python3 zdtm.py run -t zdtm/static/pidfd_dead
userns is supported
=== Run 1/1 ================ zdtm/static/pidfd_dead
====================== Run zdtm/static/pidfd_dead in uns =======================
Start test
./pidfd_dead --pidfile=pidfd_dead.pid --outfile=pidfd_dead.out
Run criu dump
Run criu restore
=[log]=> dump/zdtm/static/pidfd_dead/64/1/restore.log
------------------------ grep Error ------------------------
b'(00.016134)      1: No ipcns-sem-11.img image'
b'(00.020782)      1: net: Try to restore a link 10:1:lo'
b'(00.020820)      1: net: Restoring link lo type 1'
b'(00.024435)      1: net: \tRunning ip addr restore'
b'Error: ipv4: Address already assigned.'
b'Error: ipv6: address already assigned.'
b'(00.141528)      4: \t\tCreate fd for 3'
b'(00.141715)      1: `- render 1 iovs (0x7f8433c61000:8192...)'
b'(00.141790)      1: Restore via sigreturn'
b'(00.142590)      4: \t\tCreate fd for 4'
b'(00.143514)      4: Error (criu/cr-restore.c:1261): 18 killed by signal 9: Killed'
b'(00.143579)      4: Error (criu/pidfd.c:156): pidfd: Could not wait on temporary process with pid: 18: No child processes'
b'(00.143591)      4: Error (criu/pidfd.c:218): pidfd: Failed to delete dead_pidfd struct'
b"(00.143605)      4: Error (criu/pidfd.c:234): pidfd: Can't create pidfd 0x00000d NSpid: -1 flags: 0"
b'(00.143615)      4: Error (criu/files.c:1221): Unable to open fd=5 id=0xd'
b'(00.147752) uns: calling exit_usernsd (-1, 1)'
b'(00.147902) uns: daemon calls 0x479de0 (91, -1, 1)'
b'(00.147939) uns: `- daemon exits w/ 0'
b'(00.150476) uns: daemon stopped'
b'(00.150514) Error (criu/cr-restore.c:2314): Restoring FAILED.'
------------------------ ERROR OVER ------------------------
############### Test zdtm/static/pidfd_dead FAIL at CRIU restore ###############
Test output: ================================

 <<< ================================
##################################### FAIL #####################################

We might need to add the temporary process PID (i.e., the processes created with create_tmp_process()) in the list of task helpers (ta->helpers). See collect_helper_pids() in cr-restore.c.

cc: @avagin

Copy link
Member Author

@bsach64 bsach64 Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another solution might be to block SIGCHLD?
See sysctl.c line 271 https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/sysctl.c#L271

diff --git a/criu/pidfd.c b/criu/pidfd.c
index fdf5dec60..a63a9a4b8 100644
--- a/criu/pidfd.c
+++ b/criu/pidfd.c
@@ -27,6 +27,7 @@ struct dead_pidfd {
 	unsigned int ino;
 	int pid;
 	size_t count;
+	sigset_t oldmask;
 	mutex_t pidfd_lock;
 	struct hlist_node hash;
 };
@@ -152,12 +153,14 @@ static int free_dead_pidfd(struct dead_pidfd *dead)
 		goto err;
 	}
 
+	sleep(1);
 	if (waitpid(dead->pid, &status, 0) != dead->pid) {
 		pr_perror("Could not wait on temporary process with pid: %d",
 		dead->pid);
 		goto err;
 	}
 
+	sigprocmask(SIG_SETMASK, &dead->oldmask, NULL);
 	if (!WIFSIGNALED(status)) {
 		pr_err("Expected temporary process to be terminated by a signal\n");
 		goto err;
@@ -180,6 +183,7 @@ static int open_one_pidfd(struct file_desc *d, int *new_fd)
 {
 	struct pidfd_info *info;
 	struct dead_pidfd *dead = NULL;
+	sigset_t blockmask;
 	int pidfd;
 
 	info = container_of(d, struct pidfd_info, d);
@@ -199,6 +203,9 @@ static int open_one_pidfd(struct file_desc *d, int *new_fd)
 	BUG_ON(dead->count == 0);
 	dead->count--;
 	if (dead->pid == -1) {
+		sigemptyset(&blockmask);
+		sigaddset(&blockmask, SIGCHLD);
+		sigprocmask(SIG_BLOCK, &blockmask, &dead->oldmask);
 		dead->pid = create_tmp_process();
 		if (dead->pid < 0) {
 			mutex_unlock(&dead->pidfd_lock);
@@ -270,6 +277,7 @@ static int collect_one_pidfd(void *obj, ProtobufCMessage *msg, struct cr_img *i)
 	dead->ino = info->pidfe->ino;
 	dead->count = 1;
 	dead->pid = -1;
+	sigemptyset(&dead->oldmask);
 	mutex_init(&dead->pidfd_lock);
 
 	mutex_lock(dead_pidfd_hash_lock);

cc: @mihalicyn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants