Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kernel infrastructure to read RDT and assign RMIDs #52

Merged
merged 88 commits into from
Feb 11, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
ee79f77
add gitignore for build directory
yonch Feb 6, 2025
98dc100
add skeleton perf pmu that uses printk
yonch Feb 6, 2025
3ca9d45
remove get_irq_regs(), unused
yonch Feb 6, 2025
2020885
revert the removal of regs
yonch Feb 6, 2025
8463b13
fix include for get_irq_flags()
yonch Feb 6, 2025
5fd581c
use tracepoints to output events
yonch Feb 7, 2025
953f41d
fix creating kernel counter (needs CPU number)
yonch Feb 7, 2025
c07e2d8
declare tracepoint
yonch Feb 7, 2025
92cc151
fix tracepoint definition
yonch Feb 7, 2025
5923446
make test_pmu.sh executable
yonch Feb 7, 2025
3af7eeb
use random run IDs for testing the PMU
yonch Feb 7, 2025
fb62789
use IPI to collect from all CPUs
yonch Feb 7, 2025
77685ca
add llc miss events
yonch Feb 8, 2025
679cdce
fix trace parameter order
yonch Feb 8, 2025
5597881
add cycle and instruction monitoring
yonch Feb 8, 2025
38067b1
rename sample collection function
yonch Feb 8, 2025
26f4996
add context switch event handler
yonch Feb 8, 2025
760b735
add indication of whether a trace was due to a context switch
yonch Feb 8, 2025
f9573fd
add skeleton initialization of resctrl
yonch Feb 10, 2025
119bdd3
update github workflow for kernel module
yonch Feb 10, 2025
cf7ed75
qualify kernel logs with Memory Collector
yonch Feb 10, 2025
5649aec
add more debugging to workflow, on failure
yonch Feb 10, 2025
ea14af5
fix error handling in kernel module test
yonch Feb 10, 2025
127eb42
update kernel test to bare-metal machine for resctrl tests
yonch Feb 10, 2025
a987bcd
disable resctrl_{init,exit} to see if they cause panic on bare metal
yonch Feb 10, 2025
37f0d77
try a more limited configuration of RDT MSRs
yonch Feb 10, 2025
7a85d13
reduce the timeout for kernel test
yonch Feb 10, 2025
3691c30
make resctrl initialization almost a no-op to test kernel freeze
yonch Feb 10, 2025
c21d581
try writing to the MSR (without reading the MSR)
yonch Feb 10, 2025
3793a76
try the wrmsr in context switch rather than IPI
yonch Feb 10, 2025
2a5866d
run test script
yonch Feb 10, 2025
baabdab
expose a few lines at the beginning and end of trace
yonch Feb 10, 2025
7566b7e
fix indentation in github action
yonch Feb 10, 2025
84c4f96
mount resctrl and check capabilities before loading kernel module
yonch Feb 10, 2025
aea64f1
try disabling preemption around the MSR write
yonch Feb 10, 2025
ec71c7c
call wrmsr rather than wrmsr_safe
yonch Feb 10, 2025
bd171c1
show CPU features for only 10 cores
yonch Feb 10, 2025
796d28a
write 0,0 to PQR_ASSOC
yonch Feb 10, 2025
d81b1c7
add cpuid enumeration as specified in the Intel SDM vol 3 (19.18.3)
yonch Feb 10, 2025
560bb4d
add more debug prints in enumerate_cpuid
yonch Feb 10, 2025
e232325
add incremental checking of cpuid
yonch Feb 10, 2025
1525612
move run_module.sh, fix permissions and path
yonch Feb 10, 2025
2cab06a
fix race where timer fires before the per-cpu structs are initialized
yonch Feb 10, 2025
33996a2
reorder initialization to avoid timer firing before all CPUs are ready
yonch Feb 10, 2025
da36cee
disable context switch monitoring to elminate nested handling
yonch Feb 10, 2025
50146a8
reset the timer to see the next results
yonch Feb 10, 2025
d8bf94f
add workflow parameter for instance type
yonch Feb 11, 2025
ebc67ce
defer initialization of the sampling timer
yonch Feb 11, 2025
a50d1f6
disable LLC misses reading, to check if they are causing the kernel t…
yonch Feb 11, 2025
faabf66
loop for 30 seconds, then unload module
yonch Feb 11, 2025
ff205bd
use an hrtimer on each cpu to output measurements to avoid IPI problems
yonch Feb 11, 2025
6558b6c
fix access to percpu array
yonch Feb 11, 2025
535a6f5
add debug prints
yonch Feb 11, 2025
368d397
initialize the perf events on the CPU where they will be used
yonch Feb 11, 2025
1ab3ed2
use work queues to set up monitoring
yonch Feb 11, 2025
8870024
fortify hrtimer init and cleanup
yonch Feb 11, 2025
e3856e6
disable all perf events to check stability without perf
yonch Feb 11, 2025
721c81f
continue test even if cannot create resctrl directory
yonch Feb 11, 2025
fb16a6c
disable hrtimer
yonch Feb 11, 2025
07501ac
enforce that init_cpu_state runs only on the intended CPU
yonch Feb 11, 2025
863e00b
stop failing the insert when context switches are not set up
yonch Feb 11, 2025
6d83546
have workqueue enforce locality of work items
yonch Feb 11, 2025
aec0cb6
only try to enqueue work on online CPUs.
yonch Feb 11, 2025
f51979a
install trace-cmd before running a test that requires it
yonch Feb 11, 2025
1866536
start timers
yonch Feb 11, 2025
20463ad
use hrtimer_init rather than hrtimer_setup
yonch Feb 11, 2025
36bc46b
enable perf counters
yonch Feb 11, 2025
d2e3272
best effort to set up perf timers
yonch Feb 11, 2025
249f1d1
give kernel the time we want to reset the timer to
yonch Feb 11, 2025
bb53370
remove pr_info in workqueue when there is no error, to avoid "hogged …
yonch Feb 11, 2025
ff36734
add auto test on push to main
yonch Feb 11, 2025
c4d9c74
use perf_event_read_local (rather than the non-local variant)
yonch Feb 11, 2025
b38ce0e
we cannot read perf counters safely at context switch or timer, plann…
yonch Feb 11, 2025
e1e8a4c
add per-cpu initialization of RDT state
yonch Feb 11, 2025
8301f36
switch to RDT-capable VMs for tests
yonch Feb 11, 2025
79d3ac9
log resctrl capabilities as pr_info
yonch Feb 11, 2025
a5d431e
change default VM type due to lack of inventory
yonch Feb 11, 2025
31d1cfe
read MBM values on every timer tick
yonch Feb 11, 2025
7f7c0b6
set RMID on CPU 2
yonch Feb 11, 2025
cc52614
fix event selection in mbm read
yonch Feb 11, 2025
60da7ef
generalize resctrl read function
yonch Feb 11, 2025
ead8895
add llc and local mbm to resctrl measurements
yonch Feb 11, 2025
115f2fd
polish module test github action
yonch Feb 11, 2025
d28b927
delete the simple runner, now we do not need to aggressively read ker…
yonch Feb 11, 2025
f6dec89
fix error code when verifying unload
yonch Feb 11, 2025
23371a0
use function to write to the PQR_ASSOC MSR
yonch Feb 11, 2025
d495d86
cleanup logging and remove unneeded clauses
yonch Feb 11, 2025
a6c56e5
clarify unloading printks
yonch Feb 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 86 additions & 10 deletions .github/workflows/test-kernel-module.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
name: Test Kernel Module
on: workflow_dispatch # Manual trigger for testing
on:
workflow_dispatch: # Manual trigger for testing
inputs:
instance-type:
description: 'EC2 instance type to use'
required: false
default: 'm7i.metal-24xl'
type: string
push:
branches:
- main
paths:
- module/**

permissions:
id-token: write # Required for requesting the JWT
Expand All @@ -26,7 +38,7 @@ jobs:
mode: start
github-token: ${{ secrets.REPO_ADMIN_TOKEN }}
ec2-image-id: ami-0884d2865dbe9de4b # Ubuntu 22.04 LTS in us-east-2
ec2-instance-type: t3.large
ec2-instance-type: ${{ inputs.instance-type || 'm7i.metal-24xl' }}
market-type: spot
subnet-id: ${{ secrets.AWS_SUBNET_ID }}
security-group-id: ${{ secrets.AWS_SECURITY_GROUP_ID }}
Expand All @@ -45,7 +57,7 @@ jobs:
test-module:
needs: start-runner
runs-on: ${{ needs.start-runner.outputs.label }}
timeout-minutes: 10 # Add timeout in case system hangs
timeout-minutes: 2 # Add timeout in case system hangs
steps:
- name: Checkout code
uses: actions/checkout@v4
Expand Down Expand Up @@ -105,31 +117,95 @@ jobs:
# Now try the actual build
make

ls -l build/memory_collector.ko
ls -l build/collector.ko

- name: Check RDT Capabilities
run: |
sudo mkdir -p /sys/fs/resctrl || true
sudo mount -t resctrl resctrl /sys/fs/resctrl || true

echo "Mounting resctrl filesystem"
mount | grep resctrl || true

echo "Checking RDT capabilities"
ls /sys/fs/resctrl/info || true

echo "Monitoring features:"
cat /sys/fs/resctrl/info/L3_MON/mon_features || true

echo "Number of available RMIDs:"
cat /sys/fs/resctrl/info/L3_MON/num_rmids || true

echo "Number of CAT classes:"
cat /sys/fs/resctrl/info/L3/num_closids || true

echo "head -n 35 /proc/cpuinfo:"
head -n 35 /proc/cpuinfo || true

echo "CPU RDT features (head):"
grep -E "cat_l3|cdp_l3|cqm_occup_llc|cqm_mbm_total|cqm_mbm_local" /proc/cpuinfo | head || true

# we do not unmount, maybe mounting affects the intel_cqm checks below
#sudo umount /sys/fs/resctrl || true

- name: Load and test module
id: load-and-test-module
continue-on-error: true
working-directory: module
run: |

# Check undefined symbols
sudo modinfo -F depends build/collector.ko
sudo objdump -d build/collector.ko | grep undefined || true

# Load module
sudo insmod build/memory_collector.ko
echo "insmod build/collector.ko:"
sudo insmod build/collector.ko

# Verify module is loaded
lsmod | grep memory_collector
echo "lsmod | grep collector:"
lsmod | grep collector

# Check kernel logs for module initialization
dmesg | grep "Memory Collector" || true
echo "dmesg | grep 'Memory Collector':"
dmesg -c | grep "Memory Collector" || true

# Unload module
sudo rmmod memory_collector
echo "rmmod collector:"
sudo rmmod collector

# Verify module unloaded successfully
if lsmod | grep -q memory_collector; then
echo "lsmod | grep collector:"
! lsmod | grep collector
if lsmod | grep -q collector; then
echo "Error: Module still loaded"
exit 1
fi

# Check kernel logs for cleanup message
dmesg | grep "Memory Collector" || true
echo "dmesg | grep 'Memory Collector':"
dmesg -c | grep "Memory Collector" || true

- name: Check dmesg on failure
if: steps.load-and-test-module.outcome == 'failure'
run: |
echo "load and test module failed, showing last kernel messages:"
sudo dmesg | tail -n 100
exit 1

- name: Install trace dependencies
run: |
sudo apt-get install -y trace-cmd

- name: Run module test script
working-directory: module
run: |
# run 10 times in quick succession to stress-test insmod/rmmod and collector
for i in {1..10}; do
echo "*** Run $i:"
./test_module.sh
done


stop-runner:
name: Stop EC2 runner
Expand Down
22 changes: 19 additions & 3 deletions Dockerfile.devcontainer
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,27 @@ FROM ubuntu:latest
# Avoid prompts during package installation
ENV DEBIAN_FRONTEND=noninteractive

# Add amd64 architecture
RUN dpkg --add-architecture amd64

# Set up repositories for both arm64 and amd64
RUN echo "deb [arch=arm64] http://ports.ubuntu.com/ubuntu-ports noble main restricted universe multiverse\n\
deb [arch=arm64] http://ports.ubuntu.com/ubuntu-ports noble-updates main restricted universe multiverse\n\
deb [arch=arm64] http://ports.ubuntu.com/ubuntu-ports noble-backports main restricted universe multiverse\n\
deb [arch=arm64] http://ports.ubuntu.com/ubuntu-ports noble-security main restricted universe multiverse\n\
deb [arch=amd64] http://archive.ubuntu.com/ubuntu noble main restricted universe multiverse\n\
deb [arch=amd64] http://archive.ubuntu.com/ubuntu noble-updates main restricted universe multiverse\n\
deb [arch=amd64] http://archive.ubuntu.com/ubuntu noble-backports main restricted universe multiverse\n\
deb [arch=amd64] http://security.ubuntu.com/ubuntu noble-security main restricted universe multiverse" > /etc/apt/sources.list

RUN rm /etc/apt/sources.list.d/ubuntu.sources

# Update and install essential build tools and kernel headers
RUN apt-get update && apt-get install -y \
build-essential \
linux-headers-6.8.0-52-generic \
linux-image-6.8.0-52-generic \
crossbuild-essential-amd64 \
linux-headers-6.8.0-52-generic:amd64 \
linux-image-6.8.0-52-generic:amd64 \
git \
vim \
curl \
Expand All @@ -18,4 +34,4 @@ RUN apt-get update && apt-get install -y \
WORKDIR /workspace

# Keep container running
CMD ["sleep", "infinity"]
CMD ["sleep", "infinity"]
1 change: 1 addition & 0 deletions module/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
build/
30 changes: 27 additions & 3 deletions module/Makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,11 @@
obj-m += memory_collector.o
EXTRA_CFLAGS += -I$(src)

# Define the module name and its source files
obj-m := collector.o
collector-objs := memory_collector.o resctrl.o

# Always set architecture to x86_64 (not just x86)
ARCH := x86_64

# Check if KVERSION is provided on command line
ifdef KVERSION
Expand All @@ -19,15 +26,32 @@ endif
KDIR := /lib/modules/$(KERNEL_VERSION)/build
BUILD_DIR := $(PWD)/build

# Set cross-compilation by default in container
ifneq (,$(wildcard /.dockerenv))
CROSS_COMPILE := x86_64-linux-gnu-
endif

# Also set cross-compilation on ARM machines
ifeq ($(shell uname -m),arm64)
CROSS_COMPILE := x86_64-linux-gnu-
endif

all: | $(BUILD_DIR)
@echo "Building for kernel version: $(KERNEL_VERSION)"
$(MAKE) -C $(KDIR) M=$(BUILD_DIR) src=$(PWD) modules
$(MAKE) -C $(KDIR) M=$(BUILD_DIR) src=$(PWD) \
ARCH=$(ARCH) \
CROSS_COMPILE=$(CROSS_COMPILE) \
modules

$(BUILD_DIR):
mkdir -p $(BUILD_DIR)

clean:
$(MAKE) -C $(KDIR) M=$(BUILD_DIR) src=$(PWD) clean
$(MAKE) -C $(KDIR) M=$(BUILD_DIR) src=$(PWD) \
ARCH=$(ARCH) \
CROSS_COMPILE=$(CROSS_COMPILE) \
clean
rm -rf $(BUILD_DIR)
rm -f Module.symvers Module.markers modules.order

.PHONY: all clean
Loading
Loading