Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bpf/optimized usdt ci #8664

Draft
wants to merge 1,351 commits into
base: bpf-next_base
Choose a base branch
from

Conversation

olsajiri
Copy link
Contributor

No description provided.

torvalds and others added 30 commits March 5, 2025 07:08
The pipe_occupancy() logic implicitly relied on the natural unsigned
modulo arithmetic in C, but that doesn't work for the new 'pipe_index_t'
case, since any arithmetic will be done in 'int' (and here we had also
made it 'unsigned int' due to the function call boundary).

So make the modulo arithmetic explicit by casting the result to the
proper type.

Cc: Oleg Nesterov <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Manfred Spraul <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Swapnil Sapkal <[email protected]>
Cc: Alexey Gladkov <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Link: https://lore.kernel.org/all/CAHk-=wjyHsGLx=rxg6PKYBNkPYAejgo7=CbyL3=HGLZLsAaJFQ@mail.gmail.com/
Fixes: 3d25216 ("fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex")
Signed-off-by: Linus Torvalds <[email protected]>
Add htmldoc annotation for the newly introduced "head_tail" member
describing it to be a union of the pipe_inode_info's @Head and @tail
members.

Reported-by: Stephen Rothwell <[email protected]>
Closes: https://lore.kernel.org/lkml/[email protected]/
Fixes: 3d25216 ("fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex")
Signed-off-by: K Prateek Nayak <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Add the __counted_by() compiler attribute to the flexible array member
buf to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.

No functional changes intended.

Signed-off-by: Thorsten Blum <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
num_gb_pipes was set to a wrong value using r420_pipe_config

This have lead to HyperZ glitches on fast Z clearing.

Closes: https://bugs.freedesktop.org/show_bug.cgi?id=110897
Reviewed-by: Marek Olšák <[email protected]>
Signed-off-by: Richard Thier <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
(cherry picked from commit 044e59a85c4d84e3c8d004c486e5c479640563a6)
Cc: [email protected]
always allow ih interrupt from fw on smu v14 based on
the interface requirement

Signed-off-by: Kenneth Feng <[email protected]>
Reviewed-by: Yang Wang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
(cherry picked from commit a3199eba46c54324193607d9114a1e321292d7a1)
Cc: [email protected] # 6.12.x
While looking for incorrect users of the pipe head/tail fields (see
commit c27c66a: "fs/pipe: Fix pipe_occupancy() with 16-bit
indexes"), I found a bug in pipe_discard_from() that looked entirely
broken.

However, the fix is trivial: this buggy function isn't actually called
by anything, so let's just remove it ASAP.

Signed-off-by: Linus Torvalds <[email protected]>
…linux/kernel/git/hid/hid

Pull HID fixes from Jiri Kosina:

 - power management fix in intel-thc-hid (Even Xu)

 - nintendo gencon mapping fix (Ryan McClelland)

 - fix for UAF on device diconnect path in hid-steam (Vicki Pfau)

 - two fixes for UAF on device disconnect path in intel-ish-hid (Zhang
   Lixu)

 - fix for potential NULL dereference in hid-appleir (Daniil Dulov)

 - few other small cosmetic fixes (e.g. typos)

* tag 'hid-for-linus-2025030501' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
  HID: Intel-thc-hid: Intel-quickspi: Correct device state after S4
  HID: intel-thc-hid: Fix spelling mistake "intput" -> "input"
  HID: hid-steam: Fix use-after-free when detaching device
  HID: debug: Fix spelling mistake "Messanger" -> "Messenger"
  HID: appleir: Fix potential NULL dereference at raw event handle
  HID: apple: disable Fn key handling on the Omoton KB066
  HID: i2c-hid: improve i2c_hid_get_report error message
  HID: intel-ish-hid: Fix use-after-free issue in ishtp_hid_remove()
  HID: intel-ish-hid: Fix use-after-free issue in hid_ishtp_cl_remove()
  HID: google: fix unused variable warning under !CONFIG_ACPI
  HID: nintendo: fix gencon button events map
  HID: corsair-void: Update power supply values with a unified work handler
The kernel_recvmsg() function returns an int which could be either
negative error codes or the number of bytes received.  The problem is
that the condition:

        if (ret < sizeof(*icresp)) {

is type promoted to type unsigned long and negative values are treated
as high positive values which is success, when they should be treated as
failure.  Handle invalid positive returns separately from negative
error codes to avoid this problem.

Fixes: 578539e ("nvme-tcp: fix connect failure on receiving partial ICResp PDU")
Signed-off-by: Dan Carpenter <[email protected]>
Reviewed-by: Caleb Sander Mateos <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
…S35L41 HDA

Add support for ASUS G814PH/PM/PP and G814FH/FM/FP.

Laptops use 2 CS35L41 Amps with HDA, using Internal boost, with I2C.

Signed-off-by: Stefan Binding <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>
Link: https://patch.msgid.link/[email protected]
… CS35L41 HDA

Add support for ASUS GA603KP, GA603KM and GA603KH.

Laptops use 2 CS35L41 Amps with HDA, using Internal boost, with I2C

Signed-off-by: Stefan Binding <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>
Link: https://patch.msgid.link/[email protected]
…CS35L41 HDA

Add support for ASUS G614PH/PM/PP and G614FH/FM/FP.

Laptops use 2 CS35L41 Amps with HDA, using Internal boost, with I2C

Signed-off-by: Stefan Binding <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>
Link: https://patch.msgid.link/[email protected]
… HDA

Add support for ASUS B3405CVA, B5405CVA, B5605CVA, B3605CVA.

Laptops use 2 CS35L41 Amps with HDA, using Internal boost, with SPI

Signed-off-by: Stefan Binding <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>
Link: https://patch.msgid.link/[email protected]
… CS35L41 HDA

Add support for ASUS B3405CCA / P3405CCA, B3605CCA / P3605CCA,
B3405CCA, B3605CCA.

Laptops use 2 CS35L41 Amps with HDA, using Internal boost, with SPI

Signed-off-by: Stefan Binding <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>
Link: https://patch.msgid.link/[email protected]
… CS35L41 HDA

Add support for ASUS B5605CCA and B5405CCA.

Laptops use 2 CS35L41 Amps with HDA, using Internal boost, with SPI

Signed-off-by: Stefan Binding <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>
Link: https://patch.msgid.link/[email protected]
…g CS35L41 HDA

Laptop uses 2 CS35L41 Amps with HDA, using External boost with I2C

Signed-off-by: Stefan Binding <[email protected]>
Signed-off-by: Takashi Iwai <[email protected]>
Link: https://patch.msgid.link/[email protected]
If a userptr vma subject to prefetching was already invalidated
or invalidated during the prefetch operation, the operation would
repeatedly return -EAGAIN which would typically cause an infinite
loop.

Validate the userptr to ensure this doesn't happen.

v2:
- Don't fallthrough from UNMAP to PREFETCH (Matthew Brost)

Fixes: 5bd24e7 ("drm/xe/vm: Subclass userptr vmas")
Fixes: 617eebb ("drm/xe: Fix array of binds")
Cc: Matthew Brost <[email protected]>
Cc: <[email protected]> # v6.9+
Suggested-by: Matthew Brost <[email protected]>
Signed-off-by: Thomas Hellström <[email protected]>
Reviewed-by: Matthew Brost <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 03c346d4d0d85d210d549d43c8cfb3dfb7f20e0a)
Signed-off-by: Rodrigo Vivi <[email protected]>
Fix a (harmless) misplaced #endif leading to declarations
appearing multiple times.

Fixes: 0eb2a18 ("drm/xe: Implement VM snapshot support for BO's and userptr")
Cc: Maarten Lankhorst <[email protected]>
Cc: José Roberto de Souza <[email protected]>
Cc: <[email protected]> # v6.12+
Signed-off-by: Thomas Hellström <[email protected]>
Reviewed-by: Lucas De Marchi <[email protected]>
Reviewed-by: Tejas Upadhyay <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit fcc20a4c752214b3e25632021c57d7d1d71ee1dd)
Signed-off-by: Rodrigo Vivi <[email protected]>
Fix fault mode invalidation racing with unbind leading to the
PTE zapping potentially traversing an invalid page-table tree.
Do this by holding the notifier lock across PTE zapping. This
might transfer any contention waiting on the notifier seqlock
read side to the notifier lock read side, but that shouldn't be
a major problem.

At the same time get rid of the open-coded invalidation in the bind
code by relying on the notifier even when the vma bind is not
yet committed.

Finally let userptr invalidation call a dedicated xe_vm function
performing a full invalidation.

Fixes: e8babb2 ("drm/xe: Convert multiple bind ops into single job")
Cc: Thomas Hellström <[email protected]>
Cc: Matthew Brost <[email protected]>
Cc: Matthew Auld <[email protected]>
Cc: <[email protected]> # v6.12+
Signed-off-by: Thomas Hellström <[email protected]>
Reviewed-by: Matthew Brost <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 100a5b8dadfca50d91d9a4c9fc01431b42a25cab)
Signed-off-by: Rodrigo Vivi <[email protected]>
Concurrent VM bind staging and zapping of PTEs from a userptr notifier
do not work because the view of PTEs is not stable. VM binds cannot
acquire the notifier lock during staging, as memory allocations are
required. To resolve this race condition, use a staging tree for VM
binds that is committed only under the userptr notifier lock during the
final step of the bind. This ensures a consistent view of the PTEs in
the userptr notifier.

A follow up may only use staging for VM in fault mode as this is the
only mode in which the above race exists.

v3:
 - Drop zap PTE change (Thomas)
 - s/xe_pt_entry/xe_pt_entry_staging (Thomas)

Suggested-by: Thomas Hellström <[email protected]>
Cc: <[email protected]>
Fixes: e8babb2 ("drm/xe: Convert multiple bind ops into single job")
Fixes: a708f65 ("drm/xe: Update PT layer with better error handling")
Signed-off-by: Matthew Brost <[email protected]>
Reviewed-by: Thomas Hellström <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Thomas Hellström <[email protected]>
(cherry picked from commit 6f39b0c5ef0385eae586760d10b9767168037aa5)
Signed-off-by: Rodrigo Vivi <[email protected]>
Add proper #ifndef around the xe_hmm.h header, proper spacing
and since the documentation mostly follows kerneldoc format,
make it kerneldoc. Also prepare for upcoming -stable fixes.

Fixes: 81e058a ("drm/xe: Introduce helper to populate userptr")
Cc: Oak Zeng <[email protected]>
Cc: <[email protected]> # v6.10+
Signed-off-by: Thomas Hellström <[email protected]>
Reviewed-by: Matthew Auld <[email protected]>
Acked-by: Matthew Brost <Matthew Brost <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit bbe2b06b55bc061c8fcec034ed26e88287f39143)
Signed-off-by: Rodrigo Vivi <[email protected]>
The pnfs that we obtain from hmm_range_fault() point to pages that
we don't have a reference on, and the guarantee that they are still
in the cpu page-tables is that the notifier lock must be held and the
notifier seqno is still valid.

So while building the sg table and marking the pages accesses / dirty
we need to hold this lock with a validated seqno.

However, the lock is reclaim tainted which makes
sg_alloc_table_from_pages_segment() unusable, since it internally
allocates memory.

Instead build the sg-table manually. For the non-iommu case
this might lead to fewer coalesces, but if that's a problem it can
be fixed up later in the resource cursor code. For the iommu case,
the whole sg-table may still be coalesced to a single contigous
device va region.

This avoids marking pages that we don't own dirty and accessed, and
it also avoid dereferencing struct pages that we don't own.

v2:
- Use assert to check whether hmm pfns are valid (Matthew Auld)
- Take into account that large pages may cross range boundaries
  (Matthew Auld)

v3:
- Don't unnecessarily check for a non-freed sg-table. (Matthew Auld)
- Add a missing up_read() in an error path. (Matthew Auld)

Fixes: 81e058a ("drm/xe: Introduce helper to populate userptr")
Cc: Oak Zeng <[email protected]>
Cc: <[email protected]> # v6.10+
Signed-off-by: Thomas Hellström <[email protected]>
Reviewed-by: Matthew Auld <[email protected]>
Acked-by: Matthew Brost <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit ea3e66d280ce2576664a862693d1da8fd324c317)
Signed-off-by: Rodrigo Vivi <[email protected]>
If userptr pages are freed after a call to the xe mmu notifier,
the device will not be blocked out from theoretically accessing
these pages unless they are also unmapped from the iommu, and
this violates some aspects of the iommu-imposed security.

Ensure that userptrs are unmapped in the mmu notifier to
mitigate this. A naive attempt would try to free the sg table, but
the sg table itself may be accessed by a concurrent bind
operation, so settle for only unmapping.

v3:
- Update lockdep asserts.
- Fix a typo (Matthew Auld)

Fixes: 81e058a ("drm/xe: Introduce helper to populate userptr")
Cc: Oak Zeng <[email protected]>
Cc: Matthew Auld <[email protected]>
Cc: <[email protected]> # v6.10+
Signed-off-by: Thomas Hellström <[email protected]>
Reviewed-by: Matthew Auld <[email protected]>
Acked-by: Matthew Brost <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit ba767b9d01a2c552d76cf6f46b125d50ec4147a6)
Signed-off-by: Rodrigo Vivi <[email protected]>
The IOPOLL path posts CQEs when the io_kiocb is marked as completed,
so it cannot rely on the usual retry that non-IOPOLL requests do for
read/write requests.

If -EAGAIN is received and the request should be retried, go through
the normal completion path and let the normal flush logic catch it and
reissue it, like what is done for !IOPOLL reads or writes.

Fixes: d803d12 ("io_uring/rw: handle -EAGAIN retry at IO completion time")
Reported-by: John Garry <[email protected]>
Link: https://lore.kernel.org/io-uring/[email protected]/
Signed-off-by: Jens Axboe <[email protected]>
There is not enough room in the 12-bit ASID address space to hand out
broadcast ASIDs to every process. Only hand out broadcast ASIDs to processes
when they are observed to be simultaneously running on 4 or more CPUs.

This also allows single threaded process to continue using the cheaper, local
TLB invalidation instructions like INVLPGB.

Due to the structure of flush_tlb_mm_range(), the INVLPGB flushing is done in
a generically named broadcast_tlb_flush() function which can later also be
used for Intel RAR.

Combined with the removal of unnecessary lru_add_drain calls() (see
https://lore.kernel.org/r/20241219153253.3da9e8aa@fangorn) this results in
a nice performance boost for the will-it-scale tlb_flush2_threads test on an
AMD Milan system with 36 cores:

  - vanilla kernel:           527k loops/second
  - lru_add_drain removal:    731k loops/second
  - only INVLPGB:             527k loops/second
  - lru_add_drain + INVLPGB: 1157k loops/second

Profiling with only the INVLPGB changes showed while TLB invalidation went
down from 40% of the total CPU time to only around 4% of CPU time, the
contention simply moved to the LRU lock.

Fixing both at the same time about doubles the number of iterations per second
from this case.

Comparing will-it-scale tlb_flush2_threads with several different numbers of
threads on a 72 CPU AMD Milan shows similar results. The number represents the
total number of loops per second across all the threads:

  threads	tip		INVLPGB

  1		315k		304k
  2		423k		424k
  4		644k		1032k
  8		652k		1267k
  16		737k		1368k
  32		759k		1199k
  64		636k		1094k
  72		609k		993k

1 and 2 thread performance is similar with and without INVLPGB, because
INVLPGB is only used on processes using 4 or more CPUs simultaneously.

The number is the median across 5 runs.

Some numbers closer to real world performance can be found at Phoronix, thanks
to Michael:

https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits

  [ bp:
   - Massage
   - :%s/\<static_cpu_has\>/cpu_feature_enabled/cgi
   - :%s/\<clear_asid_transition\>/mm_clear_asid_transition/cgi
   - Fold in a 0day bot fix: https://lore.kernel.org/oe-kbuild-all/[email protected]
   ]

Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Nadav Amit <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
With AMD TCE (translation cache extensions) only the intermediate mappings
that cover the address range zapped by INVLPG / INVLPGB get invalidated,
rather than all intermediate mappings getting zapped at every TLB invalidation.

This can help reduce the TLB miss rate, by keeping more intermediate mappings
in the cache.

From the AMD manual:

Translation Cache Extension (TCE) Bit. Bit 15, read/write. Setting this bit to
1 changes how the INVLPG, INVLPGB, and INVPCID instructions operate on TLB
entries. When this bit is 0, these instructions remove the target PTE from the
TLB as well as all upper-level table entries that are cached in the TLB,
whether or not they are associated with the target PTE.  When this bit is set,
these instructions will remove the target PTE and only those upper-level
entries that lead to the target PTE in the page table hierarchy, leaving
unrelated upper-level entries intact.

  [ bp: use cpu_has()... I know, it is a mess. ]

Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
When executing the INVLPGB instruction on a bare-metal host or hypervisor, if
the ASID valid bit is not set, the instruction will flush the TLB entries that
match the specified criteria for any ASID, not just the those of the host. If
virtual machines are running on the system, this may result in inadvertent
flushes of guest TLB entries.

When executing the INVLPGB instruction in a guest and the INVLPGB instruction is
not intercepted by the hypervisor, the hardware will replace the requested ASID
with the guest ASID and set the ASID valid bit before doing the broadcast
invalidation. Thus a guest is only able to flush its own TLB entries.

So to limit the host TLB flushing reach, always set the ASID valid bit using an
ASID value of 0 (which represents the host/hypervisor). This will will result in
the desired effect in both host and guest.

Signed-off-by: Tom Lendacky <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Link: https://lore.kernel.org/r/20250304120449.GHZ8bsYYyEBOKQIxBm@fat_crate.local
 Conflicts:
	arch/x86/kernel/cpu/amd.c

Signed-off-by: Ingo Molnar <[email protected]>
On MMIO devices (e.g. MT7988 or EN7581) unicast traffic received on lanX
port is flooded on all other user ports if the DSA switch is configured
without VLAN support since PORT_MATRIX in PCR regs contains all user
ports. Similar to MDIO devices (e.g. MT7530 and MT7531) fix the issue
defining default VLAN-ID 0 for MT7530 MMIO devices.

Fixes: 110c18b ("net: dsa: mt7530: introduce driver for MT7988 built-in switch")
Signed-off-by: Lorenzo Bianconi <[email protected]>
Reviewed-by: Chester A. Unal <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
This reverts commit a5c6bc5.

The general approach described in commit e076eac ("selftests: break
the dependency upon local header files") was taken one step too far here:
it should not have been extended to include the syscall numbers.  This is
because doing so would require per-arch support in tools/include/uapi, and
no such support exists.

This revert fixes two separate reports of test failures, from Dave
Hansen[1], and Li Wang[2].  An excerpt of Dave's report:

Before this commit (a5c6bc5) things are
fine.  But after, I get:

	running PKEY tests for unsupported CPU/OS

An excerpt of Li's report:

    I just found that mlock2_() return a wrong value in mlock2-test

[1] https://lore.kernel.org/[email protected]
[2] https://lore.kernel.org/CAEemH2eW=UMu9+turT2jRie7+6ewUazXmA6kL+VBo3cGDGU6RA@mail.gmail.com

Link: https://lkml.kernel.org/r/[email protected]
Fixes: a5c6bc5 ("selftests/mm: remove local __NR_* definitions")
Signed-off-by: John Hubbard <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Li Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jeff Xu <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Muhammad Usama Anjum <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
We are about to add uprobe trampoline, so cleaning up the namespace.

Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
Making copy_from_page global and adding uprobe prefix.
Adding the uprobe prefix to copy_to_page as well for symmetry.

Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
The uprobe_write_opcode function currently updates also refctr offset
if there's one defined for uprobe.

This is not handy for following changes which needs to make several
updates (writes) to install or remove uprobe, but update refctr offset
just once.

Adding set_swbp_refctr/set_orig_refctr which makes sure refctr offset
is updated.

Signed-off-by: Jiri Olsa <[email protected]>
Adding uprobe_write function that does what uprobe_write_opcode did
so far, but allows to pass verify callback function that checks the
memory location before writing the opcode.

It will be used in following changes to simplify the checking logic.

The uprobe_write_opcode now calls uprobe_write with verify_opcode as
the verify callback.

Signed-off-by: Jiri Olsa <[email protected]>
Adding nbytes argument to uprobe_write_opcode as preparation
for writing whole instructions in following changes.

Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
The uprobe_write has special path to restore the original page when
we write original instruction back.

This happens when uprobe_write detects that we want to write anything
else but breakpoint instruction.

In following changes we want to use uprobe_write function for multiple
updates, so adding new function argument to denote that this is the
original instruction update. This way uprobe_write can make appropriate
checks and restore the original page when possible.

Signed-off-by: Jiri Olsa <[email protected]>
@olsajiri olsajiri force-pushed the bpf/optimized_usdt_ci branch 3 times, most recently from bd2fa12 to fcaa810 Compare March 12, 2025 11:42
olsajiri added 17 commits March 12, 2025 15:36
Adding new uprobe syscall that calls uprobe handlers for given
'breakpoint' address.

The idea is that the 'breakpoint' address calls the user space
trampoline which executes the uprobe syscall.

The syscall handler reads the return address of the initial call
to retrieve the original 'breakpoint' address. With this address
we find the related uprobe object and call its consumers.

Adding the arch_uprobe_trampoline_mapping function that provides
uprobe trampoline mapping. This mapping is backed with one global
page initialized at __init time and shared by the all the mapping
instances.

We do not allow to execute uprobe syscall if the caller is not
from uprobe trampoline mapping.

Signed-off-by: Jiri Olsa <[email protected]>
Adding support to add special mapping for for user space trampoline
with following functions:

  uprobe_trampoline_get - find or add related uprobe_trampoline
  uprobe_trampoline_put - remove ref or destroy uprobe_trampoline

The user space trampoline is exported as architecture specific user space
special mapping, which is provided by arch_uprobe_trampoline_mapping
function.

The uprobe trampoline needs to be callable/reachable from the probe address,
so while searching for available address we use uprobe_is_callable function
to decide if the uprobe trampoline is callable from the probe address.

All uprobe_trampoline objects are stored in uprobes_state object and are
cleaned up when the process mm_struct goes down. Adding new arch hooks
for that, because this change is x86_64 specific.

Locking is provided by callers in following changes.

Signed-off-by: Jiri Olsa <[email protected]>
Adding support to emulate nop5 as the original uprobe instruction.

Signed-off-by: Jiri Olsa <[email protected]>
Putting together all the previously added pieces to support optimized
uprobes on top of 5-byte nop instruction.

The current uprobe execution goes through following:
  - installs breakpoint instruction over original instruction
  - exception handler hit and calls related uprobe consumers
  - and either simulates original instruction or does out of line single step
    execution of it
  - returns to user space

The optimized uprobe path

  - checks the original instruction is 5-byte nop (plus other checks)
  - adds (or uses existing) user space trampoline and overwrites original
    instruction (5-byte nop) with call to user space trampoline
  - the user space trampoline executes uprobe syscall that calls related uprobe
    consumers
  - trampoline returns back to next instruction

This approach won't speed up all uprobes as it's limited to using nop5 as
original instruction, but we could use nop5 as USDT probe instruction (which
uses single byte nop ATM) and speed up the USDT probes.

This patch overloads related arch functions in uprobe_write_opcode and
set_orig_insn so they can install call instruction if needed.

The arch_uprobe_optimize triggers the uprobe optimization and is called after
first uprobe hit. I originally had it called on uprobe installation but then
it clashed with elf loader, because the user space trampoline was added in a
place where loader might need to put elf segments, so I decided to do it after
first uprobe hit when loading is done.

We do not unmap and release uprobe trampoline when it's no longer needed,
because there's no easy way to make sure none of the threads is still
inside the trampoline. But we do not waste memory, because there's just
single page for all the uprobe trampoline mappings.

We do waste frmae on page mapping for every 4GB by keeping the uprobe
trampoline page mapped, but that seems ok.

Attaching the speed up from benchs/run_bench_uprobes.sh script:

current:
        usermode-count :  818.836 ± 2.842M/s
        syscall-count  :    8.917 ± 0.003M/s
        uprobe-nop     :    3.056 ± 0.013M/s
        uprobe-push    :    2.903 ± 0.002M/s
        uprobe-ret     :    1.533 ± 0.001M/s
-->     uprobe-nop5    :    1.492 ± 0.000M/s
        uretprobe-nop  :    1.783 ± 0.000M/s
        uretprobe-push :    1.672 ± 0.001M/s
        uretprobe-ret  :    1.067 ± 0.002M/s
-->     uretprobe-nop5 :    1.052 ± 0.000M/s

after the change:

        usermode-count :  818.386 ± 1.886M/s
        syscall-count  :    8.923 ± 0.003M/s
        uprobe-nop     :    3.086 ± 0.005M/s
        uprobe-push    :    2.751 ± 0.001M/s
        uprobe-ret     :    1.481 ± 0.000M/s
-->     uprobe-nop5    :    4.016 ± 0.002M/s
        uretprobe-nop  :    1.712 ± 0.008M/s
        uretprobe-push :    1.616 ± 0.001M/s
        uretprobe-ret  :    1.052 ± 0.000M/s
-->     uretprobe-nop5 :    2.015 ± 0.000M/s

Signed-off-by: Jiri Olsa <[email protected]>
Adding __test_uprobe_syscall with non x86_64 stub to execute all the tests,
so we don't need to keep adding non x86_64 stub functions for new tests.

Signed-off-by: Jiri Olsa <[email protected]>
Using 5-byte nop for x86 usdt probes so we can switch
to optimized uprobe them.

Signed-off-by: Jiri Olsa <[email protected]>
Adding tests for optimized uprobe/usdt probes.

Checking that we get expected trampoline and attached bpf programs
get executed properly.

Signed-off-by: Jiri Olsa <[email protected]>
Adding test that makes sure parallel execution of the uprobe and
attach/detach of optimized uprobe on it works properly.

Signed-off-by: Jiri Olsa <[email protected]>
Make sure that calling uprobe syscall from outside uprobe trampoline
results in sigill signal.

Signed-off-by: Jiri Olsa <[email protected]>
Adding optimized usdt variant for basic usdt test to check that
usdt arguments are properly passed in optimized code path.

Signed-off-by: Jiri Olsa <[email protected]>
Add 5-byte nop uprobe trigger bench (x86_64 specific) to measure
uprobes/uretprobes on top of nop5 instruction.

Signed-off-by: Jiri Olsa <[email protected]>
Pass uprobe systemcall through seccomp without depending on configuration.

Note: uprobe is currently only x86_64 and isn't expected to ever be
supported in i386.

Signed-off-by: Jiri Olsa <[email protected]>
  UPROBE.not_attached.uprobe_default_allow
  UPROBE.not_attached.uprobe_default_block
  UPROBE.not_attached.uprobe_block_syscall
  UPROBE.not_attached.uprobe_default_block_with_syscall
  UPROBE.uprobe_attached.uprobe_default_allow
  UPROBE.uprobe_attached.uprobe_default_block
  UPROBE.uprobe_attached.uprobe_block_syscall
  UPROBE.uprobe_attached.uprobe_default_block_with_syscall
  UPROBE.uretprobe_attached.uprobe_default_allow
  UPROBE.uretprobe_attached.uprobe_default_block
  UPROBE.uretprobe_attached.uprobe_block_syscall
  UPROBE.uretprobe_attached.uprobe_default_block_with_syscall

Signed-off-by: Jiri Olsa <[email protected]>
@olsajiri olsajiri force-pushed the bpf/optimized_usdt_ci branch from fcaa810 to b836e48 Compare March 12, 2025 14:36
@kernel-patches-daemon-bpf kernel-patches-daemon-bpf bot force-pushed the bpf-next_base branch 2 times, most recently from 44c3a1d to 720c696 Compare March 12, 2025 23:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.