Skip to content

Kernel: Start reducing the amount of RecursiveSpinlocks #25931

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

implicitfield
Copy link
Contributor

@implicitfield implicitfield commented May 17, 2025

The Linux kernel doesn't have any recursive spinlocks at all (well, at least not kernel-wide, IDK about random subsystems), so we certainly don't need to have these either. Primarily, these just lead to rather sloppy code (just see all the refactoring in this PR, which could have been fairly easily avoided if we hadn't used these to begin with) and are very SMP-unfriendly (a crash here is both commonplace and nigh-impossible to debug).

Historically, the proliferation of RecursiveSpinlocks has also been brought forward by the fact that SpinlockProtected (since its introduction in 019ad8a, part of #8851) used a RecursiveSpinlock under the hood, which was only recently corrected (in a264a83), though this naturally left a ton of instances of RecursiveSpinlockProtected all over the place.

This PR is based off of a branch where I removed RecursiveSpinlock entirely, but that didn't really pan out for SMP systems either, because when you put a proper Spinlock on the scheduler, only a single thread can run at a time. In the long term, it might prove favorable to rework the scheduling in an xv6-style way, where each CPU has its own instance of the scheduler (at least that would make the locking less of a nightmare).

In conclusion, this PR shouldn't have much of an impact on the behavior of the system, but it should be a good step towards getting rid of RecursiveSpinlocks entirely. This probably fixes #22585 because nothing in Kernel/FileSystem uses RecursiveSpinlocks anymore, but otherwise this shouldn't affect the stability of SMP systems in one way or another.

@github-actions github-actions bot added the 👀 pr-needs-review PR needs review from a maintainer or community member label May 17, 2025
Copy link
Contributor

@Hendiadyoin1 Hendiadyoin1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one thing that stuck out to me
(Also I should revive #20619, as that plays into this)

Copy link
Contributor

@Hendiadyoin1 Hendiadyoin1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more things on the second read-through,
otherwise looks fine by me

Copy link
Member

@spholz spholz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my first review. I still need to look at all the "Prefer non-recursive Spinlocks" commits if it's safe to use non-recursive spinlocks in all those cases.

static RecursiveSpinlock<LockRank::None> s_lock {}; // needs to be recursive because of dump_backtrace()
static Spinlock<LockRank::None> s_lock {};

static void* kmalloc_impl(size_t size, size_t alignment, CallerWillInitializeMemory caller_will_initialize_memory, bool acquire_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this break if we allocate memory during a panic in kmalloc.cpp?
The call to critical_dmesgln could theoretically allocate memory. But usually the inline capacity of the StringBuilder should be enough.

Ideally we should get rid of all memory allocations during panics of course.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory we could set a global variable during panic to disable the lock, as we are going down anyway,
but that's likely a thing for another day

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess force-disabling the lock during a panic would work indeed, but in the end we should really be avoiding allocation if we're already panicking.

@@ -44,7 +44,7 @@ UNMAP_AFTER_INIT void Device::initialize_base_devices()
s_all_details->base_devices = move(base_devices);
}

RefPtr<Device> Device::acquire_by_type_and_major_minor_numbers(DeviceNodeType type, MajorNumber major, MinorNumber minor)
void Device::run_by_type_and_major_minor_numbers(DeviceNodeType type, MajorNumber major, MinorNumber minor, Function<void(RefPtr<Device>)> callback)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this name, but I can't think of a good one either.
Maybe something like for_device_by_type_and_major_minor_numbers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it's not really a for either though, because the callback should only ever get called once since the underlying map can contain just one instance of a given major/minor number pair.

@implicitfield implicitfield force-pushed the spinlocks branch 3 times, most recently from 46ee891 to 7477d68 Compare May 19, 2025 17:24
@BuggieBot
Copy link
Member

Hello!

One or more of the commit messages in this PR do not match the SerenityOS code submission policy, please check the lint_commits CI job for more details on which commits were flagged and why.
Please do not close this PR and open another, instead modify your commit message(s) with git commit --amend and force push those changes to update this PR.

@implicitfield implicitfield force-pushed the spinlocks branch 2 times, most recently from 8f97cb7 to 5be193a Compare May 19, 2025 18:51
@implicitfield
Copy link
Contributor Author

That should fix the deadlock that's been haunting CI... (it was just an instance of using kfree_sized when the big kmalloc lock was already held).

This eliminates the possibility of panicking due to encountering a page
fault while trying access the cache inside an interrupt handler.
This makes it possible to (just barely) keep up when enabling kmalloc
stack dumping (sysctl -w kmalloc_stacks=1) while running in a hardware
accelerated aarch64 VM.
Previously, list nodes would keep the loop device itself alive until the
node was manually removed from the list. This patch breaks that
reference cycle, and allows normal reference counting to take care of
the lifetimes of loop devices (although we do have to leak a reference
on creation because nothing truly owns a loop device).
Previously, you'd occasionally see this sort of construct (simplified):
```
return FooDevice::all_instances().with([&](auto& list) {
    for (auto& device : list) {
        if (/* search condition */)
            continue;
        return do_something_that_reacquires(device.index());
    }
    return ENOENT;
});
```

This would only work if the relevant `all_instances()` used a recursive
spinlock, since reacquiring the device would lock that again.

To get around this, a new helper has been added in `Device` through
which devices can be accessed whilst locking the device list, removing
the need to loop through the device type's `all_instances()`.
This assumption doesn't hold without the current Custody caching
mechanism.
This is nigh-impossible to reconcile with non-recursive Spinlocks.
@implicitfield
Copy link
Contributor Author

Just turned MasterPTY's m_slave member back into a RecursiveSpinlockProtected, since that seems to get locked twice if TTY echo is enabled.

Copy link
Member

@spholz spholz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second round, I still need to look at the last 4 commits

@@ -15,12 +15,14 @@ constexpr static AK::Duration refresh_interval = AK::Duration::from_milliseconds
NonnullLockRefPtr<Console> Console::initialize(VirtIODisplayConnector& parent_display_connector)
{
auto current_resolution = parent_display_connector.current_mode_setting();
return adopt_lock_ref(*new Console(parent_display_connector, current_resolution));
auto refresh_timer = adopt_nonnull_ref_or_enomem(new (nothrow) Timer()).release_value_but_fixme_should_propagate_errors();
return adopt_lock_ref(*new Console(parent_display_connector, current_resolution, *refresh_timer));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary *

PRINT_LINE(message.characters_without_null_termination(), message.length());
} else {
constexpr auto buffer_size = 255;
constexpr auto maximum_printable_name_length = buffer_size - (sizeof("Kernel + 0x +\n") + sizeof(FlatPtr) * 4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sizeof() includes the null terminator, which is probably not what you want.
And you also forgot to include the second 0x in the calculation.

static RecursiveSpinlock<LockRank::None> s_lock {}; // needs to be recursive because of dump_backtrace()
static Spinlock<LockRank::None> s_lock {};

static void* kmalloc_impl(size_t size, size_t alignment, CallerWillInitializeMemory caller_will_initialize_memory, CallerHasAcquiredLock acquire_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable should probably be called caller_has_acquired_lock now.

@@ -44,7 +44,7 @@ UNMAP_AFTER_INIT void Device::initialize_base_devices()
s_all_details->base_devices = move(base_devices);
}

RefPtr<Device> Device::acquire_by_type_and_major_minor_numbers(DeviceNodeType type, MajorNumber major, MinorNumber minor)
void Device::run_by_type_and_major_minor_numbers(DeviceNodeType type, MajorNumber major, MinorNumber minor, Function<void(RefPtr<Device>)> callback)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Function isn't cheap to copy, so please use a const reference here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A rvalue reference should also work, I think

@@ -38,11 +38,18 @@ VMObject::~VMObject()
VERIFY(m_regions.is_empty());
}

void VMObject::remap_regions_locked()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't remap_regions_locked() also be used in InodeVMObject::try_release_clean_pages() and AnonymousVMObject::purge()?

@@ -87,7 +87,7 @@ class SharedFramebufferVMObject final : public VMObject {
LockRefPtr<FakeWritesFramebufferVMObject> m_fake_writes_framebuffer_vmobject;
LockRefPtr<RealWritesFramebufferVMObject> m_real_writes_framebuffer_vmobject;
bool m_writes_are_faked { false };
mutable RecursiveSpinlock<LockRank::None> m_writes_state_lock {};
mutable Spinlock<LockRank::None> m_writes_state_lock {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break when switching TTYs:
SharedFramebufferVMObject::switch_to_fake_sink_framebuffer_writes() locks this lock
-> VMObject::remap_regions()
-> VMObject::remap_regions_locked()
-> Region::remap_with_locked_vmobject()
-> Region::remap_impl()
-> Region::map(PageDirectory& page_directory, ShouldLockVMObject should_lock_vmobject, ShouldFlushTLB should_flush_tlb)
-> Region::map_individual_page_impl(size_t page_index, ShouldLockVMObject should_lock_vmobject)
-> Region::physical_page_locked(size_t index) const
-> SharedFramebufferVMObject::physical_pages() const also locks this lock

Comment on lines +206 to +210
bool Region::should_dirty_on_write_impl(size_t page_index) const
{
VERIFY(vmobject().m_lock.is_locked());
return !static_cast<InodeVMObject const&>(vmobject()).is_page_dirty(first_page_index() + page_index);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this a lambda in should_dirty_on_write instead

@@ -32,6 +32,11 @@ enum class ShouldFlushTLB {
Yes,
};

enum class ShouldLockVMObject {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No user outside of Region ever uses ShouldLockVMObject::No, so I would instead keep the non-locking versions of these functions private, add _with_locked_vmobject versions and keep this enum private.

@@ -179,7 +185,7 @@ class Region final

[[nodiscard]] size_t cow_pages() const;

[[nodiscard]] bool should_dirty_on_write(size_t page_index) const;
[[nodiscard]] bool should_dirty_on_write(size_t page_index, ShouldLockVMObject should_lock_vmobject) const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really related to your changes, but this function seems like it should be private

void remap_regions();
bool remap_regions_one_page(size_t page_index, NonnullRefPtr<PhysicalRAMPage> page);

IntrusiveListNode<VMObject> m_list_node;
FixedArray<RefPtr<PhysicalRAMPage>> m_physical_pages;

mutable RecursiveSpinlock<LockRank::None> m_lock {};
mutable Spinlock<LockRank::None> m_lock {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock is used in so many places with quite complicated code, so I can't really check if this is a safe change.
I just hope this doesn't introduce some new deadlock in some weird edge case. In my testing your PR worked fine so far and I hope you tested this as well.

It would be nice if there was some automated way to verify with static analysis that no spinlock can be locked twice. That can't cover page faults, but I hope we don't cause page faults while this lock is already taken.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock was made recursive in 082ed6f. The commit description doesn't have any reason why though.

Copy link
Member

@spholz spholz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last 4 commits look fine, except for these 2 small instances of double locking.

@@ -145,11 +145,11 @@ class Inode : public ListedRefCounted<Inode, LockType::Spinlock>
};

Thread::FlockBlockerSet m_flock_blocker_set;
RecursiveSpinlockProtected<Vector<Flock>, LockRank::None> m_flocks {};
SpinlockProtected<Vector<Flock>, LockRank::None> m_flocks {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inode::try_apply_flock appears to double-lock m_flocks, since it calls Inode::can_apply_flock which also locks m_flocks.

@@ -307,7 +299,7 @@ class MemoryManager {
PhysicalPageEntry* m_physical_page_entries { nullptr };
size_t m_physical_page_entries_count { 0 };

RecursiveSpinlockProtected<GlobalData, LockRank::None> m_global_data;
SpinlockProtected<GlobalData, LockRank::None> m_global_data;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is locked twice during find_rsdp_in_ia_pc_specific_memory_locations if the RSDP wasn't found in the BIOS/EBDA. First by MemoryManager::for_each_physical_memory_range and then by MemoryManager::allocate_mmio_kernel_region.

@spholz
Copy link
Member

spholz commented May 30, 2025

Thanks for removing all of these RecursiveSpinlocks! I agree that we should preferably have less and possibly even none of them.

Reviewing this PR was just a bit annoying since you replaced so many of them at once.
Next time you make such a large refactoring commit that also requires this much attention, please split it into smaller commits and probably even multiple PRs. Thanks!

@spholz
Copy link
Member

spholz commented May 30, 2025

I found a new deadlock(?) with this PR applied:

Running /usr/Tests/LibELF/TestOrder on riscv64 with FileManager open hangs with this backtrace:

(gdb) bt
#0  AK::Atomic<unsigned char, (AK::MemoryOrder)5>::exchange (this=0x2000b4fe10 <initial_kmalloc_memory+5864976>, desired=1 '\001', order=AK::memory_order_acquire) at ././Kernel/Locking/Spinlock.h:29
#1  Kernel::Spinlock<(Kernel::LockRank)0>::lock (this=0x2000b4fe10 <initial_kmalloc_memory+5864976>) at ././Kernel/Locking/Spinlock.h:29
#2  0x0000002000009f7c in Kernel::SpinlockLocker<Kernel::Spinlock<(Kernel::LockRank)0> >::SpinlockLocker (this=0x2017e3f958, lock=<optimized out>) at ././Kernel/Locking/Spinlock.h:135
#3  0x00000020000f4000 in Kernel::SpinlockProtectedBase<AK::HashTable<Kernel::InodeWatcher*, AK::Traits<Kernel::InodeWatcher*>, false>, Kernel::Spinlock<(Kernel::LockRank)0> >::Locked<AK::HashTable<Kernel::InodeWatcher*, AK::Traits<Kernel::InodeWatcher*>, false> >::Locked
    (this=0x2017e3f950, value=..., spinlock=...) at ././Kernel/Locking/SpinlockProtectedBase.h:27
#4  Kernel::SpinlockProtectedBase<AK::HashTable<Kernel::InodeWatcher*, AK::Traits<Kernel::InodeWatcher*>, false>, Kernel::Spinlock<(Kernel::LockRank)0> >::lock_mutable (this=0x2000b4feb0 <initial_kmalloc_memory+5865136>) at ././Kernel/Locking/SpinlockProtectedBase.h:46
#5  Kernel::SpinlockProtectedBase<AK::HashTable<Kernel::InodeWatcher*, AK::Traits<Kernel::InodeWatcher*>, false>, Kernel::Spinlock<(Kernel::LockRank)0> >::with<Kernel::Inode::unregister_watcher(AK::Badge<Kernel::InodeWatcher>, Kernel::InodeWatcher&)::<lambda(auto:199&)> >(struct {...}) (this=this@entry=0x2000b4fdf8 <initial_kmalloc_memory+5864952>, callback=...) at ././Kernel/Locking/SpinlockProtectedBase.h:65
#6  0x00000020000f404c in Kernel::Inode::unregister_watcher (this=this@entry=0x2000b4fd40 <initial_kmalloc_memory+5864768>, watcher=<optimized out>) at ./Kernel/FileSystem/Inode.cpp:188
#7  0x00000020000f9d7c in operator()<Kernel::InodeWatcher::WatchMaps> (__closure=__closure@entry=0x2017e3fa30, watch_maps=...) at ././AK/NonnullOwnPtr.h:98
#8  0x00000020000f9e98 in Kernel::SpinlockProtectedBase<Kernel::InodeWatcher::WatchMaps, Kernel::Spinlock<(Kernel::LockRank)0> >::with<Kernel::InodeWatcher::unregister_by_wd(int)::<lambda(auto:198&)> >(struct {...})
    (this=this@entry=0x2000bcf6f0 <initial_kmalloc_memory+6387440>, callback=...) at ././Kernel/Locking/SpinlockProtectedBase.h:66
#9  0x00000020000f9f18 in Kernel::InodeWatcher::unregister_by_wd (this=0x2000bcf440 <initial_kmalloc_memory+6386752>, wd=wd@entry=22) at ./Kernel/FileSystem/InodeWatcher.cpp:150
#10 0x00000020002063bc in Kernel::Process::sys$inode_watcher_remove_watch (this=<optimized out>, fd=9, wd=22) at ./Kernel/Syscalls/inode_watcher.cpp:64
#11 0x00000020001fd9ba in Kernel::Syscall::handle (regs=..., function=<optimized out>, function@entry=67, arg1=<optimized out>, arg1@entry=9, arg2=<optimized out>, arg2@entry=22, arg3=<optimized out>, arg3@entry=22, arg4=<optimized out>, arg4@entry=42)
    at ./Kernel/Syscalls/SyscallHandler.cpp:95
#12 0x00000020001fdf42 in Kernel::syscall_handler (trap=<optimized out>) at ./Kernel/Syscalls/SyscallHandler.cpp:158
#13 0x0000002000242194 in Kernel::trap_handler (trap_frame=...) at ./Kernel/Arch/riscv64/Interrupts.cpp:137
#14 0x0000002000248016 in --- TRAP --- asm_trap_handler (scause=0x8, stval=0x0, sstatus=0x8000000200006020) at ./Kernel/Arch/riscv64/trap_handler.S:153
#15 0x00000005b00dd052 in Kernel::Syscall::invoke<unsigned long, unsigned long> (function=<optimized out>, arg1=9, arg2=22) at ././Kernel/API/Syscall.h:648
#16 syscall2 (function=<optimized out>, arg0=9, arg1=22) at ./Userland/Libraries/LibSystem/syscall.cpp:24
#17 0x0000001694e7d6ea in syscall<Kernel::Syscall::Function, int, int> (function=Kernel::Syscall::SC_inode_watcher_remove_watch, arg0=<optimized out>, arg1=<optimized out>) at ./Userland/Libraries/LibC/fcntl.cpp:47
#18 inode_watcher_remove_watch (fd=<optimized out>, wd=<optimized out>) at ./Userland/Libraries/LibC/fcntl.cpp:47
#19 0x00000003904e36b0 in Core::FileWatcherBase::remove_watch (this=0x1206dd4040, path=Python Exception <class 'gdb.error'>: value has been optimized out
) at ./Userland/Libraries/LibCore/FileWatcherSerenity.cpp:144
#20 0x00000003904e383c in operator() (__closure=<optimized out>) at ./Userland/Libraries/LibCore/FileWatcherSerenity.cpp:206
#21 0x00000003904e385a in AK::Function<void()>::CallableWrapper<Core::FileWatcher::FileWatcher(int, AK::NonnullRefPtr<Core::Notifier>)::<lambda()> >::call(void) (this=<optimized out>) at ././AK/Function.h:198
#22 0x00000008580385ea in AK::Function<void()>::operator() (this=0x3eb175db8) at ././AK/Function.h:132
#23 0x000000085803aa88 in Core::Notifier::event (this=0x3eb175d40, event=<optimized out>) at ./Userland/Libraries/LibCore/Notifier.cpp:63
#24 0x0000000858037b60 in Core::EventReceiver::dispatch_event (this=this@entry=0x3eb175d40, e=..., stay_within=stay_within@entry=0x0) at ./Userland/Libraries/LibCore/EventReceiver.cpp:162
#25 0x00000008580393da in Core::ThreadEventQueue::process (this=0x16eb7e0790) at ./Userland/Libraries/LibCore/ThreadEventQueue.cpp:121
#26 0x0000000858034188 in Core::EventLoopImplementationUnix::pump (this=<optimized out>, mode=Core::EventLoopImplementation::PumpMode::WaitForEvents) at ./Userland/Libraries/LibCore/EventLoopImplementationUnix.cpp:324
#27 0x00000008580341ac in Core::EventLoopImplementationUnix::exec (this=0x1742fb13a0) at ./Userland/Libraries/LibCore/EventLoopImplementationUnix.cpp:316
#28 0x00000008580307c8 in Core::EventLoop::exec (this=0x16eb7e0770) at ./Userland/Libraries/LibCore/EventLoop.cpp:88
#29 0x000000013769eb02 in GUI::Application::exec (this=<optimized out>) at ./Userland/Libraries/LibGUI/Application.cpp:127
#30 0x0000000403395f12 in run_in_windowed_mode (initial_location=/home/anon, entry_focused_on_init="") at ./Userland/Applications/FileManager/main.cpp:1333
#31 0x00000004033964d6 in serenity_main (arguments=...) at ./Userland/Applications/FileManager/main.cpp:159
#32 0x00000004033a9580 in main (argc=1, argv=0x213c544e0) at ./Userland/Libraries/LibMain/Main.cpp:39
#33 0x000000040337288c in _entry (argc=1, argv=0x213c544e0) at ./Userland/Libraries/LibC/crt0.cpp:47
#34 0x0000000000000000 in ??? ()

But no other thread seems to hold that lock afaict. Weird.

@spholz
Copy link
Member

spholz commented May 30, 2025

Rebuilding userland with debug info reveals that it's the temp file created by the test:

(gdb) f 20
#20 0x00000003904e383c in operator() (__closure=<optimized out>) at ./Userland/Libraries/LibCore/FileWatcherSerenity.cpp:206
206	               auto result = remove_watch(event.event_path);
(gdb) p event.event_path
$8 = /tmp/tmp.xs4e5u

Oh and it also happens on x86-64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
👀 pr-needs-review PR needs review from a maintainer or community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kernel: Intermittent FS lock panic on shutdown with SMP on
4 participants