Skip to content

Kernel: Start reducing the amount of RecursiveSpinlocks #25931

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

implicitfield
Copy link
Contributor

@implicitfield implicitfield commented May 17, 2025

The Linux kernel doesn't have any recursive spinlocks at all (well, at least not kernel-wide, IDK about random subsystems), so we certainly don't need to have these either. Primarily, these just lead to rather sloppy code (just see all the refactoring in this PR, which could have been fairly easily avoided if we hadn't used these to begin with) and are very SMP-unfriendly (a crash here is both commonplace and nigh-impossible to debug).

Historically, the proliferation of RecursiveSpinlocks has also been brought forward by the fact that SpinlockProtected (since its introduction in 019ad8a, part of #8851) used a RecursiveSpinlock under the hood, which was only recently corrected (in a264a83), though this naturally left a ton of instances of RecursiveSpinlockProtected all over the place.

This PR is based off of a branch where I removed RecursiveSpinlock entirely, but that didn't really pan out for SMP systems either, because when you put a proper Spinlock on the scheduler, only a single thread can run at a time. In the long term, it might prove favorable to rework the scheduling in an xv6-style way, where each CPU has its own instance of the scheduler (at least that would make the locking less of a nightmare).

In conclusion, this PR shouldn't have much of an impact on the behavior of the system, but it should be a good step towards getting rid of RecursiveSpinlocks entirely. This probably fixes #22585 because nothing in Kernel/FileSystem uses RecursiveSpinlocks anymore, but otherwise this shouldn't affect the stability of SMP systems in one way or another.

@github-actions github-actions bot added the 👀 pr-needs-review PR needs review from a maintainer or community member label May 17, 2025
Copy link
Contributor

@Hendiadyoin1 Hendiadyoin1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one thing that stuck out to me
(Also I should revive #20619, as that plays into this)

Copy link
Contributor

@Hendiadyoin1 Hendiadyoin1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more things on the second read-through,
otherwise looks fine by me

Copy link
Member

@spholz spholz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my first review. I still need to look at all the "Prefer non-recursive Spinlocks" commits if it's safe to use non-recursive spinlocks in all those cases.

static RecursiveSpinlock<LockRank::None> s_lock {}; // needs to be recursive because of dump_backtrace()
static Spinlock<LockRank::None> s_lock {};

static void* kmalloc_impl(size_t size, size_t alignment, CallerWillInitializeMemory caller_will_initialize_memory, bool acquire_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this break if we allocate memory during a panic in kmalloc.cpp?
The call to critical_dmesgln could theoretically allocate memory. But usually the inline capacity of the StringBuilder should be enough.

Ideally we should get rid of all memory allocations during panics of course.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory we could set a global variable during panic to disable the lock, as we are going down anyway,
but that's likely a thing for another day

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess force-disabling the lock during a panic would work indeed, but in the end we should really be avoiding allocation if we're already panicking.

@@ -44,7 +44,7 @@ UNMAP_AFTER_INIT void Device::initialize_base_devices()
s_all_details->base_devices = move(base_devices);
}

RefPtr<Device> Device::acquire_by_type_and_major_minor_numbers(DeviceNodeType type, MajorNumber major, MinorNumber minor)
void Device::run_by_type_and_major_minor_numbers(DeviceNodeType type, MajorNumber major, MinorNumber minor, Function<void(RefPtr<Device>)> callback)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this name, but I can't think of a good one either.
Maybe something like for_device_by_type_and_major_minor_numbers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it's not really a for either though, because the callback should only ever get called once since the underlying map can contain just one instance of a given major/minor number pair.

@implicitfield implicitfield force-pushed the spinlocks branch 3 times, most recently from 46ee891 to 7477d68 Compare May 19, 2025 17:24
@BuggieBot
Copy link
Member

Hello!

One or more of the commit messages in this PR do not match the SerenityOS code submission policy, please check the lint_commits CI job for more details on which commits were flagged and why.
Please do not close this PR and open another, instead modify your commit message(s) with git commit --amend and force push those changes to update this PR.

@implicitfield implicitfield force-pushed the spinlocks branch 2 times, most recently from 8f97cb7 to 5be193a Compare May 19, 2025 18:51
@implicitfield
Copy link
Contributor Author

That should fix the deadlock that's been haunting CI... (it was just an instance of using kfree_sized when the big kmalloc lock was already held).

This eliminates the possibility of panicking due to encountering a page
fault while trying access the cache inside an interrupt handler.
This makes it possible to (just barely) keep up when enabling kmalloc
stack dumping (sysctl -w kmalloc_stacks=1) while running in a hardware
accelerated aarch64 VM.
Previously, list nodes would keep the loop device itself alive until the
node was manually removed from the list. This patch breaks that
reference cycle, and allows normal reference counting to take care of
the lifetimes of loop devices (although we do have to leak a reference
on creation because nothing truly owns a loop device).
Previously, you'd occasionally see this sort of construct (simplified):
```
return FooDevice::all_instances().with([&](auto& list) {
    for (auto& device : list) {
        if (/* search condition */)
            continue;
        return do_something_that_reacquires(device.index());
    }
    return ENOENT;
});
```

This would only work if the relevant `all_instances()` used a recursive
spinlock, since reacquiring the device would lock that again.

To get around this, a new helper has been added in `Device` through
which devices can be accessed whilst locking the device list, removing
the need to loop through the device type's `all_instances()`.
This assumption doesn't hold without the current Custody caching
mechanism.
This is nigh-impossible to reconcile with non-recursive Spinlocks.
@implicitfield
Copy link
Contributor Author

Just turned MasterPTY's m_slave member back into a RecursiveSpinlockProtected, since that seems to get locked twice if TTY echo is enabled.

Copy link
Member

@spholz spholz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second round, I still need to look at the last 4 commits

@@ -15,12 +15,14 @@ constexpr static AK::Duration refresh_interval = AK::Duration::from_milliseconds
NonnullLockRefPtr<Console> Console::initialize(VirtIODisplayConnector& parent_display_connector)
{
auto current_resolution = parent_display_connector.current_mode_setting();
return adopt_lock_ref(*new Console(parent_display_connector, current_resolution));
auto refresh_timer = adopt_nonnull_ref_or_enomem(new (nothrow) Timer()).release_value_but_fixme_should_propagate_errors();
return adopt_lock_ref(*new Console(parent_display_connector, current_resolution, *refresh_timer));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary *

PRINT_LINE(message.characters_without_null_termination(), message.length());
} else {
constexpr auto buffer_size = 255;
constexpr auto maximum_printable_name_length = buffer_size - (sizeof("Kernel + 0x +\n") + sizeof(FlatPtr) * 4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sizeof() includes the null terminator, which is probably not what you want.
And you also forgot to include the second 0x in the calculation.

static RecursiveSpinlock<LockRank::None> s_lock {}; // needs to be recursive because of dump_backtrace()
static Spinlock<LockRank::None> s_lock {};

static void* kmalloc_impl(size_t size, size_t alignment, CallerWillInitializeMemory caller_will_initialize_memory, CallerHasAcquiredLock acquire_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable should probably be called caller_has_acquired_lock now.

@@ -44,7 +44,7 @@ UNMAP_AFTER_INIT void Device::initialize_base_devices()
s_all_details->base_devices = move(base_devices);
}

RefPtr<Device> Device::acquire_by_type_and_major_minor_numbers(DeviceNodeType type, MajorNumber major, MinorNumber minor)
void Device::run_by_type_and_major_minor_numbers(DeviceNodeType type, MajorNumber major, MinorNumber minor, Function<void(RefPtr<Device>)> callback)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Function isn't cheap to copy, so please use a const reference here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A rvalue reference should also work, I think

@@ -38,11 +38,18 @@ VMObject::~VMObject()
VERIFY(m_regions.is_empty());
}

void VMObject::remap_regions_locked()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't remap_regions_locked() also be used in InodeVMObject::try_release_clean_pages() and AnonymousVMObject::purge()?

@@ -87,7 +87,7 @@ class SharedFramebufferVMObject final : public VMObject {
LockRefPtr<FakeWritesFramebufferVMObject> m_fake_writes_framebuffer_vmobject;
LockRefPtr<RealWritesFramebufferVMObject> m_real_writes_framebuffer_vmobject;
bool m_writes_are_faked { false };
mutable RecursiveSpinlock<LockRank::None> m_writes_state_lock {};
mutable Spinlock<LockRank::None> m_writes_state_lock {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break when switching TTYs:
SharedFramebufferVMObject::switch_to_fake_sink_framebuffer_writes() locks this lock
-> VMObject::remap_regions()
-> VMObject::remap_regions_locked()
-> Region::remap_with_locked_vmobject()
-> Region::remap_impl()
-> Region::map(PageDirectory& page_directory, ShouldLockVMObject should_lock_vmobject, ShouldFlushTLB should_flush_tlb)
-> Region::map_individual_page_impl(size_t page_index, ShouldLockVMObject should_lock_vmobject)
-> Region::physical_page_locked(size_t index) const
-> SharedFramebufferVMObject::physical_pages() const also locks this lock

Comment on lines +206 to +210
bool Region::should_dirty_on_write_impl(size_t page_index) const
{
VERIFY(vmobject().m_lock.is_locked());
return !static_cast<InodeVMObject const&>(vmobject()).is_page_dirty(first_page_index() + page_index);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this a lambda in should_dirty_on_write instead

@@ -32,6 +32,11 @@ enum class ShouldFlushTLB {
Yes,
};

enum class ShouldLockVMObject {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No user outside of Region ever uses ShouldLockVMObject::No, so I would instead keep the non-locking versions of these functions private, add _with_locked_vmobject versions and keep this enum private.

@@ -179,7 +185,7 @@ class Region final

[[nodiscard]] size_t cow_pages() const;

[[nodiscard]] bool should_dirty_on_write(size_t page_index) const;
[[nodiscard]] bool should_dirty_on_write(size_t page_index, ShouldLockVMObject should_lock_vmobject) const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really related to your changes, but this function seems like it should be private

void remap_regions();
bool remap_regions_one_page(size_t page_index, NonnullRefPtr<PhysicalRAMPage> page);

IntrusiveListNode<VMObject> m_list_node;
FixedArray<RefPtr<PhysicalRAMPage>> m_physical_pages;

mutable RecursiveSpinlock<LockRank::None> m_lock {};
mutable Spinlock<LockRank::None> m_lock {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock is used in so many places with quite complicated code, so I can't really check if this is a safe change.
I just hope this doesn't introduce some new deadlock in some weird edge case. In my testing your PR worked fine so far and I hope you tested this as well.

It would be nice if there was some automated way to verify with static analysis that no spinlock can be locked twice. That can't cover page faults, but I hope we don't cause page faults while this lock is already taken.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock was made recursive in 082ed6f. The commit description doesn't have any reason why though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
👀 pr-needs-review PR needs review from a maintainer or community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kernel: Intermittent FS lock panic on shutdown with SMP on
4 participants