Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mm, slab: Fix infinite loop at _slub_get_freelist() #437

Closed
wants to merge 1 commit into from

Conversation

leitao
Copy link

@leitao leitao commented Oct 10, 2024

In some cases, _slub_get_freelist() loops forever when ptr dereferences to itself.

This causes instructions like the following to loop forever. (I got this with a vmcore)

identify_address(prog, 18446613188003018408)

If I break if the pointer is already in the freelist-set, then, I can get drgn unstuck:

identify_address(prog, 18446613188003018408)
'slab object: sock_inode_cache+0x2a8'

Co-developed-with: Leandro Silva [email protected]

In some cases, _slub_get_freelist() loops forever when ptr dereferences
to itself.

This causes instructions like the following to loop forever.  (I got
this with a vmcore)

	identify_address(prog, 18446613188003018408)

If I break if the pointer is already in the freelist-set, then, I can
get drgn unstuck:

>>> identify_address(prog, 18446613188003018408)
'slab object: sock_inode_cache+0x2a8'

Signed-off-by: Breno Leitao <[email protected]>
@osandov
Copy link
Owner

osandov commented Oct 10, 2024

I discussed this with Breno offline. We're going to use this fix internally as a quick mitigation for our automated crash dump analysis system, but I'm going to look into reporting this as a corrupted free list in some way as the proper fix.

@brenns10
Copy link
Contributor

We definitely encountered this on some core dumps as well, and did our own sort of workaround at oracle-samples/drgn-tools#110. I've had it on my agenda to work on a proper fix that allows us to report corrupted freelist pointers, and circular freelists, and continue operating, so that we can format that information for later display.

@osandov
Copy link
Owner

osandov commented Oct 10, 2024

Thanks, @brenns10, that looks pretty similar to what I had in mind. I'll take a stab at it today or tomorrow.

@osandov
Copy link
Owner

osandov commented Nov 22, 2024

Just got back to this, and I'm weighing the options for handling corrupted freelists (freelists with either a cycle or a pointer that can't be dereferenced). We want to allow recovering as much data as possible without returning misleading information. The problem is that when the freelist is corrupted, we can't know for sure what's free and what's allocated.

slab_cache_for_each_allocated_object() is probably hopeless in this case; it'll just have to fail hard with an informative error message.

slab_object_info() can partially work: we can still reliably return slab_cache, slab, and address. allocated is trickier. If the object is on the portion of the freelist that we were able to access, then we can be reasonably sure that it's actually free. But if it's not, then we don't know whether it's allocated or free. So I'm thinking of turning allocated into a tri-state variable: True is "allocated", False is "free", None is "corrupted so we don't know". It's not the prettiest interface, but it's reasonably backwards-compatible. identify_address() would of course also be updated to display this state.

@brenns10, @leitao, any thoughts?

@brenns10
Copy link
Contributor

I agree that slab_cache_for_each_allocated_object() is more or less a bust. If the CPU freelists have any corruption, then you can never be certain about whether an individual object is actually allocated.

For slab_object_info(), I think that makes a good amount of sense. If we detect CPU freelist corruption, then I think we could still return a False for free, because if we find the object on a freelist before the corruption, we can be reasonably confident that the object is free. But if it's not on the (corrupted) freelist then None would be the only option to return.

The last thing that seems useful, in the case of slab corruption, is some sort of API to give information about the slab corruption that we have identified. Right now drgn-tools has an API for giving summary info on an entire slab cache, and it includes reports related to which CPUs have which kinds of corruption. I think this could be pretty useful in general. I guess what I'm saying is that I owe a pull request to implement some of that. If you wanted, I could put these fixes in there as well.

@osandov osandov closed this in 120428f Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants