-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
excessive stack usage with kernel address sanitizer #38157
Comments
If I am reading this correctly for fn7 we do 100+x worse than gcc: 10048 for clang, not showing with limit 100 on gcc. I am also confused by difference between -fsantize=address and -fsanitize=kernel-address. Shouldn't the code be the same except for shadow base? |
Looking at the assembler output, the difference here appears to be that this code from the reduced test case void fn1(struct list_head *new) {
union {
typeof(&a) __c[1];
} b = {{new}};
*(volatile long *)b.__c;
}
void fn2(struct list_head *new) { fn1(new); }
void fn3(struct spi_transfer *t) { fn2(&t->transfer_list); } gets completely optimized away by gcc, but not by clang, which produces a complex stack check from any call to fn3(). This is what it looked like in the original code: static inline __attribute__((always_inline, unused)) __attribute__((no_instrument_function)) __attribute__((always_inline)) void __write_once_size(volatile void *p, void *res, int size)
{
switch (size) {
case 1: *(volatile __u8 *)p = *(__u8 *)res; break;
case 2: *(volatile __u16 *)p = *(__u16 *)res; break;
case 4: *(volatile __u32 *)p = *(__u32 *)res; break;
case 8: *(volatile __u64 *)p = *(__u64 *)res; break;
default:
__asm__ __volatile__("" : : : "memory");
__builtin_memcpy((void *)p, (const void *)res, size);
__asm__ __volatile__("" : : : "memory");
}
#define WRITE_ONCE(x, val) \
({ \
union { typeof(x) __val; char __c[1]; } __u = \
{ .__val = (__force typeof(x)) (val) }; \
__write_once_size(&(x), __u.__c, sizeof(x)); \
__u.__val; \
})
static inline __attribute__((always_inline, unused)) __attribute__((no_instrument_function)) bool __list_add_valid(struct list_head *new,
struct list_head *prev,
struct list_head *next)
{
return true;
}
static inline __attribute__((always_inline, unused)) __attribute__((no_instrument_function)) void __list_add(struct list_head *new,
struct list_head *prev,
struct list_head *next)
{
if (!__list_add_valid(new, prev, next))
return;
next->prev = new;
new->next = next;
new->prev = prev;
WRITE_ONCE(prev->next, new);
}
static inline __attribute__((always_inline, unused)) __attribute__((no_instrument_function)) void list_add_tail(struct list_head *new, struct list_head *head)
{
__list_add(new, head->prev, head);
}
static inline __attribute__((always_inline, unused)) __attribute__((no_instrument_function)) void
spi_message_add_tail(struct spi_transfer *t, struct spi_message *m)
{
list_add_tail(&t->transfer_list, &m->transfers);
} |
Kostya, can you help to find somebody to work on this? |
I've done a new (guided) reduction of the original source code now, which got me to this test case:
This uses 10944 bytes stack frame with clang -fsanitize=kernel-address, compare with 416 bytes with gcc, or 96 bytes using clang -fsanitize=address. |
So this is the old problem of stack slot reuse not working with ASan. Simply disabling use-after-scope detection would not help, this is a fundamental part of asan implementation in LLVM. |
But why does it only affect -fsanitize=kernel-address and not -fsanitize=address then? Disabling "asan-stack" entirely seems to work, but may be a bit heavy-handed. Are there other sub-options of asan-stack that can be left enabled while we turn off use-after-scope and other options? For gcc, the kernel already turns off use-after-scope detection to work around the problem, but the normal asan-stack stays enabled. |
Oh, sorry I misunderstood the question. |
[preprocessed linuxhttps://user-images.githubusercontent.com/92601431/143758042-76d92e65-601e-4ca0-8fce-b0a8087b6c97.gz) clang-8 -Wno-gnu -Wno-unused-argument -Wno-unused-value -Wno-sign-compare -Wframe-larger-than=1000 -O2 -fsanitize=kernel-address -mllvm -asan-stack=0 -c go7007-fw.i --target=aarch64-linux -Wno-unused-value -Wno-pointer-sign Marking one function (do_special()) as attribute((no_inline)) avoids the issue in this file. |
I am not confident we can fix this in clang/llvm. |
The problem with ltv350qv_write_reg could be fixed by tuning inlining heuristics. Right now they assume that inlining an alloca is basically free, while with asan it has a cost that is proportional to its size (for poisoning and unpoisoning, both in code size and run time). This does not address stack frame size directly, but it could be a decent proxy. |
I don't see an alloca() in the function, in fact we don't allow alloca() to be used in the kernel at all (we build with -Wvla).
I tried using attribute((always_inline)) but do not see the gcc bug you are referring to: We had a related bug in older versions of gcc that would cause stack slots to not be reused when inlining certain functions, but I added workarounds for all instances in the kernel that hit that particular gcc bug, and newer gcc versions all contain that trivial fix (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81715#c18) |
Here is another example (without inline functions) that I reduced from linux/drivers/crypto/ccp/ccp-ops.c: struct ccp_op {
int jobid[50];
};
struct ccp_op fn1(void), a, b, c;
void fn2(struct ccp_op);
void fn3(int d) {
fn2(a);
fn1();
fn2(b);
fn2(b);
fn1();
fn2(a);
fn1();
fn2(b);
fn2(b);
fn1();
fn1();
} $ clang-8 -std=gnu89 -Wframe-larger-than=2048 --target=aarch64-linux -Wall -fsanitize=kernel-address -O2 -c ccp-ops.i |
Another idea: when use-after-scope is disabled, we could merge allocas with the same size and non-overlapping lifetimes before creating the mega-alloca. This would make asan reports a bit more ambiguous - stack frame dumps will need to mention multiple possibilities for certain offset ranges. This will take a non-trivial amount of work. Re: #11, "alloca" stands for any stack allocation. There are static allocas (regular variables) and dynamic allocas (actual alloca() calls). |
I would hope something simpler could be done, where the reporting does not have to be made more complicated. Trying to come up with a relevant example that is more like the kernel code I see require much more stack on clang vs gcc, take this program: // https://godbolt.org/z/Ws2qLZ
struct s { int count; int a[500]; };
int __attribute__((noinline)) may_overflow(struct s *s)
{
return s->a[s->count];
}
static inline __attribute__((always_inline)) int largestack(int i)
{
struct s s;
s.count = i;
return may_overflow(&s);
}
int main(void)
{
return largestack(2) +
largestack(500) +
largestack(-2) +
largestack(0);
} Building this with
Doing the same thing with
But since the four objects are all conceptually the same variable in the inline function, and I still have the line information, it doesn't seem to help much that there are now four objects listed. Is this something that can perhaps be changed more easily? Note1: the -Wframe-larger-than= warning message seems wrong for "clang --sanitize=address" (it reports 88 bytes instead of 8KB), but it looks correct with either --sanitize=kernel-address or without asan. Note2: gcc also merges distinct objects of the same size, and reports a random one in the output if they are of the same type. I assume you'd consider that a bug in gcc, but I also think that it's better than using more stack space, at least in the context of the Linux kernel, where the available stack space is very constrained. |
Would it be worthwhile for us to limit inlining when building with LLVM? I see a few options in llvm/lib/Analysis/InlineCost.cpp. Perhaps by playing with If we find that that is generally helpful for this very painful and constant source of bugreports, would it be possible for LLVM to simply update that value when using -fsanitize=address? |
Based on #4 and #5 the problem is with stack slot reuse. |
If it is the warning that's a problem, not the actual stack usage, we could use the dynamic alloca approach in the kernel-address mode, too. Tweaking the inlining model does not sound very scary to me, and it might help. Worth a try. |
As discussed in a video meeting yesterday, I have done a few builds with my 'randconfig' kernel tree that has bugfixes for known problems, to isolate files that exceed the warning limit with clang but not gcc when using KASAN_STACK, nor when using clang using KASAN but no KASAN_STACK. For x86_64 and arm64 allmodconfig builds, I got a relatively compact list of affected files that should be easily reproducible using clang-13 on linux-5.15-rc2 or newer with the allmodconfig builds. All warnings that are not in this list have a pending kernel fix or workaround. Listing only one warning per file for the worst cases, anything with more than 3KB compared to less than 2KB for the other builds: crypto/ecc.c:1276:13: error: stack frame size (3456) exceeds limit (2048) in function 'ecc_point_mult' [-Werror,-Wframe-drivers/gpu/drm/i915/gt/intel_workarounds.c:1608:1: error: stack frame size (3512) exceeds limit (2048) in function 'rcs_engine_wa_init' [-Werror,-Wframe-larger-than] For 32-bit arm, the list was much longer even when applying the same 3KB limit. Looking only at files that are not already in the list above, the worst ones I see are drivers/media/dvb-frontends/drxd_hard.c:2857:12: error: stack frame size (7584) exceeds limit (1400) in function 'drxd_set_frontend' [-Werror,-Wframe-larger-than] I also happened to run some more arm32 randconfig builds, which showed additional issues. I can provide .config files for these if anyone wants to take a look, the worst ones here are: drivers/base/test/property-entry-test.c:117:13: error: stack frame size 5600 exceeds limit 1400 in function 'pe_test_uint_arrays' [-Werror,-Wframe-larger-than] |
To broadly classify the ones I've listed in comment #18:
int cx25840_write(struct i2c_client *client, u16 addr, u8 value)
{
u8 buffer[3];
buffer[0] = addr >> 8;
buffer[1] = addr & 0xff;
buffer[2] = value;
return i2c_master_send(client, buffer, 3);
} I think this is a case of adding way much more padding around each 3-byte array than gcc does, so this could be addressed for reducing the amount of padding for small stack variables, or it could be addressed by not inlining cx25840_write(). |
mentioned in issue #4440 |
Cross referencing downstream issue: ClangBuiltLinux/linux#39 |
The difference in #38157 (comment) between |
Oh, it seems neither compiler warns for conditional alloca (with EDIT: seems GCC has |
@arndb also mentioned |
…argument allocas This reverts commit e26c24b. These temporaries are only used in the callee, and their memory can be reused after the call is complete. rdar://58552124 Link: #38157 Link: #41896 Link: #43598 Link: ClangBuiltLinux/linux#39 Link: https://reviews.llvm.org/rGfafc6e4fdf3673dcf557d6c8ae0c0a4bb3184402 Reviewed By: rjmccall Differential Revision: https://reviews.llvm.org/D74094
I reapplied https://reviews.llvm.org/rGe698695fbbf62e6676f8907665187f2d2c4d814b to clang-18 which should help slightly (specifically, the cases where structs were passed by value). |
reverted (But I don't plan to reopen all of the bugs with the same root cause): #43598 (comment) |
…argument allocas This reverts commit e26c24b. These temporaries are only used in the callee, and their memory can be reused after the call is complete. rdar://58552124 Link: llvm#38157 Link: llvm#41896 Link: llvm#43598 Link: ClangBuiltLinux/linux#39 Link: https://reviews.llvm.org/rGfafc6e4fdf3673dcf557d6c8ae0c0a4bb3184402 Reviewed By: rjmccall Differential Revision: https://reviews.llvm.org/D74094
…argument allocas This reverts commit e26c24b. These temporaries are only used in the callee, and their memory can be reused after the call is complete. rdar://58552124 Link: llvm#38157 Link: llvm#41896 Link: llvm#43598 Link: ClangBuiltLinux/linux#39 Link: https://reviews.llvm.org/rGfafc6e4fdf3673dcf557d6c8ae0c0a4bb3184402 Reviewed By: rjmccall Differential Revision: https://reviews.llvm.org/D74094
…argument allocas This reverts commit e26c24b. These temporaries are only used in the callee, and their memory can be reused after the call is complete. rdar://58552124 Link: llvm#38157 Link: llvm#41896 Link: llvm#43598 Link: ClangBuiltLinux/linux#39 Link: https://reviews.llvm.org/rGfafc6e4fdf3673dcf557d6c8ae0c0a4bb3184402 Reviewed By: rjmccall Differential Revision: https://reviews.llvm.org/D74094
…argument allocas This reverts commit e26c24b. These temporaries are only used in the callee, and their memory can be reused after the call is complete. rdar://58552124 Link: llvm#38157 Link: llvm#41896 Link: llvm#43598 Link: ClangBuiltLinux/linux#39 Link: https://reviews.llvm.org/rGfafc6e4fdf3673dcf557d6c8ae0c0a4bb3184402 Reviewed By: rjmccall Differential Revision: https://reviews.llvm.org/D74094
…argument allocas This reverts commit e26c24b. These temporaries are only used in the callee, and their memory can be reused after the call is complete. rdar://58552124 Link: llvm#38157 Link: llvm#41896 Link: llvm#43598 Link: ClangBuiltLinux/linux#39 Link: https://reviews.llvm.org/rGfafc6e4fdf3673dcf557d6c8ae0c0a4bb3184402 Reviewed By: rjmccall Differential Revision: https://reviews.llvm.org/D74094
…argument allocas This reverts commit e26c24b. These temporaries are only used in the callee, and their memory can be reused after the call is complete. rdar://58552124 Link: llvm#38157 Link: llvm#41896 Link: llvm#43598 Link: ClangBuiltLinux/linux#39 Link: https://reviews.llvm.org/rGfafc6e4fdf3673dcf557d6c8ae0c0a4bb3184402 Reviewed By: rjmccall Differential Revision: https://reviews.llvm.org/D74094
…argument allocas This reverts commit e26c24b. These temporaries are only used in the callee, and their memory can be reused after the call is complete. rdar://58552124 Link: llvm#38157 Link: llvm#41896 Link: llvm#43598 Link: ClangBuiltLinux/linux#39 Link: https://reviews.llvm.org/rGfafc6e4fdf3673dcf557d6c8ae0c0a4bb3184402 Reviewed By: rjmccall Differential Revision: https://reviews.llvm.org/D74094
Extended Description
Building the Linux kernel with clang KASAN enabled shows many warnings about possible stack overflow (we limit the frame size per function to 1024 to 2048 byte, depending on configuration, because the per-thread stack is very limited).
I created a reduced test case from one of the scarier warnings:
I tested this using clang-8.0.0-svn341106-1
exp1+020180830200353.1747~1.gbp19b9f6 on Ubuntu, but an old clang-3.9 shows the same behavior.With gcc, the same function is fine with "asan-use-after-scope" disabled:
but turning on asan-use-after-scope makes gcc as bad as clang, which is expected from the source code (the kernel turns it off by default for this reason):
Using -fsantize=address in place of -fsanitize=kernel-address completely avoids the high stack usage with clang, even with asan-use-after-scope enabled:
The text was updated successfully, but these errors were encountered: