-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
-Wframe-larger-than= in drivers/gpu/drm/amd/display/dc/calcs/dce_calcs.c #1455
Comments
With
|
This was worked around in moving the description from #1918 to track this here: 6f6cb17 mentions 5k+ stack usage. (#1455) After clang-18's (https://reviews.llvm.org/rGe698695fbbf62e6676f8907665187f2d2c4d814b), I only see:
So there's a slight improvement in clang-18, but more is needed to be below the Tangentially related to #1752. See also: llvm/llvm-project#41896 |
FWICT, it looks like SROA is failing to decompose the many little So if I build this driver for 32b x86 For example, on x86_64
but 32b arm:
By the time we get to the initial SROA pass on calculate_bandwidth though, on x86_64 the temporary aggregates have no users (I think SROA just deletes them as dead). But on arm32 we have a bunch of memcpy's: call void @llvm.lifetime.start.p0(i64 8, ptr %tmp235) #17
call void @bw_int_to_fixed(ptr sret(%struct.bw_fixed) align 8 %tmp235, i64 noundef 1)
call void @llvm.memcpy.p0.p0.i32(ptr align 8 %arrayidx234, ptr align 8 %tmp235, i32 8, i1 false)
call void @llvm.lifetime.end.p0(i64 8, ptr %tmp235) #17 or, it seems that for x86_64 SROA is able to boil away the above somehow. x86 version: call void @llvm.lifetime.start.p0(i64 8, ptr %tmp236) #17
%call237 = call i64 @bw_int_to_fixed(i64 noundef 1) #19
%coerce.dive238 = getelementptr inbounds %struct.bw_fixed, ptr %tmp236, i32 0, i32 0
store i64 %call237, ptr %coerce.dive238, align 8
call void @llvm.memcpy.p0.p0.i64(ptr align 8 %arrayidx235, ptr align 8 %tmp236, i64 8, i1 false)
call void @llvm.lifetime.end.p0(i64 8, ptr %tmp236) #17 oh, is clang unwrapping the struct for x86_64 but not arm32? Yeah: https://godbolt.org/z/hzxbsKbc3 the struct gets unwrapped by clang's codegen of LLVM IR for x86, x86_64, aarch64, but not arm32... |
From what I've tracked down so far, there seems to be a difference between
I think |
Looking at this further today, I think it's actually x86 (before SROA): %compression_rate234 = getelementptr inbounds %struct.bw_calcs_data, ptr %162, i32 0, i32 162
%arrayidx235 = getelementptr [12 x %struct.bw_fixed], ptr %compression_rate234, i32 0, i32 1
call void @llvm.lifetime.start.p0(i64 8, ptr %tmp236) #17
%call237 = call i64 @bw_int_to_fixed(i64 inreg noundef 1) #19
%coerce.dive238 = getelementptr inbounds %struct.bw_fixed, ptr %tmp236, i32 0, i32 0
store i64 %call237, ptr %coerce.dive238, align 4
call void @llvm.memcpy.p0.p0.i32(ptr align 4 %arrayidx235, ptr align 4 %tmp236, i32 8, i1 false)
call void @llvm.lifetime.end.p0(i64 8, ptr %tmp236) #17 x86 (after SROA): %compression_rate234 = getelementptr inbounds %struct.bw_calcs_data, ptr %data, i32 0, i32 162
%arrayidx235 = getelementptr [12 x %struct.bw_fixed], ptr %compression_rate234, i32 0, i32 1
%call237 = call i64 @bw_int_to_fixed(i64 inreg noundef 1) #18
store i64 %call237, ptr %arrayidx235, align 4 arm (before SROA): %compression_rate233 = getelementptr inbounds %struct.bw_calcs_data, ptr %data, i32 0, i32 162
%arrayidx234 = getelementptr [12 x %struct.bw_fixed], ptr %compression_rate233, i32 0, i32 1
call void @llvm.lifetime.start.p0(i64 8, ptr %tmp235) #18
call void @bw_int_to_fixed(ptr sret(%struct.bw_fixed) align 8 %tmp235, i64 noundef 1)
call void @llvm.memcpy.p0.p0.i32(ptr align 8 %arrayidx234, ptr align 8 %tmp235, i32 8, i1 false)
call void @llvm.lifetime.end.p0(i64 8, ptr %tmp235) #18 (after SROA, same) ARM AAPCS32 §6.4 Result Return mentions:
|
https://godbolt.org/z/vvMoza9es demonstrates the issue more. When returning an argument greater than the word size, the IR isn't great for structures/aggregates/composite types. 32b x86 is saved by It's really hard to understand what is getting placed in what stack slot though, I wonder if stack-slot-coloring is having some kind of issue. Maybe it's time to beef up the debug output from that pass. |
I was able to claw back tens of bytes by rewriting calculate_bandwidth a bit (removing the two stupid Or move most of drivers/gpu/drm/amd/display/dc/dml/calcs/bw_fixed.c into drivers/gpu/drm/amd/display/dc/dml/calcs/dce_calcs.c since that's pretty much the only user. |
duplicating this to #39. Once the two issues identified by #39 (comment) are fixed in clang, we can reopen this if necessary. 1 additional thing is conspiring here (that are unavoidable):
Dunno about 390 though... but one of the two issues in clang is very much a result of dce_calcs. |
In several different configurations, I see an excessive amount of stack usage in certain functions within
drivers/gpu/drm/amd/display/dc/calcs/dce_calcs.c
:Initial report.
The text was updated successfully, but these errors were encountered: