-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LV] Fix runtime-VF logic when generating RT-checks #130118
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,7 +5,6 @@ target triple = "aarch64-unknown-linux-gnu" | |
|
||
; Test case where the minimum profitable trip count due to runtime checks | ||
; exceeds VF.getKnownMinValue() * UF. | ||
; FIXME: The code currently incorrectly is missing a umax(VF * UF, 28). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this FIXME is wrong. We already generate the umax considering UF * VF. This is for the minimum iteration check, and UF*VF is needed for correctness. the constant is the minimum iteration count needed to be profitable, but not for correctness. AFAICT the only reason the constant gets changed is due to the runtime checks being more expensive now, due to duplicated RT VF computations? the actual memory checks seem to use the same runtime VF as originally? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The constant is MinProfTC in the creation of the binary intrinsic in emitIterationCountCheck?
Yes, it's the profitability check that seems to be broken. |
||
define void @min_trip_count_due_to_runtime_checks_1(ptr %dst.1, ptr %dst.2, ptr %src.1, ptr %src.2, i64 %n) { | ||
; CHECK-LABEL: @min_trip_count_due_to_runtime_checks_1( | ||
; CHECK-NEXT: entry: | ||
|
@@ -16,7 +15,7 @@ define void @min_trip_count_due_to_runtime_checks_1(ptr %dst.1, ptr %dst.2, ptr | |
; CHECK-NEXT: [[UMAX:%.*]] = call i64 @llvm.umax.i64(i64 [[N:%.*]], i64 1) | ||
; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64() | ||
; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4 | ||
; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.umax.i64(i64 20, i64 [[TMP1]]) | ||
; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.umax.i64(i64 28, i64 [[TMP1]]) | ||
; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[UMAX]], [[TMP2]] | ||
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_MEMCHECK:%.*]] | ||
; CHECK: vector.memcheck: | ||
|
@@ -25,21 +24,29 @@ define void @min_trip_count_due_to_runtime_checks_1(ptr %dst.1, ptr %dst.2, ptr | |
; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 16 | ||
; CHECK-NEXT: [[TMP6:%.*]] = sub i64 [[DST_21]], [[DST_12]] | ||
; CHECK-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP6]], [[TMP5]] | ||
; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP4]], 16 | ||
; CHECK-NEXT: [[TMP7:%.*]] = call i64 @llvm.vscale.i64() | ||
; CHECK-NEXT: [[TMP18:%.*]] = mul i64 [[TMP7]], 2 | ||
; CHECK-NEXT: [[TMP9:%.*]] = mul i64 [[TMP18]], 16 | ||
; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[DST_12]], [[SRC_13]] | ||
; CHECK-NEXT: [[DIFF_CHECK4:%.*]] = icmp ult i64 [[TMP8]], [[TMP7]] | ||
; CHECK-NEXT: [[DIFF_CHECK4:%.*]] = icmp ult i64 [[TMP8]], [[TMP9]] | ||
; CHECK-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[DIFF_CHECK]], [[DIFF_CHECK4]] | ||
; CHECK-NEXT: [[TMP9:%.*]] = mul i64 [[TMP4]], 16 | ||
; CHECK-NEXT: [[TMP11:%.*]] = call i64 @llvm.vscale.i64() | ||
; CHECK-NEXT: [[TMP22:%.*]] = mul i64 [[TMP11]], 2 | ||
; CHECK-NEXT: [[TMP13:%.*]] = mul i64 [[TMP22]], 16 | ||
; CHECK-NEXT: [[TMP10:%.*]] = sub i64 [[DST_12]], [[SRC_25]] | ||
; CHECK-NEXT: [[DIFF_CHECK6:%.*]] = icmp ult i64 [[TMP10]], [[TMP9]] | ||
; CHECK-NEXT: [[DIFF_CHECK6:%.*]] = icmp ult i64 [[TMP10]], [[TMP13]] | ||
; CHECK-NEXT: [[CONFLICT_RDX7:%.*]] = or i1 [[CONFLICT_RDX]], [[DIFF_CHECK6]] | ||
; CHECK-NEXT: [[TMP11:%.*]] = mul i64 [[TMP4]], 16 | ||
; CHECK-NEXT: [[TMP24:%.*]] = call i64 @llvm.vscale.i64() | ||
; CHECK-NEXT: [[TMP26:%.*]] = mul i64 [[TMP24]], 2 | ||
; CHECK-NEXT: [[TMP38:%.*]] = mul i64 [[TMP26]], 16 | ||
; CHECK-NEXT: [[TMP12:%.*]] = sub i64 [[DST_21]], [[SRC_13]] | ||
; CHECK-NEXT: [[DIFF_CHECK8:%.*]] = icmp ult i64 [[TMP12]], [[TMP11]] | ||
; CHECK-NEXT: [[DIFF_CHECK8:%.*]] = icmp ult i64 [[TMP12]], [[TMP38]] | ||
; CHECK-NEXT: [[CONFLICT_RDX9:%.*]] = or i1 [[CONFLICT_RDX7]], [[DIFF_CHECK8]] | ||
; CHECK-NEXT: [[TMP13:%.*]] = mul i64 [[TMP4]], 16 | ||
; CHECK-NEXT: [[TMP19:%.*]] = call i64 @llvm.vscale.i64() | ||
; CHECK-NEXT: [[TMP20:%.*]] = mul i64 [[TMP19]], 2 | ||
; CHECK-NEXT: [[TMP21:%.*]] = mul i64 [[TMP20]], 16 | ||
; CHECK-NEXT: [[TMP14:%.*]] = sub i64 [[DST_21]], [[SRC_25]] | ||
; CHECK-NEXT: [[DIFF_CHECK10:%.*]] = icmp ult i64 [[TMP14]], [[TMP13]] | ||
; CHECK-NEXT: [[DIFF_CHECK10:%.*]] = icmp ult i64 [[TMP14]], [[TMP21]] | ||
; CHECK-NEXT: [[CONFLICT_RDX11:%.*]] = or i1 [[CONFLICT_RDX9]], [[DIFF_CHECK10]] | ||
; CHECK-NEXT: br i1 [[CONFLICT_RDX11]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]] | ||
; CHECK: vector.ph: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how this fixes the issue described in the commit message. The lambda should always be called with the same VF and arguments, unless I am missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is two-fold: first, the runtime check is generated as as an llvm.vscale expression: that is llvm.vscale multiplied by a constant. Next, addDiffRuntimeChecks creates a mul of this result, and an SCEV expansion of
SinkStart - SrcStart
, and looks up this pair in a map, eliminating redundant compares. Now, if my understanding is correct, it is not safe to cache a llvm.vscale call.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fhahn I thought that at first too, but when I looked into this it does look like a bug to me, but perhaps I'm missing something. The issue here is in
addDiffRuntimeChecks
this code:we are potentially passing in different values for the
Bits
argument, i.e. for each value ofBits
we should be creating a different runtime VF. See the bit in the lambda functiongetRuntimeVF(B, B.getIntNTy(Bits), VF)
. So if on the first instance Bits=32, we create a 32-bit runtime VF and cache that. Then on the second iteration we might pass in Bits=64 and still return the 32-bit runtime VF. I don't even know how theChkBuilder.CreateMul
worked given the mismatched types.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it gets worse with this code:
because even though
SrcStart, SinkStart, AccessSize, NeedsFreeze
may all be different we still perform the lookup based on the first runtime VF, potentially getting the valueIsConflict
wrong.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So any time you have different types for
SinkStart
in your list you potentially end up with incorrect runtime diff checks.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps now I think about this you explicitly do want to cache the runtime VF because the code in
addDiffRuntimeChecks
expects that, however there are no comments to that affect. And also, we should be caching a single value for all values ofBits
in that case.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about changing
Value *RuntimeVF
to a map such asDenseMap<unsigned, Value*> RuntimeVFs
?