Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Martien.pre2post #213

Merged
merged 11 commits into from
Oct 22, 2024
Merged

Martien.pre2post #213

merged 11 commits into from
Oct 22, 2024

Conversation

martien-de-jong
Copy link
Collaborator

Some tweaking to get more leverage from the PostPipeliner:

  • Pre-allocate resources of future operations to take more heed of possible conflicts
  • Avoid unrolling for very high MinTripCount loops
  • PrePipeliner doesn't increase II if MaxStages is exceeded, instead it stops searching and rejects the schedule in favour of pipelining
  • Allow larger II's in PostPipeliner; compute ResMII to take away some compile time

HR.emitInScoreboard(Scoreboard, MI->getDesc(), MemoryBanks,
MI->operands(), MI->getMF()->getRegInfo(), Cycle);
Cycle += II;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noice!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you think of maybe adding a small test that benefits from "pre-booking modulo resources"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see whether this tipped the conv2d balamce.

@@ -141,7 +141,7 @@ static cl::opt<int> SwpForceII("pipeliner-force-ii",
static cl::opt<int>
SwpMaxStages("pipeliner-max-stages",
cl::desc("Maximum stages allowed in the generated scheduled."),
cl::Hidden, cl::init(3));
cl::Hidden, cl::init(-1));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might have trouble upstreaming that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end, this option seems strange. This decision should always delegated to the target.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, we can push this option to -1 from the frontend.

Copy link
Collaborator Author

@martien-de-jong martien-de-jong Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this predates shouldUseSchedule. I agree with Andreu that it should be decided there.
When we upstream, I'll repair machine-pipeliner users, e.g. by implementing it in the base class of PipelinerLoopInfo rather than in the pipeliner itself.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact I will do that in this PR.

Copy link
Collaborator Author

@martien-de-jong martien-de-jong Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It's more complicated than that, because there's a MaxStages test in the main II loop. That means it could be moved to canAcceptII, but that's not upstreamed either. I propose to upstream both at the same time, moving the MaxStages check to the default implementation of canAcceptII

@@ -272,6 +272,9 @@ namespace llvm {
MachineFunction *CurrFunc;
MachineFunction::iterator NextMBB;

// Supply specific memory disambiguation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this flexibility!

@martien-de-jong martien-de-jong marked this pull request as ready for review October 17, 2024 11:10
// PostPipeliner can do nothing without tripcount > 1
if (MinTripCount <= 1) {
return false;
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It's a ZOL, Prepipeliner can't do anything either.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? We can add guards I think

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that is hard to add guards in post, it will change the CFG and will require a new register for the constant...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a loop counter GPR?

@@ -55,7 +55,7 @@ static cl::opt<bool>
"iterative loop scheduling"));

static cl::opt<int> PostPipelinerMaxII(
"aie-postpipeliner-maxii", cl::init(10),
"aie-postpipeliner-maxii", cl::init(40),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably too high. Maybe revisit when we hit noticable compile time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay, if makes me feel better about "aie-loop-aware-expensive-iterations", cl::init(35) 😆

; CHECK-TC-2: PLI: Accepting II=5

; CHECK-TC-1: Not interesting MinTripCount (<=1)!
; CHECK-TC-1-NEXT: Loop rejected
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questionable auto-update.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think the # NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 4 header should be removed. Could you maybe manually update the test?

AddrSpace ? ASI.getMemoryBanksFromAddressSpace(AddrSpace)
: AddressSpaceNoneIsSafe ? 0
: ASI.getDefaultMemoryBank();
MemoryBankUsed |= MemoryBank;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is more than an added option. We start at none, and collect the memory banks from all MMO's. We used to start at 'all banks' and intersect. The latter is incompatible with multi-memory ops like VLDB.4x

@@ -54,6 +54,9 @@ static cl::opt<int>
static cl::opt<unsigned>
MaxUnrollCost("aie-unroll-max-cost", cl::Hidden, cl::init(200),
cl::desc("Maximum partial unroll cost for loops"));
static cl::opt<unsigned> PreferSwpOverUnroll(
"aie-prefer-swp-over-unroll", cl::Hidden, cl::init(9),
cl::desc("Aim for pipelining if MinIterCount is at least this value."));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is inspired by the fact that all ReduceABC's regress for 8 and lower.

if (BS.getRegions().size() == 1) {
auto &PostSWP = BS.getPostSWP();
if (PostSWP.canAccept(*BS.TheBlock)) {
BS.FixPoint.II = PostSWP.getResMII(*BS.TheBlock);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This saves a lot of work. II divides the linear schedule length to determine the number of copies in the DAG, and DAG construction is expensive and worse than linear in the number of nodes. Something like Conv2D immediately hits the target II, so we would also save 7 iterations of the main loop.

while (C < NInstr && (Scoreboard[C] & Slots)) {
C++;
}
if (C >= NInstr) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: C == NInstr. with C > NInstr means that we need k+1 slots for k instructions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitter: This is the reverse of the first part of the loop condition, without having to know how C is updated.

C++;
}
if (C >= NInstr) {
MII = NInstr;
Copy link
Collaborator

@andcarminati andcarminati Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this happens, do we try to run the pipeliner anyway? Should we return PostPipelinerMaxII (provided we share this option)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This computes ResMII, not how it it used. I don't want to tie too many strings to it.
As we compute it now, it will never exceed NInstr. but even if II == NInstr, the latency path through those NInstrs can allow a 4 stage pipeline to be useful. And if the DAG is a straight chain of dependences, pipelining can still be cheap.

@@ -270,8 +300,15 @@ bool PostPipeliner::scheduleFirstIteration() {
LLVM_DEBUG(dbgs() << " Emit in " << -Depth + LocalCycle << "\n");
int Cycle = -Depth + LocalCycle;
LLVM_DEBUG(dbgs() << " Emit in " << Cycle << "\n");
HR.emitInScoreboard(Scoreboard, MI->getDesc(), MemoryBanks, MI->operands(),
MI->getMF()->getRegInfo(), Cycle);
for (int N = 0; N < NCopies; N++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would use unsigned here ;-).

Copy link
Collaborator Author

@martien-de-jong martien-de-jong Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I solemnly pledged only to use unsigned for bitmasks (and perhaps for very small types like char for lookup tables)

if (NewLatest < Info[P].Latest) {
Info[P].Latest = NewLatest;
Changed = true;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an overview of what this commit is doing?

Copy link
Collaborator

@gbossu gbossu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉

@martien-de-jong martien-de-jong enabled auto-merge (rebase) October 22, 2024 13:22
@martien-de-jong martien-de-jong merged commit 9fcbc8e into aie-public Oct 22, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants