Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VLSU Support #183

Closed
wants to merge 16 commits into from
Closed

VLSU Support #183

wants to merge 16 commits into from

Conversation

aarongchan
Copy link
Collaborator

PR Goal:
Implement VLSU instruction support. Current implementation uses the LSU design and adds vector iterations based on VLEN and data width of VLSU.

@aarongchan aarongchan added the enhancement New feature or request label Jul 18, 2024
@aarongchan aarongchan self-assigned this Jul 18, 2024
@aarongchan
Copy link
Collaborator Author

@kathlenemagnus
If you could look at the VLSU.cpp (specifically the completeInst_) for context for the rest of this comment when you get the chance. I've updated my code to work around how the LSU handled stores, but the issue I was running into and had to work around can be summarized as:

  • Stores get retired before "completing" in the LSU. This is an issue because:
    • We need to send multiple memory requests through. If on vector store has 16 passes needed, so 16 memory operations, we need to send that many through. However, the wrappers for LoadStoreInfo and MemoryAccessInfo, which contain relevant data to cache, TLB, and MMU status include the InstPtr. This becomes an issue if we have multiple requests, because the InstPtr already has a retired status, but the next requests will try to set the status thus resulting in an error.
    • The work around I did was to create a separate status per each LoadStoreInfo wrapper, thus each memory request has it's own status, that is separate from the instruction's.
      A couple questions:
  • Currently in my design, the memory requests are sent through the VLSU serially, so we have to wait for the first pass to finish before the 2nd pass can begin for the same vector instruction. This is due to the current design of the LSU, because we're only assuming one memory request per instruction, which is the case for LSU, but not VLSU. The benefit of this is that if there are multiple vlsu instructions, they fully utilize the pipeline because while instruction A's memory request is waiting for the cache, instruction B's memory request is at a different stage of the vlsu pipeline. My question is does this type of design make sense? Or should rearchitect this to work differently similar to the uop generator in vlsu that Knute had suggested?
  • The vlsu uop generator idea would only process one vlsu instruction at a time, but generate lets say 16 memory requests that then queue up in the VLSU. The flow would look like:
vlsu_uop_generator -> generates 16 memory requests for instruction A
vlsu_ldst_queue_ -> takes the memory requests, queues them up, begins processing them through the pipeline
vlsu -> once all memory requests are through, we officially retire instruction A
(repeat for all vlsu instructions)
vlsu_uop_generator -> generates 16 memory requests for instruction B
...

I wanted to check all this with you and Knute before proceeding, because it will change a bit of how the VLSU works compared to the LSU, especially around the setting of instruction status vs creating a LoadStoreInfo status due to the nature of VLSU instructions.

@@ -336,6 +336,9 @@ namespace olympia
// uint32_t unfusedInstsSize = insts->size();

// Decrement internal Uop Queue credits
ILOG(uop_queue_credits_)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to check these in?

core/Inst.hpp Outdated
@@ -485,6 +506,12 @@ namespace olympia
1; // start at 1 because the uop count includes the parent instruction
uint64_t uop_count_ = 0;
VCSRs VCSRs_;
uint32_t eew_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move these values into the VCSRs structure and rename it to something like VectorConfig or VectorState.

uint32_t mop_;
uint32_t stride_;

uint32_t vlsu_total_iters_ = 0;
Copy link
Collaborator

@kathlenemagnus kathlenemagnus Jul 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also might want to consider a decorator like RenameData for LSUData.

@@ -237,7 +240,7 @@ namespace olympia
"pipe. Did you define it in the yaml properly?");
// so we have a map here that checks for which valid dispatchers for that
// instruction target pipe map needs to be: "int": [exe0, exe1, exe2]
if (target_pipe != InstArchInfo::TargetPipe::LSU)
if (target_pipe != InstArchInfo::TargetPipe::LSU && target_pipe != InstArchInfo::TargetPipe::VLSU)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just check isLoadStoreInst() instead?

if(uop->isLoadStoreInst()){
// set base address according to LMUL, i.e if we're on the 3rd
// LMUL Uop, it's base address should be base address + 3 * EEW
uop->setTargetVAddr(uop->getTargetVAddr() + uop->getEEW() * uop->getUOpID());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now since this PR only supports unit stride, but eventually we should create an "address unroller" class that generates all of the addresses for a vector uop. I think it makes sense to be a part of this class, but it could also be a part of the VLSU.

}
}

void setVectorIter(uint32_t vec_iter){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open bracket should be on a new line.

private:
MemoryAccessInfoPtr mem_access_info_ptr_;
sparta::State<IssuePriority> rank_;
sparta::State<IssueState> state_;
bool in_ready_queue_;
uint32_t vector_iterations_ = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really "completed vector iterations"? Could be renamed to be more clear.

core/LSU.cpp Outdated
@@ -258,17 +259,18 @@ namespace olympia
{
sparta_assert(inst_ptr->getStatus() == Inst::Status::RETIRED,
"Get ROB Ack, but the store inst hasn't retired yet!");
if(!inst_ptr->isVector()){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should assert if the inst is a vector since you have separate ports for scalar and vector.

core/LSU.cpp Outdated
for (auto & inst_info_ptr : ldst_inst_queue_)
{
if (inst_info_ptr->getInstPtr() == inst_ptr)
if(!inst_ptr->isVector()){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above. Just assert if vector since vector insts should never end up here.

@@ -100,6 +100,13 @@ namespace olympia
return inst_ptr == nullptr ? 0 : inst_ptr->getUniqueID();
}

// This is a function which will be added in the SPARTA_ADDPAIRs API.
uint64_t getInstUOpID() const
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what scenario is the inst ptr not set?

@@ -151,6 +158,8 @@ namespace olympia
replay_queue_iterator_ = iter;
}

void setIsVector(bool is_vector){ is_vector_ = is_vector; }
bool isVector(){ return is_vector_; }
Copy link
Collaborator

@kathlenemagnus kathlenemagnus Jul 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check the inst ptr for isVector() instead?

core/ROB.cpp Outdated
@@ -73,19 +73,9 @@ namespace olympia
{
for (auto & i : *in_reorder_buffer_write_.pullData())
{
if (!i->isUOp())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this code in my most recent PR so you may have conflicts whenever you sync your fork.

@kathlenemagnus
Copy link
Collaborator

The work around I did was to create a separate status per each LoadStoreInfo wrapper, thus each memory request has it's own status, that is separate from the instruction's.

I think this is the right solution. Each memory access should be completed separately and then a vector instruction can be marked completed if all of its memory accesses have completed.

Currently in my design, the memory requests are sent through the VLSU serially, so we have to wait for the first pass to finish before the 2nd pass can begin for the same vector instruction. This is due to the current design of the LSU, because we're only assuming one memory request per instruction, which is the case for LSU, but not VLSU. The benefit of this is that if there are multiple vlsu instructions, they fully utilize the pipeline because while instruction A's memory request is waiting for the cache, instruction B's memory request is at a different stage of the vlsu pipeline. My question is does this type of design make sense? Or should rearchitect this to work differently similar to the uop generator in vlsu that Knute had suggested?

I would rearchitect like Knute suggested. What we want is for a vector instruction to be able to generate multiple memory accesses. They can be executed serially, but they should be able to be pipelined. So 1 vector instruction should be able to send a uop down the LSU pipeline every cycle. Similarly to how we used to track whether a parent instruction was "done" in the ROB, a vector load or store should expect multiple memory accesses to be completed before it can be marked as done. One thing to be careful of though is that there should only be 1 writeback to the vector destination. So you will need some sort of structure for collecting all of the data returned to the LSU.

The vlsu uop generator idea would only process one vlsu instruction at a time, but generate lets say 16 memory requests that then queue up in the VLSU. The flow would look like:

vlsu_uop_generator -> generates 16 memory requests for instruction A
vlsu_ldst_queue_ -> takes the memory requests, queues them up, begins processing them through the pipeline
vlsu -> once all memory requests are through, we officially retire instruction A
(repeat for all vlsu instructions)
vlsu_uop_generator -> generates 16 memory requests for instruction B

I wanted to check all this with you and Knute before proceeding, because it will change a bit of how the VLSU works compared to the LSU, especially around the setting of instruction status vs creating a LoadStoreInfo status due to the nature of VLSU instructions.

Let's talk about this more on Monday. I see two paths forward here:

  1. We could do as you suggest and "sequence" the vector load store instruction when it gets to the LSU and generate a memory request that is stored in the LDST queue. These memory requests would behave like scalar instructions in the queue. The question here is when to trigger WB. It's inefficient to writeback to the vector destination 16 times and it is not always possible to do partial writes to the register file.
  2. Another option would be to keep the vector instruction in the LDST queue and send it down the LSU pipe multiple times. Each time it gets sent down the pipe it would generate a different memory access. Again, we need a way to track all of the data that has come back and make sure it's all ready before sending the vector instruction down the pipe a final time to do writeback.

@aarongchan aarongchan marked this pull request as ready for review August 1, 2024 14:50
@aarongchan
Copy link
Collaborator Author

@klingaard @kathlenemagnus this should be ready for review/merge.

@klingaard klingaard self-requested a review August 8, 2024 21:23
Copy link
Collaborator

@klingaard klingaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll preemptively approve this as Kathlene will be leading the charge on this PR for now.

Thank you for your contributions to the project, @aarongchan! Best of luck to you at Intel.

core/Inst.hpp Outdated
Comment on lines 510 to 494
uint32_t mop_;
uint32_t stride_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not initialized..

@aarongchan
Copy link
Collaborator Author

Closing due to @kathlenemagnus's PR

@aarongchan aarongchan closed this Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants