Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x87 Stack Optimization #3547

Merged
merged 3 commits into from
Jul 22, 2024
Merged

x87 Stack Optimization #3547

merged 3 commits into from
Jul 22, 2024

Conversation

pmatos
Copy link
Collaborator

@pmatos pmatos commented Apr 2, 2024

This effort tries to implement a framework to optimize the x87 stack.

Currently draft, it's a work in progress and many, many tests are still failing. However, it feels like there's a direction so I am opening a PR.

I have been writing a sort of thoughts/design document about this which you can access through:
https://docs.google.com/document/d/1ZIWNuu6h6EuAkMTL70PxfMwx9cZxOZ_OAK849e8y4tM/edit?usp=sharing

@pmatos pmatos force-pushed the wip_x87_stack branch 2 times, most recently from 08d106d to cdfb112 Compare April 17, 2024 09:22
@pmatos pmatos marked this pull request as ready for review May 20, 2024 10:05
@pmatos
Copy link
Collaborator Author

pmatos commented May 20, 2024

This is not ready for merging.

Having said that the non-reduced precision work is complete. Things that need to happen before a merge:

  • Testing in places other than Psychonauts.
    • HalfLife and Oblivion not running atm under FEX (independently of this branch).
  • Integrating reduced precision mode into pass.
  • Code cleanup - code has quite a bit of logs and so on that can be removed once the work has come to a close.
    • Still wondering if it's ok to do some simplifications like transforming a pure memcopy into a load/store pair, therefore bypassing the stack. This might not exactly duplicate what's happening natively due to native->80bit and 80bit->native conversions that happen if you go through the stack.
    • There are surely other possibilities for improvement but this is already quite a speedup in my experience.

@Sonicadvance1
Copy link
Member

Still wondering if it's ok to do some simplifications like transforming a pure memcopy into a load/store pair, therefore bypassing the stack. This might not exactly duplicate what's happening natively due to native->80bit and 80bit->native conversions that happen if you go through the stack.

Should be viable since x87 is lossless (in 80-bit precision mode). Will need to double check if precision mode effects loadstores or only ALU operations but worst case that means some code invalidation on operating mode change and disallowing memcpy implementation on lower precision.

@CallumDev
Copy link
Contributor

CallumDev commented May 20, 2024

Double precision mode in master affects 80-bit and 32-bit loadstores, as it basically switches the "native" operating mode to 64-bit. Conceivably if the stack optimization eliminates stack operations for memcpy-style operations, it could be extended to reduced precision by simply eliminating the stack operation/conversion pairs? As an FLD 80-bit plus FST 80-bit translates to (very basically):

  • F80 to Double

  • Stack store

  • Stack load

  • Double to F80

However memcopy on lower precision is borked by design currently so it's not a showstopper by any means if it can't be achieved on lower precision

@pmatos
Copy link
Collaborator Author

pmatos commented Jun 24, 2024

Update: Reduced precision integrated into pass and currently all asm_tests are passing! ✔️

There are still a few issues. The plan is:

  • Benchmark a few games;
  • Code cleanup / refactoring where necessary;
  • Land;

It would be interesting to have a few more pairs of eyes over the code.
@alyssarosenzweig you offered to look at the code a few weeks ago. I think if you have any cleanup/refactoring suggestions, it would be great. I know there's quite a few things that can be improved. We have sort of 4 branches in the stack optimization patch like:

if (SlowPath) {
  if (ReducedPrecisionMode) {
    ...
  } else {
    ...
  }
} else {
  if (ReducedPrecisionMode) {
    ...
  } else {
    ...
  }
}

We should be able to improve the readability here. Any specific suggestions are very welcome! :) I also have a few TODO/FIXME comments in-code that I need to sort out but otherwise it's going in the right direction.

@alyssarosenzweig
Copy link
Collaborator

alyssarosenzweig commented Jun 24, 2024

Happy to hear you're making good progress! I'll definitely be reviewing this, although I might not get to it for a bit. (I'm prioritizing AVX review, and I'm out-of-office early next week - Monday is a Canadian holiday.)

@Sonicadvance1
Copy link
Member

Woo! Good job!

@pmatos pmatos force-pushed the wip_x87_stack branch 3 times, most recently from ae0efa5 to 198f088 Compare June 25, 2024 08:20
@pmatos
Copy link
Collaborator Author

pmatos commented Jun 26, 2024

While testing this branch on a few games, I noticed a bug on this branch with reduced precision. Currently trying to figure out why we are having a segfault in generated code.

Ref Value2 = LoadStackValue(ValueOffset2, StackOffset2);

Ref StackNode = IREmit->_VBSL(16, VecCond, Value1, Value2);
StoreStackValue(StackNode, 0, StackOffset1 && StackOffset2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the && here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's StackOffset1 != 0 && StackOffset2 != 0. The point is, we are storing to the top of the stack. We only really need to mark it as valid if none of the values come from the top of the stack. Because if they did, we would know it's already valid and there's no point in re-marking it.

@alyssarosenzweig
Copy link
Collaborator

It seems like a bunch of InterpretAsFloats might've gotten lost in the latest refactor. The memcpy instcountci blocks, for example, had a bunch of conversions in them yesterday that are not there now. Was that intended?

@pmatos
Copy link
Collaborator Author

pmatos commented Jul 19, 2024

It seems like a bunch of InterpretAsFloats might've gotten lost in the latest refactor. The memcpy instcountci blocks, for example, had a bunch of conversions in them yesterday that are not there now. Was that intended?

I simplified a few things yesterday but I am surprised you say that some conversions are gone. I will take a closer look but when I skimmed through the instcounci results, it looked good to me.

@neobrain
Copy link
Member

I simplified a few things yesterday but I am surprised you say that some conversions are gone. I will take a closer look but when I skimmed through the instcounci results, it looked good to me.

GitHub often drops conversations if you update the underlying code. There's nothing really you can do about it (other than enabling email notifications so you can look things up manually), annoyingly.

@pmatos
Copy link
Collaborator Author

pmatos commented Jul 19, 2024

It seems like a bunch of InterpretAsFloats might've gotten lost in the latest refactor. The memcpy instcountci blocks, for example, had a bunch of conversions in them yesterday that are not there now. Was that intended?

I simplified a few things yesterday but I am surprised you say that some conversions are gone. I will take a closer look but when I skimmed through the instcounci results, it looked good to me.

I know what happened here. I add the memcpy insts in the commit where I add the tests and the third commit with instcounci modifies those. A recent change correctly removed the fcvt. I think I will add these instcountci tests in a separate PR instead.

@pmatos
Copy link
Collaborator Author

pmatos commented Jul 19, 2024

It seems like a bunch of InterpretAsFloats might've gotten lost in the latest refactor. The memcpy instcountci blocks, for example, had a bunch of conversions in them yesterday that are not there now. Was that intended?

I simplified a few things yesterday but I am surprised you say that some conversions are gone. I will take a closer look but when I skimmed through the instcounci results, it looked good to me.

I know what happened here. I add the memcpy insts in the commit where I add the tests and the third commit with instcounci modifies those. A recent change correctly removed the fcvt. I think I will add these instcountci tests in a separate PR instead.

Now in #3880

@alyssarosenzweig
Copy link
Collaborator

For FixedSizeStack, why are we using a fextl::vector and not simply a static array? We know it will always be 8 elements - no need for the dynamic allocation, right?

Copy link
Collaborator

@alyssarosenzweig alyssarosenzweig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. There are still little things that could be improved, but at this point, the code looks structurally sound and we're hitting diminishing returns with prolonging the review process versus the benefit of getting this in-tree and soak-tested. Let's go 👍

@pmatos
Copy link
Collaborator Author

pmatos commented Jul 21, 2024 via email

@pmatos
Copy link
Collaborator Author

pmatos commented Jul 21, 2024 via email

Comment on lines +64 to +65
rotate();
buffer.front() = {StackSlot::VALID, Value};
Copy link
Member

@neobrain neobrain Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we're using vector, we can invert the storage order of the elements to simplify the implementation. Instead of rotate + assign to front(), we can just use this:

Suggested change
rotate();
buffer.front() = {StackSlot::VALID, Value};
buffer.emplace_back(StackSlot::VALID, Value);

Similarly, pop becomes pop_back(), and StackSlot::UNUSED can be entirely removed.

Internally, the buffer won't be fixed-size anymore then, but that's fine. Calling reserve(size) on construction ensures there will never be more than one heap allocation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the suggestion but the location where the items are valid correspond to an offset to TOP, which will stop being the case. I understand that this optimization is possible, but I think it's going to complicate the code readability elsewhere. I will take a look to see exactly what such a change would look like and report back. Thanks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing is that we cannot really rmeove StackSlot::UNUSED. This is because the stack doesn't need to be contiguous, i.e. some elements might be unused. So you do need to mark that. For example:

fld qword [addr] ; 1.0
fst st2

In this case the stack will look like:

MM1  1.0      ST2
MM0  UNUSED   ST1
MM7  1.0      ST0 <- TOP

So we need to mark these in-between slots as unused.

Copy link
Member

@neobrain neobrain Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we baking pointers within the FixedSizeStack into the generated assembly code? If not then I don't understand why you're bringing up assembly code here.

The change I'm proposing is an implementation detail and shouldn't affect any of the users of the FixedSizeStack interface.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am confused about your suggestion, but I cannot see how your suggestion would work. The "stack" is not really a stack, which makes it harder to implement as you suggest. This is because I can push an element to the stack but then set the element that's two elements above that. With your proposed solution where the vector starts empty a push would be an emplace_back but then setting ST2, would mean a rotate right by one and emplacing back. It's certainly possible but I am not sure it's an improvement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "stack" is not really a stack

I think that's what confused me here. I see now that with the offset-parameter in top/setTop, we really always need 8 valid elements (no more, no less), which means you can't just push elements without also dropping the last element.

Bummer :(

@neobrain
Copy link
Member

I don't really have the expertise to review this beyond what I already commented on, so Alyssa's +1 is good enough.

Good job on having continued pushing this forward despite how tricky it was to get working!

@neobrain neobrain merged commit d507f4c into FEX-Emu:main Jul 22, 2024
12 checks passed
@pmatos pmatos deleted the wip_x87_stack branch July 26, 2024 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants