x87 Stack Optimization #3547

pmatos · 2024-04-02T13:05:25Z

This effort tries to implement a framework to optimize the x87 stack.

Currently draft, it's a work in progress and many, many tests are still failing. However, it feels like there's a direction so I am opening a PR.

I have been writing a sort of thoughts/design document about this which you can access through:
https://docs.google.com/document/d/1ZIWNuu6h6EuAkMTL70PxfMwx9cZxOZ_OAK849e8y4tM/edit?usp=sharing

pmatos · 2024-05-20T10:11:29Z

This is not ready for merging.

Having said that the non-reduced precision work is complete. Things that need to happen before a merge:

Testing in places other than Psychonauts.
- HalfLife and Oblivion not running atm under FEX (independently of this branch).
Integrating reduced precision mode into pass.
Code cleanup - code has quite a bit of logs and so on that can be removed once the work has come to a close.
- Still wondering if it's ok to do some simplifications like transforming a pure memcopy into a load/store pair, therefore bypassing the stack. This might not exactly duplicate what's happening natively due to native->80bit and 80bit->native conversions that happen if you go through the stack.
- There are surely other possibilities for improvement but this is already quite a speedup in my experience.

Sonicadvance1 · 2024-05-20T10:23:52Z

Still wondering if it's ok to do some simplifications like transforming a pure memcopy into a load/store pair, therefore bypassing the stack. This might not exactly duplicate what's happening natively due to native->80bit and 80bit->native conversions that happen if you go through the stack.

Should be viable since x87 is lossless (in 80-bit precision mode). Will need to double check if precision mode effects loadstores or only ALU operations but worst case that means some code invalidation on operating mode change and disallowing memcpy implementation on lower precision.

CallumDev · 2024-05-20T12:17:58Z

Double precision mode in master affects 80-bit and 32-bit loadstores, as it basically switches the "native" operating mode to 64-bit. Conceivably if the stack optimization eliminates stack operations for memcpy-style operations, it could be extended to reduced precision by simply eliminating the stack operation/conversion pairs? As an FLD 80-bit plus FST 80-bit translates to (very basically):

F80 to Double
Stack store
Stack load
Double to F80

However memcopy on lower precision is borked by design currently so it's not a showstopper by any means if it can't be achieved on lower precision

pmatos · 2024-06-24T16:50:42Z

Update: Reduced precision integrated into pass and currently all asm_tests are passing! ✔️

There are still a few issues. The plan is:

Benchmark a few games;
Code cleanup / refactoring where necessary;
Land;

It would be interesting to have a few more pairs of eyes over the code.
@alyssarosenzweig you offered to look at the code a few weeks ago. I think if you have any cleanup/refactoring suggestions, it would be great. I know there's quite a few things that can be improved. We have sort of 4 branches in the stack optimization patch like:

if (SlowPath) {
  if (ReducedPrecisionMode) {
    ...
  } else {
    ...
  }
} else {
  if (ReducedPrecisionMode) {
    ...
  } else {
    ...
  }
}

We should be able to improve the readability here. Any specific suggestions are very welcome! :) I also have a few TODO/FIXME comments in-code that I need to sort out but otherwise it's going in the right direction.

alyssarosenzweig · 2024-06-24T16:55:31Z

Happy to hear you're making good progress! I'll definitely be reviewing this, although I might not get to it for a bit. (I'm prioritizing AVX review, and I'm out-of-office early next week - Monday is a Canadian holiday.)

Sonicadvance1 · 2024-06-24T16:56:19Z

Woo! Good job!

pmatos · 2024-06-26T16:15:59Z

While testing this branch on a few games, I noticed a bug on this branch with reduced precision. Currently trying to figure out why we are having a segfault in generated code.

FEXCore/Source/Interface/IR/Passes/x87StackOptimizationPass.cpp

alyssarosenzweig · 2024-07-18T13:39:06Z

FEXCore/Source/Interface/IR/Passes/x87StackOptimizationPass.cpp

+        Ref Value2 = LoadStackValue(ValueOffset2, StackOffset2);
+
+        Ref StackNode = IREmit->_VBSL(16, VecCond, Value1, Value2);
+        StoreStackValue(StackNode, 0, StackOffset1 && StackOffset2);


I don't understand the && here.

I guess it's StackOffset1 != 0 && StackOffset2 != 0. The point is, we are storing to the top of the stack. We only really need to mark it as valid if none of the values come from the top of the stack. Because if they did, we would know it's already valid and there's no point in re-marking it.

FEXCore/Source/Interface/IR/Passes/x87StackOptimizationPass.cpp

alyssarosenzweig · 2024-07-18T13:43:25Z

It seems like a bunch of InterpretAsFloats might've gotten lost in the latest refactor. The memcpy instcountci blocks, for example, had a bunch of conversions in them yesterday that are not there now. Was that intended?

pmatos · 2024-07-19T06:26:16Z

It seems like a bunch of InterpretAsFloats might've gotten lost in the latest refactor. The memcpy instcountci blocks, for example, had a bunch of conversions in them yesterday that are not there now. Was that intended?

I simplified a few things yesterday but I am surprised you say that some conversions are gone. I will take a closer look but when I skimmed through the instcounci results, it looked good to me.

neobrain · 2024-07-19T06:36:46Z

I simplified a few things yesterday but I am surprised you say that some conversions are gone. I will take a closer look but when I skimmed through the instcounci results, it looked good to me.

GitHub often drops conversations if you update the underlying code. There's nothing really you can do about it (other than enabling email notifications so you can look things up manually), annoyingly.

pmatos · 2024-07-19T07:03:43Z

It seems like a bunch of InterpretAsFloats might've gotten lost in the latest refactor. The memcpy instcountci blocks, for example, had a bunch of conversions in them yesterday that are not there now. Was that intended?

I simplified a few things yesterday but I am surprised you say that some conversions are gone. I will take a closer look but when I skimmed through the instcounci results, it looked good to me.

I know what happened here. I add the memcpy insts in the commit where I add the tests and the third commit with instcounci modifies those. A recent change correctly removed the fcvt. I think I will add these instcountci tests in a separate PR instead.

pmatos · 2024-07-19T07:11:09Z

It seems like a bunch of InterpretAsFloats might've gotten lost in the latest refactor. The memcpy instcountci blocks, for example, had a bunch of conversions in them yesterday that are not there now. Was that intended?

I simplified a few things yesterday but I am surprised you say that some conversions are gone. I will take a closer look but when I skimmed through the instcounci results, it looked good to me.

I know what happened here. I add the memcpy insts in the commit where I add the tests and the third commit with instcounci modifies those. A recent change correctly removed the fcvt. I think I will add these instcountci tests in a separate PR instead.

Now in #3880

alyssarosenzweig · 2024-07-20T18:30:50Z

For FixedSizeStack, why are we using a fextl::vector and not simply a static array? We know it will always be 8 elements - no need for the dynamic allocation, right?

alyssarosenzweig

Approved. There are still little things that could be improved, but at this point, the code looks structurally sound and we're hitting diminishing returns with prolonging the review process versus the benefit of getting this in-tree and soak-tested. Let's go 👍

pmatos · 2024-07-21T12:15:26Z

To be honest, it's I started by attempting to use an std::array but had issues with allocation due to being unable to construct an initial array with the values I needed. I then decided to move to an std vector where I could use reserve and fill methods.

…

On July 20, 2024 8:31:10 PM GMT+02:00, Alyssa Rosenzweig ***@***.***> wrote: For FixedSizeStack, why are we using a fextl::vector and not simply a static array? We know it will always be 8 elements - no need for the dynamic allocation, right? -- Reply to this email directly or view it on GitHub: #3547 (comment) You are receiving this because you authored the thread. Message ID: ***@***.***>

pmatos · 2024-07-21T12:16:25Z

Nice! Thanks for all the time you spent carefully reviewing this and providing great feedback.

…

On July 20, 2024 8:41:56 PM GMT+02:00, Alyssa Rosenzweig ***@***.***> wrote: @alyssarosenzweig approved this pull request. Approved. There are still little things that could be improved, but at this point, the code looks structurally sound and we're hitting diminishing returns with prolonging the review process versus the benefit of getting this in-tree and soak-tested. Let's go 👍 -- Reply to this email directly or view it on GitHub: #3547 (review) You are receiving this because you authored the thread. Message ID: ***@***.***>

neobrain · 2024-07-22T07:13:08Z

FEXCore/Source/Interface/IR/Passes/x87StackOptimizationPass.cpp

+    rotate();
+    buffer.front() = {StackSlot::VALID, Value};


Now that we're using vector, we can invert the storage order of the elements to simplify the implementation. Instead of rotate + assign to front(), we can just use this:

Suggested change

rotate();

buffer.front() = {StackSlot::VALID, Value};

buffer.emplace_back(StackSlot::VALID, Value);

Similarly, pop becomes pop_back(), and StackSlot::UNUSED can be entirely removed.

Internally, the buffer won't be fixed-size anymore then, but that's fine. Calling reserve(size) on construction ensures there will never be more than one heap allocation.

I understand the suggestion but the location where the items are valid correspond to an offset to TOP, which will stop being the case. I understand that this optimization is possible, but I think it's going to complicate the code readability elsewhere. I will take a look to see exactly what such a change would look like and report back. Thanks.

Another thing is that we cannot really rmeove StackSlot::UNUSED. This is because the stack doesn't need to be contiguous, i.e. some elements might be unused. So you do need to mark that. For example:

fld qword [addr] ; 1.0 fst st2

In this case the stack will look like:

MM1 1.0 ST2 MM0 UNUSED ST1 MM7 1.0 ST0 <- TOP

So we need to mark these in-between slots as unused.

Are we baking pointers within the FixedSizeStack into the generated assembly code? If not then I don't understand why you're bringing up assembly code here.

The change I'm proposing is an implementation detail and shouldn't affect any of the users of the FixedSizeStack interface.

Maybe I am confused about your suggestion, but I cannot see how your suggestion would work. The "stack" is not really a stack, which makes it harder to implement as you suggest. This is because I can push an element to the stack but then set the element that's two elements above that. With your proposed solution where the vector starts empty a push would be an emplace_back but then setting ST2, would mean a rotate right by one and emplacing back. It's certainly possible but I am not sure it's an improvement.

The "stack" is not really a stack

I think that's what confused me here. I see now that with the offset-parameter in top/setTop, we really always need 8 valid elements (no more, no less), which means you can't just push elements without also dropping the last element.

Bummer :(

neobrain · 2024-07-22T12:07:58Z

I don't really have the expertise to review this beyond what I already commented on, so Alyssa's +1 is good enough.

Good job on having continued pushing this forward despite how tricky it was to get working!

pmatos force-pushed the wip_x87_stack branch 2 times, most recently from 08d106d to cdfb112 Compare April 17, 2024 09:22

pmatos force-pushed the wip_x87_stack branch from 51cd94d to 49645b9 Compare May 15, 2024 13:51

pmatos marked this pull request as ready for review May 20, 2024 10:05

This was referenced Jun 3, 2024

Fallout: New Vegas slow in-game due to x87 soft float #3670

Closed

Tohou: Luna Nights tile x87 precision issue #3685

Closed

pmatos force-pushed the wip_x87_stack branch 14 times, most recently from ef28fb7 to 3d92caa Compare June 24, 2024 16:44

pmatos force-pushed the wip_x87_stack branch 3 times, most recently from ae0efa5 to 198f088 Compare June 25, 2024 08:20