-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x87 Stack Optimization #3547
x87 Stack Optimization #3547
Conversation
08d106d
to
cdfb112
Compare
This is not ready for merging. Having said that the non-reduced precision work is complete. Things that need to happen before a merge:
|
Should be viable since x87 is lossless (in 80-bit precision mode). Will need to double check if precision mode effects loadstores or only ALU operations but worst case that means some code invalidation on operating mode change and disallowing memcpy implementation on lower precision. |
Double precision mode in master affects 80-bit and 32-bit loadstores, as it basically switches the "native" operating mode to 64-bit. Conceivably if the stack optimization eliminates stack operations for memcpy-style operations, it could be extended to reduced precision by simply eliminating the stack operation/conversion pairs? As an FLD 80-bit plus FST 80-bit translates to (very basically):
However memcopy on lower precision is borked by design currently so it's not a showstopper by any means if it can't be achieved on lower precision |
ef28fb7
to
3d92caa
Compare
Update: Reduced precision integrated into pass and currently all asm_tests are passing! ✔️ There are still a few issues. The plan is:
It would be interesting to have a few more pairs of eyes over the code.
We should be able to improve the readability here. Any specific suggestions are very welcome! :) I also have a few TODO/FIXME comments in-code that I need to sort out but otherwise it's going in the right direction. |
Happy to hear you're making good progress! I'll definitely be reviewing this, although I might not get to it for a bit. (I'm prioritizing AVX review, and I'm out-of-office early next week - Monday is a Canadian holiday.) |
Woo! Good job! |
ae0efa5
to
198f088
Compare
While testing this branch on a few games, I noticed a bug on this branch with reduced precision. Currently trying to figure out why we are having a segfault in generated code. |
FEXCore/Source/Interface/IR/Passes/x87StackOptimizationPass.cpp
Outdated
Show resolved
Hide resolved
FEXCore/Source/Interface/IR/Passes/x87StackOptimizationPass.cpp
Outdated
Show resolved
Hide resolved
FEXCore/Source/Interface/IR/Passes/x87StackOptimizationPass.cpp
Outdated
Show resolved
Hide resolved
FEXCore/Source/Interface/IR/Passes/x87StackOptimizationPass.cpp
Outdated
Show resolved
Hide resolved
Ref Value2 = LoadStackValue(ValueOffset2, StackOffset2); | ||
|
||
Ref StackNode = IREmit->_VBSL(16, VecCond, Value1, Value2); | ||
StoreStackValue(StackNode, 0, StackOffset1 && StackOffset2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the && here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's StackOffset1 != 0 && StackOffset2 != 0
. The point is, we are storing to the top of the stack. We only really need to mark it as valid if none of the values come from the top of the stack. Because if they did, we would know it's already valid and there's no point in re-marking it.
FEXCore/Source/Interface/IR/Passes/x87StackOptimizationPass.cpp
Outdated
Show resolved
Hide resolved
It seems like a bunch of InterpretAsFloats might've gotten lost in the latest refactor. The memcpy instcountci blocks, for example, had a bunch of conversions in them yesterday that are not there now. Was that intended? |
I simplified a few things yesterday but I am surprised you say that some conversions are gone. I will take a closer look but when I skimmed through the instcounci results, it looked good to me. |
GitHub often drops conversations if you update the underlying code. There's nothing really you can do about it (other than enabling email notifications so you can look things up manually), annoyingly. |
I know what happened here. I add the memcpy insts in the commit where I add the tests and the third commit with instcounci modifies those. A recent change correctly removed the fcvt. I think I will add these instcountci tests in a separate PR instead. |
Now in #3880 |
For FixedSizeStack, why are we using a fextl::vector and not simply a static array? We know it will always be 8 elements - no need for the dynamic allocation, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. There are still little things that could be improved, but at this point, the code looks structurally sound and we're hitting diminishing returns with prolonging the review process versus the benefit of getting this in-tree and soak-tested. Let's go 👍
To be honest, it's I started by attempting to use an std::array but had issues with allocation due to being unable to construct an initial array with the values I needed. I then decided to move to an std vector where I could use reserve and fill methods.
…On July 20, 2024 8:31:10 PM GMT+02:00, Alyssa Rosenzweig ***@***.***> wrote:
For FixedSizeStack, why are we using a fextl::vector and not simply a static array? We know it will always be 8 elements - no need for the dynamic allocation, right?
--
Reply to this email directly or view it on GitHub:
#3547 (comment)
You are receiving this because you authored the thread.
Message ID: ***@***.***>
|
Nice! Thanks for all the time you spent carefully reviewing this and providing great feedback.
…On July 20, 2024 8:41:56 PM GMT+02:00, Alyssa Rosenzweig ***@***.***> wrote:
@alyssarosenzweig approved this pull request.
Approved. There are still little things that could be improved, but at this point, the code looks structurally sound and we're hitting diminishing returns with prolonging the review process versus the benefit of getting this in-tree and soak-tested. Let's go 👍
--
Reply to this email directly or view it on GitHub:
#3547 (review)
You are receiving this because you authored the thread.
Message ID: ***@***.***>
|
rotate(); | ||
buffer.front() = {StackSlot::VALID, Value}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we're using vector
, we can invert the storage order of the elements to simplify the implementation. Instead of rotate
+ assign to front()
, we can just use this:
rotate(); | |
buffer.front() = {StackSlot::VALID, Value}; | |
buffer.emplace_back(StackSlot::VALID, Value); |
Similarly, pop becomes pop_back()
, and StackSlot::UNUSED
can be entirely removed.
Internally, the buffer won't be fixed-size anymore then, but that's fine. Calling reserve(size)
on construction ensures there will never be more than one heap allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the suggestion but the location where the items are valid correspond to an offset to TOP, which will stop being the case. I understand that this optimization is possible, but I think it's going to complicate the code readability elsewhere. I will take a look to see exactly what such a change would look like and report back. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing is that we cannot really rmeove StackSlot::UNUSED. This is because the stack doesn't need to be contiguous, i.e. some elements might be unused. So you do need to mark that. For example:
fld qword [addr] ; 1.0
fst st2
In this case the stack will look like:
MM1 1.0 ST2
MM0 UNUSED ST1
MM7 1.0 ST0 <- TOP
So we need to mark these in-between slots as unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we baking pointers within the FixedSizeStack into the generated assembly code? If not then I don't understand why you're bringing up assembly code here.
The change I'm proposing is an implementation detail and shouldn't affect any of the users of the FixedSizeStack interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am confused about your suggestion, but I cannot see how your suggestion would work. The "stack" is not really a stack, which makes it harder to implement as you suggest. This is because I can push an element to the stack but then set the element that's two elements above that. With your proposed solution where the vector starts empty a push would be an emplace_back but then setting ST2, would mean a rotate right by one and emplacing back. It's certainly possible but I am not sure it's an improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "stack" is not really a stack
I think that's what confused me here. I see now that with the offset-parameter in top/setTop, we really always need 8 valid elements (no more, no less), which means you can't just push elements without also dropping the last element.
Bummer :(
I don't really have the expertise to review this beyond what I already commented on, so Alyssa's +1 is good enough. Good job on having continued pushing this forward despite how tricky it was to get working! |
This effort tries to implement a framework to optimize the x87 stack.
Currently draft, it's a work in progress and many, many tests are still failing. However, it feels like there's a direction so I am opening a PR.
I have been writing a sort of thoughts/design document about this which you can access through:
https://docs.google.com/document/d/1ZIWNuu6h6EuAkMTL70PxfMwx9cZxOZ_OAK849e8y4tM/edit?usp=sharing