Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure predicate cache is reset when control flow leaves block #4274

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

pmatos
Copy link
Collaborator

@pmatos pmatos commented Jan 13, 2025

Whenever the control float leaves the block, it might clobber the
predicate register so we reset the cache whenever that happens.

The difficulty here is that the cache is valid only during IR generation
so we need to make sure we catch all the cases during this pass where
the execution might leave the block.

Fixes #4264

@pmatos
Copy link
Collaborator Author

pmatos commented Jan 13, 2025

To be fair, this is not my favourite fix because it risks missing places where the cache should be reset but isn't and it will be a pain to debug cases that fail if you don't keep this issue in the back of your mind.

But the predicate cache is only valid during IR generation so we cannot use any of the backend stuff like Spill functions.

@pmatos pmatos force-pushed the EnsurePredCacheReset branch 3 times, most recently from 38e7ebb to 6045a0d Compare January 13, 2025 16:13
Copy link
Member

@Sonicadvance1 Sonicadvance1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting way to fix it, but I'm fine with it.

@@ -164,6 +168,9 @@ void OpDispatchBuilder::FADD(OpcodeArgs, IR::OpSize Width, bool Integer, OpDispa
if (Op->Src[0].IsNone()) { // Implicit argument case
auto Offset = Op->OP & 7;
auto St0 = 0;
if (!ReducedPrecisionMode) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand why this check is here but I don't quite get it, won't this code only run for full precision anyway? I would have thought reduced would take

void OpDispatchBuilder::FADDF64(OpcodeArgs, IR::OpSize Width, bool Integer, OpDispatchBuilder::OpResult ResInST0) {

Also just thinking through perhaps it would be cleaner to do this in FlushRegisterCache (!SRAOnly case) and to save the predicate when spilling/filling regs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - that's a great point. My intention was to merge (FADD and FADDF64) those and I didn't but thought I did.

I did attempt to do this through FlushRegisterCache but this one is called at _Break(). This seems to be called by INTOp and we don't want to reset the cache in these cases - when I attempted this, it was basically resetting the cache too often, which is why the commit calls ResetInitPredicateCache() alongside FlushRegisterCache mostly everywhere but not on Break().

I am happy to hear suggestions on how to improve this if you have any ideas. :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you mention saving the predicate reg, but if we do managed to reset the predicate register in FlushRegisterCache then there was no need to save the predicate in spilling/filling regs. The reason for not spilling/filling the predicate register is because afaiu doing this for predicate registers takes a few instructions and would most likely void the optimization that we are attempting to do, so it's better to simply reset the cache and then regenerate the predicate with a single ptrue.

Copy link
Collaborator

@bylaws bylaws Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you mention saving the predicate reg, but if we do managed to reset the predicate register in FlushRegisterCache then there was no need to save the predicate in spilling/filling regs.

Hehe, yeah this is true :)

The reason for not spilling/filling the predicate register is because afaiu doing this for predicate registers takes a few instructions and would most likely void the optimization that we are attempting to do, so it's better to simply reset the cache and then regenerate the predicate with a single ptrue.

Right yeah, I missed that we'd have to save all the predicate registers since there's no way to know which have changed - though looking through only one predicate type is ever cached, so the cache could be replaced with statically allocating a single extra predicate register. Then that is always spilled, even if a few such registers are needed in the future I think the cost would be negligible.

I did have some ideas of having the predicate register set to spill be a required argument for Spill/FillSRA and to pass that in as an argument to IROps that end up calling them, but this is definitely beyond what is necessary here

I did attempt to do this through FlushRegisterCache but this one is called at _Break(). This seems to be called by INTOp and we don't want to reset the cache in these cases - when I attempted this, it was basically resetting the cache too often, which is why the commit calls ResetInitPredicateCache() alongside FlushRegisterCache mostly everywhere but not on Break().

I'm quite surprised break/INT is being called that often in x87 code, what ends up causing this? If that can be figured then clearing the predicate cache for all !JITDispatch ops in the IR json (I think this is what you are trying to achieve by manually annotating x87 ops? Correct me if I'm wrong) could work, there's also cases like cpuid, ludiv etc that I think would need to clear the cache but don't currently. I personally prefer the former solution I mentioned since it's nice to have the guarantee that spilling and filling registers in a handler won't cause issues but will leave the choice up to you

Copy link
Collaborator Author

@pmatos pmatos Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I was looking at what's causing INTOp to be called and I am a little bit mistified myself about this. Take the test:

%ifdef CONFIG
{
  "RegData": {
    "MM7":  ["0x8000000000000000", "0x4000"]
  }
}
%endif

lea rdx, [rel data]
fld tword [rdx + 8 * 0]

lea rdx, [rel data2]
lea rax, [rdx + 8 * 0]
fstp tword [rax]
fld tword [rdx + 8 * 0]

hlt

align 8
data:
  dt 2.0
  dq 0
data2:
  dt 0.0
  dq 0

This calls fld twice (with fstp in-between), you should only need to reset the predicate cache at the beginning of the block. However, here if you call the reset function from FlushRegisterCache, you end up resetting it a couple of times due to INTOp calling it. I have not yet figured out why it's being called but it's related to threading. It comes from Core.cpp:668:

          std::invoke(Fn, Thread->OpDispatcher, DecodedInfo);

The backtrace from ResetInitPredicateCache shows:

Breakpoint 1, FEXCore::IR::IREmitter::ResetInitPredicateCache (this=0x7ff6ef7000)
    at /home/steamvr/dev/FEX/FEXCore/Source/Interface/IR/IREmitter.h:77
77	    InitPredicateCache.clear();
(gdb) bt
#0  FEXCore::IR::IREmitter::ResetInitPredicateCache (this=0x7ff6ef7000)
    at /home/steamvr/dev/FEX/FEXCore/Source/Interface/IR/IREmitter.h:77
#1  0x0000005555723324 in FEXCore::IR::OpDispatchBuilder::FlushRegisterCache (this=0x7ff6ef7000, SRAOnly=false)
    at /home/steamvr/dev/FEX/FEXCore/Source/Interface/Core/OpcodeDispatcher.h:1179
#2  0x0000005555794a34 in FEXCore::IR::OpDispatchBuilder::INTOp (this=0x7ff6ef7000, Op=0x7fe2e00300)
    at /home/steamvr/dev/FEX/FEXCore/Source/Interface/Core/OpcodeDispatcher.cpp:4693
#3  0x000000555573eb6c in std::__invoke_impl<void, void (FEXCore::IR::OpDispatchBuilder::*&)(FEXCore::X86Tables::DecodedInst const*), FEXCore::Core::NonMovableUniquePtr<FEXCore::IR::OpDispatchBuilder>&, FEXCore::X86Tables::DecodedInst const*&> (
    __f=@0x7fffffdb50: (void (FEXCore::IR::OpDispatchBuilder::*)(FEXCore::IR::OpDispatchBuilder * const, const FEXCore::X86Tables::DecodedInst *)) 0x5555794860 <FEXCore::IR::OpDispatchBuilder::INTOp(FEXCore::X86Tables::DecodedInst const*)>, __t=..., 
    __args=@0x7fffffdc20: 0x7fe2e00300) at /usr/lib/gcc/aarch64-linux-gnu/14/../../../../include/c++/14/bits/invoke.h:74
#4  0x000000555573eacc in std::__invoke<void (FEXCore::IR::OpDispatchBuilder::*&)(FEXCore::X86Tables::DecodedInst const*), FEXCore::Core::NonMovableUniquePtr<FEXCore::IR::OpDispatchBuilder>&, FEXCore::X86Tables::DecodedInst const*&> (
    __fn=@0x7fffffdb50: (void (FEXCore::IR::OpDispatchBuilder::*)(FEXCore::IR::OpDispatchBuilder * const, const FEXCore::X86Tables::DecodedInst *)) 0x5555794860 <FEXCore::IR::OpDispatchBuilder::INTOp(FEXCore::X86Tables::DecodedInst const*)>, 
    __args=@0x7fffffdc20: 0x7fe2e00300, __args=@0x7fffffdc20: 0x7fe2e00300)
    at /usr/lib/gcc/aarch64-linux-gnu/14/../../../../include/c++/14/bits/invoke.h:96
#5  0x0000005555723eb8 in std::invoke<void (FEXCore::IR::OpDispatchBuilder::*&)(FEXCore::X86Tables::DecodedInst const*), FEXCore::Core::NonMovableUniquePtr<FEXCore::IR::OpDispatchBuilder>&, FEXCore::X86Tables::DecodedInst const*&> (
    __fn=@0x7fffffdb50: (void (FEXCore::IR::OpDispatchBuilder::*)(FEXCore::IR::OpDispatchBuilder * const, const FEXCore::X86Tables::DecodedInst *)) 0x5555794860 <FEXCore::IR::OpDispatchBuilder::INTOp(FEXCore::X86Tables::DecodedInst const*)>, 
    __args=@0x7fffffdc20: 0x7fe2e00300, __args=@0x7fffffdc20: 0x7fe2e00300)
    at /usr/lib/gcc/aarch64-linux-gnu/14/../../../../include/c++/14/functional:120
#6  0x000000555572247c in FEXCore::Context::ContextImpl::GenerateIR (this=0x7ff6e57000, Thread=0x7ff6ee3000, GuestRIP=65536, 
    ExtendedDebugInfo=false, MaxInst=0) at /home/steamvr/dev/FEX/FEXCore/Source/Interface/Core/Core.cpp:668
#7  0x00000055557248f4 in FEXCore::Context::ContextImpl::CompileCode (this=0x7ff6e57000, Thread=0x7ff6ee3000, GuestRIP=65536, 
    MaxInst=0) at /home/steamvr/dev/FEX/FEXCore/Source/Interface/Core/Core.cpp:801
#8  0x0000005555724e84 in FEXCore::Context::ContextImpl::CompileBlock (this=0x7ff6e57000, Frame=0x7ff6ee3090, GuestRIP=65536, 
    MaxInst=0) at /home/steamvr/dev/FEX/FEXCore/Source/Interface/Core/Core.cpp:846

Do you understand what's happening? I will keep trying to understand this in any case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly marked VPCMPESTRX for now.

Copy link
Collaborator

@bylaws bylaws Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ufff, but we probably shouldn't do it for all !JITDispatch instructions right? Some of them could potentially just lower to other instructions that don't need to reset the cache

Yeah right this is just an issue in general, I think in some sense if an instruction is always lowered into something else it should be marked differently to one that uses a fallback. (No need to fix this though, see below)

Whole this work is to enable a relatively small optimization (see #4166 (comment)) and I am wondering if at some point we might have to say that it's better to go without it.

Yeah - this is why I suggested just doing it statically, it's so awkward with all the edge cases when statically allocating the pred reg would be a 10 or so line change and avoid having to deal with any of this and solve the original issue

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - this is why I suggested just doing it statically, it's so awkward with all the edge cases when statically allocating the pred reg would be a 10 or so line change and avoid having to deal with any of this and solve the original issue

Yep! OK, lets take a step back here and let me refactor this for a statically allocated predicate register solution.

Thanks for your patience and help on this one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference do see loadpf, you basically want to copy that exactly (Inc ra changes) but taking in predicate init params

Copy link
Collaborator Author

@pmatos pmatos Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Will do that.
I will have to travel today but I will work on this through the weekend.

@pmatos
Copy link
Collaborator Author

pmatos commented Jan 14, 2025

Marking this as draft to avoid merging by accident- need to think about what @bylaws said.

@pmatos pmatos marked this pull request as draft January 14, 2025 15:36
@pmatos pmatos force-pushed the EnsurePredCacheReset branch 2 times, most recently from 6d46c75 to 5e0886e Compare January 15, 2025 16:39
@pmatos pmatos marked this pull request as ready for review January 15, 2025 16:41
@pmatos pmatos force-pushed the EnsurePredCacheReset branch 4 times, most recently from fa9ff97 to c8594ba Compare January 17, 2025 09:00
Whenever the control float leaves the block, it might clobber the
predicate register so we reset the cache whenever that happens.

The difficulty here is that the cache is valid only during IR generation
so we need to make sure we catch all the cases during this pass where
the execution might leave the block.

Fixes FEX-Emu#4264
@pmatos pmatos force-pushed the EnsurePredCacheReset branch from c8594ba to 9aea249 Compare January 17, 2025 10:35
@pmatos pmatos marked this pull request as draft January 17, 2025 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cached predicate register being clobbered by softfp lib call
3 participants