-
Notifications
You must be signed in to change notification settings - Fork 7.8k
tracing JIT floating point register clobbering on Windows and ARM64 #18136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I get an assertion failure on 8.4.5 on Windows+JIT (when built in debug mode): |
Reduced to a standalone reproducer: <?php
namespace Foo;
function diff($point1, $point2)
{
$a = deg2rad($point1); // Prefixing these with \ also makes the issue go away
$b = deg2rad($point2);
return $a - $b;
}
function getRawDistance()
{
$distance = 0;
for ($p = 0; $p < 200; $p++) {
// Needs to be a namespaced call_user_func call to reproduce the issue (i.e. adding \ at front makes the issue go away)
$distance += call_user_func('Foo\diff', 0, $p);
}
return $distance;
}
var_dump(getRawDistance()); |
The above reproducer outputs |
Hi,
What kind of details would be usefull to understand this one? (the script is quite complex...) |
I found the root cause, and your last comment is likely related to the same root cause. The problem is due to (global) register allocation. https://github.com/dstogov/ir/blob/8d17022fb61ebfed9f6be81a8182ea31202697ed/ir_ra.c#L4099-L4100 The reason this doesn't happen on Linux is because on Linux all XMM registers are volatile, while on Windows only xmm0-xmm5 are volatile while xmm6-xmm15 are preserved. @nono303 As a workaround, try setting the R component of |
Many Thx @nielsdos for your time on it! ✅ I confirm that changing
⏩ If you need some quick (like with patch on php-src before making a PR) tests on Windows, don’t hesitate as I can easily compile binary and launch different tests |
Thanks for testing. Unfortunately, I will have difficulties making a patch for this as I'm not too familiar with the backend component of IR and it would take me some time learning that. I also think that if this means we diverge from the fixed stack frame then we get issues with the unwinding code that was added in #17095 |
The fair fix is simple. Could you check that it fixes your problem (I wasn't able to reproduce the failure with reproduction case. Only the assertion). diff --git a/ext/opcache/jit/ir/ir_ra.c b/ext/opcache/jit/ir/ir_ra.c
index 0c0e8dec3b4..fa8f639c0c3 100644
--- a/ext/opcache/jit/ir/ir_ra.c
+++ b/ext/opcache/jit/ir/ir_ra.c
@@ -4097,7 +4097,7 @@ static void assign_regs(ir_ctx *ctx)
if (IR_REGSET_DIFFERENCE(IR_REGSET_INTERSECTION(used_regs, IR_REGSET_PRESERVED),
ctx->used_preserved_regs)) {
// TODO: Preserved reg and fixed frame conflict ???
- // IR_ASSERT(0 && "Preserved reg and fixed frame conflict");
+ IR_ASSERT(0 && "Preserved reg and fixed frame conflict");
}
} else {
ctx->used_preserved_regs = IR_REGSET_UNION((ir_regset)ctx->fixed_save_regset,
diff --git a/ext/opcache/jit/zend_jit_ir.c b/ext/opcache/jit/zend_jit_ir.c
index 4d006c19d34..ff0abd8e1a0 100644
--- a/ext/opcache/jit/zend_jit_ir.c
+++ b/ext/opcache/jit/zend_jit_ir.c
@@ -33,12 +33,10 @@
# define ZREG_IP 15 /* IR_REG_R15 */
# define ZREG_FIRST_FPR 16
# if defined(_WIN64)
-# define IR_REGSET_PRESERVED ((1<<3) | (1<<5) | (1<<6) | (1<<7) | (1<<12) | (1<<13) | (1<<14) | (1<<15))
-/*
+//# define IR_REGSET_PRESERVED ((1<<3) | (1<<5) | (1<<6) | (1<<7) | (1<<12) | (1<<13) | (1<<14) | (1<<15))
# define IR_REGSET_PRESERVED ((1<<3) | (1<<5) | (1<<6) | (1<<7) | (1<<12) | (1<<13) | (1<<14) | (1<<15) | \
(1<<(16+6)) | (1<<(16+7)) | (1<<(16+8)) | (1<<(16+9)) | (1<<(16+10)) | \
(1<<(16+11)) | (1<<(16+12)) | (1<<(16+13)) | (1<<(16+14)) | (1<<(16+15)))
-*/
# define IR_SHADOW_ARGS 32
# else
# define IR_REGSET_PRESERVED ((1<<3) | (1<<5) | (1<<12) | (1<<13) | (1<<14) | (1<<15)) /* all preserved registers */
@@ -2696,7 +2694,8 @@ static void zend_jit_init_ctx(zend_jit_ctx *jit, uint32_t flags)
jit->ctx.flags |= IR_USE_FRAME_POINTER;
jit->ctx.fixed_stack_frame_size = sizeof(void*) * 16; /* 10 saved registers and 6 spill slots (8 bytes) */
#elif defined(_WIN64)
- jit->ctx.fixed_stack_frame_size = sizeof(void*) * 11; /* 8 saved registers and 3 spill slots (8 bytes) */
+// jit->ctx.fixed_stack_frame_size = sizeof(void*) * 11; /* 8 saved registers and 3 spill slots (8 bytes) */
+ jit->ctx.fixed_stack_frame_size = sizeof(void*) * 21; /* 18 saved registers and 3 spill slots (8 bytes) */
#elif defined(IR_TARGET_X86_64)
jit->ctx.fixed_stack_frame_size = sizeof(void*) * 9; /* 6 saved registers and 3 spill slots (8 bytes) */
#else /* IR_TARGET_x86 */ The drawback of the fix, that it significantly increases code of each prologue and epilogue. This is the reason it wasn't applied in the first place. After all, I think that it may make sense to avoid "fixed stack frame" for CALL VM JIT, but this would require inlining of most the JIT stubs. Probably, this is the right way, but it's not a simple fix. @nielsdos may be you have other ideas? |
@arnaud-lb please also think about this problem |
If I understand correctly, If that's the case, I think we could achieve a similar result by storing a bitset of saved registers on the top of the stack frame, so that stubs know the size of the frame and which registers they have to restore. Stub epilogues would be slower, but normal functions will usually have smaller epilogue/prologue. An alternative is to not emit an epilogue in stubs, and to call them:
NB: I don't understand the difference between CALL and HYBRID in this context. |
@dstogov That works and the patch looks right: now xmm6-xmm15 are preserved. I had tried something like this (but more hacked) but didn't pursue it because of the overhead. |
This won't work at least for linking between traces. (prologue in root trace, then few branches to side traces, then epilogue and return).
HYBRID handlers are called through C computed goto, inside the execute_ex() stack frame, and "return" through indirect jmp. They require "fixed stack frame" and may avoid saving/restoring preserved registers. |
Hi @dstogov
I'll check if i see some regression on other use-cases |
I though about this problem a bit more, and I think it may be reduced... The problems arises when JIT-ed code that uses XMM6-XMM15 calls some engine function, that calls interpreter, that calls the other JIT-ed code that modifies XMM6-XMM15. So, a wrapper for In long terms I would think about using the single HYBRID VM for all targets that supports JIT. |
Ok, I was missing that HYBRID JIT does not preserve registers at all. r14/r15 are shared by execute_ex and JIT'ed code so don't need to be preserved (but must be preserved when reentering execute_ex, so they are preserved by execute_ex). rbx/r12/r13 are considered cloberred when calling opcode handlers in execute_ex due to HYBRID_JIT_GUARD.
I think this can also happen in the following scenario: Internal function that uses XMM6-15 calls JIT'ed code that modifies XMM6-15. We may need to save XMM6-15 when entering/leaving execute_ex. |
You are right. It seems you have found an extremely simple and efficient solution! :) |
I'll be able to test a patch on x64 MSVC 14.2 (vs16) & 14.3 (vs17) |
It's an interesting idea. Unfortunately MSVC does not support inline assembly on x64, so the usual trick to set the registers in the clobber list does not work unfortunately; maybe there's another simple way to let MSVC do this. EDIT: I'll think about how to use an external assembler file to do this thing. |
I've worked on patching this, and I had two ideas (only for x64 for now so far):
|
Thx @nielsdos!
|
@nono303 Did you regenerate the VM? Try rerunning |
Nop... Better when regenerating the VM ;)
🔴
|
@nono303 Thanks a lot for testing. I finally figured out what was wrong in my other patch (notworking.patch). Turns out I needed to reserve shadow space. I'll try to make a PR soon and will ping you for testing it if you'd still like to do that. |
On win64, xmm6-xmm15 are preserved registers, but the prologues and epilogues of JITted code don't handle these. The issue occurs when calling into the JIT code again via an internal handler (like call_user_func). Therefore, we want to save/restore xmm registers upon entering/leaving execute_ex. Since MSVC x64 does not support inline assembly, we create an assembly wrapper around the real execute_ex function. The alternative is to always save/restore these xmm registers into the fixed call frame, but this causes unnecessary overhead.
…ndows and ARM64 On win64, xmm6-xmm15 are preserved registers, but the prologues and epilogues of JITted code don't handle these. The issue occurs when calling into the JIT code again via an internal handler (like call_user_func). Therefore, we want to save/restore xmm registers upon entering/leaving execute_ex. Since MSVC x64 does not support inline assembly, we create an assembly wrapper around the real execute_ex function. The alternative is to always save/restore these xmm registers into the fixed call frame, but this causes unnecessary overhead. The same issue occurs for ARM64 platforms for floating point register 8 to 15. However, there we can use inline asm to fix this.
* PHP-8.4: Fix GH-18136: tracing JIT floating point register clobbering on Windows and ARM64
Hi @nielsdos,
|
Thanks for checking! |
Hi,
I'm facing a strange bug on a "simple" distance calculation (addition in a loop) with PHP cli & opcache JIT enabled.
with
opcache.enable_cli = 1
andopcache.jit >= 1252
(optimization level between 2 and 5) at some time in the loop the addition (+=) give erratic results.I've extracted the distance calculation function for standalone & callback tests with same data without any issue.
So, might be related to namespace, object type or whatever...
Here are "ready to test" files to reproduce it (like on linux)
⏩ Results:
Adding a debug line to DistanceCalculator.php
results, failing @ iteration 128:
Available to perform further tests and provide details if needed
PHP Version
PHP 8.4.5 TS vs17 x64
Operating System
Windows 11 x64
The text was updated successfully, but these errors were encountered: