You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Improve internal function argument parsing performance by reducing code bloat
The goal of this PR is to reduce some of the code bloat induced by fast-ZPP.
Reduced code bloat results in fewer cache misses (and better DSB coverage),
and fewer instructions executed.
If we take a look at a simple function:
```c
PHP_FUNCTION(twice) {
zend_long foo;
ZEND_PARSE_PARAMETERS_START(1, 1)
Z_PARAM_LONG(foo)
ZEND_PARSE_PARAMETERS_END();
RETURN_LONG(foo * 2);
}
```
We obtain the following assembly on x86-64 in the non-cold blocks:
```s
<+0>: push %r12
<+2>: push %rbp
<+3>: push %rbx
<+4>: sub $0x10,%rsp
<+8>: mov %fs:0x28,%r12
<+17>: mov %r12,0x8(%rsp)
<+22>: mov 0x2c(%rdi),%r12d
<+26>: cmp $0x1,%r12d
<+30>: jne 0x22beaf <zif_twice-3212257>
<+36>: cmpb $0x4,0x58(%rdi)
<+40>: mov %rsi,%rbp
<+43>: jne 0x53c2f0 <zif_twice+96>
<+45>: mov 0x50(%rdi),%rax
<+49>: add %rax,%rax
<+52>: movl $0x4,0x8(%rbp)
<+59>: mov %rax,0x0(%rbp)
<+63>: mov 0x8(%rsp),%rax
<+68>: sub %fs:0x28,%rax
<+77>: jne 0x53c312 <zif_twice+130>
<+79>: add $0x10,%rsp
<+83>: pop %rbx
<+84>: pop %rbp
<+85>: pop %r12
<+87>: ret
<+88>: nopl 0x0(%rax,%rax,1)
<+96>: lea 0x50(%rdi),%rbx
<+100>: mov %rsp,%rsi
<+103>: mov $0x1,%edx
<+108>: mov %rbx,%rdi
<+111>: call 0x620240 <zend_parse_arg_long_slow>
<+116>: test %al,%al
<+118>: je 0x22be96 <zif_twice.cold>
<+124>: mov (%rsp),%rax
<+128>: jmp 0x53c2c1 <zif_twice+49>
<+130>: call 0x201050 <__stack_chk_fail@plt>
```
Notice how we get the stack protector overhead in this function
and also have to reload the parsed value on the slow path.
This happens because the parsed value is returned via a pointer.
If instead we were to return struct with a value pair
(similar to optional in C++ / Option in Rust), then the values are returned
via registers. This means that we no longer have stack protector overhead
and we also don't need to reload a value, resulting in better register usage.
This is the resulting assembly for the sample function after this patch:
```s
<+0>: push %r12
<+2>: push %rbp
<+3>: push %rbx
<+4>: mov 0x2c(%rdi),%r12d
<+8>: cmp $0x1,%r12d
<+12>: jne 0x22d482 <zif_twice-3205454>
<+18>: cmpb $0x4,0x58(%rdi)
<+22>: mov %rsi,%rbp
<+25>: jne 0x53be08 <zif_twice+56>
<+27>: mov 0x50(%rdi),%rax
<+31>: add %rax,%rax
<+34>: movl $0x4,0x8(%rbp)
<+41>: mov %rax,0x0(%rbp)
<+45>: pop %rbx
<+46>: pop %rbp
<+47>: pop %r12
<+49>: ret
<+50>: nopw 0x0(%rax,%rax,1)
<+56>: lea 0x50(%rdi),%rbx
<+60>: mov $0x1,%esi
<+65>: mov %rbx,%rdi
<+68>: call 0x61e7b0 <zend_parse_arg_long_slow>
<+73>: test %dl,%dl
<+75>: je 0x22d46a <zif_twice.cold>
<+81>: jmp 0x53bdef <zif_twice+31>
```
The following uses the default benchmark programs we use in CI.
Each program is ran on php-cgi with the appropriate `-T` argument, then repeated 15 times.
It shows a small performance improvement on Symfony both with and without JIT, and a small improvement
on WordPress with JIT.
For WordPress, the difference is small as my CPU is bottlenecked on some other stuff as well.
| Test | Old Mean | Old Stddev | New Mean | New Stddev |
|---------------------------------|----------|------------|----------|------------|
| Symfony, no JIT (-T10,50) | 0.5324 | 0.0050 | 0.5272 | 0.0042 |
| Symfony, tracing JIT (-T10,50) | 0.5301 | 0.0029 | 0.5264 | 0.0036 |
| WordPress, no JIT (-T5,25) | 0.7408 | 0.0049 | 0.7404 | 0.0048 |
| WordPress, tracing JIT (-T5,25) | 0.6814 | 0.0052 | 0.6770 | 0.0055 |
I was not able to measure any meaningful difference for our micro benchmarks
`Zend/bench.php` and `Zend/micro_bench.php`.
The Valgrind instruction counts also show a decrease:
-0.19% on Symfony without JIT, and -0.14% on WordPress without JIT (see CI).
0 commit comments