Skip to content

Improve fast-ZPP performance #126

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

nielsdos
Copy link
Owner

No description provided.

@nielsdos nielsdos force-pushed the zend-parse-arg-improve-1 branch 2 times, most recently from 51c64a4 to 9806445 Compare April 23, 2025 15:56
@nielsdos nielsdos changed the title wip Improve fast-ZPP performance Apr 23, 2025
@nielsdos nielsdos force-pushed the zend-parse-arg-improve-1 branch 3 times, most recently from 12ad48c to b3dec4a Compare April 26, 2025 17:31
…de bloat

The goal of this PR is to reduce some of the code bloat induced by fast-ZPP.
Reduced code bloat results in fewer cache misses (and better DSB coverage),
and fewer instructions executed.

If we take a look at a simple function:
```c
PHP_FUNCTION(twice) {
    zend_long foo;
    ZEND_PARSE_PARAMETERS_START(1, 1)
        Z_PARAM_LONG(foo)
    ZEND_PARSE_PARAMETERS_END();
    RETURN_LONG(foo * 2);
}
```

We obtain the following assembly on x86-64 in the non-cold blocks:
```s
<+0>:	push   %r12
<+2>:	push   %rbp
<+3>:	push   %rbx
<+4>:	sub    $0x10,%rsp
<+8>:	mov    %fs:0x28,%r12
<+17>:	mov    %r12,0x8(%rsp)
<+22>:	mov    0x2c(%rdi),%r12d
<+26>:	cmp    $0x1,%r12d
<+30>:	jne    0x22beaf <zif_twice-3212257>
<+36>:	cmpb   $0x4,0x58(%rdi)
<+40>:	mov    %rsi,%rbp
<+43>:	jne    0x53c2f0 <zif_twice+96>
<+45>:	mov    0x50(%rdi),%rax
<+49>:	add    %rax,%rax
<+52>:	movl   $0x4,0x8(%rbp)
<+59>:	mov    %rax,0x0(%rbp)
<+63>:	mov    0x8(%rsp),%rax
<+68>:	sub    %fs:0x28,%rax
<+77>:	jne    0x53c312 <zif_twice+130>
<+79>:	add    $0x10,%rsp
<+83>:	pop    %rbx
<+84>:	pop    %rbp
<+85>:	pop    %r12
<+87>:	ret
<+88>:	nopl   0x0(%rax,%rax,1)
<+96>:	lea    0x50(%rdi),%rbx
<+100>:	mov    %rsp,%rsi
<+103>:	mov    $0x1,%edx
<+108>:	mov    %rbx,%rdi
<+111>:	call   0x620240 <zend_parse_arg_long_slow>
<+116>:	test   %al,%al
<+118>:	je     0x22be96 <zif_twice.cold>
<+124>:	mov    (%rsp),%rax
<+128>:	jmp    0x53c2c1 <zif_twice+49>
<+130>:	call   0x201050 <__stack_chk_fail@plt>
```

Notice how we get the stack protector overhead in this function
and also have to reload the parsed value on the slow path.
This happens because the parsed value is returned via a pointer.
If instead we were to return struct with a value pair
(similar to optional in C++ / Option in Rust), then the values are returned
via registers. This means that we no longer have stack protector overhead
and we also don't need to reload a value, resulting in better register usage.
This is the resulting assembly for the sample function after this patch:
```s
<+0>:	push   %r12
<+2>:	push   %rbp
<+3>:	push   %rbx
<+4>:	mov    0x2c(%rdi),%r12d
<+8>:	cmp    $0x1,%r12d
<+12>:	jne    0x22d482 <zif_twice-3205454>
<+18>:	cmpb   $0x4,0x58(%rdi)
<+22>:	mov    %rsi,%rbp
<+25>:	jne    0x53be08 <zif_twice+56>
<+27>:	mov    0x50(%rdi),%rax
<+31>:	add    %rax,%rax
<+34>:	movl   $0x4,0x8(%rbp)
<+41>:	mov    %rax,0x0(%rbp)
<+45>:	pop    %rbx
<+46>:	pop    %rbp
<+47>:	pop    %r12
<+49>:	ret
<+50>:	nopw   0x0(%rax,%rax,1)
<+56>:	lea    0x50(%rdi),%rbx
<+60>:	mov    $0x1,%esi
<+65>:	mov    %rbx,%rdi
<+68>:	call   0x61e7b0 <zend_parse_arg_long_slow>
<+73>:	test   %dl,%dl
<+75>:	je     0x22d46a <zif_twice.cold>
<+81>:	jmp    0x53bdef <zif_twice+31>
```

The following uses the default benchmark programs we use in CI.
Each program is ran on php-cgi with the appropriate `-T` argument, then repeated 15 times.
It shows a small performance improvement on Symfony both with and without JIT, and a small improvement
on WordPress with JIT.
For WordPress, the difference is small as my CPU is bottlenecked on some other stuff as well.

| Test                            | Old Mean | Old Stddev | New Mean | New Stddev |
|---------------------------------|----------|------------|----------|------------|
| Symfony, no JIT (-T10,50)       | 0.5324   | 0.0050     | 0.5272   | 0.0042     |
| Symfony, tracing JIT (-T10,50)  | 0.5301   | 0.0029     | 0.5264   | 0.0036     |
| WordPress, no JIT (-T5,25)      | 0.7408   | 0.0049     | 0.7404   | 0.0048     |
| WordPress, tracing JIT (-T5,25) | 0.6814   | 0.0052     | 0.6770   | 0.0055     |

I was not able to measure any meaningful difference for our micro benchmarks
`Zend/bench.php` and `Zend/micro_bench.php`.

The Valgrind instruction counts also show a decrease:
-0.19% on Symfony without JIT, and -0.14% on WordPress without JIT (see CI).
@nielsdos
Copy link
Owner Author

Now open on php-src

@nielsdos nielsdos closed this Apr 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant