-
Notifications
You must be signed in to change notification settings - Fork 209
Segfault on random requests #89
Comments
Hi, Thanks for the report. We might need to have some more information to figure out what is going on. Is there any chance of you sharing a reproducible case in a docker image? If not, please could you give us some more information:
Thanks |
Hi Dan, thanks for your time. Creating a docker image will be hard, as this is a server with several different PHP applications running on it. The application we are trying to profile is a shopware instance. The list of extensions:
xdebug is not installed on the server. We can try to remove opcache tomorrow and see what happens. Again, thanks for your time. Best, |
Actually, another thought - even though you might not be able to upgrade to the latest version of PHP right now, please could you try reproducing the issue against PHP 7.3? There were a number of segfaults that have been fixed, that may be lurking, but only showing when the xhprof extension is also being run. |
If disabling opcache does make the segfault stop happening, you could then look at turning off the particular optimisation that is likely causing the issue. If you have the default setting, you probably have the opcache optimization level set as:
Looking at the code, turning off the final bit, would disable an optimisation that has caused a similar issue https://bugs.php.net/bug.php?id=73654
Or you could turn more of the optimisations off and try to find the particular setting that is causing the issue.
|
Hi Dan, thanks for your time! We disabled opcache completely, unfortunately the segfaults continued. So it seems like opcache is not the (only) problem here. Any other thoughts? |
It would definitely be worth trying against PHP 7.3. Even if your application might not fully work against that version of PHP, as there were quite a few known issues in 7.1 (which means there are likely still some unknown issues) similar to this issue, figuring out if it's still happening in 7.3 or not would be useful information. My suspicion is that it would still be present in 7.3. Other than that, someone will need to do some debugging with GDB to figure out what variable is being corrupted that is causing the segfault. If you focus on the 5.0.1 version please. I've put some notes below of how to do that. Looking at the stack trace, it seems that the problem is occurring when loading a class, through a userland registered autoloader,
Which is weird, on multiple levels. First, this code is used everywhere in every PHP server....so any bugs in it are normally found very quickly. Second, most of the time, the code isn't doing anything interesting....so there's very little scope for edge-cases. I think if you follow the notes below, you should be able to find either the exact cause of the bug, some decent clues as to what the problem is, or possibly if you look at the data in the frame for zend_fetch_class_by_name, you'll be able to see what class is being loaded, and look in your app to see if there's anything out of the ordinary for loading that code. Instructions for GDB to investigate what is causing the problem.The instructions below won't completely identify what the cause of that corruption is, but should provide more information which will be the base of further investigation.
This step is optional, but it should make life easier as it means that PHP-FPM will only have 1 process running that is processing requests, which is easier than trying to pick out the right process.
Using something like top or For me, that gives the output of:
The 'master process' is the process that spawns the worker processes. The 'pool www' process is the one that we want the PID for, so in this case it is 16581
If you see an error about
When GDB attaches to a process, it stops execution of the program, so that you can do things like setup breakpoints. To make PHP run again, so that we can trigger the breakpoint, we need to tell it to continue. So run the command:
That should be picked up by GDB which will be back in interactive mode, and you'll be able to use GDB to look at the data.
Assuming that you can set all the above up and it doesn't magically make the bug disappear, you should be left with the debugger stopped in a place where we can get useful information. For all of the remaining steps that display values, please can you save them to a file, so that they can be shared with Tideways. First import some GDB scripts, which will allow nicer formatting of PHP values that are held in memory. This can be done by 'sourcing' the gdb scripts file inside GDB. The file for that can be downloaded from https://github.com/php/php-src/blob/PHP-7.1.30/.gdbinit
*** Heads up *** After this point you'll need to figure out what steps to do from looking at the output. I have a strong suspicion still that this is going to be a bug in PHP which is only showing up due to changes in the memory layout caused by loading the Tideways extension, rather than it being a bug in the extension itself.
So for the segfault in 5.0.1, the function that it is occuring in is That has a single parameteter of You should be able to see exactly which line the problem is occurring on, and so may be able to figure out what part of that data structure is apparently invalid.
And then to print just the individual parts of the data structure would be for example:
function_name is a pointer to a zend_string, and this command would print the pointer as a pointer.
Would print the function name as a string.. The command 'print_zstr' is define in the .gdbinit file, so will only be available if that has been sourced into GDB.
The error is possibly originating in one of the previous function calls, and only showing as a problem in the final function. The command to show the list of frames is:
To switch to a particular frame, you can run the GDB command
You can also get some more information about the frames.
Prints a brief description of the currently selected stack frame.
This command prints a verbose description of the selected stack frame. Next stepsWe'll need to look at the output from that debugging to figure out the next step, as how to proceed will depend on what we see. The data corruption is in the same memory location each time.This is the best case scenario. Setting a breakpoint on the exact memory location that is being changed after starting GDB but before triggering the error with:
Should allow us to find what is causing the corruption by running The data corruption is obviously a string being written to the wrong place.There's a small chance the data that is being written to the wrong place will be a string, that contains enough information by itself to work back to where it's being written from You can use the following commands to inspect the memory
Display the 32bytes of memory at the address as a string
WARNING this has the capability to include private data/api keys. You should check it's nothing private before sharing it with us. Notes
(This probably also applies toother container systems, not just Docker.) By default, Docker does not give permissions to run certain operations inside a container. For Docker-compose this can be added in the docker-composer.yml file, by adding:
to the service entries that are running the process you wish to debug. |
@liwo Can we help you out in any way to find the cause for this segfault? |
Hi Ben, sorry for the late reply. I was a bit caught up in other stuff the past couple days and my colleague managed to get enough profiling data to work with (as only every 3rd request died it was possible to profile, but not convenient ;-)), so I didn't have pressure. I'm planning to debug the issue with the info provided by Dan next week. |
Hi Dan, Ben, it took me some time and I had trouble reproducing the issue today on the same server. I only got error messages In the end, I managed to reliably reproduce a completely weird behavior in a small docker environment: https://gist.github.com/liwo/807106b9577087380152bd8bf8bd3012. Run it with I experience the following behavior:
Now replace
Shouldn't make a difference in behavior, it just checks which version (4 or 5) of tideways is installed and uses the appropriate function to enable xhprof. Well, no. Instead of segfaulting, this time the
I have no idea why this small code change prevents the segfault. I also tried to use 7.3 in the As it's getting late I'm not going to start gdb in the docker container and try my luck. Instead I hope this helps you to reproduce the segfault. Thanks again for your time! |
@liwo thank you for this, we were able to see the crash as well. This is extremely helpful. We are getting back to you as soon as we know what the culprit is. |
We have tried to use tideways xhprof on one of our servers, which for reasons outside our immediate control still runs on PHP 7.1.
The extension runs, but on every third request (as far as we could determine) we get a segfault. PHP is setup with FPM, and as far as I can see we had two children running. So every third request could well be the second request to a given FPM child. Apparently it doesn't matter which script we are calling, the same script can run fine and produce a profile and then segfault the next time.
We tried with version 4.1.7 (from ubuntu PPA of Ondrej Sury) and 5.0.1 downloaded from github release page. Both result in a segfault, although with different stack traces.
Stack trace from 4.1.7:
Stack trace from 5.0.1:
As the stack traces are quite different I am not entirely sure it is the same issue or not, but from the outside view it shows the exact same symptoms.
My knowledge of gdb is pretty much exhausted at this point, so if we can provide more useful details, please advice us. I will leave the relevant stack traces lying around for the time being.
The text was updated successfully, but these errors were encountered: