From 16586ca56bdad4b42f4889b1942effb04bdbfd98 Mon Sep 17 00:00:00 2001 From: CF Bolz-Tereick Date: Sun, 24 Mar 2024 12:23:51 +0100 Subject: [PATCH 01/18] blog post draft --- posts/2024/03/fixing-bug-incremental-gc.md | 478 +++++++++++++++++++++ 1 file changed, 478 insertions(+) create mode 100644 posts/2024/03/fixing-bug-incremental-gc.md diff --git a/posts/2024/03/fixing-bug-incremental-gc.md b/posts/2024/03/fixing-bug-incremental-gc.md new file mode 100644 index 000000000..4944ae496 --- /dev/null +++ b/posts/2024/03/fixing-bug-incremental-gc.md @@ -0,0 +1,478 @@ + + +# Introduction + +Since last summer, I've been looking on and off into a weird and hard to +reproduce [crash bug in PyPy](https://github.com/pypy/pypy/issues/3959). It was +manifesting only on CI, and it seemed to always happening in the AST rewriting +phase of [pytest](https://pytest.org), the symptoms being that PyPy would crash +with a segfault. All my attempts to reproduce it locally failed, and my +attempts to try to understand the problem by dumping the involved ASTs lead +nowhere. + +A few weeks ago, we got [two more](https://github.com/PyO3/pyo3/issues/3766) +[bug reports](https://github.com/orgs/pypy/discussions/4923), the last one by +the authors of the [nanobind](https://nanobind.readthedocs.io/) binding +generator, with the same symptoms: crash in AST rewriting, only on CI. I +decided to make a more serious push to try to find the bug this time. +Ultimately the problem turned out to be several bugs in PyPy's garbage +collector (GC) that had been there since its inception in +[2013](https://www.pypy.org/posts/2013/10/incremental-garbage-collector-in-pypy-8956893523842234676.html) +(started by [Andrew Chambers](https://acha.ninja/)). Understanding the +situation turned out to be quite involved, additionally complicated by this +being the first time that I was working on this particular aspect of PyPy's GC. +Since the bug was so much work to find, I thought I'd write a blog post about +it. + +The blog post consists of three parts: first a chronological description of +what I did to find the bug, a technical explanation of what goes wrong, some +reflections on the bug (and then a bonus bug I also found in the process). + +# Finding the Bug + +I was starting from the failing [nanobind +CI](https://github.com/wjakob/nanobind/actions/runs/8234561874/job/22516568891) +runs that ended with a segfault of the PyPy interpreter. This was only an +intermittent problem, not every run was failing. When I tried to just run the +test suite locally, I couldn't get it to fail. Therefore at first I tried to +learn more about what was happening by looking on the CI runners. + +## Running on CI + +I forked the nanobind repo and hacked the CI script in order to get it to use a +PyPy build with full debug information and more assertions turned on. In order +to increase the probability of seeing the crash I added an otherwise unused +[matrix](https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs) +variable to the CI script that just contained 32 parameters. This means every +build is done 32 times (sorry Github for wasting your CPUs 😕). With that +amount of repetition, one of the builds was quite reliably crashing. + +Then I added the `-Xfaulthandler` option to the PyPy command which will use the +[faulthandler](https://docs.python.org/3.11/library/faulthandler.html) module +try to print a Python stacktrace if the VM segfaults to confirm that PyPy was +indeed crashing in the [AST](https://docs.python.org/3/library/ast.html) +[rewriting +phase](https://github.com/pytest-dev/pytest/blob/main/src/_pytest/assertion/rewrite.py) +of pytest, which pytest uses for [nicer +assertions](https://docs.pytest.org/en/7.1.x/how-to/assert.html#asserting-with-the-assert-statement). +I experimented with hacking our faulthandler implementation to also give me a +C-level callstack, but that didn't work as well as I hoped. + +Then I tried to run [gdb](https://sourceware.org/gdb/) on CI to try to get it +to print a C callstack at the crash point. You can get gdb to execute commands +as if typed at the prompt with the `-ex` commandline option, I used something +like this: + +``` +gdb -ex "set confirm off" -ex "set pagination off" -ex "set debuginfod enabled off" -ex run -ex where -ex quit --args +``` + +But unfortunately the crash never occurred when running in gdb. + +Afterwards I tried the next best thing, which was configuring the CI runner to +[dump a core file and upload it as a build +artifact](https://github.com/itamarst/gha-upload-cores), which worked. Looking +at the cores locally only sort of worked, because I am running a different +version of Ubuntu than the the CI runners. So I used +[tmate](https://mxschmitt.github.io/action-tmate/) to be able to log into the +CI runner after a crash and interactively used gdb there. Unfortunately what I +learned from that was that the bug was some kind of **memory corruption**, +which is always incredibly unpleasant to debug. Basically the header word of a +Python object had been corrupted somehow at the point of the crash, which means +that it's [vtable](https://en.wikipedia.org/wiki/Virtual_method_table) wasn't +usable any more. + +(Sidenote: [PyPy doesn't really use a vtable +pointer](https://www.pypy.org/posts/2009/10/gc-improvements-6174120095428192954.html#unifying-the-vtable-ptr-with-the-gc-header), +instead it uses half a word in the header for the vtable, and the other half +for flags that the GC needs to keep track of the state of the object. +Corrupting all this is still bad.) + +## Reproducing Locally + +At that point it was clear that I had to push to reproduce the problem on my +laptop, to allow me to work on the problem more directly and not to always have +to go via the CI runner. Memory corruption bugs often have a lot of randomness +(depending on which part of memory gets modified, things might crash or more +likely just happily keep running). Therefore I decided to try to brute-force +reproducing the crash by simply running the tests many many times. Since the +crash happened in the AST rewriting phase of pytest, and that happens only if +no [pyc +files](https://stackoverflow.com/questions/2998215/if-python-is-interpreted-what-are-pyc-files) +of the bytecode-compiled rewritten ASTs exist, I made sure to delete them +before every test run. To repeat the test runs I used +[multitime](https://tratt.net/laurie/src/multitime/), which is a simple program +that runs a command repeatedly. It's meant for leightweight benchmarking +purposes, but it also halts the execution of the command if that command exits +with an error (and it sleeps a small random time between runs, which might help +with randomizing the situation, maybe). + +Thankfully, running the tests repeatedly eventually lead to a crash, solving my +"only happens on CI" problem. I then tried various variants to exclude possible +sources of errors. The first source of errors to exclude in PyPy bugs is the +just-in-time compiler, so I reran the tests with `--jit off` to see whether I +could still get it to crash, and thankfully I could (JIT bugs are often very +annoying). + +Next source of bugs to exclude where C-extensions. Since those were the tests +of nanobind, a framework for creating C-extension modules I was a bit worried +that the bug might be in our emulation of CPython's C-API. But running PyPy +with the `-v` option (which will print all the imports as they happen) +confirmed that at the point of crash no C-extension had been imported yet. + +## Using `rr` + +I still couldn't get the bug to happen in GDB, so the tool I tried next was +[rr](https://rr-project.org/). rr can record the execution of a program and +later replay it arbitrarily often. This gives you a time-traveling debugger +that allows you to execute the program backwards in addition to forwards. +Eventually I managed to get the crash to happen when running the tests with `rr +record --chaos`. + +Using rr well is quite hard, and I'm not very good at it. The main approach I +use with rr to debug memory corruption is to replay the crash, then set a +[watchpoint](https://sourceware.org/gdb/current/onlinedocs/gdb.html/Set-Watchpoints.html) +for the corrupted memory location, then use the command `reverse-continue` to +find the place in the code that mutated the memory location. + + + +Doing this for my bug revealed that the object that was being corrupted was +erroneously collected by the garbage collector. For some reason the GC had +wrongly decided that the object was no longer reachable and therefore put the +object into a freelist by writing a pointer to the next entry in the freelist +into the first word of the object, overwriting the object's header. The next +time the object was used things crashed. + +## Side-quest: wrong GC assertions + +At this point in the process, I got massively side-tracked. PyPy's GC has a +number of debug modes that you can optionally turn on. Those slow down the +program execution a lot, but they should in theory help to understand why the +GC goes wrong. When I turned them on, I was getting a failing assertion really +early in the test execution, complaining about an invariant violation in the GC +logic. At first this made me very happy. I thought that this would help me fix +the bug more quickly. + +Extremely frustratingly, after two days of work I concluded that the assertion +logic itself was wrong. I have fixed that in the meantime too, the gory details +of that are in the bonus section at the end of the post. + +## Using GDB scripting to find the real bug + +After that disaster I went back to the earlier rr recording without assertions +and tried to understand in more detail why the GC decided to free an object +that was still being referenced. To be able to do that I used the [GDB Python +scripting +API](https://sourceware.org/gdb/current/onlinedocs/gdb.html/Python-API.html) to +write some helper commands to understand the state of the GC heap (rr is an +extension of GDB, so the GDB scripting API works in rr too). + +The first (small) helper command I wrote with the GDB scripting API was a way +to pretty-print the currently active GC flags of a random PyPy object, starting +just from the pointer. The more complex command I wrote was an object tracer, +which follows pointers to GC objects starting from a root object to explore the +object graph. The object tracer isn't complete, it doesn't deal with all the +complexities of PyPy's GC. But it was good enough to help me with my problem, I +found out that the corrupted object was stored in an array. + +The next step was to understand why the array content wasn't being correctly +traced by the GC, which I eventually managed with some [conditional +breakpoints](https://www.fayewilliams.com/2011/07/13/gdb-conditional-breakpoints/), +more watchpoints, and using `reverse-continue`. It turned out to be a bug that +occurs when the content of one array was memcopied into another array. The +technical details of why the array wasn't traced correctly are described in +detail in the next section. + +## Writing a unit test + +To try to make sure I really understood the bug correctly I then wrote a GC +unit test that shows the problem. Like most of PyPy, our GC is written in +RPython, a (somewhat strange) subset/dialect of Python2, which can be compiled +to C code. However, since it is also valid Python2 code, it can be [unit-tested +on top of a Python2 +implementation](https://www.pypy.org/posts/2022/04/how-is-pypy-tested.html) +(which is one of the reasons why we keep maintaining PyPy2). + +In the GC unit tests you have a lot of control about what order things happen +in, e.g. how objects are allocated, when garbage collection phases happen, etc. +After some trying I managed to write a test that crashes with the same kind of +memory corruption that my original crash exhibited: an object that is still +reachable via an array is collected by the GC. To give you a flavor of what +this kind of test looks like, here's an (edited for clarity) version of the +test I eventually managed to write + +```python +def test_incrementality_bug_arraycopy(self): + source = self.malloc(VAR, 8) # first array + # the stackroots list emulates the C stack + self.stackroots.append(source) + target = self.malloc(VAR, 8) # second array + self.stackroots.append(target) + node = self.malloc(S) # unrelated object, will be collected + node.x = 5 + self.writearray(source, 0, node) # store reference into source array, calling the write barrier + val = self.gc.collect_step() # + source = self.stackroots[0] # reload arrays, they might have move + target = self.stackroots[1] + # this GC step traces target + val = self.gc.collect_step() + + # emulate what a memcopy of arrays does + addr_src = llmemory.cast_ptr_to_adr(source) + addr_dst = llmemory.cast_ptr_to_adr(target) + res = self.gc.writebarrier_before_copy(addr_src, addr_dst, 0, 0, 2) + assert res + target[0] = source[0] # copy two elements of the arrays + target[1] = source[1] + self.writearray(source, 0, lltype.nullptr(S)) + # this GC step traces source + self.gc.collect_step() + # some more collection steps, crucially target isn't traced again + # but node is deleted + for i in range(3): + self.gc.collect_step() + # used to crash, node got collected + assert target[0].x == 5 + +``` + +One of the fun properties of testing our GC that way is that all the memory is +emulated. The crash in the last line of the test isn't a segfault at all, +instead you get a nice exception saying that you tried to access a freed chunk +of memory and you can then debug this with a python2 debugger. + +## Fixing the Bug + +With the unit test in hand, fixing the test was relatively straightforward (the +diff in its simplest form is anyway only a single line change). After my fix, I +[talked to Armin +Rigo](https://github.com/pypy/pypy/issues/4925#issuecomment-2014459454) who +helped me find different case that was still wrong, in the same area of the +code. + +TODO: check details of this paragraph I also got help by the developers at ... +who are using PyPy on their servers and had seen some mysterious PyPy crashes +in recent months. They did test deployments of my fixes in their various stages +to their servers to try to see whether stability improved for them. The fact +that their crash rate went down a lot provided great extra confirmation for me +that I was doing something right. + +## Writing a GC fuzzer/property based test + +Finding bugs in the GC is always extremely disconcerting, particularly since +this one manged to hide for so long (more than ten years!). Therefore I wanted +to use these bugs as motivation to try to find more problems in PyP's GC. Given +the ridiculous effectiveness of fuzzing, I used +[hypothesis](https://hypothesis.readthedocs.io/en/latest/) to write a +property-based test. Every test performs a sequence of randomly chosen steps +from the following list: + +- allocate an object +- read a random field from a random object +- write a random reference into a random object +- drop a random stack reference +- perform one GC step +- allocate an array +- read a random index from a random array +- write to an array +- memcopy between two arrays + +This approach of doing a sequence of steps is pretty close to the [stateful +testing](https://hypothesis.readthedocs.io/en/latest/stateful.html) approach of +hypothesis, but I just implemented it manually with the [data +strategy](https://hypothesis.readthedocs.io/en/latest/data.html#drawing-interactively-in-tests). + +Every one of those steps is always performed on both the tested GC, and on some +regular Python objects. The Python objects provide the "ground truth" of what +the heap should look like, so the we can compare the state of the GC objects +with the state of the Python objects to find out whether the GC made a mistake. + +In order to check whether the test is actually useful, I reverted my bug fixes +and made sure that the test re-finds both the spurious assertion error and the +problems with memcopying an array. + +After running for a few hours, the test actually found an unrelated crash in +the GC! So far I have not yet spent time on debugging that, it's what I'll work +on next. I also plan on adding a bunch of other GC features as steps in the +test to stress them too (for example weakrefs, pinning, maybe finalization). + + +# The Bug + +In order to understand the technical details of the bug, I need to give some +background explanations about PyPy's GC. + +## PyPy's incremental GC + +PyPy uses an incremental generational mark-sweep GC. It's +[generational](https://en.wikipedia.org/wiki/Tracing_garbage_collection#Generational_GC_(ephemeral_GC)) +and therefore has minor collections (where only young objects get collected) +and major collections (collecting long-lived objects eventually, using a +mark-sweep algorithm). Young objects are allocated in a nursery using a +bump-pointer allocator, which makes allocation quite efficient. They are moved +out of the nursery by minor collections. In order to find references from old +to young objects the GC uses a write barrier to detect writes into old objects. + +The GC is also +[incremental](https://en.wikipedia.org/wiki/Tracing_garbage_collection#Stop-the-world_vs._incremental_vs._concurrent), +which means that its major collections aren't done all at once (which would +lead to long pauses). Instead, major collections are sliced up into small +steps, which are done directly after a minor collection (the GC isn't +*concurrent* though, which would mean that the GC does work in a separate +thread). + +The incremental GC uses [tri-color +marking](https://en.wikipedia.org/wiki/Tracing_garbage_collection#Tri-color_marking) +to reason about the reachable part of the heap during the marking phase, where +old objects can be: + +- black: already marked, reachable, definitely survives the collection +- grey: will survive, but still needs to be marked +- white: potentially dead The color of every object is encoded by setting flags + in the object header. + +The GC maintains the **invariant** that black objects must never point to white +objects. At the start of a major collection cycle the stack roots are turned +gray. During a major collection cycle, the GC will trace gray objects, until +none are left. To trace a gray object, all the objects it references have to be +marked grey if they are white so far. After a grey object is traced, it can be +marked black (because all the referenced objects are now either black or gray). +Eventually, there are no gray objects left. At that point (because no white +object can be reached from a black one) all the white objects are known to be +unreachable and can therefore be freed. + +The GC is incremental because every collection step will only trace a limited +amount of gray object, before giving control back to the program. This leads to +a problem: if an already traced (black) object is changed between to marking +steps of the GC, the program can mutate that object and write a new reference +into one of its field. This could lead to an invariant violation, if the +referenced object is white. Therefore, the GC uses the write barrier (which it +needs anyway to find references from old to young objects) to mark all black +objects that are written into gray, and then trace them again at one of the +later collection steps. + +## The special write barrier of memcopy + +Arrays use a different kind of write barrier than normal objects. Since they +can be arbitrarily large, tracing them can take a long time. Therefore it's +potentially wasteful to trace them fully at a minor collection. To fix this. +the array write barrier keeps more granular information about which parts of +the array have been written to since the last collection step. Then only the +modified parts of the array need to be traced, not the whole array. + +In addition, there is another optimization for arrays, which is that memcopy is +treated specially by the GC. If memcopy is implemented by simply writing a loop +that copies the content of one array to the other, that will invoke the write +barrier every single loop iteration for the writes of every array element, +costing a lot of overhead. Therefore the GC has a special memcopy-specific +write barrier that will perform the GC logic once before the memcopy loop, and +then use a regular (typically SIMD-optimized) memcopy implementation from +`libc`. + +## The bug + +The bugs turned out to be precisely in this memcopy write barrier. When we +implemented the current GC, we adapted our previous GC, which was a +generational mark-sweep GC but *not* incremental. We started with most of the +previous GC's code, including the write barriers. The regular write barriers +were adapted to the new incremental assumptions, in particular the need for the +write barrier to also turn black objects back to gray when they are modified +during a marking phase. This was simply not done at all for the memcopy write +barrier, at least in two of the code paths. Fixing this problem fixes the unit +tests and stops the crashes. + +# Reflections + +The way the bug was introduced is really typical. A piece of code (the memcopy +write barrier) was written under a set of assumptions. Then those assumptions +changed later. Not all the code pieces that relied on these assumptions to be +correct were updated. It's pretty hard to prevent this in all situations. + +I still think we could have done more to prevent the bug occurring. Writing a +property-based test for the GC would have been a good idea given the complexity +of the GC, and definitely something we did in other parts of our code at the +time (just using the `random` module mostly, we started using hypothesis +later). + +It's a bit of a mystery to me why this bug managed to be undetected for so +long. Memcopy happens in a lot of pretty core operations of e.g. lists in +Python (`list.extend`, to name just one example). To speculate, I would suspect +that all the other preconditions for the bug occurring made it pretty rare: +- the contend of an old list that is not yet marked needs to be copied into + another old list that is marked already +- the source of the copy needs to also store an object that has no other + references +- the source of the copy then needs to be overwritten with other data +- then the next collection steps need to be happening at the right points +- ... + +Given the complexity of the GC logic I also wonder whether some lightweight +formal methods would have been a good idea. Formalizing some of the core +invariants in B or TLA+ and then model checking them up to some number of +objects would have found this problem pretty soon. There are also correctness +proofs for GC algorithms in some research papers, but I don't have a good +overview of the literature to point to any that are particularly good or bad. +Going such a more formal route might have fixed this and probably a whole bunch +of other bugs, but of course it's a pretty expensive (and tedious) approach. + +While it was super annoying to track this down, it was definitely good to learn +a bit more about how to use rr and the GDB scripting interface. + +# Bonus Section: The Wrong Assertion + +Some more technical information about the wrong assertion is here. + +## Background: pre-built objects + +PyPy's VM-building bootstrapping process can "freeze" a bunch of heap objects +into the final binary. This allows the VM to start up quickly, because those +frozen objects are loaded by the OS as part of the binary. + +Those frozen pre-built objects are parts of the 'roots' of the garbage +collector and need to be traced. However, tracing all the pre-built objects at +every collection would be very expensive, because there are a lot of them +(about 150,000 in a PyPy 3.10 binary). Tracing them all is also not necessary, +because most of them are never modified. Therefore they only have references to +other pre-built objects, which can never be deallocated anyway. Therefore we +have an optimization that uses the write barrier (which we need anyway to find +old-to-young pointers) to notice when a pre-built object gets modified for the +very first time. If that happens, it gets added to the set of pre-built objects +that gets counted as a root, and is therefore traced as a root at collections +from then on. + +## The wrong assertion + +The assertion that triggered when I turned on the GC debug mode was saying that +the GC found a reference from a black to a white object, violating its +invariant. pre-built objects count as always black, as long as they have never +been mutated and aren't roots, because they can only ever reference other +pre-built objects. However, after they have been mutated for the first time, +they become part of the root set and will be marked gray at the beginning of a +new collection cycle (which works fine). + +The wrong assertion triggers if a pre-built object is mutated for the very +first time in the middle of an incremental marking phase. While the pre-built +object gets added to the root set just fine, and it will get traced before the +marking phase ends, this is encoded slightly differently for pre-built objects, +compared to "regular" old objects. Therefore, the invariant checking code +wrongly reported a black->white pointer in this situation. + +To fix it I also wrote a unit test checking the problem, made sure that the GC +fuzzer also found the bug, and then fixed the wrong assertion to take the color +encoding of pre-built objects into account. + +The bug managed to be invisible because we don't tend to turn on the GC +assertions very often only when we find a GC bug, which is of course also when +we need it the most to be correct. From f966a0afd712370c1963854426f3fe79fc075e0c Mon Sep 17 00:00:00 2001 From: CF Bolz-Tereick Date: Sun, 24 Mar 2024 17:09:47 +0100 Subject: [PATCH 02/18] Apply suggestions from Matti's review Co-authored-by: Matti Picus --- posts/2024/03/fixing-bug-incremental-gc.md | 26 +++++++++++++--------- 1 file changed, 15 insertions(+), 11 deletions(-) diff --git a/posts/2024/03/fixing-bug-incremental-gc.md b/posts/2024/03/fixing-bug-incremental-gc.md index 4944ae496..dde482295 100644 --- a/posts/2024/03/fixing-bug-incremental-gc.md +++ b/posts/2024/03/fixing-bug-incremental-gc.md @@ -28,7 +28,7 @@ decided to make a more serious push to try to find the bug this time. Ultimately the problem turned out to be several bugs in PyPy's garbage collector (GC) that had been there since its inception in [2013](https://www.pypy.org/posts/2013/10/incremental-garbage-collector-in-pypy-8956893523842234676.html) -(started by [Andrew Chambers](https://acha.ninja/)). Understanding the +Understanding the situation turned out to be quite involved, additionally complicated by this being the first time that I was working on this particular aspect of PyPy's GC. Since the bug was so much work to find, I thought I'd write a blog post about @@ -40,7 +40,7 @@ reflections on the bug (and then a bonus bug I also found in the process). # Finding the Bug -I was starting from the failing [nanobind +I started from the failing [nanobind CI](https://github.com/wjakob/nanobind/actions/runs/8234561874/job/22516568891) runs that ended with a segfault of the PyPy interpreter. This was only an intermittent problem, not every run was failing. When I tried to just run the @@ -50,7 +50,7 @@ learn more about what was happening by looking on the CI runners. ## Running on CI I forked the nanobind repo and hacked the CI script in order to get it to use a -PyPy build with full debug information and more assertions turned on. In order +PyPy build with [full debug information and more assertions turned on](https://doc.pypy.org/en/latest/build.html#making-a-debug-build-of-pypy). In order to increase the probability of seeing the crash I added an otherwise unused [matrix](https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs) variable to the CI script that just contained 32 parameters. This means every @@ -74,7 +74,9 @@ as if typed at the prompt with the `-ex` commandline option, I used something like this: ``` -gdb -ex "set confirm off" -ex "set pagination off" -ex "set debuginfod enabled off" -ex run -ex where -ex quit --args +gdb -ex "set confirm off" -ex "set pagination off" -ex \ + "set debuginfod enabled off" -ex run -ex where -ex quit \ + --args ``` But unfortunately the crash never occurred when running in gdb. @@ -340,7 +342,9 @@ old objects can be: - black: already marked, reachable, definitely survives the collection - grey: will survive, but still needs to be marked -- white: potentially dead The color of every object is encoded by setting flags +- white: potentially dead + +The color of every object is encoded by setting flags in the object header. The GC maintains the **invariant** that black objects must never point to white @@ -354,10 +358,10 @@ object can be reached from a black one) all the white objects are known to be unreachable and can therefore be freed. The GC is incremental because every collection step will only trace a limited -amount of gray object, before giving control back to the program. This leads to -a problem: if an already traced (black) object is changed between to marking +number of gray objects, before giving control back to the program. This leads to +a problem: if an already traced (black) object is changed between two marking steps of the GC, the program can mutate that object and write a new reference -into one of its field. This could lead to an invariant violation, if the +into one of its fields. This could lead to an invariant violation, if the referenced object is white. Therefore, the GC uses the write barrier (which it needs anyway to find references from old to young objects) to mark all black objects that are written into gray, and then trace them again at one of the @@ -369,7 +373,7 @@ Arrays use a different kind of write barrier than normal objects. Since they can be arbitrarily large, tracing them can take a long time. Therefore it's potentially wasteful to trace them fully at a minor collection. To fix this. the array write barrier keeps more granular information about which parts of -the array have been written to since the last collection step. Then only the +the array have been modified since the last collection step. Then only the modified parts of the array need to be traced, not the whole array. In addition, there is another optimization for arrays, which is that memcopy is @@ -410,7 +414,7 @@ It's a bit of a mystery to me why this bug managed to be undetected for so long. Memcopy happens in a lot of pretty core operations of e.g. lists in Python (`list.extend`, to name just one example). To speculate, I would suspect that all the other preconditions for the bug occurring made it pretty rare: -- the contend of an old list that is not yet marked needs to be copied into +- the content of an old list that is not yet marked needs to be copied into another old list that is marked already - the source of the copy needs to also store an object that has no other references @@ -421,7 +425,7 @@ that all the other preconditions for the bug occurring made it pretty rare: Given the complexity of the GC logic I also wonder whether some lightweight formal methods would have been a good idea. Formalizing some of the core invariants in B or TLA+ and then model checking them up to some number of -objects would have found this problem pretty soon. There are also correctness +objects would have found this problem pretty quickly. There are also correctness proofs for GC algorithms in some research papers, but I don't have a good overview of the literature to point to any that are particularly good or bad. Going such a more formal route might have fixed this and probably a whole bunch From 7f91f9ab5600855b96e573936ad23fa03da60f90 Mon Sep 17 00:00:00 2001 From: CF Bolz-Tereick Date: Sun, 24 Mar 2024 17:15:36 +0100 Subject: [PATCH 03/18] some fixes --- posts/2024/03/fixing-bug-incremental-gc.md | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/posts/2024/03/fixing-bug-incremental-gc.md b/posts/2024/03/fixing-bug-incremental-gc.md index dde482295..cc0986f67 100644 --- a/posts/2024/03/fixing-bug-incremental-gc.md +++ b/posts/2024/03/fixing-bug-incremental-gc.md @@ -364,7 +364,7 @@ steps of the GC, the program can mutate that object and write a new reference into one of its fields. This could lead to an invariant violation, if the referenced object is white. Therefore, the GC uses the write barrier (which it needs anyway to find references from old to young objects) to mark all black -objects that are written into gray, and then trace them again at one of the +objects that are modified gray, and then trace them again at one of the later collection steps. ## The special write barrier of memcopy @@ -460,15 +460,14 @@ from then on. The assertion that triggered when I turned on the GC debug mode was saying that the GC found a reference from a black to a white object, violating its -invariant. pre-built objects count as always black, as long as they have never -been mutated and aren't roots, because they can only ever reference other -pre-built objects. However, after they have been mutated for the first time, -they become part of the root set and will be marked gray at the beginning of a -new collection cycle (which works fine). +invariant. Unmodified pre-built objects count as black, and they aren't roots, +because they can only ever reference other pre-built objects. However, when a +pre-built object gets modified for the first time, it becomes part of the root +set and will be marked gray. This logic works fine. The wrong assertion triggers if a pre-built object is mutated for the very first time in the middle of an incremental marking phase. While the pre-built -object gets added to the root set just fine, and it will get traced before the +object gets added to the root set just fine, and will get traced before the marking phase ends, this is encoded slightly differently for pre-built objects, compared to "regular" old objects. Therefore, the invariant checking code wrongly reported a black->white pointer in this situation. @@ -478,5 +477,5 @@ fuzzer also found the bug, and then fixed the wrong assertion to take the color encoding of pre-built objects into account. The bug managed to be invisible because we don't tend to turn on the GC -assertions very often only when we find a GC bug, which is of course also when -we need it the most to be correct. +assertions very often. We only do that when we find a GC bug, which is of +course also when we need it the most to be correct. From 27bce36fe0153ee5d94b4d9d8a410809f178dece Mon Sep 17 00:00:00 2001 From: CF Bolz-Tereick Date: Sun, 24 Mar 2024 19:58:58 +0100 Subject: [PATCH 04/18] some more edits --- posts/2024/03/fixing-bug-incremental-gc.md | 95 ++++++++++++---------- 1 file changed, 51 insertions(+), 44 deletions(-) diff --git a/posts/2024/03/fixing-bug-incremental-gc.md b/posts/2024/03/fixing-bug-incremental-gc.md index cc0986f67..6214dc356 100644 --- a/posts/2024/03/fixing-bug-incremental-gc.md +++ b/posts/2024/03/fixing-bug-incremental-gc.md @@ -2,10 +2,10 @@ .. title: Fixing a Bug in PyPy's Incremental GC .. slug: fixing-bug-incremental-gc .. date: 2024-03-28 12:55:09 UTC -.. tags: -.. category: -.. link: -.. description: +.. tags: +.. category: +.. link: +.. description: .. type: text .. author: Carl Friedrich Bolz-Tereick --> @@ -14,7 +14,7 @@ Since last summer, I've been looking on and off into a weird and hard to reproduce [crash bug in PyPy](https://github.com/pypy/pypy/issues/3959). It was -manifesting only on CI, and it seemed to always happening in the AST rewriting +manifesting only on CI, and it seemed to always happen in the AST rewriting phase of [pytest](https://pytest.org), the symptoms being that PyPy would crash with a segfault. All my attempts to reproduce it locally failed, and my attempts to try to understand the problem by dumping the involved ASTs lead @@ -27,7 +27,7 @@ generator, with the same symptoms: crash in AST rewriting, only on CI. I decided to make a more serious push to try to find the bug this time. Ultimately the problem turned out to be several bugs in PyPy's garbage collector (GC) that had been there since its inception in -[2013](https://www.pypy.org/posts/2013/10/incremental-garbage-collector-in-pypy-8956893523842234676.html) +[2013](https://www.pypy.org/posts/2013/10/incremental-garbage-collector-in-pypy-8956893523842234676.html). Understanding the situation turned out to be quite involved, additionally complicated by this being the first time that I was working on this particular aspect of PyPy's GC. @@ -40,9 +40,9 @@ reflections on the bug (and then a bonus bug I also found in the process). # Finding the Bug -I started from the failing [nanobind -CI](https://github.com/wjakob/nanobind/actions/runs/8234561874/job/22516568891) -runs that ended with a segfault of the PyPy interpreter. This was only an +I started from the failing [nanobind CI +runs](https://github.com/wjakob/nanobind/actions/runs/8234561874/job/22516568891) +that ended with a segfault of the PyPy interpreter. This was only an intermittent problem, not every run was failing. When I tried to just run the test suite locally, I couldn't get it to fail. Therefore at first I tried to learn more about what was happening by looking on the CI runners. @@ -55,7 +55,7 @@ to increase the probability of seeing the crash I added an otherwise unused [matrix](https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs) variable to the CI script that just contained 32 parameters. This means every build is done 32 times (sorry Github for wasting your CPUs 😕). With that -amount of repetition, one of the builds was quite reliably crashing. +amount of repetition, I got at least one job of every build that was crashing. Then I added the `-Xfaulthandler` option to the PyPy command which will use the [faulthandler](https://docs.python.org/3.11/library/faulthandler.html) module @@ -85,7 +85,7 @@ Afterwards I tried the next best thing, which was configuring the CI runner to [dump a core file and upload it as a build artifact](https://github.com/itamarst/gha-upload-cores), which worked. Looking at the cores locally only sort of worked, because I am running a different -version of Ubuntu than the the CI runners. So I used +version of Ubuntu than the CI runners. So I used [tmate](https://mxschmitt.github.io/action-tmate/) to be able to log into the CI runner after a crash and interactively used gdb there. Unfortunately what I learned from that was that the bug was some kind of **memory corruption**, @@ -100,7 +100,7 @@ instead it uses half a word in the header for the vtable, and the other half for flags that the GC needs to keep track of the state of the object. Corrupting all this is still bad.) -## Reproducing Locally +## Reproducing Locally At that point it was clear that I had to push to reproduce the problem on my laptop, to allow me to work on the problem more directly and not to always have @@ -123,8 +123,8 @@ Thankfully, running the tests repeatedly eventually lead to a crash, solving my "only happens on CI" problem. I then tried various variants to exclude possible sources of errors. The first source of errors to exclude in PyPy bugs is the just-in-time compiler, so I reran the tests with `--jit off` to see whether I -could still get it to crash, and thankfully I could (JIT bugs are often very -annoying). +could still get it to crash, and thankfully I eventually could (JIT bugs are +often very annoying). Next source of bugs to exclude where C-extensions. Since those were the tests of nanobind, a framework for creating C-extension modules I was a bit worried @@ -145,7 +145,8 @@ Using rr well is quite hard, and I'm not very good at it. The main approach I use with rr to debug memory corruption is to replay the crash, then set a [watchpoint](https://sourceware.org/gdb/current/onlinedocs/gdb.html/Set-Watchpoints.html) for the corrupted memory location, then use the command `reverse-continue` to -find the place in the code that mutated the memory location. +find the place in the code that mutated the memory location. Here's a little +demo of this: @@ -168,12 +169,12 @@ logic. At first this made me very happy. I thought that this would help me fix the bug more quickly. Extremely frustratingly, after two days of work I concluded that the assertion -logic itself was wrong. I have fixed that in the meantime too, the gory details +logic itself was wrong. I have fixed that in the meantime too, the details of that are in the bonus section at the end of the post. ## Using GDB scripting to find the real bug -After that disaster I went back to the earlier rr recording without assertions +After that disaster I went back to the earlier rr recording without GSC assertions and tried to understand in more detail why the GC decided to free an object that was still being referenced. To be able to do that I used the [GDB Python scripting @@ -224,20 +225,20 @@ def test_incrementality_bug_arraycopy(self): self.stackroots.append(target) node = self.malloc(S) # unrelated object, will be collected node.x = 5 - self.writearray(source, 0, node) # store reference into source array, calling the write barrier - val = self.gc.collect_step() # - source = self.stackroots[0] # reload arrays, they might have move + # store reference into source array, calling the write barrier + self.writearray(source, 0, node) + val = self.gc.collect_step() + source = self.stackroots[0] # reload arrays, they might have moved target = self.stackroots[1] # this GC step traces target val = self.gc.collect_step() # emulate what a memcopy of arrays does - addr_src = llmemory.cast_ptr_to_adr(source) - addr_dst = llmemory.cast_ptr_to_adr(target) - res = self.gc.writebarrier_before_copy(addr_src, addr_dst, 0, 0, 2) + res = self.gc.writebarrier_before_copy(source, target, 0, 0, 2) assert res target[0] = source[0] # copy two elements of the arrays target[1] = source[1] + # now overwrite the reference to node in source self.writearray(source, 0, lltype.nullptr(S)) # this GC step traces source self.gc.collect_step() @@ -250,7 +251,7 @@ def test_incrementality_bug_arraycopy(self): ``` -One of the fun properties of testing our GC that way is that all the memory is +One of the good properties of testing our GC that way is that all the memory is emulated. The crash in the last line of the test isn't a segfault at all, instead you get a nice exception saying that you tried to access a freed chunk of memory and you can then debug this with a python2 debugger. @@ -275,7 +276,7 @@ that I was doing something right. Finding bugs in the GC is always extremely disconcerting, particularly since this one manged to hide for so long (more than ten years!). Therefore I wanted -to use these bugs as motivation to try to find more problems in PyP's GC. Given +to use these bugs as motivation to try to find more problems in PyPy's GC. Given the ridiculous effectiveness of fuzzing, I used [hypothesis](https://hypothesis.readthedocs.io/en/latest/) to write a property-based test. Every test performs a sequence of randomly chosen steps @@ -298,20 +299,21 @@ strategy](https://hypothesis.readthedocs.io/en/latest/data.html#drawing-interact Every one of those steps is always performed on both the tested GC, and on some regular Python objects. The Python objects provide the "ground truth" of what -the heap should look like, so the we can compare the state of the GC objects +the heap should look like, so we can compare the state of the GC objects with the state of the Python objects to find out whether the GC made a mistake. In order to check whether the test is actually useful, I reverted my bug fixes -and made sure that the test re-finds both the spurious assertion error and the +and made sure that the test re-finds both the spurious GC assertion error and the problems with memcopying an array. After running for a few hours, the test actually found an unrelated crash in the GC! So far I have not yet spent time on debugging that, it's what I'll work on next. I also plan on adding a bunch of other GC features as steps in the -test to stress them too (for example weakrefs, pinning, maybe finalization). +test to stress them too (for example weakrefs, identity hashes, pinning, maybe +finalization). -# The Bug +# The technical details of the bug In order to understand the technical details of the bug, I need to give some background explanations about PyPy's GC. @@ -322,7 +324,8 @@ PyPy uses an incremental generational mark-sweep GC. It's [generational](https://en.wikipedia.org/wiki/Tracing_garbage_collection#Generational_GC_(ephemeral_GC)) and therefore has minor collections (where only young objects get collected) and major collections (collecting long-lived objects eventually, using a -mark-sweep algorithm). Young objects are allocated in a nursery using a +[mark-and-sweep](https://en.wikipedia.org/wiki/Tracing_garbage_collection#Na%C3%AFve_mark-and-sweep) +algorithm). Young objects are allocated in a nursery using a bump-pointer allocator, which makes allocation quite efficient. They are moved out of the nursery by minor collections. In order to find references from old to young objects the GC uses a write barrier to detect writes into old objects. @@ -338,18 +341,19 @@ thread). The incremental GC uses [tri-color marking](https://en.wikipedia.org/wiki/Tracing_garbage_collection#Tri-color_marking) to reason about the reachable part of the heap during the marking phase, where -old objects can be: +every old object can be: - black: already marked, reachable, definitely survives the collection - grey: will survive, but still needs to be marked -- white: potentially dead +- white: potentially dead The color of every object is encoded by setting flags - in the object header. +in the object header. The GC maintains the **invariant** that black objects must never point to white objects. At the start of a major collection cycle the stack roots are turned -gray. During a major collection cycle, the GC will trace gray objects, until +gray. During the mark phase of a major collection cycle, the GC will trace gray +objects, until none are left. To trace a gray object, all the objects it references have to be marked grey if they are white so far. After a grey object is traced, it can be marked black (because all the referenced objects are now either black or gray). @@ -371,7 +375,7 @@ later collection steps. Arrays use a different kind of write barrier than normal objects. Since they can be arbitrarily large, tracing them can take a long time. Therefore it's -potentially wasteful to trace them fully at a minor collection. To fix this. +potentially wasteful to trace them fully at a minor collection. To fix this, the array write barrier keeps more granular information about which parts of the array have been modified since the last collection step. Then only the modified parts of the array need to be traced, not the whole array. @@ -379,13 +383,13 @@ modified parts of the array need to be traced, not the whole array. In addition, there is another optimization for arrays, which is that memcopy is treated specially by the GC. If memcopy is implemented by simply writing a loop that copies the content of one array to the other, that will invoke the write -barrier every single loop iteration for the writes of every array element, +barrier every single loop iteration for the write of every array element, costing a lot of overhead. Therefore the GC has a special memcopy-specific write barrier that will perform the GC logic once before the memcopy loop, and then use a regular (typically SIMD-optimized) memcopy implementation from `libc`. -## The bug +## The bug The bugs turned out to be precisely in this memcopy write barrier. When we implemented the current GC, we adapted our previous GC, which was a @@ -415,16 +419,19 @@ long. Memcopy happens in a lot of pretty core operations of e.g. lists in Python (`list.extend`, to name just one example). To speculate, I would suspect that all the other preconditions for the bug occurring made it pretty rare: - the content of an old list that is not yet marked needs to be copied into - another old list that is marked already + another old list that is marked already - the source of the copy needs to also store an object that has no other - references -- the source of the copy then needs to be overwritten with other data + references +- the source of the copy then needs to be overwritten with other data - then the next collection steps need to be happening at the right points - ... Given the complexity of the GC logic I also wonder whether some lightweight formal methods would have been a good idea. Formalizing some of the core -invariants in B or TLA+ and then model checking them up to some number of +invariants in [B](https://en.wikipedia.org/wiki/B-Method) or +[TLA+](https://en.wikipedia.org/wiki/TLA%2B) and then [model +checking](https://en.wikipedia.org/wiki/Model_checking) them up to some number +of objects would have found this problem pretty quickly. There are also correctness proofs for GC algorithms in some research papers, but I don't have a good overview of the literature to point to any that are particularly good or bad. @@ -436,7 +443,7 @@ a bit more about how to use rr and the GDB scripting interface. # Bonus Section: The Wrong Assertion -Some more technical information about the wrong assertion is here. +Some more technical information about the wrong assertion is in this section. ## Background: pre-built objects @@ -473,8 +480,8 @@ compared to "regular" old objects. Therefore, the invariant checking code wrongly reported a black->white pointer in this situation. To fix it I also wrote a unit test checking the problem, made sure that the GC -fuzzer also found the bug, and then fixed the wrong assertion to take the color -encoding of pre-built objects into account. +hypothesis test also found the bug, and then fixed the wrong assertion to take +the color encoding of pre-built objects into account. The bug managed to be invisible because we don't tend to turn on the GC assertions very often. We only do that when we find a GC bug, which is of From 7401d545dde552de422ed7412ebf0b589b74f134 Mon Sep 17 00:00:00 2001 From: CF Bolz-Tereick Date: Sun, 24 Mar 2024 20:18:17 +0100 Subject: [PATCH 05/18] add a multitime demo --- posts/2024/03/fixing-bug-incremental-gc.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/posts/2024/03/fixing-bug-incremental-gc.md b/posts/2024/03/fixing-bug-incremental-gc.md index 6214dc356..bc0504be1 100644 --- a/posts/2024/03/fixing-bug-incremental-gc.md +++ b/posts/2024/03/fixing-bug-incremental-gc.md @@ -114,10 +114,12 @@ files](https://stackoverflow.com/questions/2998215/if-python-is-interpreted-what of the bytecode-compiled rewritten ASTs exist, I made sure to delete them before every test run. To repeat the test runs I used [multitime](https://tratt.net/laurie/src/multitime/), which is a simple program -that runs a command repeatedly. It's meant for leightweight benchmarking +that runs a command repeatedly. It's meant for lightweight benchmarking purposes, but it also halts the execution of the command if that command exits with an error (and it sleeps a small random time between runs, which might help -with randomizing the situation, maybe). +with randomizing the situation, maybe). Here's a demo: + + Thankfully, running the tests repeatedly eventually lead to a crash, solving my "only happens on CI" problem. I then tried various variants to exclude possible From 638703fe7f0ce321c1a599c982f3786a741fd7b3 Mon Sep 17 00:00:00 2001 From: CF Bolz-Tereick Date: Sun, 24 Mar 2024 22:28:23 +0100 Subject: [PATCH 06/18] show some gdb api code, link to all of it, another recording --- posts/2024/03/fixing-bug-incremental-gc.md | 35 ++++++++++++++++++++++ 1 file changed, 35 insertions(+) diff --git a/posts/2024/03/fixing-bug-incremental-gc.md b/posts/2024/03/fixing-bug-incremental-gc.md index bc0504be1..47a54898b 100644 --- a/posts/2024/03/fixing-bug-incremental-gc.md +++ b/posts/2024/03/fixing-bug-incremental-gc.md @@ -192,6 +192,41 @@ object graph. The object tracer isn't complete, it doesn't deal with all the complexities of PyPy's GC. But it was good enough to help me with my problem, I found out that the corrupted object was stored in an array. +As an example, here's a function that uses the GDB API to walk one of the +helper data structures of the GC, a stack of pointers: + +```python +def walk_addr_stack(obj): + """ walk an instance of the AddressStack class (which is a linked list of + arrays of 1019 pointers). + + the first of the arrays is only partially filled with used_in_last_chunk + items, all the other chunks are full.""" + if obj.type.code == gdb.TYPE_CODE_PTR: + obj = obj.dereference() + used_in_last_chunk = lookup(obj, "used_in_last_chunk") + chunk = lookup(obj, "inst_chunk").dereference() + while 1: + items = lookup(chunk, "items") + for i in range(used_in_last_chunk): + yield items[i] + chunk = lookup(chunk, "next") + if not chunk: + break + chunk = chunk.dereference() + used_in_last_chunk = 1019 +``` + +The full file of supporting code I wrote can be found in [this +gist](https://gist.github.com/cfbolz/13cadcbbef321d93fc9790dff6f60a6a). This is +pretty rough throw-away code, however. + +In the following recording I show a staged debugging session with some of the +extra commands I wrote with the Python API. The details aren't important, I +just wanted to give a bit of a flavor of what inspecting objects looks like: + + + The next step was to understand why the array content wasn't being correctly traced by the GC, which I eventually managed with some [conditional breakpoints](https://www.fayewilliams.com/2011/07/13/gdb-conditional-breakpoints/), From ffa5716d5a943a5dd09d53a8352510765c6df10d Mon Sep 17 00:00:00 2001 From: CF Bolz-Tereick Date: Sun, 24 Mar 2024 22:32:23 +0100 Subject: [PATCH 07/18] fix asciinema embeds --- posts/2024/03/fixing-bug-incremental-gc.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/posts/2024/03/fixing-bug-incremental-gc.md b/posts/2024/03/fixing-bug-incremental-gc.md index 47a54898b..59e45f7e5 100644 --- a/posts/2024/03/fixing-bug-incremental-gc.md +++ b/posts/2024/03/fixing-bug-incremental-gc.md @@ -119,7 +119,7 @@ purposes, but it also halts the execution of the command if that command exits with an error (and it sleeps a small random time between runs, which might help with randomizing the situation, maybe). Here's a demo: - + Thankfully, running the tests repeatedly eventually lead to a crash, solving my "only happens on CI" problem. I then tried various variants to exclude possible @@ -150,8 +150,7 @@ for the corrupted memory location, then use the command `reverse-continue` to find the place in the code that mutated the memory location. Here's a little demo of this: - + Doing this for my bug revealed that the object that was being corrupted was erroneously collected by the garbage collector. For some reason the GC had @@ -225,7 +224,7 @@ In the following recording I show a staged debugging session with some of the extra commands I wrote with the Python API. The details aren't important, I just wanted to give a bit of a flavor of what inspecting objects looks like: - + The next step was to understand why the array content wasn't being correctly traced by the GC, which I eventually managed with some [conditional From 399381ca9429f2823412635b1eda67706188a8c4 Mon Sep 17 00:00:00 2001 From: CF Bolz-Tereick Date: Mon, 25 Mar 2024 09:59:21 +0100 Subject: [PATCH 08/18] start working on Max' feedback --- posts/2024/03/fixing-bug-incremental-gc.md | 50 +++++++++++++++++++--- 1 file changed, 44 insertions(+), 6 deletions(-) diff --git a/posts/2024/03/fixing-bug-incremental-gc.md b/posts/2024/03/fixing-bug-incremental-gc.md index 59e45f7e5..1afb8b1d4 100644 --- a/posts/2024/03/fixing-bug-incremental-gc.md +++ b/posts/2024/03/fixing-bug-incremental-gc.md @@ -112,7 +112,9 @@ crash happened in the AST rewriting phase of pytest, and that happens only if no [pyc files](https://stackoverflow.com/questions/2998215/if-python-is-interpreted-what-are-pyc-files) of the bytecode-compiled rewritten ASTs exist, I made sure to delete them -before every test run. To repeat the test runs I used +before every test run. + +To repeat the test runs I used [multitime](https://tratt.net/laurie/src/multitime/), which is a simple program that runs a command repeatedly. It's meant for lightweight benchmarking purposes, but it also halts the execution of the command if that command exits @@ -121,6 +123,10 @@ with randomizing the situation, maybe). Here's a demo: +([Max](https://bernsteinbear.com/) pointed out +[autoclave](https://github.com/silentbicycle/autoclave) to me when reviewing +this post, which is a more dedicated tool for this job.) + Thankfully, running the tests repeatedly eventually lead to a crash, solving my "only happens on CI" problem. I then tried various variants to exclude possible sources of errors. The first source of errors to exclude in PyPy bugs is the @@ -141,7 +147,8 @@ I still couldn't get the bug to happen in GDB, so the tool I tried next was later replay it arbitrarily often. This gives you a time-traveling debugger that allows you to execute the program backwards in addition to forwards. Eventually I managed to get the crash to happen when running the tests with `rr -record --chaos`. +record --chaos` (`--chaos` randomizes some decisions that rr takes, to try to +increase the chance of reproducing bugs). Using rr well is quite hard, and I'm not very good at it. The main approach I use with rr to debug memory corruption is to replay the crash, then set a @@ -175,7 +182,7 @@ of that are in the bonus section at the end of the post. ## Using GDB scripting to find the real bug -After that disaster I went back to the earlier rr recording without GSC assertions +After that disaster I went back to the earlier rr recording without GC assertions and tried to understand in more detail why the GC decided to free an object that was still being referenced. To be able to do that I used the [GDB Python scripting @@ -295,7 +302,9 @@ of memory and you can then debug this with a python2 debugger. ## Fixing the Bug With the unit test in hand, fixing the test was relatively straightforward (the -diff in its simplest form is anyway only a single line change). After my fix, I +diff in its simplest form is anyway only a [single line +change](https://github.com/pypy/pypy/commit/78bbeb93471b5f38438004e971f4b4f84ab17a84)). +After my fix, I [talked to Armin Rigo](https://github.com/pypy/pypy/issues/4925#issuecomment-2014459454) who helped me find different case that was still wrong, in the same area of the @@ -420,10 +429,31 @@ In addition, there is another optimization for arrays, which is that memcopy is treated specially by the GC. If memcopy is implemented by simply writing a loop that copies the content of one array to the other, that will invoke the write barrier every single loop iteration for the write of every array element, -costing a lot of overhead. Therefore the GC has a special memcopy-specific +costing a lot of overhead. Here's some pseudo-code: + +```python +def arraycopy(source, dest, source_start, dest_start, length): + for i in range(length): + value = source[source_start + i] + dest[dest_start + i] = value # <- write barrier inserted here +``` + +Therefore the GC has a special memcopy-specific write barrier that will perform the GC logic once before the memcopy loop, and then use a regular (typically SIMD-optimized) memcopy implementation from -`libc`. +`libc`. Roughly like this: + +```python +def arraycopy(source, dest, source_start, dest_start, length): + gc_writebarrier_before_array_copy(source, dest, source_start, dest_start, length) + raw_memcopy(cast_to_voidp(source) + source_start, + cast_to_voidp(dest) + dest_start, + sizeof(itemtype(source)) * length) +``` + +(this is really a rough sketch. The [real +code](ttps://github.com/pypy/pypy/blob/789f964fff59c722b0872abcdc56d2b1373a9f3b/rpython/rlib/rgc.py#L365) +is much more complicated.) ## The bug @@ -454,6 +484,7 @@ It's a bit of a mystery to me why this bug managed to be undetected for so long. Memcopy happens in a lot of pretty core operations of e.g. lists in Python (`list.extend`, to name just one example). To speculate, I would suspect that all the other preconditions for the bug occurring made it pretty rare: + - the content of an old list that is not yet marked needs to be copied into another old list that is marked already - the source of the copy needs to also store an object that has no other @@ -522,3 +553,10 @@ the color encoding of pre-built objects into account. The bug managed to be invisible because we don't tend to turn on the GC assertions very often. We only do that when we find a GC bug, which is of course also when we need it the most to be correct. + +# Acknowledgements + +Thanks to Matti Picus, Max Bernstein for giving me feedback on drafts of the +post. Thanks to Armin Rigo for reviewing the code and pointing out holes in my +thinking. Thanks to the original reporters of the various forms of the bug, +Lily Foote, David Hewitt, Wenzel Jakob. From 47fde6ee020e9d66b314264d10d6c58edc7f4817 Mon Sep 17 00:00:00 2001 From: CF Bolz-Tereick Date: Mon, 25 Mar 2024 10:11:04 +0100 Subject: [PATCH 09/18] explain what reverse-continue is --- posts/2024/03/fixing-bug-incremental-gc.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/posts/2024/03/fixing-bug-incremental-gc.md b/posts/2024/03/fixing-bug-incremental-gc.md index 1afb8b1d4..31584f317 100644 --- a/posts/2024/03/fixing-bug-incremental-gc.md +++ b/posts/2024/03/fixing-bug-incremental-gc.md @@ -154,8 +154,9 @@ Using rr well is quite hard, and I'm not very good at it. The main approach I use with rr to debug memory corruption is to replay the crash, then set a [watchpoint](https://sourceware.org/gdb/current/onlinedocs/gdb.html/Set-Watchpoints.html) for the corrupted memory location, then use the command `reverse-continue` to -find the place in the code that mutated the memory location. Here's a little -demo of this: +find the place in the code that mutated the memory location. `reverse-continue` +is like `continue`, except that it will execute the program backwards from the +current point. Here's a little demo of this: From 34239a86a5b55ecd5ec0e2de49a3c1b840656517 Mon Sep 17 00:00:00 2001 From: CF Bolz-Tereick Date: Mon, 25 Mar 2024 10:11:16 +0100 Subject: [PATCH 10/18] link to crashes issue --- posts/2024/03/fixing-bug-incremental-gc.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/posts/2024/03/fixing-bug-incremental-gc.md b/posts/2024/03/fixing-bug-incremental-gc.md index 31584f317..ba1b858b6 100644 --- a/posts/2024/03/fixing-bug-incremental-gc.md +++ b/posts/2024/03/fixing-bug-incremental-gc.md @@ -312,8 +312,9 @@ helped me find different case that was still wrong, in the same area of the code. TODO: check details of this paragraph I also got help by the developers at ... -who are using PyPy on their servers and had seen some mysterious PyPy crashes -in recent months. They did test deployments of my fixes in their various stages +who are using PyPy on their servers and had seen some [mysterious PyPy +crashes](https://github.com/pypy/pypy/issues/4900) +recently. They did test deployments of my fixes in their various stages to their servers to try to see whether stability improved for them. The fact that their crash rate went down a lot provided great extra confirmation for me that I was doing something right. From ac7072b849f961ea4540ee2175582ff21c255c66 Mon Sep 17 00:00:00 2001 From: CF Bolz-Tereick Date: Mon, 25 Mar 2024 10:40:55 +0100 Subject: [PATCH 11/18] mention company, seems there crashes are really gone --- posts/2024/03/fixing-bug-incremental-gc.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/posts/2024/03/fixing-bug-incremental-gc.md b/posts/2024/03/fixing-bug-incremental-gc.md index ba1b858b6..c7b9bd798 100644 --- a/posts/2024/03/fixing-bug-incremental-gc.md +++ b/posts/2024/03/fixing-bug-incremental-gc.md @@ -311,13 +311,13 @@ Rigo](https://github.com/pypy/pypy/issues/4925#issuecomment-2014459454) who helped me find different case that was still wrong, in the same area of the code. -TODO: check details of this paragraph I also got help by the developers at ... +I also got help by the developers at [PortaOne](https://portaone.com/) who are using PyPy on their servers and had seen some [mysterious PyPy crashes](https://github.com/pypy/pypy/issues/4900) recently. They did test deployments of my fixes in their various stages to their servers to try to see whether stability improved for them. The fact -that their crash rate went down a lot provided great extra confirmation for me -that I was doing something right. +that their crashes seem to have gone away provided great extra confirmation for +me that I was doing something right. ## Writing a GC fuzzer/property based test From 97403cec18986b616d68553a46c86ff2a9644b19 Mon Sep 17 00:00:00 2001 From: CF Bolz-Tereick Date: Mon, 25 Mar 2024 19:51:35 +0100 Subject: [PATCH 12/18] ouch, the portaone problem is a different GC bug :-( --- posts/2024/03/fixing-bug-incremental-gc.md | 8 -------- 1 file changed, 8 deletions(-) diff --git a/posts/2024/03/fixing-bug-incremental-gc.md b/posts/2024/03/fixing-bug-incremental-gc.md index c7b9bd798..c6f24d7b8 100644 --- a/posts/2024/03/fixing-bug-incremental-gc.md +++ b/posts/2024/03/fixing-bug-incremental-gc.md @@ -311,14 +311,6 @@ Rigo](https://github.com/pypy/pypy/issues/4925#issuecomment-2014459454) who helped me find different case that was still wrong, in the same area of the code. -I also got help by the developers at [PortaOne](https://portaone.com/) -who are using PyPy on their servers and had seen some [mysterious PyPy -crashes](https://github.com/pypy/pypy/issues/4900) -recently. They did test deployments of my fixes in their various stages -to their servers to try to see whether stability improved for them. The fact -that their crashes seem to have gone away provided great extra confirmation for -me that I was doing something right. - ## Writing a GC fuzzer/property based test Finding bugs in the GC is always extremely disconcerting, particularly since From ef818c2add3ca4f89163245cbfdfe4d3906ab2d1 Mon Sep 17 00:00:00 2001 From: CF Bolz-Tereick Date: Mon, 25 Mar 2024 19:52:20 +0100 Subject: [PATCH 13/18] there were more people involved in these various issues --- posts/2024/03/fixing-bug-incremental-gc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/posts/2024/03/fixing-bug-incremental-gc.md b/posts/2024/03/fixing-bug-incremental-gc.md index c6f24d7b8..fe71fd40d 100644 --- a/posts/2024/03/fixing-bug-incremental-gc.md +++ b/posts/2024/03/fixing-bug-incremental-gc.md @@ -553,4 +553,4 @@ course also when we need it the most to be correct. Thanks to Matti Picus, Max Bernstein for giving me feedback on drafts of the post. Thanks to Armin Rigo for reviewing the code and pointing out holes in my thinking. Thanks to the original reporters of the various forms of the bug, -Lily Foote, David Hewitt, Wenzel Jakob. +including Lily Foote, David Hewitt, Wenzel Jakob. From f842593aa6c07c63c4479c2f171e8422912d48a8 Mon Sep 17 00:00:00 2001 From: CF Bolz-Tereick Date: Tue, 26 Mar 2024 16:00:06 +0100 Subject: [PATCH 14/18] current situation --- posts/2024/03/fixing-bug-incremental-gc.md | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/posts/2024/03/fixing-bug-incremental-gc.md b/posts/2024/03/fixing-bug-incremental-gc.md index fe71fd40d..b36ad61a6 100644 --- a/posts/2024/03/fixing-bug-incremental-gc.md +++ b/posts/2024/03/fixing-bug-incremental-gc.md @@ -305,12 +305,22 @@ of memory and you can then debug this with a python2 debugger. With the unit test in hand, fixing the test was relatively straightforward (the diff in its simplest form is anyway only a [single line change](https://github.com/pypy/pypy/commit/78bbeb93471b5f38438004e971f4b4f84ab17a84)). -After my fix, I +After this first version of my fix, I [talked to Armin Rigo](https://github.com/pypy/pypy/issues/4925#issuecomment-2014459454) who helped me find different case that was still wrong, in the same area of the code. +I also got help by the developers at [PortaOne](https://portaone.com/) +who are using PyPy on their servers and had seen some [mysterious PyPy +crashes](https://github.com/pypy/pypy/issues/4900) +recently, that looked related to the GC. They did test deployments of my fixes +in their various stages to their servers to try to see whether stability +improved for them. Unfortunately in the end it turned out that their crashes +are an unrelated GC bug related to object pinning, which we haven't resolved +yet. + + ## Writing a GC fuzzer/property based test Finding bugs in the GC is always extremely disconcerting, particularly since @@ -345,12 +355,15 @@ In order to check whether the test is actually useful, I reverted my bug fixes and made sure that the test re-finds both the spurious GC assertion error and the problems with memcopying an array. -After running for a few hours, the test actually found an unrelated crash in -the GC! So far I have not yet spent time on debugging that, it's what I'll work -on next. I also plan on adding a bunch of other GC features as steps in the +In addition, the test also found corner cases in my fix. There was a situation +that I hadn't accounted for, which the test found after eventually. +I also plan on adding a bunch of other GC features as steps in the test to stress them too (for example weakrefs, identity hashes, pinning, maybe finalization). +At the point of publishing this post, the fixes got merged to the 2.7/3.9/3.10 +branches of PyPy, and we'll have to do a release with the fixes at some point. + # The technical details of the bug From 9d04ece13724c34b6225d7cfe09f521a9cf3adac Mon Sep 17 00:00:00 2001 From: CF Bolz-Tereick Date: Tue, 26 Mar 2024 20:14:26 +0100 Subject: [PATCH 15/18] fix date before merging --- posts/2024/03/fixing-bug-incremental-gc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/posts/2024/03/fixing-bug-incremental-gc.md b/posts/2024/03/fixing-bug-incremental-gc.md index b36ad61a6..a09bc1a23 100644 --- a/posts/2024/03/fixing-bug-incremental-gc.md +++ b/posts/2024/03/fixing-bug-incremental-gc.md @@ -1,7 +1,7 @@