diff --git a/posts/2024/03/fixing-bug-incremental-gc.md b/posts/2024/03/fixing-bug-incremental-gc.md new file mode 100644 index 000000000..d99952c7e --- /dev/null +++ b/posts/2024/03/fixing-bug-incremental-gc.md @@ -0,0 +1,569 @@ + + +# Introduction + +Since last summer, I've been looking on and off into a weird and hard to +reproduce [crash bug in PyPy](https://github.com/pypy/pypy/issues/3959). It was +manifesting only on CI, and it seemed to always happen in the AST rewriting +phase of [pytest](https://pytest.org), the symptoms being that PyPy would crash +with a segfault. All my attempts to reproduce it locally failed, and my +attempts to try to understand the problem by dumping the involved ASTs lead +nowhere. + +A few weeks ago, we got [two more](https://github.com/PyO3/pyo3/issues/3766) +[bug reports](https://github.com/orgs/pypy/discussions/4923), the last one by +the authors of the [nanobind](https://nanobind.readthedocs.io/) binding +generator, with the same symptoms: crash in AST rewriting, only on CI. I +decided to make a more serious push to try to find the bug this time. +Ultimately the problem turned out to be several bugs in PyPy's garbage +collector (GC) that had been there since its inception in +[2013](/posts/2013/10/incremental-garbage-collector-in-pypy-8956893523842234676.html). +Understanding the +situation turned out to be quite involved, additionally complicated by this +being the first time that I was working on this particular aspect of PyPy's GC. +Since the bug was so much work to find, I thought I'd write a blog post about +it. + +The blog post consists of three parts: first a chronological description of +what I did to find the bug, a technical explanation of what goes wrong, some +reflections on the bug (and then a bonus bug I also found in the process). + +# Finding the Bug + +I started from the failing [nanobind CI +runs](https://github.com/wjakob/nanobind/actions/runs/8234561874/job/22516568891) +that ended with a segfault of the PyPy interpreter. This was only an +intermittent problem, not every run was failing. When I tried to just run the +test suite locally, I couldn't get it to fail. Therefore at first I tried to +learn more about what was happening by looking on the CI runners. + +## Running on CI + +I forked the nanobind repo and hacked the CI script in order to get it to use a +PyPy build with [full debug information and more assertions turned on](https://doc.pypy.org/en/latest/build.html#making-a-debug-build-of-pypy). In order +to increase the probability of seeing the crash I added an otherwise unused +[matrix](https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs) +variable to the CI script that just contained 32 parameters. This means every +build is done 32 times (sorry Github for wasting your CPUs 😕). With that +amount of repetition, I got at least one job of every build that was crashing. + +Then I added the `-Xfaulthandler` option to the PyPy command which will use the +[faulthandler](https://docs.python.org/3.11/library/faulthandler.html) module +try to print a Python stacktrace if the VM segfaults to confirm that PyPy was +indeed crashing in the [AST](https://docs.python.org/3/library/ast.html) +[rewriting +phase](https://github.com/pytest-dev/pytest/blob/main/src/_pytest/assertion/rewrite.py) +of pytest, which pytest uses for [nicer +assertions](https://docs.pytest.org/en/7.1.x/how-to/assert.html#asserting-with-the-assert-statement). +I experimented with hacking our faulthandler implementation to also give me a +C-level callstack, but that didn't work as well as I hoped. + +Then I tried to run [gdb](https://sourceware.org/gdb/) on CI to try to get it +to print a C callstack at the crash point. You can get gdb to execute commands +as if typed at the prompt with the `-ex` commandline option, I used something +like this: + +``` +gdb -ex "set confirm off" -ex "set pagination off" -ex \ + "set debuginfod enabled off" -ex run -ex where -ex quit \ + --args +``` + +But unfortunately the crash never occurred when running in gdb. + +Afterwards I tried the next best thing, which was configuring the CI runner to +[dump a core file and upload it as a build +artifact](https://github.com/itamarst/gha-upload-cores), which worked. Looking +at the cores locally only sort of worked, because I am running a different +version of Ubuntu than the CI runners. So I used +[tmate](https://mxschmitt.github.io/action-tmate/) to be able to log into the +CI runner after a crash and interactively used gdb there. Unfortunately what I +learned from that was that the bug was some kind of **memory corruption**, +which is always incredibly unpleasant to debug. Basically the header word of a +Python object had been corrupted somehow at the point of the crash, which means +that it's [vtable](https://en.wikipedia.org/wiki/Virtual_method_table) wasn't +usable any more. + +(Sidenote: [PyPy doesn't really use a vtable +pointer](/posts/2009/10/gc-improvements-6174120095428192954.html#unifying-the-vtable-ptr-with-the-gc-header), +instead it uses half a word in the header for the vtable, and the other half +for flags that the GC needs to keep track of the state of the object. +Corrupting all this is still bad.) + +## Reproducing Locally + +At that point it was clear that I had to push to reproduce the problem on my +laptop, to allow me to work on the problem more directly and not to always have +to go via the CI runner. Memory corruption bugs often have a lot of randomness +(depending on which part of memory gets modified, things might crash or more +likely just happily keep running). Therefore I decided to try to brute-force +reproducing the crash by simply running the tests many many times. Since the +crash happened in the AST rewriting phase of pytest, and that happens only if +no [pyc +files](https://stackoverflow.com/questions/2998215/if-python-is-interpreted-what-are-pyc-files) +of the bytecode-compiled rewritten ASTs exist, I made sure to delete them +before every test run. + +To repeat the test runs I used +[multitime](https://tratt.net/laurie/src/multitime/), which is a simple program +that runs a command repeatedly. It's meant for lightweight benchmarking +purposes, but it also halts the execution of the command if that command exits +with an error (and it sleeps a small random time between runs, which might help +with randomizing the situation, maybe). Here's a demo: + + + +([Max](https://bernsteinbear.com/) pointed out +[autoclave](https://github.com/silentbicycle/autoclave) to me when reviewing +this post, which is a more dedicated tool for this job.) + +Thankfully, running the tests repeatedly eventually lead to a crash, solving my +"only happens on CI" problem. I then tried various variants to exclude possible +sources of errors. The first source of errors to exclude in PyPy bugs is the +just-in-time compiler, so I reran the tests with `--jit off` to see whether I +could still get it to crash, and thankfully I eventually could (JIT bugs are +often very annoying). + +Next source of bugs to exclude where C-extensions. Since those were the tests +of nanobind, a framework for creating C-extension modules I was a bit worried +that the bug might be in our emulation of CPython's C-API. But running PyPy +with the `-v` option (which will print all the imports as they happen) +confirmed that at the point of crash no C-extension had been imported yet. + +## Using `rr` + +I still couldn't get the bug to happen in GDB, so the tool I tried next was +[rr, the "reverse debugger"](https://rr-project.org/). rr can record the execution of a program and +later replay it arbitrarily often. This gives you a time-traveling debugger +that allows you to execute the program backwards in addition to forwards. +Eventually I managed to get the crash to happen when running the tests with +`rr record --chaos` (`--chaos` randomizes some decisions that rr takes, to try to +increase the chance of reproducing bugs). + +Using rr well is quite hard, and I'm not very good at it. The main approach I +use with rr to debug memory corruption is to replay the crash, then set a +[watchpoint](https://sourceware.org/gdb/current/onlinedocs/gdb.html/Set-Watchpoints.html) +for the corrupted memory location, then use the command `reverse-continue` to +find the place in the code that mutated the memory location. `reverse-continue` +is like `continue`, except that it will execute the program backwards from the +current point. Here's a little demo of this: + + + +Doing this for my bug revealed that the object that was being corrupted was +erroneously collected by the garbage collector. For some reason the GC had +wrongly decided that the object was no longer reachable and therefore put the +object into a freelist by writing a pointer to the next entry in the freelist +into the first word of the object, overwriting the object's header. The next +time the object was used things crashed. + +## Side-quest: wrong GC assertions + +At this point in the process, I got massively side-tracked. PyPy's GC has a +number of debug modes that you can optionally turn on. Those slow down the +program execution a lot, but they should in theory help to understand why the +GC goes wrong. When I turned them on, I was getting a failing assertion really +early in the test execution, complaining about an invariant violation in the GC +logic. At first this made me very happy. I thought that this would help me fix +the bug more quickly. + +Extremely frustratingly, after two days of work I concluded that the assertion +logic itself was wrong. I have fixed that in the meantime too, the details +of that are in the bonus section at the end of the post. + +## Using GDB scripting to find the real bug + +After that disaster I went back to the earlier rr recording without GC assertions +and tried to understand in more detail why the GC decided to free an object +that was still being referenced. To be able to do that I used the [GDB Python +scripting +API](https://sourceware.org/gdb/current/onlinedocs/gdb.html/Python-API.html) to +write some helper commands to understand the state of the GC heap (rr is an +extension of GDB, so the GDB scripting API works in rr too). + +The first (small) helper command I wrote with the GDB scripting API was a way +to pretty-print the currently active GC flags of a random PyPy object, starting +just from the pointer. The more complex command I wrote was an object tracer, +which follows pointers to GC objects starting from a root object to explore the +object graph. The object tracer isn't complete, it doesn't deal with all the +complexities of PyPy's GC. But it was good enough to help me with my problem, I +found out that the corrupted object was stored in an array. + +As an example, here's a function that uses the GDB API to walk one of the +helper data structures of the GC, a stack of pointers: + +```python +def walk_addr_stack(obj): + """ walk an instance of the AddressStack class (which is a linked list of + arrays of 1019 pointers). + + the first of the arrays is only partially filled with used_in_last_chunk + items, all the other chunks are full.""" + if obj.type.code == gdb.TYPE_CODE_PTR: + obj = obj.dereference() + used_in_last_chunk = lookup(obj, "used_in_last_chunk") + chunk = lookup(obj, "inst_chunk").dereference() + while 1: + items = lookup(chunk, "items") + for i in range(used_in_last_chunk): + yield items[i] + chunk = lookup(chunk, "next") + if not chunk: + break + chunk = chunk.dereference() + used_in_last_chunk = 1019 +``` + +The full file of supporting code I wrote can be found in [this +gist](https://gist.github.com/cfbolz/13cadcbbef321d93fc9790dff6f60a6a). This is +pretty rough throw-away code, however. + +In the following recording I show a staged debugging session with some of the +extra commands I wrote with the Python API. The details aren't important, I +just wanted to give a bit of a flavor of what inspecting objects looks like: + + + +The next step was to understand why the array content wasn't being correctly +traced by the GC, which I eventually managed with some [conditional +breakpoints](https://www.fayewilliams.com/2011/07/13/gdb-conditional-breakpoints/), +more watchpoints, and using `reverse-continue`. It turned out to be a bug that +occurs when the content of one array was memcopied into another array. The +technical details of why the array wasn't traced correctly are described in +detail in the next section. + +## Writing a unit test + +To try to make sure I really understood the bug correctly I then wrote a GC +unit test that shows the problem. Like most of PyPy, our GC is written in +RPython, a (somewhat strange) subset/dialect of Python2, which can be compiled +to C code. However, since it is also valid Python2 code, it can be [unit-tested +on top of a Python2 +implementation](/posts/2022/04/how-is-pypy-tested.html) +(which is one of the reasons why we keep maintaining PyPy2). + +In the GC unit tests you have a lot of control about what order things happen +in, e.g. how objects are allocated, when garbage collection phases happen, etc. +After some trying I managed to write a test that crashes with the same kind of +memory corruption that my original crash exhibited: an object that is still +reachable via an array is collected by the GC. To give you a flavor of what +this kind of test looks like, here's an (edited for clarity) version of the +test I eventually managed to write + +```python +def test_incrementality_bug_arraycopy(self): + source = self.malloc(VAR, 8) # first array + # the stackroots list emulates the C stack + self.stackroots.append(source) + target = self.malloc(VAR, 8) # second array + self.stackroots.append(target) + node = self.malloc(S) # unrelated object, will be collected + node.x = 5 + # store reference into source array, calling the write barrier + self.writearray(source, 0, node) + val = self.gc.collect_step() + source = self.stackroots[0] # reload arrays, they might have moved + target = self.stackroots[1] + # this GC step traces target + val = self.gc.collect_step() + + # emulate what a memcopy of arrays does + res = self.gc.writebarrier_before_copy(source, target, 0, 0, 2) + assert res + target[0] = source[0] # copy two elements of the arrays + target[1] = source[1] + # now overwrite the reference to node in source + self.writearray(source, 0, lltype.nullptr(S)) + # this GC step traces source + self.gc.collect_step() + # some more collection steps, crucially target isn't traced again + # but node is deleted + for i in range(3): + self.gc.collect_step() + # used to crash, node got collected + assert target[0].x == 5 + +``` + +One of the good properties of testing our GC that way is that all the memory is +emulated. The crash in the last line of the test isn't a segfault at all, +instead you get a nice exception saying that you tried to access a freed chunk +of memory and you can then debug this with a python2 debugger. + +## Fixing the Bug + +With the unit test in hand, fixing the test was relatively straightforward (the +diff in its simplest form is anyway only a [single line +change](https://github.com/pypy/pypy/commit/78bbeb93471b5f38438004e971f4b4f84ab17a84)). +After this first version of my fix, I +[talked to Armin +Rigo](https://github.com/pypy/pypy/issues/4925#issuecomment-2014459454) who +helped me find different case that was still wrong, in the same area of the +code. + +I also got help by the developers at [PortaOne](https://portaone.com/) +who are using PyPy on their servers and had seen some [mysterious PyPy +crashes](https://github.com/pypy/pypy/issues/4900) +recently, that looked related to the GC. They did test deployments of my fixes +in their various stages to their servers to try to see whether stability +improved for them. Unfortunately in the end it turned out that their crashes +are an unrelated GC bug related to object pinning, which we haven't resolved +yet. + + +## Writing a GC fuzzer/property based test + +Finding bugs in the GC is always extremely disconcerting, particularly since +this one manged to hide for so long (more than ten years!). Therefore I wanted +to use these bugs as motivation to try to find more problems in PyPy's GC. Given +the ridiculous effectiveness of fuzzing, I used +[hypothesis](https://hypothesis.readthedocs.io/en/latest/) to write a +property-based test. Every test performs a sequence of randomly chosen steps +from the following list: + +- allocate an object +- read a random field from a random object +- write a random reference into a random object +- drop a random stack reference +- perform one GC step +- allocate an array +- read a random index from a random array +- write to an array +- memcopy between two arrays + +This approach of doing a sequence of steps is pretty close to the [stateful +testing](https://hypothesis.readthedocs.io/en/latest/stateful.html) approach of +hypothesis, but I just implemented it manually with the [data +strategy](https://hypothesis.readthedocs.io/en/latest/data.html#drawing-interactively-in-tests). + +Every one of those steps is always performed on both the tested GC, and on some +regular Python objects. The Python objects provide the "ground truth" of what +the heap should look like, so we can compare the state of the GC objects +with the state of the Python objects to find out whether the GC made a mistake. + +In order to check whether the test is actually useful, I reverted my bug fixes +and made sure that the test re-finds both the spurious GC assertion error and the +problems with memcopying an array. + +In addition, the test also found corner cases in my fix. There was a situation +that I hadn't accounted for, which the test found after eventually. +I also plan on adding a bunch of other GC features as steps in the +test to stress them too (for example weakrefs, identity hashes, pinning, maybe +finalization). + +At the point of publishing this post, the fixes got merged to the 2.7/3.9/3.10 +branches of PyPy, and will be part of the next release (v7.3.16). + + +# The technical details of the bug + +In order to understand the technical details of the bug, I need to give some +background explanations about PyPy's GC. + +## PyPy's incremental GC + +PyPy uses an incremental generational mark-sweep GC. It's +[generational](https://en.wikipedia.org/wiki/Tracing_garbage_collection#Generational_GC_(ephemeral_GC)) +and therefore has minor collections (where only young objects get collected) +and major collections (collecting long-lived objects eventually, using a +[mark-and-sweep](https://en.wikipedia.org/wiki/Tracing_garbage_collection#Na%C3%AFve_mark-and-sweep) +algorithm). Young objects are allocated in a nursery using a +bump-pointer allocator, which makes allocation quite efficient. They are moved +out of the nursery by minor collections. In order to find references from old +to young objects the GC uses a write barrier to detect writes into old objects. + +The GC is also +[incremental](https://en.wikipedia.org/wiki/Tracing_garbage_collection#Stop-the-world_vs._incremental_vs._concurrent), +which means that its major collections aren't done all at once (which would +lead to long pauses). Instead, major collections are sliced up into small +steps, which are done directly after a minor collection (the GC isn't +*concurrent* though, which would mean that the GC does work in a separate +thread). + +The incremental GC uses [tri-color +marking](https://en.wikipedia.org/wiki/Tracing_garbage_collection#Tri-color_marking) +to reason about the reachable part of the heap during the marking phase, where +every old object can be: + +- black: already marked, reachable, definitely survives the collection +- grey: will survive, but still needs to be marked +- white: potentially dead + +The color of every object is encoded by setting flags +in the object header. + +The GC maintains the **invariant** that black objects must never point to white +objects. At the start of a major collection cycle the stack roots are turned +gray. During the mark phase of a major collection cycle, the GC will trace gray +objects, until +none are left. To trace a gray object, all the objects it references have to be +marked grey if they are white so far. After a grey object is traced, it can be +marked black (because all the referenced objects are now either black or gray). +Eventually, there are no gray objects left. At that point (because no white +object can be reached from a black one) all the white objects are known to be +unreachable and can therefore be freed. + +The GC is incremental because every collection step will only trace a limited +number of gray objects, before giving control back to the program. This leads to +a problem: if an already traced (black) object is changed between two marking +steps of the GC, the program can mutate that object and write a new reference +into one of its fields. This could lead to an invariant violation, if the +referenced object is white. Therefore, the GC uses the write barrier (which it +needs anyway to find references from old to young objects) to mark all black +objects that are modified gray, and then trace them again at one of the +later collection steps. + +## The special write barrier of memcopy + +Arrays use a different kind of write barrier than normal objects. Since they +can be arbitrarily large, tracing them can take a long time. Therefore it's +potentially wasteful to trace them fully at a minor collection. To fix this, +the array write barrier keeps more granular information about which parts of +the array have been modified since the last collection step. Then only the +modified parts of the array need to be traced, not the whole array. + +In addition, there is another optimization for arrays, which is that memcopy is +treated specially by the GC. If memcopy is implemented by simply writing a loop +that copies the content of one array to the other, that will invoke the write +barrier every single loop iteration for the write of every array element, +costing a lot of overhead. Here's some pseudo-code: + +```python +def arraycopy(source, dest, source_start, dest_start, length): + for i in range(length): + value = source[source_start + i] + dest[dest_start + i] = value # <- write barrier inserted here +``` + +Therefore the GC has a special memcopy-specific +write barrier that will perform the GC logic once before the memcopy loop, and +then use a regular (typically SIMD-optimized) memcopy implementation from +`libc`. Roughly like this: + +```python +def arraycopy(source, dest, source_start, dest_start, length): + gc_writebarrier_before_array_copy(source, dest, source_start, dest_start, length) + raw_memcopy(cast_to_voidp(source) + source_start, + cast_to_voidp(dest) + dest_start, + sizeof(itemtype(source)) * length) +``` + +(this is really a rough sketch. The [real +code](https://github.com/pypy/pypy/blob/789f964fff59c722b0872abcdc56d2b1373a9f3b/rpython/rlib/rgc.py#L365) +is much more complicated.) + +## The bug + +The bugs turned out to be precisely in this memcopy write barrier. When we +implemented the current GC, we adapted our previous GC, which was a +generational mark-sweep GC but *not* incremental. We started with most of the +previous GC's code, including the write barriers. The regular write barriers +were adapted to the new incremental assumptions, in particular the need for the +write barrier to also turn black objects back to gray when they are modified +during a marking phase. This was simply not done at all for the memcopy write +barrier, at least in two of the code paths. Fixing this problem fixes the unit +tests and stops the crashes. + +# Reflections + +The way the bug was introduced is really typical. A piece of code (the memcopy +write barrier) was written under a set of assumptions. Then those assumptions +changed later. Not all the code pieces that relied on these assumptions to be +correct were updated. It's pretty hard to prevent this in all situations. + +I still think we could have done more to prevent the bug occurring. Writing a +property-based test for the GC would have been a good idea given the complexity +of the GC, and definitely something we did in other parts of our code at the +time (just using the `random` module mostly, we started using hypothesis +later). + +It's a bit of a mystery to me why this bug managed to be undetected for so +long. Memcopy happens in a lot of pretty core operations of e.g. lists in +Python (`list.extend`, to name just one example). To speculate, I would suspect +that all the other preconditions for the bug occurring made it pretty rare: + +- the content of an old list that is not yet marked needs to be copied into + another old list that is marked already +- the source of the copy needs to also store an object that has no other + references +- the source of the copy then needs to be overwritten with other data +- then the next collection steps need to be happening at the right points +- ... + +Given the complexity of the GC logic I also wonder whether some lightweight +formal methods would have been a good idea. Formalizing some of the core +invariants in [B](https://en.wikipedia.org/wiki/B-Method) or +[TLA+](https://en.wikipedia.org/wiki/TLA%2B) and then [model +checking](https://en.wikipedia.org/wiki/Model_checking) them up to some number +of +objects would have found this problem pretty quickly. There are also correctness +proofs for GC algorithms in some research papers, but I don't have a good +overview of the literature to point to any that are particularly good or bad. +Going such a more formal route might have fixed this and probably a whole bunch +of other bugs, but of course it's a pretty expensive (and tedious) approach. + +While it was super annoying to track this down, it was definitely good to learn +a bit more about how to use rr and the GDB scripting interface. + +# Bonus Section: The Wrong Assertion + +Some more technical information about the wrong assertion is in this section. + +## Background: pre-built objects + +PyPy's VM-building bootstrapping process can "freeze" a bunch of heap objects +into the final binary. This allows the VM to start up quickly, because those +frozen objects are loaded by the OS as part of the binary. + +Those frozen pre-built objects are parts of the 'roots' of the garbage +collector and need to be traced. However, tracing all the pre-built objects at +every collection would be very expensive, because there are a lot of them +(about 150,000 in a PyPy 3.10 binary). Tracing them all is also not necessary, +because most of them are never modified. Unmodified pre-built objects can only reference +other pre-built objects, which can never be deallocated anyway. Therefore we +have an optimization that uses the write barrier (which we need anyway to find +old-to-young pointers) to notice when a pre-built object gets modified for the +very first time. If that happens, it gets added to the set of pre-built objects +that gets counted as a root, and is traced as a root at collections +from then on. + +## The wrong assertion + +The assertion that triggered when I turned on the GC debug mode was saying that +the GC found a reference from a black to a white object, violating its +invariant. Unmodified pre-built objects count as black, and they aren't roots, +because they can only ever reference other pre-built objects. However, when a +pre-built object gets modified for the first time, it becomes part of the root +set and will be marked gray. This logic works fine. + +The wrong assertion triggers if a pre-built object is mutated for the very +first time in the middle of an incremental marking phase. While the pre-built +object gets added to the root set just fine, and will get traced before the +marking phase ends, this is encoded slightly differently for pre-built objects, +compared to "regular" old objects. Therefore, the invariant checking code +wrongly reported a black->white pointer in this situation. + +To fix it I also wrote a unit test checking the problem, made sure that the GC +hypothesis test also found the bug, and then fixed the wrong assertion to take +the color encoding of pre-built objects into account. + +The bug managed to be invisible because we don't tend to turn on the GC +assertions very often. We only do that when we find a GC bug, which is of +course also when we need it the most to be correct. + +# Acknowledgements + +Thanks to Matti Picus, Max Bernstein, Wouter van Heyst for giving me feedback on drafts of the +post. Thanks to Armin Rigo for reviewing the code and pointing out holes in my +thinking. Thanks to the original reporters of the various forms of the bug, +including Lily Foote, David Hewitt, Wenzel Jakob.