aggregator_core: Allow dumping query plans to stdout #3347

inahga · 2024-07-31T18:54:26Z

Makes it easier to inspect query plans while working locally. These query plans do have to be taken with a grain of salt, because they're always running against an empty database and with enable_seqscan=false, so they're not entirely representative.

We dump to stdout because the output from tracing is fairly incomprehensible, since each node of the query plan is on its own tracing event.

inahga · 2024-07-31T18:56:07Z

aggregator_core/README.md

+`JANUS_TEST_DUMP_POSTGRESQL_LOGS`. The test database is set to log all query plans. You should only
+run this option with one test at a time, otherwise you might not get all the logs.


I suspect this is a testcontainers bug--if running multiple tests at at time, the output is incomplete, and perhaps interspersed with other containers. IMO it's not a big deal, since we'd only care to run this against one test at a time, so I don't care to spend too much time on it.

Actually, we run multiple tests against the same container, to save on startup time, and separate by using different databases within the same process. Thus, database logs from queries in unrelated test aggregators will be commingled. It's also possible that we may shut down one database, and start another, if all the relevant Arcs go out of scope simultaneously. I wonder if there's some sort of race in the testcontainers library where deleting a container interrupts log forwarding, and if killing the container first, then deleting it once the log stream is done, would fix this.

Good point about sharing a container (I had thought it was 1:1, since I didn't read the code closely, and because I think the phenomena you describe about Arcs going out of scope applies, showing as multiple postgres containers in docker ps).

It looks like the log consumer task is spawned via tokio::spawn, and the corresponding handle is never joined, so testcontainers never waits for the log stream to be drained before the process is exited or the container is dropped. So I think a bunch of log lines get queued up, but never get serviced because the process/container ends too quickly.

Indeed, if I tack a sleep() at the end of an affected test, I see a much more plausible volume of output.

inahga · 2024-07-31T19:04:18Z

Eh, I think there's a more subtle synchronization error here, as sometimes I see incomplete results even when doing one test.

inahga · 2024-08-01T00:10:53Z

Yeesh, async drop is a hard problem. I've written an attempt at fixing synchronization without sleeps, done by spawning all testcontainers work in its own thread with its own tokio runtime.

I don't think it's possible for testcontainers to be in the same main thread and wait on logs, because blocking waiting for logs starves the logic that would terminate the log loop, i.e. a deadlock.

I think this hack is small enough to avoid pulling in something like the async_drop crate. I do think a bug report for testcontainers is justified over synchronization of logs, and I think they have some logic to make it a (hopefully) easy enough change.