-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aggregator_core: Allow dumping query plans to stdout #3347
Conversation
aggregator_core/README.md
Outdated
`JANUS_TEST_DUMP_POSTGRESQL_LOGS`. The test database is set to log all query plans. You should only | ||
run this option with one test at a time, otherwise you might not get all the logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this is a testcontainers bug--if running multiple tests at at time, the output is incomplete, and perhaps interspersed with other containers. IMO it's not a big deal, since we'd only care to run this against one test at a time, so I don't care to spend too much time on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we run multiple tests against the same container, to save on startup time, and separate by using different databases within the same process. Thus, database logs from queries in unrelated test aggregators will be commingled. It's also possible that we may shut down one database, and start another, if all the relevant Arcs go out of scope simultaneously. I wonder if there's some sort of race in the testcontainers library where deleting a container interrupts log forwarding, and if killing the container first, then deleting it once the log stream is done, would fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about sharing a container (I had thought it was 1:1, since I didn't read the code closely, and because I think the phenomena you describe about Arcs going out of scope applies, showing as multiple postgres containers in docker ps
).
It looks like the log consumer task is spawned via tokio::spawn
, and the corresponding handle is never joined, so testcontainers never waits for the log stream to be drained before the process is exited or the container is dropped. So I think a bunch of log lines get queued up, but never get serviced because the process/container ends too quickly.
Indeed, if I tack a sleep()
at the end of an affected test, I see a much more plausible volume of output.
Eh, I think there's a more subtle synchronization error here, as sometimes I see incomplete results even when doing one test. |
Yeesh, async drop is a hard problem. I've written an attempt at fixing synchronization without sleeps, done by spawning all testcontainers work in its own thread with its own tokio runtime. I don't think it's possible for testcontainers to be in the same main thread and wait on logs, because blocking waiting for logs starves the logic that would terminate the log loop, i.e. a deadlock. I think this hack is small enough to avoid pulling in something like the |
// Enable logging of query plans. | ||
"shared_preload_libraries=auto_explain", | ||
"log_min_messages=LOG", | ||
"auto_explain.log_min_duration=0", | ||
"auto_explain.log_analyze=true", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to conditionally enable these settings when the JANUS_TEST_DUMP_POSTGRESQL_LOGS
environment variable is set. Both enabling auto-explain and raising the logging level could add overhead to tests.
// databases are not seeded with data, the query planner will often | ||
// choose a sequential scan because it would be faster than hitting an | ||
// index. | ||
"enable_seqscan=false", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't disable sequential scans here. This is better used in a targeted fashion, for what-if analysis and comparisons between different query plans. Plans made with enable_seqscan=false
cannot reallly tell us if our indexes are appropriate, because the plan will by necessity choose some sort of index scan, even if it would use a sequential scan in normal testing, or even in production. I think the only way to get a useful query plan is with realistic statistics and tuple counts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surprisingly, spot checking a few tests, postgres still largely prefers index scans, so I think enable_seqscan
was too presumptuous on my part.
No complaints from me, it points to these empty DB query plans still having a reasonable amount of signal.
|
||
// Wait for shutdown to be signalled. | ||
shutdown_rx.await.unwrap(); | ||
drop(db_container); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend trying out calling ContainerAsync::stop()
first, then waiting for the log consumers to finish, then dropping to delete the container. If the log consumer streams stop once the container has stopped, then that should work , and it will prevent race conditions leading to lost logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, that appears to work and is much cleaner, insofar as I still see the database system is shut down
at the end of logs.
// Hack: run testcontainer logic under its own thread with its own tokio runtime, to avoid | ||
// deadlocking the main runtime when waiting for logs to finish in the Drop implementation. | ||
let db_thread = std::thread::spawn(move || { | ||
tokio::runtime::Builder::new_current_thread() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach to resolving the deadlock LGTM. I suspect another way to solve it would be with [tokio::test(flavor = "multi_threaded")]
everywhere, but that would be very intrusive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right, if all tests run multithreaded, then maybe we can use block_in_place()
. But yeah having to decorate every test would be irritating--we would probably be better off writing our own macro wrapper in that case. (I wonder why tokio tests default to the current thread runtime anyways.)
testcontainers/testcontainers-rs#719 for the testcontainers problem. |
Makes it easier to inspect query plans while working locally. These query plans do have to be taken with a grain of salt, because they're always running against an empty database and with
enable_seqscan=false
, so they're not entirely representative.We dump to stdout because the output from tracing is fairly incomprehensible, since each node of the query plan is on its own tracing event.