Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aggregator_core: Allow dumping query plans to stdout #3347

Merged
merged 6 commits into from
Aug 8, 2024

Conversation

inahga
Copy link
Contributor

@inahga inahga commented Jul 31, 2024

Makes it easier to inspect query plans while working locally. These query plans do have to be taken with a grain of salt, because they're always running against an empty database and with enable_seqscan=false, so they're not entirely representative.

We dump to stdout because the output from tracing is fairly incomprehensible, since each node of the query plan is on its own tracing event.

@inahga inahga requested a review from a team as a code owner July 31, 2024 18:54
Comment on lines 12 to 13
`JANUS_TEST_DUMP_POSTGRESQL_LOGS`. The test database is set to log all query plans. You should only
run this option with one test at a time, otherwise you might not get all the logs.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is a testcontainers bug--if running multiple tests at at time, the output is incomplete, and perhaps interspersed with other containers. IMO it's not a big deal, since we'd only care to run this against one test at a time, so I don't care to spend too much time on it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we run multiple tests against the same container, to save on startup time, and separate by using different databases within the same process. Thus, database logs from queries in unrelated test aggregators will be commingled. It's also possible that we may shut down one database, and start another, if all the relevant Arcs go out of scope simultaneously. I wonder if there's some sort of race in the testcontainers library where deleting a container interrupts log forwarding, and if killing the container first, then deleting it once the log stream is done, would fix this.

Copy link
Contributor Author

@inahga inahga Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about sharing a container (I had thought it was 1:1, since I didn't read the code closely, and because I think the phenomena you describe about Arcs going out of scope applies, showing as multiple postgres containers in docker ps).

It looks like the log consumer task is spawned via tokio::spawn, and the corresponding handle is never joined, so testcontainers never waits for the log stream to be drained before the process is exited or the container is dropped. So I think a bunch of log lines get queued up, but never get serviced because the process/container ends too quickly.

Indeed, if I tack a sleep() at the end of an affected test, I see a much more plausible volume of output.

@inahga
Copy link
Contributor Author

inahga commented Jul 31, 2024

Eh, I think there's a more subtle synchronization error here, as sometimes I see incomplete results even when doing one test.

@inahga inahga marked this pull request as draft July 31, 2024 19:04
@inahga
Copy link
Contributor Author

inahga commented Aug 1, 2024

Yeesh, async drop is a hard problem. I've written an attempt at fixing synchronization without sleeps, done by spawning all testcontainers work in its own thread with its own tokio runtime.

I don't think it's possible for testcontainers to be in the same main thread and wait on logs, because blocking waiting for logs starves the logic that would terminate the log loop, i.e. a deadlock.

I think this hack is small enough to avoid pulling in something like the async_drop crate. I do think a bug report for testcontainers is justified over synchronization of logs, and I think they have some logic to make it a (hopefully) easy enough change.

@inahga inahga marked this pull request as ready for review August 1, 2024 00:12
Comment on lines +79 to +83
// Enable logging of query plans.
"shared_preload_libraries=auto_explain",
"log_min_messages=LOG",
"auto_explain.log_min_duration=0",
"auto_explain.log_analyze=true",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to conditionally enable these settings when the JANUS_TEST_DUMP_POSTGRESQL_LOGS environment variable is set. Both enabling auto-explain and raising the logging level could add overhead to tests.

// databases are not seeded with data, the query planner will often
// choose a sequential scan because it would be faster than hitting an
// index.
"enable_seqscan=false",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't disable sequential scans here. This is better used in a targeted fashion, for what-if analysis and comparisons between different query plans. Plans made with enable_seqscan=false cannot reallly tell us if our indexes are appropriate, because the plan will by necessity choose some sort of index scan, even if it would use a sequential scan in normal testing, or even in production. I think the only way to get a useful query plan is with realistic statistics and tuple counts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprisingly, spot checking a few tests, postgres still largely prefers index scans, so I think enable_seqscan was too presumptuous on my part.

No complaints from me, it points to these empty DB query plans still having a reasonable amount of signal.

aggregator_core/src/datastore/test_util.rs Show resolved Hide resolved

// Wait for shutdown to be signalled.
shutdown_rx.await.unwrap();
drop(db_container);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend trying out calling ContainerAsync::stop() first, then waiting for the log consumers to finish, then dropping to delete the container. If the log consumer streams stop once the container has stopped, then that should work , and it will prevent race conditions leading to lost logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, that appears to work and is much cleaner, insofar as I still see the database system is shut down at the end of logs.

// Hack: run testcontainer logic under its own thread with its own tokio runtime, to avoid
// deadlocking the main runtime when waiting for logs to finish in the Drop implementation.
let db_thread = std::thread::spawn(move || {
tokio::runtime::Builder::new_current_thread()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach to resolving the deadlock LGTM. I suspect another way to solve it would be with [tokio::test(flavor = "multi_threaded")] everywhere, but that would be very intrusive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right, if all tests run multithreaded, then maybe we can use block_in_place(). But yeah having to decorate every test would be irritating--we would probably be better off writing our own macro wrapper in that case. (I wonder why tokio tests default to the current thread runtime anyways.)

@inahga
Copy link
Contributor Author

inahga commented Aug 7, 2024

testcontainers/testcontainers-rs#719 for the testcontainers problem.

@inahga inahga merged commit 42d3609 into main Aug 8, 2024
8 checks passed
@inahga inahga deleted the inahga/log-query-plans branch August 8, 2024 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants