Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to print SQL statements for debug/perf reasons #5

Open
aaronsteers opened this issue Oct 24, 2022 · 7 comments
Open

Option to print SQL statements for debug/perf reasons #5

aaronsteers opened this issue Oct 24, 2022 · 7 comments

Comments

@aaronsteers
Copy link
Contributor

aaronsteers commented Oct 24, 2022

@edgarrmondragon and @kgpayne - Is there any option that you know of to print the SQL statements (commands or SELECT queries) as they are executing?

I believe I recall vaguely that SQLAlchemy supports something like this and I thought it could be helpful to enable it for scenarios where we want to debug what commands are being sent and/or to observe performance on specific commands.

@edgarrmondragon
Copy link
Member

@aaronsteers afaik sqlalchemy emits logs for all queries with the sqlalchemy.engine.Engine logger, so it's a matter of setting the log level of that to INFO.

@aaronsteers
Copy link
Contributor Author

aaronsteers commented Oct 24, 2022

@edgarrmondragon - Great, thanks! Turns out I didn't need it in this pass. At first I was getting weird results due to discovery running first and taking a minute on its own - but then mitigated by running meltano select EXTRACTOR --list between executions to make sure the catalog is up-to-date before running the 'real' sync operation.

It would probably still be nice to have a docs entry or env var that can toggle this on or off - but I don't think it's very high priority as of now.

@kgpayne
Copy link
Collaborator

kgpayne commented Oct 25, 2022

@aaronsteers @edgarrmondragon I ran into performance issues with SQLAlchemy inspect and Snowflake during testing today too. TLDR; is that our default implementation of catalog discovery assumes it is cheap to query db metadata, which in snowflake it really isn't.

The snowflake dialect is implemented using a mix of show <schema/tables> in <database/schema> and select ... from information_schema where table_schema=<schema>, and in general steps sequentially through discovered entities fetching data. This results in many unnecessary calls to the same information_schema tables filtered to individual schema/table names. We would be much better off reading whole information_schema tables once and filtering on the client side 🤷‍♂️

Also, during pytest testing, we set stream._MAX_RECORDS_LIMIT = 1 however this is not pushed down into the SQL select statements (as far as I can see). This means the entire record set has to be compiled by the server and sent to the client (taking time a warehouse resources) for nothing 🤦‍♂️ I was running tests on the SNOWFLAKE_SAMPLE_DATA and watching the XS warehouse try and chug through the largest table. It would have taken hours, if not days.

Will create issues for both 👍

@edgarrmondragon
Copy link
Member

TLDR; is that our default implementation of catalog discovery assumes it is cheap to query db metadata, which in snowflake it really isn't.

@kgpayne https://github.com/transferwise/pipelinewise-tap-snowflake gets around that by having a tables setting to pre-select tables for discovery.

@aaronsteers
Copy link
Contributor Author

aaronsteers commented Oct 25, 2022

Also, during pytest testing, we set stream._MAX_RECORDS_LIMIT = 1 however this is not pushed down into the SQL select statements

Yeah, that makes sense that it isn't referencing stream._MAX_RECORDS_LIMIT as of now. Referencing that in the SDK with (hopefully) a generic SQLAlchemy LIMIT n implementation would totally make sense.

TLDR; is that our default implementation of catalog discovery assumes it is cheap to query db metadata, which in snowflake it really isn't.

Definitely worth raising and worth logging. Thanks for calling this out. Can you say a bit more about your execution times for discovery? I saw discovery finish in 1.5 minutes on the Snowflake sample DB, which is not great but doesn't seem horrible. I was also running from Codespaces so roundtip network latency may have also been significantly better vs running on home WiFi.

@kgpayne
Copy link
Collaborator

kgpayne commented Oct 26, 2022

Doing some performance testing using the query history in the Snowflake UI, my first discovery query arrives at 11:53:38 AM and the first stream starts to load at 11:56:58 AM. So I guess performance isn't so terrible, and I agree it probably isn't a priority in this iteration. 3 minutes still feels like a long time for the terminal to hang for, and there is also inflated cost associated with constantly hitting the metadata tables, which we could look to improve in future 🙂

@aaronsteers
Copy link
Contributor Author

aaronsteers commented Oct 26, 2022

Doing some performance testing using the query history in the Snowflake UI, my first discovery query arrives at 11:53:38 AM and the first stream starts to load at 11:56:58 AM. So I guess performance isn't so terrible, and I agree it probably isn't a priority in this iteration. 3 minutes still feels like a long time for the terminal to hang for, and there is also inflated cost associated with constantly hitting the metadata tables, which we could look to improve in future 🙂

Agreed, and thanks for the additional detail. Having waited in the past upwards of 10-15 minutes for Salesforce to run discovery, I'm definitely tolerant of 3 minutes runtime for now, but I agree it makes sense to come back and tune this.

This Meltano-side issue would actually be a big help here, since visibility/observability is part of the problem:

On thinking through this a bit more... does the SDK print an INFO-level log when discovery is starting and finishing? (I don't think so.) If not, that's probably another quick win that wouldn't require a Meltano-side update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants