-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Avoid wrapping in subqueries where possible and respect order by
clauses with dbt show --limit
#9600
Comments
Thanks for raising this @joellabes ! Agreed, it makes sense that we'd want to respect We'd also want to respect WorkaroundThe standard definition is: {% macro default__get_limit_subquery_sql(sql, limit) %}
select *
from (
{{ sql }}
) as model_limit_subq
limit {{ limit }}
{% endmacro %} You can get the behavior you're after by creating a file within your project named something like {% macro default__get_limit_subquery_sql(sql, limit) %}
{{ sql }}
limit {{ limit }}
{% endmacro %} |
dbt show --limit
doesn't respect order by clauses and should avoid wrapping in subqueries where possibleorder by
clauses with dbt show --limit
ooooh i like that workaround. on my way to propose a PR in internal-analytics 👀
@dbeatty10 can you clarify what you mean here? Aren't preview queries by definition ephemeral (not in the materialization sense) and unbreakable? |
Like you mentioned originally, any query that uses a limit won't work with the macro override above. Here's an example that doesn't work: dbt show --inline "select 1 as id limit 3" |
OK got it! I guess my pitch is that previews don't have to be protected from that happening - if you accidentally wind up issuing |
Hi @dbeatty10 splitting hairs here :-) I don't know if I would say this is a |
Very good points @dsmdavid -- we could reasonably categorize it as any of those. We've taken a couple swings at this:
The latter has interactions with IdealIn the most ideal scenario, we'd just do the following: Add a limit if none exists in the SQL. If a limit already exists in the SQL, and it is less than the provided CLI/Cloud limit, then there's nothing to do. Otherwise, replace the preexisting SQL limit with the CLI/Cloud limit. The devil would be in the implementation details due to the query re-writing, so this could be a large lift. We'd probably opt for a more practical option that is more easy to implement, like below. Less ideal, but more practicalSee below for a not-quite-perfect workaround that seems reasonably practical. There's two variants we could choose between when applying
Is there any way that we have the best of both words without re-writing the query? I don't think so. But we can do better than just choosing one to apply to all scenarios. Rather, we can dynamically choose one or other depending on the situation. ProposalI'd propose that we choose between these two (imperfect!) strategies:
Both variants works great in all scenarios except the following edge case: select 1 as id
order by id
limit 3 Either one will break down in this scenario, but each in a different way: Using variant 1 will lead to a syntax error, whereas for variant 2, the underlying database may clobber the order by clause. CodeCode to override
{% macro get_show_sql(compiled_code, sql_header, limit) -%}
{%- if sql_header -%}
{{ sql_header }}
{%- endif -%}
{%- if limit is not none -%}
{%- if use_limit_subquery(compiled_code) is sameas true -%}
{{ get_limit_subquery_sql(compiled_code, limit) }}
{%- else -%}
{{ get_limit_sql(compiled_code, limit) }}
{%- endif -%}
{%- else -%}
{{ compiled_code }}
{%- endif -%}
{% endmacro %}
{% macro get_limit_subquery_sql(sql, limit) %}
{{ adapter.dispatch('get_limit_subquery_sql', 'dbt')(sql, limit) }}
{% endmacro %}
{% macro default__get_limit_subquery_sql(sql, limit) %}
select *
from (
{{ sql }}
) as model_limit_subq
limit {{ limit }}
{% endmacro %}
{% macro get_limit_sql(sql, limit) %}
{{ adapter.dispatch('get_limit_sql', 'dbt')(sql, limit) }}
{% endmacro %}
{% macro default__get_limit_sql(sql, limit) %}
{{ sql }}
limit {{ limit }}
{% endmacro %}
{% macro use_limit_subquery(compiled_code) %}
{{ adapter.dispatch('use_limit_subquery', 'dbt')(compiled_code) }}
{% endmacro %}
{% macro default__use_limit_subquery(compiled_code) %}
{%- do return('order by' not in compiled_code.lower()) -%}
{% endmacro %} Examples for variant 1Please bear with the "LaTeX" casing I adopted in the example below. This mixed-casing is included to ensure the proposed solution is case-insensitive. The example below demonstrates variant 1. To switch these examples to variant 2, just replace the {% macro use_limit_subquery(compiled_code) %}
{%- do return('limit' in compiled_code.lower()) -%}
{% endmacro %} Happy path 👍The following models work as intended when using
select 1 as id
select *
from (
select 1 as id
) as model_limit_subq
limit 5
select 1 as id
LiMiT 3
select *
from (
select 1 as id
LiMiT 3
) as model_limit_subq
limit 5
select 1 as id
OrDeR bY id
select 1 as id
OrDeR bY id
limit 5
select 1 as id
OrDeR bY id
-- LiMiT 3
select 1 as id
OrDeR bY id
-- LiMiT 3
limit 5 Unhappy path / edge case 👎
select 1 as id
OrDeR bY id
LiMiT 3
select 1 as id
OrDeR bY id
LiMiT 3
limit 5
select 1 as id
-- OrDeR bY id
LiMiT 3
select 1 as id
-- OrDeR bY id
LiMiT 3
limit 5 Making each variant configurableWe could consider making a flag so that users can toggle between which trade-off they want to accept (limit that clobbers |
Enthusiastic about a flag where you can choose your tradeoff! I assume the above sample |
Good point @joellabes ! Yep, this is a perfect example of two competing edge cases. Variant 1 would be best for the As mentioned earlier, the code provided earlier comes with trade-offs that we can't fully get around without fully re-writing the query itself to either overwrite an existing To avoid bewilderment/frustration while wrestling with this, I have to remind myself of the true root cause:
So all these gymnastics are just trying to do our best to workaround that undesirable behavior. Here's a few follow-up ideas that are not mutually exclusive and could be used in combination. Idea 1One idea would be making the "show strategy" configurable somehow on a model-by-model basis. e.g., add a model config like the following:
The available strategies could be named like this (or alternatives):
The last one is to not apply any additional limit at all and just trust that the user is handling limits within the model itself. Idea 2Another idea is to allow users to make assertions about their model via a model-level config: e.g., add a model config like the following:
If we added a model-level config that allows users to be explicit, then we could raise appropriate warnings during inference, like:
Idea 3The real edge case is when both For users that have a limit clause at the end of their model code, we could encourage a pattern like the following that uses either the hard-coded limit in the model or the CLI limit (depending on which is least). Suppose the user had this before: limit 17 Then we'd encourage them to use the limit {{ [17, invocation_args_dict.get("limit", 999)] | min }} Gnarly! 😬 But if the model has this code, then we can bypass adding another limit altogether because the model will already have it. But like in the other scenarios discussed above, we'd probably need the user to explicitly say that they don't need/want anything added to the show query (for this model specifically). i.e., the |
Can somebody put this in terms that a non-developer can use? We do not employ a senior technical developer and found DBT easy to use before, but this recent change has caused many of our models to fail and we are unable to provide reporting to end-users. We ROUTINELY work with large data sets and have hundreds of models. It seems DBT is overcomplicating something that should be EASY and is BASIC SQL language (i.e. Limit 100000). Can you please specify EXACTLY what we need to write/change to:
I have tried to create a macro But no matter what I do, I cannot get this to run on a model preview OR a simple SQL Query like Or should I just consider migrating away from DBT at this point? As a note: We just upgraded our environment version since 1.5 is out-of-life and dbt is moving towards versionless. So we are just now seeing this as an issue. |
Hey @kristinguyle! Sorry that this has broken your workflow. Here's the recap:
The Cloud IDE team are not going to re-enable arbitrarily large query previews or downloads, but that doesn't mean you need to move off of dbt. dbt is an excellent tool for building and testing your tables (such as Instead, you should consider one of the following approaches:
One other clarification: all of the limits are only applied during preview queries in development. If you have a billion row model, all billion rows will be built into your warehouse with |
@joellabes your response highlights a key oversight by DBT in its shift away from supporting centralized data teams. This change now forces us to introduce additional tools into our existing data ecosystem, which undermines both the efficiency and cost-effectiveness of our operations. The cascading impact on our ability to deliver timely and accurate data to our stakeholders is pushing us to consider alternative platforms. We've relied on DBT for a long time, particularly appreciating its ability to serve as an integrated, all-in-one solution for data transformation and modeling. However, it now appears that this model was not truly supported in the long run. Here’s a breakdown of the challenges we're facing with this transition under your proposed solutions:
This fragmentation of tools and processes is a significant step back for data professionals who need agility and cohesion in their workbench. As such, the shift DBT is making not only affects technical efficiency but also the strategic alignment of data delivery with business needs. |
@kristinguyle the specifics of the problems you're describing make me think that we're slightly misaligned. Put another way, I sympathise with the high level goals you have, but I disagree with your assessment that dbt is shifting away from supporting centralised data teams, and the approaches you're describing don't mesh with my understanding of most dbt projects. Instead of going back and forth on a GitHub issue, could we discuss this live? Here's a meeting booking link - I think I'll be able to be more helpful that way. (Heads up: after this week I'm travelling for our conference and PTO so most of October doesn't show availability). |
Is this a new bug in dbt-core?
Current Behavior
At the moment, the
limit
flag ondbt show
is achieved by wrapping the main query in a subquery, so:instead of the more idiomatic
Because of this, the output isn't always correctly sorted (screenshot from the Cloud IDE, but behaviour is consistent across Cloud IDE, Cloud CLI and Core CLI)
Expected Behavior
Sort order to be respected. The easiest way to do this would be to not wrap in a subquery and instead append the limit directly, as shown above.
I imagine that we're doing it this way to abstract over differences in syntax across adapters (lookin at you, SQL Server)
(Potentially we're also doing it to avoid
dbt show --inline "select * from my_table limit 10" --limit 100
from rendering two limit clauses and causing an error. But IMO if you do that, you should get the error from the DWH and then remove one of your limits.)Steps To Reproduce
Unfortunately it reproduces sporadically - I'd hoped to provide you a basic query like
dbt_utils.generate_series
and sort by that, but I think there's something to do with how each DWH distributes the results that makes this tricky to force into existence.Relevant log output
No response
Environment
- dbt: 1.6.latest and 1.7.0
Which database adapter are you using with dbt?
snowflake
Additional Context
No response
The text was updated successfully, but these errors were encountered: