Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Avoid wrapping in subqueries where possible and respect order by clauses with dbt show --limit #9600

Open
2 tasks done
joellabes opened this issue Feb 19, 2024 · 12 comments
Labels
enhancement New feature or request Impact: Exp Refinement Maintainer input needed

Comments

@joellabes
Copy link
Contributor

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

At the moment, the limit flag on dbt show is achieved by wrapping the main query in a subquery, so:

select *
    from (
        with summarised as (select * from analytics_dev.dbt_jlabes.messages)
        
        select 
            len(text), 
            *  
        from summarised 
        order by len(text) asc
    ) as model_limit_subq
    limit 500

instead of the more idiomatic

with summarised as (select * from analytics_dev.dbt_jlabes.messages)

select 
    len(text), 
    *  
from summarised 
order by len(text) asc
limit 500

Because of this, the output isn't always correctly sorted (screenshot from the Cloud IDE, but behaviour is consistent across Cloud IDE, Cloud CLI and Core CLI)
image

Expected Behavior

Sort order to be respected. The easiest way to do this would be to not wrap in a subquery and instead append the limit directly, as shown above.

I imagine that we're doing it this way to abstract over differences in syntax across adapters (lookin at you, SQL Server)

(Potentially we're also doing it to avoid dbt show --inline "select * from my_table limit 10" --limit 100 from rendering two limit clauses and causing an error. But IMO if you do that, you should get the error from the DWH and then remove one of your limits.)

Steps To Reproduce

Unfortunately it reproduces sporadically - I'd hoped to provide you a basic query like dbt_utils.generate_series and sort by that, but I think there's something to do with how each DWH distributes the results that makes this tricky to force into existence.

Relevant log output

No response

Environment

- dbt: 1.6.latest and 1.7.0

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

@joellabes joellabes added bug Something isn't working triage labels Feb 19, 2024
@dbeatty10
Copy link
Contributor

Thanks for raising this @joellabes !

Agreed, it makes sense that we'd want to respect order by within the underlying query.

We'd also want to respect limit within the underlying query too though. If we make the proposed change as-is, I think we'd be at risk of breaking preview queries that are working currently. So I'm going to re-categorize this as a feature request.

Workaround

The standard definition is:

{% macro default__get_limit_subquery_sql(sql, limit) %}
    select *
    from (
        {{ sql }}
    ) as model_limit_subq
    limit {{ limit }}
{% endmacro %}

You can get the behavior you're after by creating a file within your project named something like macros/overrides/get_limit_subquery_sql.sql:

{% macro default__get_limit_subquery_sql(sql, limit) %}
    {{ sql }}
    limit {{ limit }}
{% endmacro %}

@dbeatty10 dbeatty10 added enhancement New feature or request and removed bug Something isn't working labels Feb 20, 2024
@dbeatty10 dbeatty10 changed the title [Bug] dbt show --limit doesn't respect order by clauses and should avoid wrapping in subqueries where possible [Feature] Avoid wrapping in subqueries where possible and respect order by clauses with dbt show --limit Feb 20, 2024
@joellabes
Copy link
Contributor Author

joellabes commented Feb 20, 2024

ooooh i like that workaround. on my way to propose a PR in internal-analytics 👀

I think we'd be at risk of breaking preview queries that are working currently.

@dbeatty10 can you clarify what you mean here? Aren't preview queries by definition ephemeral (not in the materialization sense) and unbreakable?

@dbeatty10
Copy link
Contributor

Like you mentioned originally, any query that uses a limit won't work with the macro override above.

Here's an example that doesn't work:

dbt show --inline "select 1 as id limit 3"

@joellabes
Copy link
Contributor Author

OK got it! I guess my pitch is that previews don't have to be protected from that happening - if you accidentally wind up issuing select 1 as id limit 3 limit 100, you solve it by going "oops!" and removing the --limit flag (or the limit statement from your actual query, as appropriate)

@dbeatty10 dbeatty10 added Refinement Maintainer input needed and removed triage labels Feb 27, 2024
@dsmdavid
Copy link

dsmdavid commented Mar 15, 2024

Hi @dbeatty10 splitting hairs here :-) I don't know if I would say this is a feature request or a bug fixing. Yes, I hear you that some queries currently working would stop working (the double limit queries); however, I could also frame it as there are queries that used to work as expected before that are now not throwing errors but returning incorrect results compared to previous versions (e.g. if you downgrade the environment to 1.6 they would work as expected) so I would consider this more of a regression bug that a new feature 😅
Appreciate the workarounds. Also there's another (at least for Snowflake reported by Jack Nguyen https://getdbt.slack.com/archives/CBSQTAPLG/p1698024014827109?thread_ts=1698004749.652379&cid=CBSQTAPLG )
using a explicit limit inside the query instead of relying on dbt's wrap.

@dbeatty10
Copy link
Contributor

dbeatty10 commented Mar 16, 2024

Very good points @dsmdavid -- we could reasonably categorize it as any of those.

We've taken a couple swings at this:

The latter has interactions with ORDER BY clauses that is the subject of this issue.

Ideal

In the most ideal scenario, we'd just do the following:

Add a limit if none exists in the SQL. If a limit already exists in the SQL, and it is less than the provided CLI/Cloud limit, then there's nothing to do. Otherwise, replace the preexisting SQL limit with the CLI/Cloud limit.

The devil would be in the implementation details due to the query re-writing, so this could be a large lift. We'd probably opt for a more practical option that is more easy to implement, like below.

Less ideal, but more practical

See below for a not-quite-perfect workaround that seems reasonably practical.

There's two variants we could choose between when applying LIMIT to the end of the query, each with their own trade-offs:

  1. Don't use a subquery -- accepting the trade-off that a pre-existing LIMIT clause will lead to a syntax error.
  2. Use a subquery -- accepting the trade-off that any pre-existing ORDER BY clause may be clobbered silently.

Is there any way that we have the best of both words without re-writing the query? I don't think so.

But we can do better than just choosing one to apply to all scenarios. Rather, we can dynamically choose one or other depending on the situation.

Proposal

I'd propose that we choose between these two (imperfect!) strategies:

  1. IFF there is a pre-existing ORDER BY clause, Don't use a subquery.
  2. IFF there is a pre-existing LIMIT clause, Use a subquery.

Both variants works great in all scenarios except the following edge case:

select 1 as id
order by id
limit 3

Either one will break down in this scenario, but each in a different way: Using variant 1 will lead to a syntax error, whereas for variant 2, the underlying database may clobber the order by clause.

Code

Code to override get_show_sql, etc:

macros/overrides/show.sql

{% macro get_show_sql(compiled_code, sql_header, limit) -%}
  {%- if sql_header -%}
  {{ sql_header }}
  {%- endif -%}
  {%- if limit is not none -%}
    {%- if use_limit_subquery(compiled_code) is sameas true -%}
      {{ get_limit_subquery_sql(compiled_code, limit) }}
    {%- else -%}
      {{ get_limit_sql(compiled_code, limit) }}
    {%- endif -%}
  {%- else -%}
  {{ compiled_code }}
  {%- endif -%}
{% endmacro %}

{% macro get_limit_subquery_sql(sql, limit) %}
  {{ adapter.dispatch('get_limit_subquery_sql', 'dbt')(sql, limit) }}
{% endmacro %}

{% macro default__get_limit_subquery_sql(sql, limit) %}
    select *
    from (
        {{ sql }}
    ) as model_limit_subq
    limit {{ limit }}
{% endmacro %}

{% macro get_limit_sql(sql, limit) %}
  {{ adapter.dispatch('get_limit_sql', 'dbt')(sql, limit) }}
{% endmacro %}

{% macro default__get_limit_sql(sql, limit) %}
    {{ sql }}
    limit {{ limit }}
{% endmacro %}

{% macro use_limit_subquery(compiled_code) %}
  {{ adapter.dispatch('use_limit_subquery', 'dbt')(compiled_code) }}
{% endmacro %}

{% macro default__use_limit_subquery(compiled_code) %}
  {%- do return('order by' not in compiled_code.lower()) -%}
{% endmacro %}

Examples for variant 1

Please bear with the "LaTeX" casing I adopted in the example below. This mixed-casing is included to ensure the proposed solution is case-insensitive.

The example below demonstrates variant 1. To switch these examples to variant 2, just replace the use_limit_subquery macro with this definition instead:

{% macro use_limit_subquery(compiled_code) %}
  {%- do return('limit' in compiled_code.lower()) -%}
{% endmacro %}

Happy path 👍

The following models work as intended when using dbt show along with variant 1 as described above.

models/without_limit_or_order_by.sql

select 1 as id

dbt show runs this query:

    select *
    from (
        select 1 as id
    ) as model_limit_subq
    limit 5

models/without_order_by.sql

select 1 as id
LiMiT 3

dbt show runs this query:

    select *
    from (
        select 1 as id
LiMiT 3
    ) as model_limit_subq
    limit 5

models/without_limit.sql

select 1 as id
OrDeR bY id

dbt show runs this query:

    select 1 as id
OrDeR bY id
    limit 5

models/with_commented_limit.sql

select 1 as id
OrDeR bY id
-- LiMiT 3

dbt show runs this query:

    select 1 as id
OrDeR bY id
-- LiMiT 3
    limit 5

Unhappy path / edge case 👎

models/with_limit_and_order_by.sql

select 1 as id
OrDeR bY id
LiMiT 3

dbt show runs this query and generates a database syntax error:

    select 1 as id
OrDeR bY id
LiMiT 3
    limit 5

models/with_commented_order_by.sql

select 1 as id
-- OrDeR bY id
LiMiT 3

dbt show runs this query and generates a database syntax error:

    select 1 as id
-- OrDeR bY id
LiMiT 3
    limit 5

Making each variant configurable

We could consider making a flag so that users can toggle between which trade-off they want to accept (limit that clobbers order by vs. double limit syntax error). If we surfaced a flag, we could name it something like --limit-subquery / --no-limit-subquery.

@joellabes
Copy link
Contributor Author

Enthusiastic about a flag where you can choose your tradeoff!

I assume the above sample use_limit_subquery macros are intended as a proof of concept, but for the sake of completeness... code like row_number() over (partition by some_id order by date desc) and is_limited_liability_corp precludes us from using them as-is without Scunthorping ourselves.

@dbeatty10
Copy link
Contributor

Good point @joellabes ! Yep, this is a perfect example of two competing edge cases.

Variant 1 would be best for the is_limited_liability_corp edge case, whereas variant 2 would be better for the row_number() over (partition by some_id order by date desc) edge case.

As mentioned earlier, the code provided earlier comes with trade-offs that we can't fully get around without fully re-writing the query itself to either overwrite an existing LIMIT clause or to append one (which is ideal, but not trivial).

To avoid bewilderment/frustration while wrestling with this, I have to remind myself of the true root cause:

  • for some reason I don't understand, some data platforms don't preserve order by clauses when wrapped in a subquery .

So all these gymnastics are just trying to do our best to workaround that undesirable behavior.

Here's a few follow-up ideas that are not mutually exclusive and could be used in combination.

Idea 1

One idea would be making the "show strategy" configurable somehow on a model-by-model basis.

e.g., add a model config like the following:

  • limit_stategy

The available strategies could be named like this (or alternatives):

  • append_limit
  • subquery_limit / potentially_clobber_order_by
  • none / empty

The last one is to not apply any additional limit at all and just trust that the user is handling limits within the model itself.

Idea 2

Another idea is to allow users to make assertions about their model via a model-level config:

e.g., add a model config like the following:

  • has_limit and/or has_order_by

If we added a model-level config that allows users to be explicit, then we could raise appropriate warnings during inference, like:

Warning: It looks like the `abc` model might include a `LIMIT` clause at the end of the query.
If not, set the `has_limit` model config to `false`.
Warning: It looks like the `abc` model might include a `ORDER BY` clause at the end of the query.
If not, set the `has_order_by` model config to `false`.

Idea 3

The real edge case is when both LIMIT and ORDER BY are detected. The crux of the issue is that we know we want to limit the number of results in dbt show so that we don't use an excessive amount memory or screen real estate when the model has a ton of rows.

For users that have a limit clause at the end of their model code, we could encourage a pattern like the following that uses either the hard-coded limit in the model or the CLI limit (depending on which is least).

Suppose the user had this before:

limit 17

Then we'd encourage them to use the invocation_args_dict and the min filter to dynamically choose the lesser:

limit {{ [17, invocation_args_dict.get("limit", 999)] | min }}

Gnarly! 😬

But if the model has this code, then we can bypass adding another limit altogether because the model will already have it. But like in the other scenarios discussed above, we'd probably need the user to explicitly say that they don't need/want anything added to the show query (for this model specifically). i.e., the none / empty strategy described above.

@kristinguyle
Copy link

kristinguyle commented Sep 26, 2024

Can somebody put this in terms that a non-developer can use? We do not employ a senior technical developer and found DBT easy to use before, but this recent change has caused many of our models to fail and we are unable to provide reporting to end-users. We ROUTINELY work with large data sets and have hundreds of models. It seems DBT is overcomplicating something that should be EASY and is BASIC SQL language (i.e. Limit 100000). Can you please specify EXACTLY what we need to write/change to:

  1. make our existing models with LIMIT queryable
  2. run a new file query with a higher limit that we specify (I do not want to have to change this macro constantly when we need larger data sets.

I have tried to create a macro {%- macro get_limit_subquery_sql(sql, limit) -%} {{ adapter.dispatch( 'get_limit_sql', 'dbt')(sql, 100000) }} {%- endmacro -%}

But no matter what I do, I cannot get this to run on a model preview OR a simple SQL Query like Select * From stg_syspro__inventory_master

Or should I just consider migrating away from DBT at this point? As a note: We just upgraded our environment version since 1.5 is out-of-life and dbt is moving towards versionless. So we are just now seeing this as an issue.

@joellabes
Copy link
Contributor Author

Hey @kristinguyle! Sorry that this has broken your workflow. Here's the recap:

  • The dbt Cloud IDE is designed for developing models, and part of that tooling includes the ability to preview a query and download the results to a spreadsheet.
  • For performance reasons (both your browser tab and our servers), we limit the number of rows returned by those previews to 10,000 (which is plenty for its intended purpose of checking "does my query behave the way I expect?")
  • dbt core v1.5 and earlier had a bug which meant that it was unintentionally possible to retrieve larger rowcounts. Later versions implement filtering on both the client side (your browser) and server side, so even overriding the macros will not enable you to retrieve infinite rows.
  • The change described in this issue and its corresponding PR impacts how dbt core applies the limit (to fix a bug where rows were sometimes incorrectly sorted), but it doesn't have anything to do with whether dbt Cloud should ask core to apply the limit.

The Cloud IDE team are not going to re-enable arbitrarily large query previews or downloads, but that doesn't mean you need to move off of dbt. dbt is an excellent tool for building and testing your tables (such as stg_syspro__inventory_master), but it's not intended to be used for data exports.

Instead, you should consider one of the following approaches:

  • Most BI tools support scheduled exports of data; it's likely that you could instead download the raw data from inside your BI tool
  • Your data warehouse's query tool might support arbitrarily large downloads. For example, the old Snowflake web experience had a limit of 100MB but my understanding is that the new Snowsight experience can download any size file. (You can read more about Snowflake options here)
  • If neither of those web-based experiences are possible, you could also consider downloading a SQL workbench such as DBeaver which lets you export the results of a query to CSV.

One other clarification: all of the limits are only applied during preview queries in development. If you have a billion row model, all billion rows will be built into your warehouse with dbt run, not just the first 10k.

@kristinguyle
Copy link

@joellabes your response highlights a key oversight by DBT in its shift away from supporting centralized data teams. This change now forces us to introduce additional tools into our existing data ecosystem, which undermines both the efficiency and cost-effectiveness of our operations. The cascading impact on our ability to deliver timely and accurate data to our stakeholders is pushing us to consider alternative platforms.

We've relied on DBT for a long time, particularly appreciating its ability to serve as an integrated, all-in-one solution for data transformation and modeling. However, it now appears that this model was not truly supported in the long run.

Here’s a breakdown of the challenges we're facing with this transition under your proposed solutions:

  • Extracting Raw Data from BI Tools: We are now back to relying on Excel as a quasi-database solution, which introduces manual processes and increases the likelihood of errors. In 2024, this is an impractical and outdated approach for enterprise-level data management.

  • Querying in Snowflake: While it’s technically feasible, Snowflake’s native querying capabilities are less streamlined than DBT’s approach, especially when managing complex data transformations. Additionally, direct interaction with Snowflake for this purpose increases security risks, as it opens up access points for potentially sensitive datasets.

  • Introducing a Third SQL Workbench: Adding another SQL interface to our stack creates redundancy and confusion within the team. It dilutes operational efficiency, complicates workflows, and introduces friction in collaborative data development.

This fragmentation of tools and processes is a significant step back for data professionals who need agility and cohesion in their workbench. As such, the shift DBT is making not only affects technical efficiency but also the strategic alignment of data delivery with business needs.

@joellabes
Copy link
Contributor Author

@kristinguyle the specifics of the problems you're describing make me think that we're slightly misaligned. Put another way, I sympathise with the high level goals you have, but I disagree with your assessment that dbt is shifting away from supporting centralised data teams, and the approaches you're describing don't mesh with my understanding of most dbt projects.

Instead of going back and forth on a GitHub issue, could we discuss this live? Here's a meeting booking link - I think I'll be able to be more helpful that way. (Heads up: after this week I'm travelling for our conference and PTO so most of October doesn't show availability).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Impact: Exp Refinement Maintainer input needed
Projects
None yet
Development

No branches or pull requests

5 participants