CTE Considerations #1787

bennieregenold7 · 2022-07-22T21:26:52Z

bennieregenold7
Jul 22, 2022

CTE = common table expression, it is defined using the WITH clause, and can be thought of as a temporary named view that exists only in the query that is running the CTE.

Read through pretty much any dbt article about how we do work in dbt and you'll see guidance or actual examples of code using a ton of CTEs. They're great for breaking up code into logical units of work and make code much easier to read. However, as with all things in life, a good thing used in a bad way will lead to bad results. While CTEs generally do not impact performance when compared to a subquery, there are certain scenarios where they can be misused to poor effect.

Quick note: I am well versed in Snowflake, but lack the same depth of experience in BigQuery and RedShift. I hope others in the community will contribute their knowledge on those two technologies.

Snowflake

First up, let's tackle import CTEs. These are the CTEs at the top of a SQL statement that make it easy to see all the dependencies of the model you're working in:

with

customers as (
    select * from {{ ref('stg_jaffle_shop__customers') }}
),

orders as (
    select * from {{ ref('stg_jaffle_shop__orders') }}
),

customer_order_cnt as (
    select
        customer_id,
        count(distinct order_id) as order_cnt
    from orders
    group by
        customer_id
),

final as (
    select
        customers.customer_id,
        coalesce(customer_order_cnt.order_cnt, 0) as dist_order_cnt
    from customers
        left join customer_order_cnt on 
            customers.customer_id = customer_order_cnt.customer_id
)

select *
from final

In the example above, customers and orders are the import CTEs where we do a select * on the ref() to show that we're using this table in the query. These are later called in customer_order_cnt and final to do the actual logic of the query.

Modern query optimizers are typically smart enough to handle this structure and pull from the table in the way you'd expect. You can also compare the query above to the same logic written with subqueries and you'll see the same query plan.

This is fantastic, but it's important to remember that you should always filter early in your queries. This has some pretty big implications for import CTEs, especially when you need to use two different slices from the same dataset. If you run into an instance where you're preforming two pieces of logic on different sets of the same table, you should import the CTE twice to avoid an OR join condition.

Here's an example to show that point. First, let's query using an import CTE that we filter 2 different ways:

with
customers as (
    select * from {{ ref('customers') }}
),

orders as (
    select * from {{ ref('orders') }}
),

customer_completed_order_cnt as (
    select
       customer_id,
       order_id
    from orders
    where
        status = 'completed'
        and status <> 'placed'
),

customer_returned_order_cnt as (
    select
        customer_id,
        order_id
    from orders
    where
        status = 'returned'
),

final as (
    select
        customers.customer_id,
        count(distinct complete.order_id) as complete_orders,
        count(distinct returned.order_id) as returned_orders,
        complete_orders - returned_orders as retained_orders
    from customers
        left join customer_completed_order_cnt as complete on
            customers.customer_id = complete.customer_id
        left join customer_returned_order_cnt as returned on
            customers.customer_id = returned.customer_id
    group by
        1
)

select *
from final
limit 500

Then, we'll do the same thing, but this time we'll filter in the CTEs with the direct table call:

with
customers as (
    select * from {{ ref('customers') }}
),

customer_completed_order_cnt as (
    select
       customer_id,
       order_id
    from {{ ref('orders') }}
    where
        status = 'completed'
        and status <> 'placed'
),

customer_returned_order_cnt as (
    select
        customer_id,
        order_id
    from {{ ref('orders') }}
    where
        status = 'returned'
),

final as (
    select
        customers.customer_id,
        count(distinct complete.order_id) as complete_orders,
        count(distinct returned.order_id) as returned_orders,
        complete_orders - returned_orders as retained_orders
    from customers
        left join customer_completed_order_cnt as complete on
            customers.customer_id = complete.customer_id
        left join customer_returned_order_cnt as returned on
            customers.customer_id = returned.customer_id
    group by
        1
)
select *
from final
limit 500

As you can see, the second query plan is much simpler, and you'll generally get faster results with this approach.

edemiraydin · 2022-07-23T19:36:10Z

edemiraydin
Jul 23, 2022

Thank you for posting this example. I'd add, it is always a good idea to try both approaches and observe the query plan for both.

When we reference a CTE (your first example of the last two), you might get more query result caching benefit. With CTE, we try to read the orders table only once, then split it into a separate streams (WithReference) in the logical plan. This can be great at reducing I/O in some cases, but definitely not all.

1 reply

bennieregenold7 Jul 25, 2022
Author

Yes, those are excellent points! I'm always a fan of looking at long running queries and seeing how to optimize them. I think my general approach is to avoid the pattern I showed above of filtering the same CTE in two ways, but if I still see performance issues I'll dig into it more to see what's going on.

bennieregenold7 · 2022-07-29T21:19:48Z

bennieregenold7
Jul 29, 2022
Author

Another consideration with CTEs is the use of the select * syntax with columnar data warehouses. Here are a few examples from Snowflake to help clarify any performance implications that may arise when deciding to use select * vs select col1, col2, etc....

tl;dr

Snowflake can occasionally automatically determine the columns to pull from a source table when the query is simple and uses a select *. However, as queries get more complex it will likely pull all columns from a table when using select *.

While we generally recommend using select * in import CTEs , as with any recommendation there are caveats. When you're working with smaller datasets on a properly sized warehouse, you will likely see no noticeable difference between the two. However, if you're working with either long (100M rows+) or wide (50+ columns) tables, it could certainly be something to look at. There are performance differences when you get into these types of tables, and it is well worth your time to look into being explicit in your select statement if you notice run times that are less than desirable.

Narrow table example

Snowflake comes with a free benchmarking dataset named TPCH that you can use to follow along. The name of the database will likely be different for you, but the tables should be the same as shown below.

In the example below, I do a very simple query with two import CTEs that both use the select * syntax. Even though these tables are joined and aggregated, Snowflake was still able to decide that only a subset of columns should be pulled:

with customers as (

  select *
  from "RAW_TPCH"."TPCH_SF1000"."CUSTOMER"
  
),orders as (

  select *
  from "RAW_TPCH"."TPCH_SF1000"."ORDERS"

),customer_orders as (

    select
        c_custkey,
        count(distinct o_orderkey)
    from customers
        join orders on customers.c_custkey = orders.o_custkey
    group by
        1
)
select *
from customer_orders

Query plan:

However, this pattern doesn't hold when you get into more complex queries. As you can see, I added a third table which requires another join to the customer table. This time around, Snowflake pulled all the columns from customers even though only two were necessary:

with customers as (

  select *
  from "RAW_TPCH"."TPCH_SF1000"."CUSTOMER"
  
),orders as (

  select *
  from "RAW_TPCH"."TPCH_SF1000"."ORDERS"
  
), nations as (

  select *
  from "RAW_TPCH"."TPCH_SF1000"."NATION"
  
),customer_orders as (

  select
      c_custkey,
       ount(distinct o_orderkey) as unique_orders
  from customers
       left join orders on customers.c_custkey = orders.o_custkey
  group by
      1

), combined as (

  select
    customers.*,
    customer_orders.unique_orders,
    nations.*
  from customers
    join customer_orders on customers.c_custkey = customer_orders.c_custkey
    left join nations on customers.c_nationkey = nations.n_nationkey
  
)

select
    c_custkey,
    n_name as nation_name,
    unique_orders
from combined

Running two versions of this query, one with select * and one with only necessary columns explicitly listed in the select statement, the performance difference was negligible:

Testing parameters:

Warehouse is set to Small. The hope here is that if pulling additional columns makes a difference, it will be more apparent on smaller warehouses that have limited memory.
Query cache turned off with alter session set use_cached_result = false
Warehouse cache normalized by running a single initial, untimed query to pull the tables into the WH cache. All subsequent runs were performed without suspending the warehouse, so every query benefited from the WH cache.
Times shown are an average of 3 runs, back-to-back
The statement run for testing was a CREATE OR REPLACE TABLE statement to avoid network latency as part of the run time

Row Count: 150 million rows in every table, 8 columns in customer table
Run times:

Star: 70 seconds
Explicit: 154 seconds

Since that record set is a very narrow table, I decided to add more columns to customer and try the test again. I added 60 new columns (for a total of 68 columns in the table). To avoid Snowflake benefiting from optimizing/compressing the data in the columns, there were 30 random strings of length 1,000 characters, and 30 random numbers with length of 64 bits (sort of "worst case scenario" colums). They were created with syntax similar to this:

create or replace table development.dbt_bregenold.customer as 
select 
    c.*,
    RANDSTR(1000, random()) as rand_string1,
    ...
    RANDSTR(1000, random()) as rand_string30,
    RANDOM() as rand_number1,
    ...
    RANDOM() as rand_number30
from "RAW_TPCH"."TPCH_SF1000"."CUSTOMER" as c

Row Count: 150 million rows in every table, 68 columns in customer table
Run times:

Star: 197 seconds
Explicit: 113 seconds

As you can see, there are situations where this makes a difference. In general, I think the recommendation for select * is excellent for smaller data sets, but in your project updating your largest tables with explicit columns is certainly worth looking into if run times are not as good as you'd lilke.

0 replies

PaulGVernon · 2023-02-23T09:55:19Z

PaulGVernon
Feb 23, 2023

There has been some more discussion on the topics here in this Slack thread https://getdbt.slack.com/archives/C2JRRQDTL/p1677082245391779

1 reply

PaulGVernon Jun 21, 2023

In particular this message :

So, I decided to ask Snowflake support about this column pruning thing in CTEs that are referenced twice.

The TL;DR; is that the column pruning does happen, but occurs at execution time and hence this fact is not reflected in the query profile which misleadingly shows all the columns as being accessed (which is not what actually occurs).

Any slowdown you get using CTEs referenced multiple time is AFAIU, simply the difference between Snowflake "materializing" the CTE as an intermediate structure rather than doing a cost analysis and picking a plan that just repeats the CTE logic twice (or more times) if that is cheaper. Hopefully one day the SF planner will recognize both options as semantically equivalent and automatically pick the best one. Until then, if you reference a CTE more than once, it's up to you to work-out if repeating it manually (or say in a view) would be cheaper.

Also notable that Postgres has the ability to explicitly state if you want a CTE to be Materialized or not. https://www.postgresql.org/docs/15/queries-with.html#id-1.5.6.12.7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CTE Considerations #1787

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

CTE Considerations #1787

bennieregenold7 Jul 22, 2022

Snowflake

Replies: 3 comments · 2 replies

edemiraydin Jul 23, 2022

bennieregenold7 Jul 25, 2022 Author

bennieregenold7 Jul 29, 2022 Author

tl;dr

Narrow table example

PaulGVernon Feb 23, 2023

PaulGVernon Jun 21, 2023

bennieregenold7
Jul 22, 2022

Replies: 3 comments 2 replies

edemiraydin
Jul 23, 2022

bennieregenold7 Jul 25, 2022
Author

bennieregenold7
Jul 29, 2022
Author

PaulGVernon
Feb 23, 2023