CTE Considerations #1787
Replies: 3 comments 2 replies
-
Thank you for posting this example. I'd add, it is always a good idea to try both approaches and observe the query plan for both. When we reference a CTE (your first example of the last two), you might get more query result caching benefit. With CTE, we try to read the orders table only once, then split it into a separate streams (WithReference) in the logical plan. This can be great at reducing I/O in some cases, but definitely not all. |
Beta Was this translation helpful? Give feedback.
-
Another consideration with CTEs is the use of the tl;drSnowflake can occasionally automatically determine the columns to pull from a source table when the query is simple and uses a While we generally recommend using Narrow table exampleSnowflake comes with a free benchmarking dataset named In the example below, I do a very simple query with two import CTEs that both use the with customers as (
select *
from "RAW_TPCH"."TPCH_SF1000"."CUSTOMER"
),orders as (
select *
from "RAW_TPCH"."TPCH_SF1000"."ORDERS"
),customer_orders as (
select
c_custkey,
count(distinct o_orderkey)
from customers
join orders on customers.c_custkey = orders.o_custkey
group by
1
)
select *
from customer_orders However, this pattern doesn't hold when you get into more complex queries. As you can see, I added a third table which requires another join to the with customers as (
select *
from "RAW_TPCH"."TPCH_SF1000"."CUSTOMER"
),orders as (
select *
from "RAW_TPCH"."TPCH_SF1000"."ORDERS"
), nations as (
select *
from "RAW_TPCH"."TPCH_SF1000"."NATION"
),customer_orders as (
select
c_custkey,
ount(distinct o_orderkey) as unique_orders
from customers
left join orders on customers.c_custkey = orders.o_custkey
group by
1
), combined as (
select
customers.*,
customer_orders.unique_orders,
nations.*
from customers
join customer_orders on customers.c_custkey = customer_orders.c_custkey
left join nations on customers.c_nationkey = nations.n_nationkey
)
select
c_custkey,
n_name as nation_name,
unique_orders
from combined Running two versions of this query, one with Testing parameters:
Row Count: 150 million rows in every table, 8 columns in
Since that record set is a very narrow table, I decided to add more columns to create or replace table development.dbt_bregenold.customer as
select
c.*,
RANDSTR(1000, random()) as rand_string1,
...
RANDSTR(1000, random()) as rand_string30,
RANDOM() as rand_number1,
...
RANDOM() as rand_number30
from "RAW_TPCH"."TPCH_SF1000"."CUSTOMER" as c Row Count: 150 million rows in every table, 68 columns in
As you can see, there are situations where this makes a difference. In general, I think the recommendation for |
Beta Was this translation helpful? Give feedback.
-
There has been some more discussion on the topics here in this Slack thread https://getdbt.slack.com/archives/C2JRRQDTL/p1677082245391779 |
Beta Was this translation helpful? Give feedback.
-
CTE = common table expression, it is defined using the
WITH
clause, and can be thought of as a temporary named view that exists only in the query that is running the CTE.Read through pretty much any dbt article about how we do work in dbt and you'll see guidance or actual examples of code using a ton of CTEs. They're great for breaking up code into logical units of work and make code much easier to read. However, as with all things in life, a good thing used in a bad way will lead to bad results. While CTEs generally do not impact performance when compared to a subquery, there are certain scenarios where they can be misused to poor effect.
Quick note: I am well versed in Snowflake, but lack the same depth of experience in BigQuery and RedShift. I hope others in the community will contribute their knowledge on those two technologies.
Snowflake
First up, let's tackle
import CTEs
. These are the CTEs at the top of a SQL statement that make it easy to see all the dependencies of the model you're working in:In the example above,
customers
andorders
are the import CTEs where we do aselect *
on theref()
to show that we're using this table in the query. These are later called incustomer_order_cnt
andfinal
to do the actual logic of the query.Modern query optimizers are typically smart enough to handle this structure and pull from the table in the way you'd expect. You can also compare the query above to the same logic written with subqueries and you'll see the same query plan.
This is fantastic, but it's important to remember that you should always filter early in your queries. This has some pretty big implications for import CTEs, especially when you need to use two different slices from the same dataset. If you run into an instance where you're preforming two pieces of logic on different sets of the same table, you should import the CTE twice to avoid an
OR
join condition.Here's an example to show that point. First, let's query using an import CTE that we filter 2 different ways:
Then, we'll do the same thing, but this time we'll filter in the CTEs with the direct table call:
As you can see, the second query plan is much simpler, and you'll generally get faster results with this approach.
Beta Was this translation helpful? Give feedback.
All reactions