-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Table Partitioning Option for PostgreSQL #168
Comments
Thanks for the request @sheyd! I took a look at the docs you've sent over. Table partitioning on pg looks pretty slick but also very involved. There are a couple of challenges for us to sort out here: in dbt, users don't specify table DDL. Instead, dbt creates tables with the schema described in One approach would be to create an empty table from the specified create table dbt_dbanin.tbl_partitioning_test__dbt_tmp as (
-- model SELECT statement (with a limit 0)
select
1 as id,
'drew'::text as name,
'green'::text as favorite_color
limit 0
);
-- create the partitioned table from the __tmp table definition
create table dbt_dbanin.tbl_partitioning_test
(
like dbt_dbanin.tbl_partitioning_test__dbt_tmp
)
partition by HASH (favorite_color); While this might work for creating the partitioned table, it's really only a tiny part of the overall implementation. The docs here make it look to me like dbt would need to create one table per partition, specifying the range of dates in that partition. Further, there's a bunch of examples of custom functions like
or
There's also a discussion of triggers and this sort of wild line:
In all, this feels like a pretty cumbersome process, and one that's a little outside the realm of how dbt usually operates! I am acutely interested in figuring out how dbt can better create date-partitioned datasets on data warehouses (like redshift / snowflake / bigquery), so I'm super happy to spend the time to explore some approaches on postgres too. As it stands though, I don't imagine this is something we'd implement in dbt-core. Maybe it's a good use for a Custom Materialization? Let me know if I'm overlooking or overcomplicating anything, and thanks again for the suggestion! |
Closing this as I don't think it's something we're going to prioritize in the next 6 months. Happy to re-open if anyone feels strongly! |
With the recent release of Postgres 14, Postgres further solidifies itself as a great choice for sub-petabyte scale. I would ask you to reconsider the prioritization of this, as it can come to significant cost savings for
|
@joeyfezster Thanks for the bump! Do you have a sense of whether the implementation required for Postgres partitions has become any simpler in the years since this issue was originally opened? A lot of the original concerns still seem relevant. In the meantime, there's nothing blocking from someone implementing this as a custom materialization in user-space code (i.e., no fork of dbt necessary). I'd be interested to see the complexity required, before deciding whether this should live in the |
Hi, I would also value this functionality. I think postgres' ability to function effectively as a data warehouse depends on partitioning. It seems that the simplest way to deal with partitions is to use the So, when creating the table, dbt would create the table as normal, but with a
Also, during each dbt run, before inserting any data into the table, it should run partman.run_maintenance_proc, so partman can make sure all the required partitions are created. Would you please consider reopening this issue? Kind regards |
Also - would it work if I created partitioned tables manually? Would dbt just use them as if they're normal tables? |
Hi everyone, @jtcohen6 Thank you for the following comment:
Could you be so kind to point me in the right direction on how to do this? In the dbt docs I did not find anything regarding custom materializations. Thank you |
I'm using AlloyDB with list partitioned tables and yes DBT can use partitioned tables on Postgres that are created prior to running DBT. We use incremental loading. You can opt for incremental and load just new rows, or if you want to replace entire partitions or data you can choose incremental and set the unique identifier to be the key you are partitioning on. The default DBT patterns are not very optimal for replacing entire partitions of data. (large scale deletes and inserts can be slow) To combat this you may need to make adjustments to the incremental materialization to enable better partition pruning. particularly when using a composite key for your unique key. The default incremental materialization out of the box will probably work okay for most simple use cases but you're likely to run into a few performance issues if your workload is anything but simple. I've been tinkering with a few components/ideas that could eventually go into a new materialisation to enable hot-swapping of entire partitions for Postgres but don't really have time to build out the materialisation. Happy to share my thoughts if anyone is looking at tackling this, |
The way I was looking at approaching this is to have dbt create the table as it normally would. however, this would just be a temporary table. The operation can be pretty fast as long as you add in a 1=2 kind of filter in the outer select so that the product is an empty table. Once this is created you can query the system catalogue and use the materialisation metadata to create a new table appending the partitioned by clause with the field you want. (assume partition by list) I was then thinking about hot-swapping partitions, which involved creating a new partition with a dummy partition value (can't create two partitions with the same partition value ) You can then detach the partition and then load the table as normal. once loaded, you would need to detach the previously active partition and attach the new partition. This is a simplification of the details required to implement this as you would also normally want the materialization to be safe in the event of a failure/disconnect which would involve a decent amount of thought around how to run in a transaction or if that isn't feasible with the mix of operations then ensuring that the job is recoverable. Its been a few months since i really looked a this but i think i was also coming across the issue of having to ensure constraint names/indexes were not duplicated. |
I'm interested in this feature too, but I don't have the knowledge to implement it myself. I think there are two usecases here: The second case should be way simpler to implement, at the expense of more end-user care. What does the simple case bring to the table? Basically, I could use a different storage (Citus Columnar) on non-changing partitions. Also, partitioning by date should reduce the need for some indexes. |
Just in case it helps anyone, I have a very rough implementation of partitioning by overriding a Postgres macro:
To use it:
I plan to create a pull request for this to the dbt-postgres repo once I fully test it. Next steps would be to create an incremental materialization that drops and recreates whole partitions instead of merging. PS: with this approach, the select query needs to be run twice (one for getting the list of partitions and another one for actually inserting the data). Another more lightweight option is to use pg_partman extension for this. PS2: Postgres 17 plans to add split and merge commands (https://www.dbi-services.com/blog/postgresql-17-split-and-merge-partitions/) which would also add the option to 1- create just one default partition with all values 2- select from the just created table to get min/max dates or values list (which might be faster than repeating the original query, especially if there column is indexed) and 3- split into the required partitions. |
For completeness, a very rough implementation of a custom incremental partitioning strategy: The strategy macro:
The model:
|
Opened a PR on #78. If you are interested in this feature, please show your support and feedback! |
Feature
Feature description
Table partitioning has existed since PostgreSQL 9, but in PostgreSQL 11 has been made significantly easier to use and more performant.
Who will this benefit?
Anyone who has solid Medium-Sized™ data that doesn't quite need the complexity or resources provided by BigQuery/RedShift/Snowflake.
For instance, in our current environment we have a table for all email messages sent to every user historically that is 200 million rows. While queryable, it could definitely benefit from using a
range
partition based on dates or even user groups.The text was updated successfully, but these errors were encountered: