-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Allow use of unique_tmp_table_suffix
flag with Iceberg tables.
#688
Comments
I believe that this type of feature could be implemented and it does make sense in some cases.
Reflecting a bit more on this, could be a nice feature to have for some cases:
Here are some edge cases that could lead to ambiguous situations:
Said so @benbeckingsale feel free to propose a feature for it. |
Hi @nicor88 and thanks for your patience. Just to explain my case a little more in terms of your answer above – I'm using DBT to incrementally update a model whose input is a 'raw' time-series table (a source updated by an external streaming process which often receives late data) and whose output is an aggregated time-series. Both input and output tables are partitioned by hour. The end goal is to run the model for the last 'x' hour partitions (2 currently). I'm using a DAG per partition so that the model can be run more frequently for the most recent partition and less frequently for older ones. The DAG runs might overlap – hence the table collision issue – but the data written by each will not, since each DAG run is confined to a single partition in SQL. In other words this would be a 'concurrent merge on different partitions'. This is a very common use case in my world, so I'd be very interested to know whether you can recommend alternative ways of doing this with DBT athena, perhaps using a single DAG? This feature is still needed regardless, so I will aim to contribute soon. |
@benbeckingsale Sorry for the late reply, but I was OoO. Out of curiosity why do you want to make such an operation for the last x hours parallel? are you dealing with such big data that parallelism is required? Can't you process the last x hours with one query? |
Is this your first time submitting a feature request?
Describe the feature
I would like to enable the use of the
unique_temp_table_suffix
param when using Iceberg tables with amerge
strategy.Currently this flag only has an effect if also using
hive
tables with aninsert_overwrite
strategy.Describe alternatives you've considered
Using a different
temp_schema
per run to avoid table name collisions – but it is much more desirable to keep all temp tables in the same schema.Who will this benefit?
This should reduce the risk of temp table name collision for anyone using parallel DAGs/processes to write to the same model (and using Iceberg with a
merge
strategy).In my case, I have DAGs A and B which write non-overlapping data to the same incremental model (DAG A writes recent data and more frequently; B writes older data and less frequently). If A and B overlap such that the temp table created by A still exists while B tries to create its own, DAG B will fail due to name collision.
If
unique_tmp_table_suffix
were supported in this case, the table name collision could be avoided.Are you interested in contributing this feature?
Yes
Anything else?
Similar to:
The text was updated successfully, but these errors were encountered: