-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce "delimiter" parameter for seeds? Allow non-csv file types #3990
Comments
@sgoley When you say "non-csv file types," are you talking only about tabular data with non-comma delimiters (e.g. TSVs)? Or are you also thinking about other formats for structured data (JSON, XML, ...)? If it's just a question of delimiters, could you provide a bit more background on the difficulty you're encountering? Is there a blocker to pre-processing your files, so as to switch from tab/semicolon/other delimiter to commas? We've got another issue already open for newline-delimited JSON support: #2365 |
Yes, in my case I am specifically talking about non-comma delimiters like TSVs, semi-colon delimited (common in some european countries given the numerical comma standard), or "|" ( 'VERTICAL LINE' U+007C ) delimited files (infrequent but used within US financial systems). I do completely understand that pre-processing is the only current workable solution, just opening the issue here since I searched and no truly similar issue has been raised or closed. After reading more into the agate csv reader function, it looks like that particular api does not support a "sep" / "delimiter" parameter which is why I assume this was not supported natively? |
@sgoley You're right, we just call agate's The good news: in that method, So I think this could be as simple as:
column_types = self.model.config.column_types
delimiter = self.model.config.delimiter
try:
table = agate_helper.from_csv(path, text_columns=column_types, delimiter=delimiter) So, I'm pretty close to tagging this as a
|
Hello @jtcohen6 Following this issue along with JSON related (following #2365 ) I understand the poor use cases outlined for dbt here https://docs.getdbt.com/docs/building-a-dbt-project/seeds - for valid/good use cases seeding small amounts of data as part of a build process (sampling ) - writing tests, documentation etc What if you need custom quoting, escape characters, etc? Should we seek to add generalized support for all csv.reader() kwargs as configs? Or do we think a configurable delimiter covers 90% of the bases? -- I believe the community would benefit from adding generalized support for csv - COMMA is a very common character and shows up everywhere. If the data is not escaped or quoted, the data may shift even for simple files.
Would you expect to be able to define these seed files with other file extensions (e.g. *.tsv)? That change would need to happen in a different part of the codebase. -- Another community benefit, sometimes we need to work with small datasets from different formats. I believe support for CSV (if it has the configurable delimiter with text enclosure and escape character; this covers any character delimited file .txt,.tsv, pipe delimited etc - Would we be better off saying "no" to all of the above, keep seed support very simple (dbt isn't a data loader!), even at the cost of pre-processing? keeping it simple is great but this limits the entry point to dbt as only comma separated files. Going through ---- As the community grows this is going to be a growing need. Now, dbt seed is still not ideal for loading data into a warehouse, as part of the build process for a data project, sampling (maybe find size limitations to avoid functionality abuse) needs to be recommended as part of best practice to keep the functionality light as is. Yes (dbt is not a a data loader), but can it be used to build an ELT pipeline based on small samples of different file formats? Mocking the (EL) process for common data formats, build pipeline based on a small amount of data. Let's not say "no", say yes 😄 - the community needs these functionalities. The preprocess of a file before entry to dbt is a little hectic and expensive for a modern day tool. Thank you for considering this. Keep up the great work. |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers. |
I opened a pull request that adds this feature into dbt-core: #7186 |
Thanks @ramonvermeulen! Reopening, given you've picked this up |
… (#7186) * Support configurable delimiter for seed files, default to comma (#3990) * Update Features-20230317-144957.yaml * Moved "delimiter" to seed config instead of node config * Update core/dbt/clients/agate_helper.py Co-authored-by: Cor <[email protected]> * Update test_contracts_graph_parsed.py * fixed integration tests * Added functional tests for seed files with a unique delimiter * Added docstrings * Added a test for an empty string configured delimiter value * whitespace * ran black * updated changie entry * Update Features-20230317-144957.yaml --------- Co-authored-by: Cor <[email protected]>
Hello team, when this issue will be fixed? we have both csv and tsv files in the seeds, seems like tsv is not working in seeds - I'm in version dbt-databricks 1.8.0, Thanks |
@Ram-Dev7 See call with delimiter being passed as argument: dbt-core/core/dbt/context/providers.py Line 1055 in bdf28d7
And function definition in dbt-common: https://github.com/dbt-labs/dbt-common/blob/65d7ba5d71f4ebea80605408824708bf0e5a99af/dbt_common/clients/agate_helper.py#L152 In that case I think it might be a bit harder to fix, e.g. this needs to be implemented in agate, or a very big change has to happen in how dbt is handling seeds (e.g. using python internals or another library instead of agate?). But let's first see if we can have a reproducable example, to confirm that this is really the case. |
Describe the feature
Use alternative delimiters for seeds besides "," (comma)
Describe alternatives you've considered
Currently, the only option is to load a non-csv seed into database as a temp table and then reference that as a "source".
Additional context
No, not database specific.
Who will this benefit?
Anyone who uses seeds.
The text was updated successfully, but these errors were encountered: