[CT-260] [Bug] `generate_database_name` does not apply to sources #4753

alexrosenfeld10 · 2022-02-19T04:51:12Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

When using sources, the generate_database_name override doesn't apply to sources.

Expected Behavior

Sources would take the overridden database name, just like models.

This is either a very big oversight, or for some reason it's intentional. It's quite a blocker for me though. I've tried to get around it by adding some jinja in my sources yaml block, but no luck so far. I'm even trying to call macros from there, which I don't think is possible.

I suppose leaving database explicitly unset might be an OK workaround, as it would take the database of the session which most of the time would be correct.

Steps To Reproduce

Set up a dbt project
Configure a source and a model that consumes the source.
Add a generate_database_name override macro. Here's an example:

{% macro generate_database_name(custom_database_name=none, node=none) -%}

  {%- if custom_database_name is not none and node.resource_type in ['seed', 'model'] -%}
    {%- set error_message -%}
      {{ node.resource_type | capitalize }} '{{ node.unique_id }}' has a database configured. This is not allowed because the it will be derived from the directory structure and target name.
    {%- endset -%}
    {{ exceptions.raise_compiler_error(error_message) }}
  {%- endif -%}

  {% set path = node.path %}
  {% set database = path.split('/')[0] %}
  {% if target.name in ['prod_us', 'prod_eu'] %}
    {{ database }}
  {% else %}
    {{ database + "_" + target.name }}
  {% endif %}

{%- endmacro %}

dbt compile
The compiled models will have whatever db is set in your profile, instead of the overridden one from the macro

Relevant log output

No response

Environment

- OS: macOS
- Python: 3.7.9
- dbt: 

❯ dbt --version
installed version: 1.0.2
   latest version: 1.0.2

Up to date!

Plugins:
  - snowflake: 1.0.0 - Up to date!



### What database are you using dbt with?

snowflake

### Additional Context

The text was updated successfully, but these errors were encountered:

alexrosenfeld10 · 2022-02-19T05:09:48Z

After working on this for a while I finally have come to a workaround:

sources:
  - name: the_thing_that_is_in_a_different_db_depending_on_things
    database: "reporting{{ '_' + target.name if not target.name.startswith('prod') }}"
    schema: schema_is_consistent_so_its_hard_coded
    tables: [ ... ]

alexrosenfeld10 · 2022-02-19T05:12:49Z

btw, I'm using target.name because our dbs are named as such:

reporting_dev, reporting_qa, reporting (this one is prod).

We also give each engineer their own, like this: reporting_qa_someonesname.

Is there a better practice for doing this? I know you can use environment variables, but I didn't find it to be much cleaner because I also have another db, let's call it other_reporting that follows the same naming conventions. They interact, and I want to do things to both dbs in the same dbt project in the same run. Because of this, it's pretty much useless to set the database in the profile...

My other option with env vars leaves me having to set the db in each yml file / in each model config, which is what led me to the override macro approach - it's DRYer and less error prone (someone could forget to set the db in the yml file / each model, but not with a macro override).

nathan-protempo · 2022-02-23T01:44:15Z

For my 2c I think this is working as expected. The generate_database_name override controls the database used for dbt models. Sources are not models in dbt, they are references to external data and so not governed by the dbt configuration for target database since source data is generally in a different database than models generated by dbt.

nathan-protempo · 2022-02-23T01:54:38Z

@alexrosenfeld10 probably useful to include the config from your source and model files in the issue since this could just be a config issue. It's not clear from the information given if you are calling the macro in the source config.

alexrosenfeld10 · 2022-02-23T04:27:55Z

source data is generally in a different database than models generated by dbt.

That's quite a generalization to make! It's super common to EL something into one schema and then do transformations on it in another schema. I don't see any reason why the expectation should be that those are in separate databases. Moreover, if they are in separate databases, then that's even more reason for such an override to exist.

@nathan-protempo Here's my source config: #4753 (comment). The idea is that I can DRY the database qualifier up by putting it at the source level, not the model levels. You might suggest to set that in the profile, and I thought about that as well. That's not useful to me because my project selects from multiple databases (controlled by models/<database_name> + the macro I posted above). So my options are either:

specify the database explicitly in every model / source, which is manual and error prone and very non-DRY
do the sort of crazy workaround I have, where directory name == database name + macro to go with, model name follows a schemaName__modelName.sql convention + macro to with, and finally an alias override macro to remove the schemaName__ prefix from models before aliasing.
Maybe a third option you can suggest that I didn't think of? I'd love that!

Thanks for helping to think this through. I realize it's possible I might have missed a good option, if so, I'm all ears. If not, I think this is a change that the dbt Labs team should consider. It doesn't have to be the same override even, a separate one just for sources would work well enough.

nathan-protempo · 2022-02-23T23:15:14Z

@alexrosenfeld10 - Yes in hindsight that is just me generalising my experience and of course there is no reason you couldn't have source tables in a different schema of the same database.

I tried to do some local testing with your custom database macro and it generated errors as the initial show terse objects queries because the target was in the form project_name\folder.default_schema. I'm not sure if this is what you intended as I don't know what your folder structure or profile yml looks like. It could also be an environment issue as I am using Windows.

I note that your macro code uses the "target" function - there's an example on the doc page for that function for using target.name to override the source database, I wonder if that might help?

alexrosenfeld10 · 2022-02-24T03:54:40Z

@nathan-protempo I think the environment-specific problem could potentially be the direction of the '/' character used in my macro. Not sure how to get around that.. I'll have to think about it.

Anyway, I've changed my approach a little bit. It's still the same core ideas though. Here's what I do now (it works pretty well):

{% macro generate_schema_name(custom_schema_name=none, node=none) -%}

  {%- if custom_schema_name is not none and node.resource_type in ['seed', 'model'] -%}
    {%- set error_message -%}
      {{ node.resource_type | capitalize }} '{{ node.unique_id }}' has a schema configured. This is not allowed because the it will be derived from the directory structure.
    {%- endset -%}
    {{ exceptions.raise_compiler_error(error_message) }}
  {%- endif -%}
{#
Because we have many models with the same name in different schemas, we have to follow the
schema_name__model_name.sql naming convention.

dbt enforces that model filenames are unique, and does not support schema names in the ref() function.

See https://discourse.getdbt.com/t/extracting-schema-and-model-names-from-the-filename/575 and
https://discourse.getdbt.com/t/preventing-name-conflicts/221 for more information.

#}
  {{ node.name.split('__')[0] | trim }}
{%- endmacro %}

{% macro generate_alias_name(custom_alias_name=none, node=none) -%}
  {%- if custom_alias_name is not none and node.resource_type in ['seed', 'model'] -%}
    {%- set error_message -%}
      {{ node.resource_type | capitalize }} '{{ node.unique_id }}' has an alias configured. This is not allowed because the it will be derived from the file name.
    {%- endset -%}
    {{ exceptions.raise_compiler_error(error_message) }}
  {%- endif -%}
    {{ node.name.split('__')[1] }}
{%- endmacro %}

{% macro generate_database_name(custom_database_name=none, node=none) -%}

  {%- if custom_database_name is not none and node.resource_type in ['seed', 'model'] -%}
    {%- set error_message -%}
      {{ node.resource_type | capitalize }} '{{ node.unique_id }}' has a database configured. This is not allowed because the it will be derived from the directory structure and target name.
    {%- endset -%}
    {{ exceptions.raise_compiler_error(error_message) }}
  {%- endif -%}

  {% set path = node.path %}
  {% set database = path.split('/')[0] %}
  {{ database }}{{ '_' + target.name if not target.name.startswith('prod') }}
{%- endmacro %}

Note that this requires your databases to follow the convention of db_name_<target_name>. For example, ours are reporting_dev, reporting_qa, and reporting. We also have other_org_reporting_dev, other_org_reporting_qa, and other_org_reporting (the two orgs interact with each other in some places, which is the root of why I really need this, and can't just set it once in my profile.yml). Therefore the targets we configure are dev, qa, and prod_us / prod_eu etc.

With this approach, you can very clearly see why I want the overrides to apply to sources as well. Instead, I have to do this in my sources:

sources:
  - name: blah
    database: "reporting{{ '_' + target.name if not target.name.startswith('prod') }}" # for sources in other_org, I have to change it to "other_org{{ ... same exact conditional here ... }}"
    schema: schema_name
    tables: [ ... ]

alexrosenfeld10 · 2022-02-24T03:57:01Z

That example is cool, I didn't know you could multi-line and use full {% if %} style jinja. I was under the impression it had to be inline, in one {{ evaluation statement }}. This is nice to know about in general, but it doesn't address the core problem here. In fact, I'm effectively doing that to get around this problem, I'm just doing it with a one line if statement + a convenient naming convention.

jtcohen6 · 2022-02-24T09:58:21Z

Jumping in with three quick thoughts, since there's already some great conversation happening here:

@nathan-protempo has the right sense for why dbt doesn't support generate_database_name for sources by default. dbt assumes that (a) sources are preexisting tables with predetermined locations, not things it is tasked with putting in place; and (b) those locations don't vary across environments. The goal behind the default custom schema behavior is to avoid two developers putting the same model in the same namespace location, and stepping on each others' toes. The same logic applies for why snapshots use target_database + target_schema configs instead, and don't use generate_X_name macros—it should be the same for all users. Obviously, in many complex deployments, these objects can vary across environments—sources in dev/staging are sampled, cleansed of PII, perhaps even faked entirely.
Because snapshots are written in dbt-SQL, users can still set target_database = generate_database_name('db') if they so choose. The default is justifiable, and the alternative is still possible. Unfortunately, dbt doesn't support macro calls within .yml configuration today. (Exception: "late-rendered" config, such as hooks.) This is a capability that we've continually worked toward, and we've laid a number of the foundational pieces. The reason we haven't just done this: adding macros to the rendering contexts for configurations will be a significant performance hit at parse time. Could we add support for just generate_X_name macros, since those already have special behavior during parsing? Maybe. That would make this possible, at least:

# not possible today
sources:
  - name: blah
    database: "{{ generate_database_name('reporting') }}"
    schema: schema_name
    tables: [ ... ]

We will soon be working to reconcile configurations for sources, so that they are at parity with other resource types ([CT-201] Reconcile configs + properties for sources #3662). This is work we began ahead of v1.0, but de-prioritized relative to other changes. This will at least make the following behavior possible:

# dbt_project.yml

sources:
  my_project:
    all_my_sources:
      # logic is still duplicated, but compact - not scattered across different files
      reporting_sources:
        +database: "reporting{{ '_' + target.name if not target.name.startswith('prod') }}"
      other_org_sources:
        +database: "other_org{{ '_' + target.name if not target.name.startswith('prod') }}"

alexrosenfeld10 · 2022-02-24T11:40:09Z

Good thoughts, thanks @jtcohen6. Of those three I am most a fan of the third option.

It's not uncommon for people to use the same physical instance of a cloud database with separate logical (_qa kind of thing) databases, to save on $$. Also, it's not uncommon for folks to do what I'm doing where separate sub-orgs within a company go in separate databases. I understand the assumption dbt is currently making that sources are expected to be in a known location, my point here is to call into question the validity of that assumption.

My main question for you is, why isn't there a 4th option, a generate_X_name_sources (where x is database and schema)? Same exact idea as what we do for models, just applied to sources.

Similarly, you did say that dbt "doesn't support generate_x_name for sources by default", which implies it can be enabled. Where can I change that setting (Perhaps there is no way to and I'm just reading in a bit too carefully)?

jtcohen6 · 2022-02-24T12:24:58Z

Similarly, you did say that dbt "doesn't support generate_x_name for sources by default", which implies it can be enabled. Where can I change that setting (Perhaps there is no way to and I'm just reading in a bit too carefully)?

Sorry, yes, no way to do this today. I meant option 2 to be an illustration of how you could someday reenable this, for a particular source, in the same way users have done with snapshots.

So option 4 might be, an optional configuration that "enrolls" sources into generate_x_name macros (or a dedicated source-specific version). Behind the scenes, this would need to look like running update_parsed_node_name over sources when they're parsed.

For the short term, I still like the third option most. We're planning to start work on that soon, with the aim of including it in v1.1. After that capability is unlocked, we can see if there remains a significant unmet need for more dynamic, macro-based configuration.

alexrosenfeld10 · 2022-02-24T12:33:02Z

Cool. The thing I like about having a way to enroll sources into generate_x_name macros is it could apply to more than just sources (right?). Right now I'm not using analyses or snapshots, but soon enough I'm sure I will be, and I'll need the same functionality there too. This kind of thing is a project-structure level thing, so it's going to apply for all pieces. For now, I'll have to come up with workarounds (like what I did for sources) and copy-paste them everywhere.

I'm a fan of 3, but per the above I still think a more global, configurable option would be great

alexrosenfeld10 · 2022-02-24T12:33:58Z

FWIW, I think anyone using something like this: https://discourse.getdbt.com/t/extracting-schema-and-model-names-from-the-filename/575 would be impacted by the same / a similar problem(s), at some level somewhere in their project.

github-actions · 2022-10-20T02:14:53Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

alexrosenfeld10 · 2022-10-20T03:26:38Z

Commenting to keep this open

Callumoco · 2023-02-22T12:27:41Z

I've just come across this behaviour too and was struggling to understand why sources weren't changing their database value even though I configured the macro to do so.

In my case I want any source that uses a PROD_ database to use the DEV_ database instead, when the non-live profile is used.

# generate_database_name.sql

# Does not work for sources

{% macro generate_database_name(custom_database_name, node) -%}

    {%- set default_database = target.database -%}
    {%- if custom_database_name is none -%}

        {{ default_database }}

    {%- elif target.name == "non-live" and custom_database_name[:5] | upper == "PROD_" -%}

        {{ "DEV_" ~ custom_database_name[5:] }}

     {%- else -%}

        {{ custom_database_name | trim }}

    {%- endif -%}

{%- endmacro %}

For now the below Jinja logic works, it's just not very user friendly for developers when it comes to adding a new source. Would love for this to be a flag or setting in a future version.

# sources.yml
version: 2

sources:
  - name: source_name
    database: "{{ 'dev_database' if target.name == 'non-live' else 'prod_database' }}"

alexrosenfeld10 · 2023-02-22T14:27:48Z

I have the same workaround in place. Mine's a tiny bit different, posting in case you prefer:

version: 2

sources:
  - name: foo
    description: The foo data sources
    database: "reporting{{ '_' + target.database if not target.database.startswith('prod') }}"
    schema: foo
    tables:
      - name: a_table

github-actions · 2024-02-25T01:46:15Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2024-03-04T01:48:28Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

avjkumar · 2024-06-14T18:17:22Z

👋 We do need to programmatically define sources for various security and compliance reasons. Is this a feature we can request or is there an update on this?

alexrosenfeld10 added bug Something isn't working triage labels Feb 19, 2022

github-actions bot changed the title ~~[Bug] generate_database_name does not apply to sources~~ [CT-260] [Bug] generate_database_name does not apply to sources Feb 19, 2022

jtcohen6 added the Team:Language label Feb 19, 2022

jtcohen6 removed the triage label Feb 24, 2022

jtcohen6 mentioned this issue Mar 23, 2022

The future of vars #4938

Closed

jtcohen6 added enhancement New feature or request and removed bug Something isn't working labels Apr 20, 2022

jayhale mentioned this issue Oct 12, 2022

[CT-1338] [Bug] generate_schema_name not available for sources #6056

Closed

2 tasks

github-actions bot added the stale Issues that have gone stale label Oct 20, 2022

github-actions bot removed the stale Issues that have gone stale label Oct 21, 2022

jtcohen6 mentioned this issue Oct 30, 2022

[CT-1279] [Bug] Slim CI not working with external tables #5977

Closed

2 tasks

jtcohen6 mentioned this issue Nov 23, 2022

[CT-1544] [Bug] Should get_catalog() respect an overriden source() builtin? #6308

Closed

2 tasks

jtcohen6 removed the Team:Language label Jul 19, 2023

github-actions bot added the stale Issues that have gone stale label Feb 25, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 4, 2024

geo-martino mentioned this issue Oct 17, 2024

[Feature] Improvements to source properties handling #10875

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-260] [Bug] `generate_database_name` does not apply to sources #4753

[CT-260] [Bug] `generate_database_name` does not apply to sources #4753

alexrosenfeld10 commented Feb 19, 2022 •

edited

Loading

alexrosenfeld10 commented Feb 19, 2022

alexrosenfeld10 commented Feb 19, 2022 •

edited

Loading

nathan-protempo commented Feb 23, 2022

nathan-protempo commented Feb 23, 2022

alexrosenfeld10 commented Feb 23, 2022 •

edited

Loading

nathan-protempo commented Feb 23, 2022 •

edited

Loading

alexrosenfeld10 commented Feb 24, 2022 •

edited

Loading

alexrosenfeld10 commented Feb 24, 2022 •

edited

Loading

jtcohen6 commented Feb 24, 2022 •

edited

Loading

alexrosenfeld10 commented Feb 24, 2022 •

edited

Loading

jtcohen6 commented Feb 24, 2022

alexrosenfeld10 commented Feb 24, 2022

alexrosenfeld10 commented Feb 24, 2022

github-actions bot commented Oct 20, 2022

alexrosenfeld10 commented Oct 20, 2022

Callumoco commented Feb 22, 2023

alexrosenfeld10 commented Feb 22, 2023

github-actions bot commented Feb 25, 2024

github-actions bot commented Mar 4, 2024

avjkumar commented Jun 14, 2024 •

edited

Loading

[CT-260] [Bug] generate_database_name does not apply to sources #4753

[CT-260] [Bug] generate_database_name does not apply to sources #4753

Comments

alexrosenfeld10 commented Feb 19, 2022 • edited Loading

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

alexrosenfeld10 commented Feb 19, 2022

alexrosenfeld10 commented Feb 19, 2022 • edited Loading

nathan-protempo commented Feb 23, 2022

nathan-protempo commented Feb 23, 2022

alexrosenfeld10 commented Feb 23, 2022 • edited Loading

nathan-protempo commented Feb 23, 2022 • edited Loading

alexrosenfeld10 commented Feb 24, 2022 • edited Loading

alexrosenfeld10 commented Feb 24, 2022 • edited Loading

jtcohen6 commented Feb 24, 2022 • edited Loading

alexrosenfeld10 commented Feb 24, 2022 • edited Loading

jtcohen6 commented Feb 24, 2022

alexrosenfeld10 commented Feb 24, 2022

alexrosenfeld10 commented Feb 24, 2022

github-actions bot commented Oct 20, 2022

alexrosenfeld10 commented Oct 20, 2022

Callumoco commented Feb 22, 2023

alexrosenfeld10 commented Feb 22, 2023

github-actions bot commented Feb 25, 2024

github-actions bot commented Mar 4, 2024

avjkumar commented Jun 14, 2024 • edited Loading

[CT-260] [Bug] `generate_database_name` does not apply to sources #4753

[CT-260] [Bug] `generate_database_name` does not apply to sources #4753

alexrosenfeld10 commented Feb 19, 2022 •

edited

Loading

alexrosenfeld10 commented Feb 19, 2022 •

edited

Loading

alexrosenfeld10 commented Feb 23, 2022 •

edited

Loading

nathan-protempo commented Feb 23, 2022 •

edited

Loading

alexrosenfeld10 commented Feb 24, 2022 •

edited

Loading

alexrosenfeld10 commented Feb 24, 2022 •

edited

Loading

jtcohen6 commented Feb 24, 2022 •

edited

Loading

alexrosenfeld10 commented Feb 24, 2022 •

edited

Loading

avjkumar commented Jun 14, 2024 •

edited

Loading