Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The future of vars #4938

Closed
jtcohen6 opened this issue Mar 23, 2022 · 4 comments
Closed

The future of vars #4938

jtcohen6 opened this issue Mar 23, 2022 · 4 comments
Labels

Comments

@jtcohen6
Copy link
Contributor

Thanks to @gshank @joellabes @ChenyuLInx for a great discussion yesterday about all things vars! I'll do my best to summarize below.

Prompted by requests like:

When to recommend...?

  • vars
  • env vars
  • target-based logic
  • "pure" macros that return static values (available to set configs + template runtime code in .sql files, but not in .yml files)
  • generate_x_name macros (database/schema/alias of "executable" nodes, but not for sources)

Do we have any strong recommendations around when to use vars, and when to use env vars? I don't think so, at least not well-documented. Here are my own strong-ish opinions:

Env vars

  • Secrets — including credentials (if you need to check profiles.yml into version control), or git-related info in packages.yml
  • Any time of environment-based configuration that reliably differs between dev/prod/CI, e.g. a different source database per environment
  • "Run modes" that are going to differ in a predictable way

Vars

  • Configuring packages, using levers that the package maintainer has documented and made available, through well-understood means that you never
  • "Run modes," with sensible defaults that hold >95% of the time, that it may make occasional sense to "override" with arbitrary / unpredictable values (via CLI --vars)

One small note: dbt is smarter about partial parsing today with env vars (an actual partial re-parse) than with vars (will trigger a full re-parse, whether the change is in --vars or in dbt_project.yml). This is a temporary technical limitation only, and we plan to reach eventual parity.

Other types of config

  • target-based Jinja conditional expressions are nice when you want to surface the logic very close to where it's being used: {{ 3 if target.name == 'dev' else 100 }} days of data for a backfill, for instance.
  • generate_x_name macros have a leg up on the others, in that they have access to everything: all vars, env vars, target info, and the full node specification of the model/seed/snapshot/test. So if you have highly specific rules about what goes where (if X config, Y tag, Z materialization type, ...), the macro is the right place to encode that "rule." This logic just can't be (re)used anywhere else

Forward-looking vibes

While some of these changes are not mutually exclusive, I sensed two

Vibe 1: Vars are exactly like env vars.

You can (already) define both on the CLI:

$ dbt run --vars 'MY_VAR: True'
$ MY_VAR=True dbt run

They should available in all the same places and rendering contexts. That includes profiles.yml + packages.yml (places where only --vars are available today).

Vars can be defined in a .yml or .env file, in addition to dbt_project.yml. They're parsed first, regardless of where they're defined—even before profile information.

Vibe 2: Vars are a way to DRY up your project

Vars save and render static Jinja expressions, which can take env vars + target values as their inputs.

Following the example in #4753, imagine you've got:

sources:
  - name: source_one
    database: |
      {%- if 'other_country' in target.database -%} prod__other_country__source_one
      {%- elif target.name == 'prod' -%} prod__main_country__source_one
      {%- else -%} {{ target.database }}__source_one
      {%- endif -%}
    ...
  - name: source_two
    database: |
      {%- if 'other_country' in target.database -%} prod__other_country__source_two
      {%- elif target.name == 'prod' -%} prod__main_country__source_two
      {%- else -%} {{ target.database }}__source_two
      {%- endif -%}

We've proposed solving that with better source configurations (#3662), crucially the ability to configure multiple sources at once. But look again: Those configs are just slightly different.

The immediate instinct is to replace this with a macro, until you realize that macros aren't supported in .yml files today—at this point, primarily for reasons of rendering speed / parsing performance. And we can't tell a "pure" macro (static inputs + outputs) from a "dirty" one.

Instead, what if you could "save" a reusable Jinja expression / static snippet, in the form of a rendered var?

# dbt_project.yml
vars:
  my_project:
    source_database_prefix: |
      {%- if 'other_country' in target.database -%} prod__other_country__source_two
      {%- elif target.name == 'prod' -%} prod__main_country__source_two
      {%- else -%} {{ target.database }}__source_two
      {%- endif -%}
sources:
  - name: source_one
    database: {{ var('source_database_prefix') }}__source_one
    ...
  - name: source_two
    database: {{ var('source_database_prefix') }}__source_two

This feels much more intuitive for vars that are set in dbt_project.yml, versus the ones passed in for --vars. In fact, it feels like a bit of a disjuncture between them. Do we need a third kind of var...?

@joellabes
Copy link
Contributor

joellabes commented Mar 24, 2022

Vibes-wise, I'm most attached to the problems solved by Vibe 2, but I disagree with the approach to get there:

Do we need a third kind of var...?

What about a second type of macro? We could take inspiration from Redshift UDFs (and likely other things) which can attach hints (VOLATILE | STABLE | IMMUTABLE) to indicate whether the planner can recycle the same output for a given input.

It would be cool to be able to do something like this:

{% macro immutable generate_database_prefix() %}
      {%- if 'other_country' in target.database -%} prod__other_country
      {%- elif target.name == 'prod' -%} prod__main_country
      {%- else -%} {{ target.database }}
      {%- endif -%}
{% endmacro %}
sources:
  - name: source_one
    database: {{ generate_database_prefix() | trim }}__source_one
    ...
  - name: source_two
    database: {{ generate_database_prefix() | trim }}__source_two

and only allow this subset of macros in yaml files, where they would have access to target and env_vars/variables defined by --vars. Any macro claiming to be immutable that tried to do anything to the database would be rejected.

Functionally, this would be identical to the variables envisaged above. I think it would aid in a cleaner mental model though - conceptually I think of vars as constant, exported values, not as snippets of code in their own right.

I could swing either way on whether we allow things like {{ run_started_at }} - at that point you're looking at STABLE as opposed to IMMUTABLE in redshiftese - I'm not averse to it but I can't think of a use case.

@rumbin
Copy link

rumbin commented Jul 21, 2022

Regarding Vibe 2 I suggest the following extension on vars:

Folder-level variables

Imagine that we could define variables in the dbt_project.yml in models → <my_project> → <model_folder_name>.
This would be a very useful way to configure, e.g., mart schemas of which many technically similar ones exist by just defining mart/schema-specific variables which are then picked up by the models of this folder/schema.

Example:

vars:
  # global vars:
  test_lake_load_ts_predicate: "(_lake_load_ts > (current_timestamp - interval '3 days'))"
  test_mod_ts_predicate: "(_mod_ts > (current_timestamp - interval '3 days'))"

  # package-specific vars:
  ###### ↓ current implementation ↓ ######
  dbt_artifacts:
    dbt_artifacts_database: "lp_lake"
    dbt_artifacts_schema: "raw_dbt_artifacts"
    dbt_artifacts_table: "lp_prodb"
  ###### ↑ current implementation ↑ ######

models:

  # general model settings:
  +persist_docs:
    relation: true
    columns: true
  +post-hook:
    - "{{ dbt_snow_mask.apply_masking_policy() }}"

  # packages:
  dbt_artifacts:
    # NOTE: To me it feels like a good idea to put package-specific variables
    # rather here than in the global vars block above!
    ###### ↓ proposed change ↓ ######
    vars:
        dbt_artifacts_database: "lp_lake"
        dbt_artifacts_schema: "raw_dbt_artifacts"
        dbt_artifacts_table: "lp_prodb"
    ###### ↑ proposed change ↑ ######
    +schema: core_dbt_artifacts
    staging:
      +schema: orig_dbt_artifacts

  # this project:
  this_dbt_project:

    mart_whatever:
      +schema: mart_whatever
      description:  A generic data mart

    ###### ↓ proposed change ↓ ######
    vars:
        source_prefix: "whatever"
        database_name: "whatever_2"
    ###### ↑ proposed change ↑ ######

Side note: As you can see, I also propose to put package-specific vars rather (or alternatively/additionally) into the models section of the specific package.

What do you all think about this?

@ValdarT
Copy link

ValdarT commented Oct 13, 2022

Would be nice if one could use secret env vars also in models (i.e., outside of profiles.yml or packages.yml).

My use case is creating a temp view over a JDBC connection in a pre-hook to access some data and I would need a way to securely pass the credentials.

@jtcohen6
Copy link
Contributor Author

I'm going to turn this into a GitHub discussion, since we're using those more frequently — for Bigger Improvements, in addition to Very Big Ideas.

We don't have immediate next steps, or concrete plans of when to schedule this work, but it's feedback (and confusion) that we get pretty consistently. By the time of dbt v2, I'd like to see us have better, more consistent, more intuitive answers for conditional configuration.

@dbt-labs dbt-labs locked and limited conversation to collaborators Oct 28, 2022
@jtcohen6 jtcohen6 converted this issue into discussion #6170 Oct 28, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Projects
None yet
Development

No branches or pull requests

4 participants