Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbt 0.20.0 tries to create two tables with the same name when --store-failures is used with similar-looking expression_is_true tests #3626

Closed
1 of 5 tasks
joellabes opened this issue Jul 25, 2021 · 5 comments
Labels
bug Something isn't working dbt tests Issues related to built-in dbt testing functionality

Comments

@joellabes
Copy link
Contributor

joellabes commented Jul 25, 2021

Describe the bug

This schema.yml file:

version: 2

sources:
    - name: crm
      database: reporting
      schema: crm
      loader: Amazon DMS

      tables: 
        - name: schoolattributevalues
          tests:
          - dbt_utils.expression_is_true:
              expression: "not (len(memovalue) > 0 and len(value) > 0)" #only one of memovalue and value can be populated
          - dbt_utils.expression_is_true:
              expression: "not (len(memovalue) = 0 and len(value) = 0)" #... but, one of them does have to be

gives an error when running dbt test --store-failures:

Encountered an error:
Compilation Error
  dbt found two resources with the database representation ""reporting"."dev_jlabes_dbt_test__audit"."dbt_utils_source_expression_is_d74da5f62da07f2223c634b4409dd71d"".
  dbt cannot create two resources with identical database representations. To fix this,
  change the configuration of one of these resources:
  - test.educationperfect.dbt_utils_source_expression_is_true_crm_schoolattributevalues_not_len_memovalue_0_and_len_value_0_.77167f5215 (models\staging\crm\other\crm_sources.yml)
  - test.educationperfect.dbt_utils_source_expression_is_true_crm_schoolattributevalues_not_len_memovalue_0_and_len_value_0_.c19220b2a8 (models\staging\crm\other\crm_sources.yml)

I guess the problem comes from the only difference between the expressions being the > vs =, which both get turned into _s. However, the .77167f5215 and .c19220b2a8 suffixes provide deduplication for the tests' FQNs (I think those are the FQNs?)

Expected behavior

Both tests to be given different DB representations.

Screenshots and log output

See above

System information

Which database are you using dbt with?

  • postgres

  • redshift

  • other (specify: ____________)

  • bigquery

  • snowflake

The output of dbt --version:

installed version: 0.20.0
   latest version: 0.20.0

Up to date!

Plugins:
  - bigquery: 0.18.2
  - postgres: 0.20.0
  - redshift: 0.20.0
  - snowflake: 0.18.2

The operating system you're using:
Win10

The output of python --version:
Python 3.8.0

@joellabes joellabes added bug Something isn't working triage labels Jul 25, 2021
@jtcohen6 jtcohen6 added dbt tests Issues related to built-in dbt testing functionality and removed triage labels Jul 26, 2021
@jtcohen6
Copy link
Contributor

@joellabes Thanks for the report!

dbt does try to create a unique hashed alias of <64 characters for tests with too-long names. The issue is that we're first substituting all special characters for underscores, then hashing the flattened arguments into a unique identifier:

https://github.com/dbt-labs/dbt/blob/72722635f275666b34efc9baf8485be79dfcaa88/core/dbt/parser/schema_test_builders.py#L41-L58

Since the two tests only differ by a special character, both get stringified to dbt_utils_source_expression_is_true_crm_schoolattributevalues_not_len_memovalue_0_and_len_value_0_, then hashed and suffixed to dbt_utils_source_expression_is_d74da5f62da07f2223c634b4409dd71d.

fix

The most obvious fix here is to hash flat_args before flattening into clean_flat_args. I think the code change for that is fairly straightforward, so I'm going to call this a good first issue.

There's one issue I can foresee: If a test name (including all arguments and underscores) is <64 characters, we aren't including a hashed suffix at all. Special characters would still be substituted out for underscores, causing a collision. Should we start adding hash suffixes for all tests with arguments, regardless of length? Frankly, I don't know how common it is for two tests to both have very few/concise arguments and differ by only special characters between them. We will have at least narrowed this edge case into an even rarer occasion.

workaround

In the meantime, you could certainly work around this by adding an extra condition to one of the tests, and "salting" the hash:

          tests:
          - dbt_utils.expression_is_true:
              expression: "not (len(memovalue) > 0 and len(value) > 0)" #only one of memovalue and value can be populated
          - dbt_utils.expression_is_true:
              expression: "not (len(memovalue) = 0 and len(value) = 0) and true" #... but, one of them does have to be

longer-term options

Tests already support an alias config, just not yet as a test-specific modifier (defined in the same .yml file where the test is defined). We could add support for that, giving users the option of overriding whatever default behavior dbt is attempting around argument flattening and hashing:

          tests:
          - dbt_utils.expression_is_true:
              expression: "not (len(memovalue) > 0 and len(value) > 0)" #only one of memovalue and value can be populated
              alias: test_only_one_of_memovalue_and_value
          - dbt_utils.expression_is_true:
              expression: "not (len(memovalue) = 0 and len(value) = 0)" #... but, one of them does have to be
              alias: test_at_least_one_of_memovalue_and_value

We've also got some broader thinking in the works about node naming + identification: #3348, #3548

@joellabes
Copy link
Contributor Author

Thanks for all the options!

Should we start adding hash suffixes for all tests with arguments, regardless of length?

Feels like a lot of uglification on the happy path to protect against an edge case of edge cases, so I'd go 👎 until it becomes a clearer issue.

I worked around it by flipping the order of the clauses on the second test! It was nice to have them in the same order to make it easy to visually diff, but not a big deal. (Made for a fun code review though
image
)

@joellabes
Copy link
Contributor Author

@jtcohen6 I've just spent some time looking into this, and have found another wrinkle:

    columns:
      - name: total_unweighted
        tests:
          - not_null
          - accepted_range:
              where: "is_resub_adjustment is false"
              min_value: 0
              tags: 
                - hubspot_bad_revenue
          - accepted_range:
              where: "is_resub_adjustment is true"
              max_value: 0

The args dict for the latter is

{'max_value': 0, 'column_name': 'total_unweighted', 'model': "{% if config.get('where') %}(select * from {{ ref('hubspot_revenue__revenue_lines') }} where {{config.get('where')}}) hubspot_revenue__revenue_lines{% else %}{{ ref('hubspot_revenue__revenue_lines') }}{% endif %}"}

and the where clause is nowhere to be found. That means that I wind up with total_unweighted__0 as the entirety of even the unclean flat args.

Should I be passing config into get_nice_schema_test_name as well? Is it even available at that point? I haven't gone looking, figured I'd check in before going too far down the rabbit hole.

@jtcohen6
Copy link
Contributor

@joellabes Ah, that's a really good point. We've made it so that config values are not included in the generation of a test's name (or unique_id, I believe), so that dbt is able to detect it's the same test, just with different configs (rather than a deleted test + a new test). That's a pretty subtle difference, and it may not matter much, except for the few folks who use the state:new selection method.

The thing that feels weirder to me here is that both min_value: 0 and max_value: 0 are showing up as just __0. I wonder if we should be including arg names as well as values in the generated name?

As it happens, thanks to the work in #3616, v0.21 will actually include support for the "long-term option" I described above: defining a custom alias under each specific test, which would enable you to avoid collisions with --store-failures. What do you think about that as a workable resolution here?

@joellabes
Copy link
Contributor Author

I wonder if we should be including arg names as well as values in the generated name?

I had the same thought, but decided that it was still papering around the edges of the problem instead of fixing it forever so it didn't really move things forward.

What do you think about that as a workable resolution here?

I strongly prefer this 😍 our use case here is to use a combo of get_relations_by_pattern and union_relations to build a dashboard of failing tests over time. It'll be much easier to get the specific relations by pattern if there's.. a pattern, especially one that we can control.

Let's close this issue and I'll jump on the .21 beta quicksharp!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dbt tests Issues related to built-in dbt testing functionality
Projects
None yet
Development

No branches or pull requests

2 participants