[CT-3296] [Feature] Option to enforce 2-argument `ref()` when referencing models from other packages/projects #8951

jaklan · 2023-10-31T18:29:55Z

Is this a new bug in dbt-core?

I believe this is a new bug in dbt-core
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

As stated in dbt docs here:

We especially recommend using two-argument ref to avoid ambiguity, in cases where a model name is duplicated across multiple projects or installed packages. If you use one-argument ref (just the model_name), dbt will look for a model by that name in the same namespace (package or project); if it finds none, it will raise an error.

The last sentence is incorrect though - if you use 1-argument ref and a model doesn't exist in the same package (namespace), but exists in a package installed as a dependency - no error is raised.

Expected Behavior

I expect the behaviour specified in docs, so 1-argument ref should only work with models from the same package (namespace), but not with models (even the public ones) from installed packages. Such models should be referenced exclusively with two-argument ref, otherwise - an error should be raised.

If it's not a bug, but a mistake in documentation and you don't want to enable such behaviour by default - there should be an option provided to enforce it.

Steps To Reproduce

create package_a
create model_a (public or protected) in package_a
create package_b
specify package_a as a dependency in package_b
create model_b in package_b as select * from {{ ref('model_a') }}
run dbt compile in package_b
compilation succeeds instead of raising an error

Relevant log output

No response

Environment

- OS: macOS
- Python: 3.10
- dbt: both 1.6 & 1.7

Which database adapter are you using with dbt?

redshift

Additional Context

No response

The text was updated successfully, but these errors were encountered:

jaklan · 2023-10-31T20:04:38Z

cc: @jtcohen6, as we discussed package / project namespaces in the past

dbeatty10 · 2023-11-01T21:14:13Z

@jaklan would you mind double-checking this?

When I tried this with dbt-core 1.6.6, two-argument ref within package_b raised an error when restrict-access: True within package_a. Otherwise, it passed without an error. See below for details.

We should update the docs

We should update the docs for two-argument refto specifically call-out how restrict-access in the upstream project affects two-argument ref in the downstream project.

Project setup

Here's the files that I used:

Directory structure

package_a
├── dbt_project.yml
├── models
│   └── model_a.sql
└── profiles.yml
package_b
├── dbt_project.yml
├── dependencies.yml
├── models
│   └── model_b.sql
└── profiles.yml

Project A

package_a/dbt_project.yml

name: "package_a"
version: "1.0.0"
config-version: 2
profile: "sandcastle"

restrict-access: True  # default is False

package_a/models/model_a.sql

select 1 as id

Project B

package_a/dbt_project.yml

name: "package_b"
version: "1.0.0"
config-version: 2
profile: "sandcastle"

package_b/models/model_b.sql

select * from {{ ref('model_a') }}

package_b/dependencies.yml

packages:
  - local: ../package_a

Try to compile in Project B

cd package_b
dbt deps
dbt compile

Error log output

(postgres_1.6) $ dbt compile

21:01:15  Running with dbt=1.6.6
21:01:16  Registered adapter: postgres=1.6.6
21:01:16  Unable to do partial parsing because saved manifest not found. Starting full parse.
21:01:16  Encountered an error:
Parsing Error
  Node model.package_b.model_b attempted to reference node model.package_a.model_a, which is not allowed because the referenced node is protected to the 'package_a' package.

When I removed restrict-access: True from package_a/dbt_project.yml, then it compiled just fine though.

jaklan · 2023-11-01T21:42:22Z

@dbeatty10 if I understand correctly, you haven't specified model access, so by default it's protected. So in your case it failed not because of dbt enforcing 2-argument ref(), but because of referring to protected model with restrict-access: True. Please try again following that step:

create public model model_a in package_a

Actually that part:

create package_a with restrict-access: True

should be replaced with:

create package_a

as restrict-access: True is not really relevant for the actual issue. Sorry for the confusion, I have just updated the original description - without that config you should be able to refer to both protected and public models from package_a with 1-argument ref() with no error.

dbeatty10 · 2023-11-02T00:01:24Z

Yeah, I see what you are saying!

Indeed, in contraction to the original issue in #7446 and the docs here, the following code in the reprex below works fine instead of erroring.

i.e. this says that it shouldn't:

If you use one-argument ref (just the model_name), dbt will look for a model by that name in the same namespace (package or project); if it finds none, it will raise an error.

And #7446 also says it shouldn't:

model in root (running) project has a ref('model_a')

if it defines its own model_a node → model.root_project.model_a

otherwise: raise an error indicating an ambiguous ref

The first two scenarios already work! But the third is broken because dbt resolves “model_a” to an arbitrary project.

Reprex

Here's the files that I used to demonstrate the issue

Directory structure

package_a
├── dbt_project.yml
├── models
│   └── model_a.sql
└── profiles.yml
package_b
├── dbt_project.yml
├── dependencies.yml
├── models
│   └── model_b.sql
└── profiles.yml

Project A

package_a/dbt_project.yml

name: "package_a"
version: "1.0.0"
config-version: 2
profile: "sandcastle"

package_a/models/model_a.sql

select 1 as id

Project B

package_a/dbt_project.yml

name: "package_b"
version: "1.0.0"
config-version: 2
profile: "sandcastle"

package_b/dependencies.yml

packages:
  - local: ../package_a

package_b/models/model_b.sql

select * from {{ ref('model_a') }}

Compile in Project B

cd package_b
dbt deps
dbt compile --s model_b

Output

(postgres_1.6) $ dbt compile --select models/model_b.sql
00:00:15  Running with dbt=1.6.6
00:00:16  Registered adapter: postgres=1.6.6
00:00:16  Found 2 models, 0 sources, 0 exposures, 0 metrics, 352 macros, 0 groups, 0 semantic models
00:00:16  
00:00:16  Concurrency: 5 threads (target='postgres')
00:00:16  
00:00:16  Compiled node 'model_b' is:
select * from "postgres"."dbt_dbeatty"."model_a"

jaklan · 2023-11-02T01:35:27Z

@dbeatty10 thank you for the answer! I have also done a few additional tests in various combinations, here are the outcomes:

Click to expand test scenarios

package_b installs package_a, package_c installs package_b

package_a with model_a + package_b with model_a:
- select * from {{ ref('model_a') }} works in package_b - model_a is taken from package_b (✅)
package_a with model_a + package_b without model_a:
- select * from {{ ref('model_a') }} works in package_b - model_a is taken from package_a (🚨)
package_a with model_a + package_b with model_a + package_c with model_a:
- select * from {{ ref('model_a') }} works in package_c, model_a is taken from package_c (✅)

package_a with model_a + package_b with model_a + package_c without model_a:

select * from {{ ref('model_a') }} doesn't work in package_c (✅)

Compilation Error in model model_c1 (models/model_c1.sql)
  When referencing 'model_a', dbt found nodes in multiple packages: 'model.project_b.model_a', 'model.project_a.model_a'
  To fix this, use two-argument 'ref', with the package name first: 'project_b' or 'project_a'

package_a with model_a + package_b without model_a + package_c without model_a:
- select * from {{ ref('model_a') }} works in package_c - model_a is taken from package_a (🚨)
package_a without model_a + package_b with model_a + package_c without model_a:
- select * from {{ ref('model_a') }} works in package_c - model_a is taken from package_b (🚨)
package_a without model_a + package_b with model_a + package_c with model_a:
- select * from {{ ref('model_a') }} works in package_c, model_a is taken from package_c (✅)
package_a with model_a + package_b without model_a + package_c with model_a:
- select * from {{ ref('model_a') }} works in package_c, model_a is taken from package_c (✅)

To sum up - an error is only raised in a situation when both conditions are met:

a) model_a is not found in the current project
b) model_a is found in more than one installed package

When model_a is found in one installed package exclusively then there's no ambiguity, so that's why dbt doesn't fail currently. But in my opinion - ambiguity shouldn't be the only reason to raise such error.

Why? Let's imagine the 6. scenario described above - you work on package_c without model_a, but such model is in package_b, so ref('model_a') works with no error. But after some time - maintainers of package_a decide to add model_a to their package. This way our scenario turns into the 4. one - so package_c starts to fail although it wasn't modified.

Another example - we again start with the 6. scenario, but this time model_a was added to package_c by another developer working on that project. This way we get into the 7. scenario. All models which were previously using ref('model_a') to refer to model_a from package_b - now quietly start to refer to the new model from package_c!

The only way to prevent such situations is to use 2-argument ref() from the very beginning. But when you can't enforce it - it's just a convention that developers may simply not follow, we have no real control of that. And that's why I find it really crucial that dbt allows to enforce explicit, 2-argument ref() when using models from installed packages - either as a default behaviour or as a config in package_c.

jtcohen6 · 2023-11-29T10:47:42Z

After reading this a few times, I've finally understood!

The docs here say:

We especially recommend using two-argument ref to avoid ambiguity, in cases where a model name is duplicated across multiple projects or installed packages. If you use one-argument ref (just the model_name), dbt will look for a model by that name in the same namespace (package or project); if it finds none, it will raise an error.

When perhaps we should more clearly say:

In cases where a model name is duplicated across multiple projects or installed packages, it is always recommended to use two-argument ref to avoid ambiguity. If you use one-argument ref (just the model_name), dbt will look for a model by that name in the same namespace (package or project); if dbt finds none in the same namespace, and multiple candidates in other namespaces, dbt will raise an error.

@jaklan It sounds like you want a compiler setting that is: "All references to models in other namespaces (projects/packages) must use two-argument ref." It would need to be opt-in, to avoid breaking existing projects. Would you expect to set this:

As a config within the referenced project (similar to require-access; could be different for different projects/packages being referenced)
As a config within the referencing project (could be different for different referencing projects)
As a runtime config (applies to all projects/packages)

Personally, I'd vote for (3) — it's by far simplest.

I'm going to relabel this from bug to enhancement, and add it to my list of multi-project follow-ups (left over from #7979).

jaklan · 2023-11-29T11:51:00Z

Hi @jtcohen6, thanks for the reply. Yes, we have two issues here:

docs clarification
option to enforce 2-arg ref()

Ad. 1: Looking at your proposal, I believe we can still improve it a bit.

If you use one-argument ref (just the model_name), dbt will look for a model by that name in the same namespace (package or project); if dbt finds none in the same namespace, and multiple candidates in other namespaces, dbt will raise an error.

That part imho doesn't cover the case when given model is not in the same namespace, but one candidate is in another namespace, so that's for example that scenario:

package_a with model_a + package_b without model_a:

select * from {{ ref('model_a') }} works in package_b - model_a is taken from package_a

Maybe we should describe the resolution mechanism in a step-by-step form like:

dbt looks for the model in the current namespace

If it's found - the model is selected

If it's not found, dbt looks for the model in another namespaces:
3a. If dbt finds one candidate - that candidate is selected
3b. If dbt finds multiple candidates - the error is raised

Ad. 2: Exactly, that's the option I'm looking for. And fully agree we can't introduce breaking changes. About the 3 proposed variants - I would definitely vote for the second approach, because defining that as a runtime config could lead to completely different behaviour of given project depending on local settings. Let's look at that scenario:

package_a with model_a + package_b without model_a + package_c without model_a:

select * from {{ ref('model_a') }} works in package_c - model_a is taken from package_a

If you enable the new runtime config - that's gonna fail (as expected). But if some developers in the project don't do that - it would still work. So we end up in a situation when different developers can use different ref() syntax within the same project, which creates a lot of confusion and personally I don't see any added value here. That's why I think the goal of that setting should be to enforce 2-arg ref() at the project level, so all people working on given project have to use it.

jtcohen6 · 2023-11-29T12:03:13Z

That's why I think the goal of that setting should be to enforce 2-arg ref() at the project level, so all people working on given project have to use it.

I agree with this!

I should clarify what I meant in option (3) above: I don't think it should be possible to enforce this differently based on the configs of the referencing project. It should be the same for all models in an invocation. And it should be a config that's defined, in version control, in the project code, so that it does not differ for different users/environments.

jaklan · 2023-11-29T12:42:54Z

@jtcohen6

And it should be a config that's defined, in version control, in the project code, so that it does not differ for different users/environments.

So we agree on that one 😄 I just still miss what's the difference between 2nd and 3rd variant then, could you provide some example?

jaklan added bug Something isn't working triage labels Oct 31, 2023

github-actions bot changed the title ~~[Bug] dbt doesn't enforce 2-argument 'ref()' when using models from installed packages~~ [CT-3296] [Bug] dbt doesn't enforce 2-argument 'ref()' when using models from installed packages Oct 31, 2023

dbeatty10 changed the title ~~[CT-3296] [Bug] dbt doesn't enforce 2-argument 'ref()' when using models from installed packages~~ [CT-3296] [Bug] dbt doesn't enforce 2-argument ref() when using models from installed packages Nov 1, 2023

dbeatty10 self-assigned this Nov 1, 2023

dbeatty10 added awaiting_response and removed triage labels Nov 1, 2023

github-actions bot added triage and removed awaiting_response labels Nov 1, 2023

dbeatty10 removed the triage label Nov 2, 2023

dbeatty10 removed their assignment Nov 2, 2023

jtcohen6 added enhancement New feature or request multi_project and removed bug Something isn't working labels Nov 29, 2023

jtcohen6 changed the title ~~[CT-3296] [Bug] dbt doesn't enforce 2-argument ref() when using models from installed packages~~ [CT-3296] [Enhancement] Option to enforce 2-argument ref() when referencing models from other packages/projects Nov 29, 2023

jtcohen6 changed the title ~~[CT-3296] [Enhancement] Option to enforce 2-argument ref() when referencing models from other packages/projects~~ [CT-3296] [Feature] Option to enforce 2-argument ref() when referencing models from other packages/projects Nov 29, 2023

jtcohen6 mentioned this issue May 10, 2024

[Epic] Multi-project collaboration: one year in #10125

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-3296] [Feature] Option to enforce 2-argument `ref()` when referencing models from other packages/projects #8951

[CT-3296] [Feature] Option to enforce 2-argument `ref()` when referencing models from other packages/projects #8951

jaklan commented Oct 31, 2023 •

edited

Loading

jaklan commented Oct 31, 2023

dbeatty10 commented Nov 1, 2023 •

edited

Loading

Directory structure

Project A

Project B

Try to compile in Project B

jaklan commented Nov 1, 2023 •

edited

Loading

dbeatty10 commented Nov 2, 2023

Directory structure

Project A

Project B

Compile in Project B

Output

jaklan commented Nov 2, 2023 •

edited

Loading

jtcohen6 commented Nov 29, 2023 •

edited

Loading

jaklan commented Nov 29, 2023

jtcohen6 commented Nov 29, 2023

jaklan commented Nov 29, 2023

[CT-3296] [Feature] Option to enforce 2-argument ref() when referencing models from other packages/projects #8951

[CT-3296] [Feature] Option to enforce 2-argument ref() when referencing models from other packages/projects #8951

Comments

jaklan commented Oct 31, 2023 • edited Loading

Is this a new bug in dbt-core?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Which database adapter are you using with dbt?

Additional Context

jaklan commented Oct 31, 2023

dbeatty10 commented Nov 1, 2023 • edited Loading

We should update the docs

Project setup

Directory structure

Project A

Project B

Try to compile in Project B

Error log output

jaklan commented Nov 1, 2023 • edited Loading

dbeatty10 commented Nov 2, 2023

Reprex

Directory structure

Project A

Project B

Compile in Project B

Output

jaklan commented Nov 2, 2023 • edited Loading

jtcohen6 commented Nov 29, 2023 • edited Loading

jaklan commented Nov 29, 2023

jtcohen6 commented Nov 29, 2023

jaklan commented Nov 29, 2023

[CT-3296] [Feature] Option to enforce 2-argument `ref()` when referencing models from other packages/projects #8951

[CT-3296] [Feature] Option to enforce 2-argument `ref()` when referencing models from other packages/projects #8951

jaklan commented Oct 31, 2023 •

edited

Loading

dbeatty10 commented Nov 1, 2023 •

edited

Loading

jaklan commented Nov 1, 2023 •

edited

Loading

jaklan commented Nov 2, 2023 •

edited

Loading

jtcohen6 commented Nov 29, 2023 •

edited

Loading