Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1561] [Feature] Support defining server_side_parameters inside of dbt_project.yml #529

Closed
3 tasks done
akashrn5 opened this issue Nov 29, 2022 · 15 comments
Closed
3 tasks done
Assignees
Labels
enhancement New feature or request refinement Product or leadership input needed

Comments

@akashrn5
Copy link
Contributor

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-spark functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Background

When the connection type is session for spark type and lets say user wants to pass specific application level properties, then only option user have is to change the properties in SPARK_HOME path, in spark-defaults.conf file. User has to have the option to pass the parameters so that these will be considered when the sparkSession is initialized.

Describe alternatives you've considered

Currently Supported or a way to achieve this:

If user needs to provide a dynamically configurable properties of spark, it can be passed using pre_hook tag in model configurations. But lets say there are multiple dbt projects and user wants to pass some application specific memory configurations for each of the dbt projects, in that case, the only possible way is user has to change the config in spark-defaults.conf which is not a feasible way.

Who will this benefit?

Any spark users can take benefit from this feature. If there any use cases where many application are submitted with different configs or memory requirements can take benefit from this, where they dont need to change the cluster or warehouse config directly. This solution can be discussed and made more generalized for any adapter.

Are you interested in contributing this feature?

Yes. I would love to work on this feature and contribute back to community.

Anything else?

What can be done

Currently server_side_paramaters we can give in profile, according to this #201. So code changes can be done to access the info from this and set when starting the spark session in sessions.py. But lets say there are multiple dbt projects referring to the same profile, but the application properties or memory requirements are different, in that case again user has to intervene to update the profiles.yml file which is not expected.

Proposal

Planning to introduce a feature to support the above use case. Please evaluate this and lets discuss on this and im planning to introduce a new generalized config file to support adding custom properties which can be used specific to dbt_projects. In this way above use cases will be satisfied.

Im planning to contribute this feature to opensource, Need help from the community to evaluate and finalize a solution for the same.

@akashrn5 akashrn5 added enhancement New feature or request triage labels Nov 29, 2022
@github-actions github-actions bot changed the title [Feature] <Support passing application level properties for spark Session connection type.> [CT-1561] [Feature] <Support passing application level properties for spark Session connection type.> Nov 29, 2022
@akashrn5
Copy link
Contributor Author

akashrn5 commented Dec 9, 2022

@jtcohen6 @dbeatty10 can you guys and also can core team have a look at this feature suggestion? This is very common use case and useful feature required.

@akashrn5 akashrn5 changed the title [CT-1561] [Feature] <Support passing application level properties for spark Session connection type.> [CT-1561] [Feature] Support passing application level properties for spark Session connection type. Dec 9, 2022
@Fleid
Copy link
Contributor

Fleid commented Feb 15, 2023

Hi @akashrn5, I have a feeling this is the same problem for session that is solved for thrift here:

Would it make sense to adopt a similar approach, adding a server_side_parameters configuration to the profile?

@akashrn5
Copy link
Contributor Author

akashrn5 commented Feb 21, 2023

Hi @Fleid , Thank you responding to the issue, i was waiting for it from long time.
Yes, you are right and also i have mentioned in description that , we can reuse server_side_parameters which can be defined in profiles.yml file and we can make necessary code changes to consider these prop when initialising the spark session.

But, i have also mentioned the issue, like if two projects refer to same profile and both can have different property or config requirements, which is a valid use case, in this case it fails to satisfy. So it would be better if we have at project level which is at dbt_project.yml file to satisfy all the condition.

What do you think?

Also i would love to contribute this feature to community and i have already made changes in my custom application with dbt cli which works. If approved i would love to contribute the code to the community.

Hoping to get the clarity soon from the community. Thank you :)

@Fleid
Copy link
Contributor

Fleid commented Feb 27, 2023

In #591, we're discussing adding a new parameter to the profile, of type string, so it can be overloaded via an environment variable at each invocation.

But as you mention, we could also move that field in the project file (and use variables if need be).

The thing is that we're currently looking at that issue in a pragmatic way, where can it fit easily, rather than where it should be in the first place.
Could you please elaborate on the kind of settings that are intended to be used here? What is the scope of their application? Is that more the "session/connection" level, or the model level?

@akashrn5
Copy link
Contributor Author

@Fleid Sorry for the delay in reply.
Basically im talking about the project level. Multiple projects can refer to single profile, so if we add some logic with environment variable, it would be a an additional work for the user. So if we add at project level then it would be easy for the user to config for the specific project whenever the project is configured.

So there exists model level, profile level and project level config. Priority can be assigned as model level >> project level >> profile level. This would make more sense and it would be better to fix from a long term perspective.

@Fleid
Copy link
Contributor

Fleid commented Mar 24, 2023

Hey @akashrn5, could you please check this discussion, I'm trying to regroup all the threads on that topic in one place.
Please let me know if the plan works for you, before we can decide to move forward with this specific issue.

@akashrn5
Copy link
Contributor Author

akashrn5 commented Apr 1, 2023

Hey @akashrn5, could you please check this discussion, I'm trying to regroup all the threads on that topic in one place. Please let me know if the plan works for you, before we can decide to move forward with this specific issue.

Hi @Fleid , yeah sure, this works. We can discuss and take a decision.

@Fleid Fleid added refinement Product or leadership input needed and removed triage labels Apr 5, 2023
@JCZuurmond
Copy link
Collaborator

Hi @akashrn5, could you rewrite (the title of) the issue to be about defining the server_side_parameters inside the dbt_project.yml and giving priority as: profile > dbt_project.yml.

The Spark session server_side_parameters is described in #690 and solved in #823

@dbeatty10 dbeatty10 changed the title [CT-1561] [Feature] Support passing application level properties for spark Session connection type. [CT-1561] [Feature] Support defining server_side_parameters inside of dbt_project.yml Jul 7, 2023
@dbeatty10
Copy link
Contributor

@JCZuurmond I just changed the title. We can continue to iterate as-needed -- just reach out if you think we should change it further.

@akashrn5
Copy link
Contributor Author

akashrn5 commented Jul 9, 2023

Hi @JCZuurmond @dbeatty10 Sorry for the delay in response. @dbeatty10 thanks for changing the title. Can i take this up feature of bringing it to the project level and making the profile one as priority? I wanted work on making it work with session type connection. But i see @JCZuurmond has already done the changes and PR is raised.

@JCZuurmond
Copy link
Collaborator

@akashrn5 it would be great if you implemented this!

An open question I have: where in the dbt_project.yml can we define the server_side_parameters? Maybe someone from dbt labs can help here: @Fleid and @dbeatty10.

If I understand the dbt_project.yml correctly, the field with a + in front are the fields intended to pass on to the adapters. The "normal" fields - without the + prefix - are dbt-core fields. I expect that we can not change the dbt-core fields, so I expect we will use a field with the + prefix.

Furthermore, I have only seen the fields with a + prefix to be defined under the models field, not at the top level of the dbt_project.yml or under any other dbt-core field.

This implies that we have to put the server_side_parameters as a +server_side_parameters under models (or any subfield of the models).

I have some doubts on defining the server_side_parameters parameters under the model level. As some (or most?) Spark configuration parameters are set once per connection and if you define the server_side_parameters for a subset of the models they may still be set for the other models if those are downstream as the server_side_parameters are not reset after the models run.

I would prefer to define the server_side_parameters in the top level of the dbt_project.yml or maybe under the profile.

@akashrn5
Copy link
Contributor Author

@JCZuurmond thanks for your comments and doubts. Here is my take on this, would love your feedback on this, also @Fleid @dbeatty10 .

You are right about the + sign. But if you define under model directory then its just an alternative to configuring in the config block in model files right. Also those would be spark properties which supports dynamic SET. But it will not be applied at whole project level, we cannot define the clear hierarchy and our use cases wont be satisfied.

So i would suggest a new key like below with **adapter-configs** key

name: 'demp_dbt_project'
config-version: 2
version: 2
profile: 'demo_spark_profile'
adapter-configs:
      spark.driver.memory: 4g
      spark.executor.memoryOverhead: 512m

This way we can make it bit generalised. Please give your feedback on this.

@JCZuurmond
Copy link
Collaborator

Something like that makes sense to me, maybe with a subfield server_side_parameters, like so:

adapter-configs:
      server_side_parameters:
            spark.driver.memory: 4g
            spark.executor.memoryOverhead: 512m

Is that something that dbt-core allows/supports?

Note on the dash - vs underscore _, I am fine with any if we are consistent.

@dbeatty10
Copy link
Contributor

What do you think @dataders?

@dataders
Copy link
Contributor

dataders commented Aug 3, 2023

hey @akashrn5 -- really appreciate both how clearly you've spelled out the use case as well as your diligence in following up and providing more context.

What you're asking for is something we as the Core maintainers would like to happen; namely that all of these configurations can be set at a model-level. But that requires significant investment both within dbt-core as well as on the various database drivers upon which dbt adapters depend. This might be possible within just dbt-spark, but I'm not sure how that might happen.

If you'll allow me to paraphrase your ask as the below, then perhaps YAML anchors might serve as a workaround?

I want to avoid the redundancy of setting the same server_side_parameters across the various profiles I use

Below is an example where I define an anchor &server_params with my dev target that is re-used in the prd target. You can get fancier, but this should show the potential.

jaffle_shop:
  target: dev
  outputs:
    dev:
      type: spark
      method: odbc
      driver: path/to/driver
      schema: my_schema
      host: myorg.sparkhost.com
      token: abc123
      cluster: '123'
      server_side_parameters: &server_params
        spark.driver.memory: 4g
        spark.executor.memoryOverhead: 512m
    prd:
      type: spark
      method: odbc
      driver: path/to/driver
      schema: my_schema
      host: myorg_PROD.sparkhost.com
      token: abc456
      cluster: '456'
      server_side_parameters: *server_params

For now, I'm going to close this issue, but please feel free to respond and let me know what you think of my proposed workaround.

@dataders dataders closed this as not planned Won't fix, can't repro, duplicate, stale Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request refinement Product or leadership input needed
Projects
None yet
Development

No branches or pull requests

5 participants