Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add dbt_ml_inline_preprocessing #327

Merged
merged 2 commits into from
Aug 19, 2024

Conversation

Matts52
Copy link
Contributor

@Matts52 Matts52 commented Aug 16, 2024

Description

This package allows users to perform common machine learning preprocessing techniques inline with their SQL select statements, only requiring model references when absolutely required.

Techniques that can be performed inline include:

  • categorical imputation
  • numerical imputation
  • random imputation
  • label encoding
  • one hot encoding
  • rare category encoding
  • interaction terms
  • k bins discretization
  • log transformation
  • max absolute value scaling
  • min/max scaling
  • numerical binarization
  • robust scaling
  • standardization

This allows for those doing their machine learning preprocessing in dbt to do so in a more semantically followable, interchangeable, and flexible fashion

Link to your package's repository: https://github.com/Matts52/dbt-ml-inline-preprocessing

Checklist

This checklist is a cut down version of the best practices that we have identified as the package hub has grown. Although meeting these checklist items is not a prerequisite to being added to the Hub, we have found that packages which don't conform provide a worse user experience.

First run experience

  • (Required): The package includes a licence file detectable by GitHub, such as the Apache 2.0 or MIT licence.
  • The package includes a README which explains how to get started with the package and customise its behaviour
  • The README indicates which data warehouses/platforms are expected to work with this package

Customisability

  • The package uses ref or source, instead of hard-coding table references.

Packages for data transformation (delete if not relevant):

  • provide a mechanism (such as variables) to customise the location of source tables.
  • do not assume database/schema names in sources.

Dependencies

Dependencies on dbt Core

  • The package has set a supported require-dbt-version range in dbt_project.yml. Example: A package which depends on functionality added in dbt Core 1.2 should set its require-dbt-version property to [">=1.2.0", "<2.0.0"].

Dependencies on other packages defined in packages.yml:

  • Dependencies are imported from the dbt Package Hub when available, as opposed to a git installation.
  • Dependencies contain the widest possible range of supported versions, to minimise issues in dependency resolution.
  • In particular, dependencies are not pinned to a patch version unless there is a known incompatibility.

Interoperability

  • The package does not override dbt Core behaviour in such a way as to impact other dbt resources (models, tests, etc) not provided by the package.
  • The package uses the cross-database macros built into dbt Core where available, such as {{ dbt.except() }} and {{ dbt.type_string() }}.
  • The package disambiguates its resource names to avoid clashes with nodes that are likely to already exist in a project. For example, packages should not provide a model simply called users.

Versioning

  • (Required): The package's git tags validates against the regex defined in version.py
  • The package's version follows the guidance of Semantic Versioning 2.0.0. (Note in particular the recommendation for production-ready packages to be version 1.0.0 or above)

joellabes
joellabes previously approved these changes Aug 19, 2024
Copy link
Contributor

@joellabes joellabes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Please note that in your readme when you say

To import this package into your dbt project, add the following to either the packages.yml or dbt_project.yml file:

It should actually be packages.yml or dependencies.yml files - you can't specify packages in dbt_project.yml.

@joellabes joellabes merged commit 259cefc into dbt-labs:main Aug 19, 2024
3 checks passed
@Matts52
Copy link
Contributor Author

Matts52 commented Aug 20, 2024

Looks good! Please note that in your readme when you say

To import this package into your dbt project, add the following to either the packages.yml or dbt_project.yml file:

It should actually be packages.yml or dependencies.yml files - you can't specify packages in dbt_project.yml.

Great, thanks for catching that, updated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants