Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dbt package generator using GenAI and PyAirbyte #6

Open
aaronsteers opened this issue May 16, 2024 · 7 comments
Open

Create dbt package generator using GenAI and PyAirbyte #6

aaronsteers opened this issue May 16, 2024 · 7 comments
Assignees

Comments

@aaronsteers
Copy link

aaronsteers commented May 16, 2024

Summary

The goal with this application is to take a specific raw data schema for a source being run with PyAirbyte and to auto-generate a simple dbt project for that data.

This could be the foundation of a new type of integration opportunity for Airbyte users.

Definition of Done

These are not specifically related to GenAI, but are the foundation of the code-gen:

  • Solution should be written in Python.
  • Solution can be written as a new feature in PyAirbyte or as a standalone project. (Author's preference, but probably easier/faster as a new project that just calls PyAirbyte.)
  • Solution should be able to generate basic dbt project scaffold, including a basic "profiles" yaml and "dbt_project.yml". (Okay if these are hard-coded or hand-written as generic boilerplate.)
  • Solution should be able to generate a "sources" yaml file for one or more sources that are being extracted using PyAirbyte. This should describe the tables being used.
  • Solution should be able to be executed with dbt run, as proof of the working solution

The GenAI "code gen" application portion of this project is:

  • Solution should be able to generate a dbt model (a .sql file) performing some basic transforms on top of the source table(s) defined in the "sources" yaml.
    • For instance, if the raw data is a 'sales' table, the LLM may create an aggregate table.
    • The LLM may also create 'stage' tables that take the raw schema and map the raw schema to new column names with conformed naming convention and/or conformed data types.
  • Solution should use an LLM to generate the SQL.
  • Instructions to the LLM can be hard-coded to one particular source, but the LLM should be doing the work of generating the SQL.

In terms of documentation:

  • A README.md will be required for this project.
  • A walkthrough tutorial explaining usage is also required. The walkthrough can exist within the README.md or can be provided in any other format, such as blog.
  • A demo video walkthrough is optional, but not required.

Suggestions (Per Author's Discretion)

These are some suggestions - but are not required:

  • We suggest using a simple source like source-faker, source-coin-api, or similar.
  • We suggest using DuckDB as a backend - since it is easy to replicate results locally, doesn't required a paid account, and has good SQL support.

Resources to Assist

  • PyAirbyte can be used to gather json schema for each stream.
  • (@aaronsteers will add more resources and info here shortly.)
@Hashcode-Ankit
Copy link

hi @aaronsteers can I have more information on this topic would love to contribute to it.

@siddhant3030
Copy link

can you give us the summary on this?

@marcosmarxm
Copy link
Member

@Hashcode-Ankit it is yours

@Hashcode-Ankit
Copy link

Hi @marcosmarxm thanks for assigning it to me @aaronsteers can I expect some more details on it? I have set up Airbyte and built some connectors using Airbyte.
Need more information about how to approach this so that I can break down it into even more simple chunks.....

@aaronsteers
Copy link
Author

@Hashcode-Ankit - I've updated the above with a description. Admittedly, this is a large and ambitious project. Let us know if it is still interesting, and/or if you have any questions or changes you would like to propose.

@Hashcode-Ankit
Copy link

@aaronsteers Yes as I mentioned on Slack as well, I am very much excited about it, and it matches my previous work on dbt and dbt-clickhouse.

@Hashcode-Ankit
Copy link

Hi @aaronsteers @marcosmarxm as discussed on slack i have pushed the code for it : https://github.com/Hashcode-Ankit/pyairbyte-dbt
Need some more guidance on task to complete the exact use case @aaronsteers let me know if we can connect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants