Skip to content

Add sql2dbx: LLM-powered SQL to Databricks notebook converter #399

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nakazax
Copy link

@nakazax nakazax commented Apr 25, 2025

Add sql2dbx tool to databrickslabs/sandbox

Overview

This PR adds sql2dbx to the databrickslabs/sandbox repository. sql2dbx is an automation tool designed to convert SQL files into Databricks notebooks. It leverages Large Language Models (LLMs) to perform the conversion based on system prompts tailored for various SQL dialects. sql2dbx consists of a series of Databricks notebooks.

Features

  • Batch processing workflow for SQL file conversion
  • Extensible prompt-based architecture for SQL dialect handling
  • LLM-powered conversion with syntax validation
  • Automatic error correction and cell splitting
  • Direct output as ready-to-use Databricks notebook files (.py format)
  • Support for multiple language models (Claude, Azure OpenAI, etc.)

Sample SQL Dialect Prompts

The tool includes sample YAML-based conversion prompts for:

  • T-SQL (SQL Server, Azure Synapse)
  • Oracle
  • Teradata
  • MySQL/MariaDB
  • PostgreSQL
  • Snowflake
  • Redshift
  • Netezza

Each prompt file contains a system message and few-shot examples tailored to the specific SQL dialect's syntax and semantics.

Documentation

The main notebook (00_main) serves as the entry point with documentation on the conversion workflow and instructions for creating custom dialect prompts or extending the existing samples.

@nakazax nakazax requested a review from a team as a code owner April 25, 2025 09:42
@nakazax nakazax requested a review from grusin-db April 25, 2025 09:42
@alexott
Copy link
Contributor

alexott commented Apr 25, 2025

@nakazax please sign your commits - PR can't be merged until this condition is fulfilled

Commits must have verified signatures.

@nakazax
Copy link
Author

nakazax commented Apr 26, 2025

@alexott Thanks for your comment. I've added a verified signature to the commit.

@alexott alexott requested a review from Copilot May 7, 2025 11:33
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds the sql2dbx tool to the databrickslabs/sandbox repository, which automates the conversion of SQL files into Databricks notebooks using an LLM-powered, prompt-based approach. Key changes include:

  • Adding multiple notebooks for various SQL dialects (PostgreSQL, Oracle, Netezza, MySQL) that are auto-converted from SQL scripts.
  • Implementing batch processing workflows and error handling patterns in each notebook for converting SQL into .py notebooks.
  • Including sample YAML-based system prompts and documentation in the README to guide users.

Reviewed Changes

Copilot reviewed 71 out of 82 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
sql2dbx/examples/postgresql/output/postgresql_example2_stored_procedure.py Creates a notebook with transaction-like logic using separate MERGE operations for backup and value capping.
sql2dbx/examples/postgresql/output/postgresql_example1_multi_statement_transformation.py Sets up a products table with discount-based price updates via MERGE and deletion rules.
sql2dbx/examples/oracle/output/oracle_example2_stored_procedure.py Implements stored procedure logic with parameter widgets and threshold updates using MERGE.
sql2dbx/examples/oracle/output/oracle_example1_multi_statement_transformation.py Demonstrates multi-statement transformations and discount operations on product data.
sql2dbx/examples/netezza/output/netezza_example2_stored_procedure.py Provides a stored procedure for adjusting thresholds with rollback simulation.
sql2dbx/examples/netezza/output/netezza_example1_multi_statement_transformation.py Deals with multi-statement data transformations and discount applications.
sql2dbx/examples/mysql/output/mysql_example2_stored_procedure.py Performs threshold checks and updates with rollback simulation in a MySQL context.
sql2dbx/examples/mysql/output/mysql_example1_multi_statement_transformation.py Implements multi-statement order transformations with discount adjustments and cleanup.
sql2dbx/README.md Offers documentation on tool usage and setup instructions for integrating sql2dbx with Databricks notebooks.
Files not reviewed (11)
  • sql2dbx/.gitignore: Language not supported
  • sql2dbx/examples/mysql/input/mysql_example1_multi_statement_transformation.sql: Language not supported
  • sql2dbx/examples/mysql/input/mysql_example2_stored_procedure.sql: Language not supported
  • sql2dbx/examples/netezza/input/netezza_example1_multi_statement_transformation.sql: Language not supported
  • sql2dbx/examples/netezza/input/netezza_example2_stored_procedure.sql: Language not supported
  • sql2dbx/examples/oracle/input/oracle_example1_multi_statement_transformation.sql: Language not supported
  • sql2dbx/examples/oracle/input/oracle_example2_stored_procedure.sql: Language not supported
  • sql2dbx/examples/postgresql/input/postgresql_example1_multi_statement_transformation.sql: Language not supported
  • sql2dbx/examples/postgresql/input/postgresql_example2_stored_procedure.sql: Language not supported
  • sql2dbx/examples/redshift/input/redshift_example1_multi_statement_transformation.sql: Language not supported
  • sql2dbx/examples/redshift/input/redshift_example2_stored_procedure.sql: Language not supported
Comments suppressed due to low confidence (1)

sql2dbx/examples/oracle/output/oracle_example2_stored_procedure.py:16

  • [nitpick] Consider renaming 'p_multiplier' to 'multiplier' (or a similarly clear name) to align with the widget name and improve code clarity.
p_multiplier = float(dbutils.widgets.get("multiplier"))

Comment on lines +54 to +55
# Update forecast table: first backup original values
# Using MERGE instead of UPDATE FROM since Databricks doesn't support UPDATE FROM
Copy link
Preview

Copilot AI May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider evaluating whether the two separate MERGE statements (one for backing up original values and another for capping forecast values) can be consolidated to reduce duplicate logic and improve maintainability, provided the business logic permits.

Suggested change
# Update forecast table: first backup original values
# Using MERGE instead of UPDATE FROM since Databricks doesn't support UPDATE FROM
# Update forecast table: backup original values and cap forecast values
# Using a single MERGE statement to reduce duplication

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants