Skip to content

Commit

Permalink
Merge pull request #45 from databrickslabs/feature/issue_41
Browse files Browse the repository at this point in the history
Feature/issue 41
  • Loading branch information
ravi-databricks authored Apr 15, 2024
2 parents 077c71d + 7b2d6b1 commit 423a776
Show file tree
Hide file tree
Showing 20 changed files with 135 additions and 194 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ celerybeat.pid
# Environments
.env
.venv
.venvclear/
env/
venv/
ENV/
Expand Down
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# Changelog

## [v.0.0.7]
- Added dlt-meta cli documentation and readme with browser support: [PR](https://github.com/databrickslabs/dlt-meta/pull/45)

## [v.0.0.6]
- migrate to create streaming table api from create streaming live table: [PR](https://github.com/databrickslabs/dlt-meta/pull/39)

## [v.0.0.5]
- Enabled Unity Catalog support: [PR](https://github.com/databrickslabs/dlt-meta/pull/28)
Expand Down
201 changes: 98 additions & 103 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,21 @@

<!-- Top bar will be removed from PyPi packaged versions -->
<!-- Dont remove: exclude package -->

[Documentation](https://databrickslabs.github.io/dlt-meta/) |
[Release Notes](CHANGELOG.md) |
[Examples](https://github.com/databrickslabs/dlt-meta/tree/main/examples)
[Examples](https://github.com/databrickslabs/dlt-meta/tree/main/examples)

<!-- Dont remove: end exclude package -->

---

<p align="left">
<a href="https://databrickslabs.github.io/dlt-meta/">
<img src="https://img.shields.io/badge/DOCS-PASSING-green?style=for-the-badge" alt="Documentation Status"/>
</a>
<a href="https://pypi.org/project/dlt-meta/">
<img src="https://img.shields.io/badge/PYPI-v%200.0.1-green?style=for-the-badge" alt="Latest Python Release"/>
<img src="https://img.shields.io/badge/PYPI-v%200.0.7-green?style=for-the-badge" alt="Latest Python Release"/>
</a>
<a href="https://github.com/databrickslabs/dlt-meta/actions/workflows/onpush.yml">
<img src="https://img.shields.io/github/workflow/status/databrickslabs/dlt-meta/build/main?style=for-the-badge"
Expand All @@ -23,13 +26,6 @@
<img src="https://img.shields.io/codecov/c/github/databrickslabs/dlt-meta?style=for-the-badge&amp;token=2CxLj3YBam"
alt="codecov"/>
</a>
<a href="https://lgtm.com/projects/g/databrickslabs/dlt-meta/alerts">
<img src="https://img.shields.io/lgtm/alerts/github/databricks/dlt-meta?style=for-the-badge" alt="lgtm-alerts"/>
</a>
<a href="https://lgtm.com/projects/g/databrickslabs/dlt-meta/context:python">
<img src="https://img.shields.io/lgtm/grade/python/github/databrickslabs/dbx?style=for-the-badge"
alt="lgtm-code-quality"/>
</a>
<a href="https://pypistats.org/packages/dl-meta">
<img src="https://img.shields.io/pypi/dm/dlt-meta?style=for-the-badge" alt="downloads"/>
</a>
Expand All @@ -39,142 +35,141 @@
</a>
</p>

[![lines of code](https://tokei.rs/b1/github/databrickslabs/dlt-meta)]([https://codecov.io/github/databrickslabs/dlt-meta](https://github.com/databrickslabs/dlt-meta))
[![lines of code](https://tokei.rs/b1/github/databrickslabs/dlt-meta)](<[https://codecov.io/github/databrickslabs/dlt-meta](https://github.com/databrickslabs/dlt-meta)>)

---

# Project Overview
```DLT-META``` is a metadata-driven framework based on Databricks [Delta Live Tables](https://www.databricks.com/product/delta-live-tables) (aka DLT) which lets you automate your bronze and silver data pipelines.

With this framework you need to record the source and target metadata in an onboarding json file which acts as the data flow specification aka Dataflowspec. A single generic ```DLT``` pipeline takes the ```Dataflowspec``` and runs your workloads.
`DLT-META` is a metadata-driven framework based on Databricks [Delta Live Tables](https://www.databricks.com/product/delta-live-tables) (aka DLT) which lets you automate your bronze and silver data pipelines.

With this framework you need to record the source and target metadata in an onboarding json file which acts as the data flow specification aka Dataflowspec. A single generic `DLT` pipeline takes the `Dataflowspec` and runs your workloads.

### Components:

#### Metadata Interface
#### Metadata Interface

- Capture input/output metadata in [onboarding file](https://github.com/databrickslabs/dlt-meta/blob/main/examples/onboarding.json)
- Capture [Data Quality Rules](https://github.com/databrickslabs/dlt-meta/tree/main/examples/dqe/customers/bronze_data_quality_expectations.json)
- Capture processing logic as sql in [Silver transformation file](https://github.com/databrickslabs/dlt-meta/blob/main/examples/silver_transformations.json)
- Capture processing logic as sql in [Silver transformation file](https://github.com/databrickslabs/dlt-meta/blob/main/examples/silver_transformations.json)

#### Generic DLT pipeline

- Apply appropriate readers based on input metadata
- Apply data quality rules with DLT expectations
- Apply data quality rules with DLT expectations
- Apply CDC apply changes if specified in metadata
- Builds DLT graph based on input/output metadata
- Launch DLT pipeline

## High-Level Process Flow:

![DLT-META High-Level Process Flow](./docs/static/images/solutions_overview.png)

## Steps

![DLT-META Stages](./docs/static/images/dlt-meta_stages.png)

## Getting Started

Refer to the [Getting Started](https://databrickslabs.github.io/dlt-meta/getting_started)

### Databricks Labs DLT-META CLI lets you run onboard and deploy in interactive python terminal

#### pre-requisites:
- [Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/tutorial.html)

- Python 3.8.0 +
#### Steps:
- ``` git clone dlt-meta ```
- ``` cd dlt-meta ```
- ``` python -m venv .venv ```
- ```source .venv/bin/activate ```
- ``` pip install databricks-sdk ```
- ```databricks labs dlt-meta onboard```
- - Above command will prompt you to provide onboarding details. If you have cloned dlt-meta git repo then accept defaults which will launch config from demo folder.

``` Provide onboarding file path (default: demo/conf/onboarding.template):
Provide onboarding files local directory (default: demo/):
Provide dbfs path (default: dbfs:/dlt-meta_cli_demo):
Provide databricks runtime version (default: 14.2.x-scala2.12):
Run onboarding with unity catalog enabled?
[0] False
[1] True
Enter a number between 0 and 1: 1
Provide unity catalog name: ravi_dlt_meta_uc
Provide dlt meta schema name (default: dlt_meta_dataflowspecs_203b9da04bdc49f78cdc6c379d1c9ead):
Provide dlt meta bronze layer schema name (default: dltmeta_bronze_cf5956873137432294892fbb2dc34fdb):
Provide dlt meta silver layer schema name (default: dltmeta_silver_5afa2184543342f98f87b30d92b8c76f):
Provide dlt meta layer
[0] bronze
[1] bronze_silver
[2] silver
Enter a number between 0 and 2: 1
Provide bronze dataflow spec table name (default: bronze_dataflowspec):
Provide silver dataflow spec table name (default: silver_dataflowspec):
Overwrite dataflow spec?
[0] False
[1] True
Enter a number between 0 and 1: 1
Provide dataflow spec version (default: v1):
Provide environment name (default: prod): prod
Provide import author name (default: ravi.gawai):
Provide cloud provider name
[0] aws
[1] azure
[2] gcp
Enter a number between 0 and 2: 0
Do you want to update ws paths, catalog, schema details to your onboarding file?
[0] False
[1] True

- Databricks CLI v0.213 or later. See [instructions](https://docs.databricks.com/en/dev-tools/cli/tutorial.html)

- Install Databricks CLI on macOS:
- ![macos_install_databricks](docs/static/images/macos_1_databrickslabsmac_installdatabricks.gif)

- Install Databricks CLI on Windows:
- ![windows_install_databricks.png](docs/static/images/windows_install_databricks.png)

Once you install Databricks CLI, authenticate your current machine to a Databricks Workspace:

```commandline
databricks auth login --host WORKSPACE_HOST
```

To enable debug logs, simply add `--debug` flag to any command.

### Installing dlt-meta:

- Install dlt-meta via Databricks CLI:

```commandline
databricks labs install dlt-meta
```

### Onboard using dlt-meta CLI:

If you want to run existing demo files please follow these steps before running onboard command:

```commandline
git clone https://github.com/databrickslabs/dlt-meta.git
```

```commandline
cd dlt-meta
```

```commandline
python -m venv .venv
```

```commandline
source .venv/bin/activate
```

```commandline
pip install databricks-sdk
```

```commandline
databricks labs dlt-meta onboard
```

![onboardingDLTMeta.gif](docs/static/images/onboardingDLTMeta.gif)

Above commands will prompt you to provide onboarding details. If you have cloned dlt-meta git repo then accept defaults which will launch config from demo folder.
![onboardingDLTMeta_2.gif](docs/static/images/onboardingDLTMeta_2.gif)


- Goto your databricks workspace and located onboarding job under: Workflow->Jobs runs

### depoly using dlt-meta CLI:

- Once onboarding jobs is finished deploy `bronze` and `silver` DLT using below command
- ```databricks labs dlt-meta deploy```
- ```commandline
databricks labs dlt-meta deploy
```
- - Above command will prompt you to provide dlt details. Please provide respective details for schema which you provided in above steps
- - Bronze DLT
```
Deploy DLT-META with unity catalog enabled?
[0] False
[1] True
Enter a number between 0 and 1: 1
Provide unity catalog name: ravi_dlt_meta_uc
Deploy DLT-META with serverless?
[0] False
[1] True
Enter a number between 0 and 1: 1
Provide dlt meta layer
[0] bronze
[1] silver
Enter a number between 0 and 1: 0
Provide dlt meta onboard group: A1
Provide dlt_meta dataflowspec schema name: dlt_meta_dataflowspecs_203b9da04bdc49f78cdc6c379d1c9ead
Provide bronze dataflowspec table name (default: bronze_dataflowspec):
Provide dlt meta pipeline name (default: dlt_meta_bronze_pipeline_2aee3eb837f3439899eef61b76b80d53):
Provide dlt target schema name: dltmeta_bronze_cf5956873137432294892fbb2dc34fdb
```

![deployingDLTMeta_bronze.gif](docs/static/images/deployingDLTMeta_bronze.gif)


- Silver DLT
- - ```databricks labs dlt-meta deploy```
- - ```commandline
databricks labs dlt-meta deploy
```
- - Above command will prompt you to provide dlt details. Please provide respective details for schema which you provided in above steps
```
Deploy DLT-META with unity catalog enabled?
[0] False
[1] True
Enter a number between 0 and 1: 1
Provide unity catalog name: ravi_dlt_meta_uc
Deploy DLT-META with serverless?
[0] False
[1] True
Enter a number between 0 and 1: 1
Provide dlt meta layer
[0] bronze
[1] silver
Enter a number between 0 and 1: 1
Provide dlt meta onboard group: A1
Provide dlt_meta dataflowspec schema name: dlt_meta_dataflowspecs_203b9da04bdc49f78cdc6c379d1c9ead
Provide silver dataflowspec table name (default: silver_dataflowspec):
Provide dlt meta pipeline name (default: dlt_meta_silver_pipeline_2147545f9b6b4a8a834f62e873fa1364):
Provide dlt target schema name: dltmeta_silver_5afa2184543342f98f87b30d92b8c76f
```

![deployingDLTMeta_silver.gif](docs/static/images/deployingDLTMeta_silver.gif)


## More questions

Refer to the [FAQ](https://databrickslabs.github.io/dlt-meta/faq)
and DLT-META [documentation](https://databrickslabs.github.io/dlt-meta/)

# Project Support

Please note that all projects released under [`Databricks Labs`](https://www.databricks.com/learn/labs)
are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements
(SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket
are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements
(SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket
relating to any issues arising from the use of these projects.

Any issues discovered through the use of this project should be filed as issues on the Github Repo.
Expand Down
7 changes: 4 additions & 3 deletions demo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
2. [Databricks Techsummit Demo](#databricks-tech-summit-fy2024-demo): 100s of data sources ingestion in bronze and silver DLT pipelines automatically.


# DAIS 2023 DEMO
# DAIS 2023 DEMO
## [DAIS 2023 Session Recording](https://www.youtube.com/watch?v=WYv5haxLlfA)
This Demo launches Bronze and Silver DLT pipleines with following activities:
- Customer and Transactions feeds for initial load
- Adds new feeds Product and Stores to existing Bronze and Silver DLT pipelines with metadata changes.
Expand All @@ -23,7 +24,7 @@ This Demo launches Bronze and Silver DLT pipleines with following activities:
export PYTHONPATH=<<local dlt-meta path>>
```

6. Run the command ```python demo/launch_dais_demo.py --username=<<your databricks username>> --source=cloudfiles --uc_catalog_name=<<uc catalog name>> --cloud_provider_name=aws --dbr_version=13.3.x-scala2.12 --dbfs_path=dbfs:/dais-dlt-meta-demo-automated_new```
6. Run the command ```python demo/launch_dais_demo.py --source=cloudfiles --uc_catalog_name=<<uc catalog name>> --cloud_provider_name=aws --dbr_version=13.3.x-scala2.12 --dbfs_path=dbfs:/dais-dlt-meta-demo-automated_new```
- cloud_provider_name : aws or azure or gcp
- db_version : Databricks Runtime Version
- dbfs_path : Path on your Databricks workspace where demo will be copied for launching DLT-META Pipelines
Expand Down Expand Up @@ -61,7 +62,7 @@ This demo will launch auto generated tables(100s) inside single bronze and silve
export PYTHONPATH=<<local dlt-meta path>>
```

6. Run the command ```python demo/launch_techsummit_demo.py --[email protected] --source=cloudfiles --cloud_provider_name=aws --dbr_version=13.3.x-scala2.12 --dbfs_path=dbfs:/techsummit-dlt-meta-demo-automated ```
6. Run the command ```python demo/launch_techsummit_demo.py --source=cloudfiles --cloud_provider_name=aws --dbr_version=13.3.x-scala2.12 --dbfs_path=dbfs:/techsummit-dlt-meta-demo-automated ```
- cloud_provider_name : aws or azure or gcp
- db_version : Databricks Runtime Version
- dbfs_path : Path on your Databricks workspace where demo will be copied for launching DLT-META Pipelines
Expand Down
7 changes: 5 additions & 2 deletions demo/launch_dais_demo.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import uuid
import webbrowser
from databricks.sdk.service import jobs
from src.install import WorkspaceInstaller
from integration_tests.run_integration_tests import (
Expand Down Expand Up @@ -84,8 +85,10 @@ def launch_workflow(self, runner_conf: DLTMetaRunnerConf):
runner_conf.job_id = created_job.job_id
print(f"Job created successfully. job_id={created_job.job_id}, started run...")
print(f"Waiting for job to complete. run_id={created_job.job_id}")
run_by_id = self.ws.jobs.run_now(job_id=created_job.job_id).result()
print(f"Job run finished. run_id={run_by_id}")
run_by_id = self.ws.jobs.run_now(job_id=created_job.job_id)
url = f"{self.ws.config.host}/jobs/{runner_conf.job_id}/runs/{run_by_id}?o={self.ws.get_workspace_id()}/"
webbrowser.open(url)
print(f"Job launched with url={url}")

def create_daisdemo_workflow(self, runner_conf: DLTMetaRunnerConf):
"""
Expand Down
7 changes: 5 additions & 2 deletions demo/launch_techsummit_demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
"""

import uuid
import webbrowser
from databricks.sdk.service import jobs
from databricks.sdk.service.catalog import VolumeType, SchemasAPI
from databricks.sdk.service.workspace import ImportFormat
Expand Down Expand Up @@ -163,8 +164,10 @@ def launch_workflow(self, runner_conf: DLTMetaRunnerConf):
runner_conf.job_id = created_job.job_id
print(f"Job created successfully. job_id={created_job.job_id}, started run...")
print(f"Waiting for job to complete. run_id={created_job.job_id}")
run_by_id = self.ws.jobs.run_now(job_id=created_job.job_id).result()
print(f"Job run finished. run_id={run_by_id}")
run_by_id = self.ws.jobs.run_now(job_id=created_job.job_id)
url = f"{self.ws.config.host}/jobs/{runner_conf.job_id}/runs/{run_by_id}?o={self.ws.get_workspace_id()}/"
webbrowser.open(url)
print(f"Job launched with url={url}")

def create_techsummit_demo_workflow(self, runner_conf: TechsummitRunnerConf):
"""
Expand Down
1 change: 0 additions & 1 deletion docs/config.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@ baseURL = 'https://databrickslabs.github.io/dlt-meta/'
languageCode = 'en-us'
title = 'DLT-META'
theme= "hugo-theme-learn"

pluralizeListTitles = false
canonifyURLs = true

Expand Down
Loading

0 comments on commit 423a776

Please sign in to comment.