Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added deploy with modal. #1805

Open
wants to merge 30 commits into
base: devel
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
3d327f2
Added deploy with modal.
dat-a-man Sep 13, 2024
8a49dce
A few minor fixes
dat-a-man Sep 13, 2024
87a1045
updated links as per comment
dat-a-man Sep 16, 2024
e1c8cbd
Updated as per the comments.
dat-a-man Sep 16, 2024
6bca836
Update docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-m…
burnash Sep 16, 2024
9da1b27
Updated
dat-a-man Sep 23, 2024
1ad1ee9
Merge branch 'docs/how-to-deploy-using-modal' of https://github.com/d…
dat-a-man Sep 23, 2024
48674f3
Updated as per comments
dat-a-man Sep 23, 2024
ba8cdbe
Updated
dat-a-man Sep 23, 2024
3160477
minor fix for relative link
dat-a-man Sep 23, 2024
177ac20
Merge branch 'devel' into docs/how-to-deploy-using-modal
dat-a-man Oct 4, 2024
fd225f9
Incorporated comments and new script provided.
dat-a-man Oct 7, 2024
ebbd06e
Added the snippets
dat-a-man Oct 9, 2024
2d27c3f
Updated
dat-a-man Oct 9, 2024
c783772
Updated
dat-a-man Oct 9, 2024
a87d1b5
Merge branch 'devel' into docs/how-to-deploy-using-modal
dat-a-man Oct 9, 2024
13bcf7d
updated poetry.lock
dat-a-man Oct 9, 2024
e397ff6
Merge branch 'docs/how-to-deploy-using-modal' of https://github.com/d…
dat-a-man Oct 9, 2024
cc8d5ae
Updated "poetry.lock"
dat-a-man Oct 9, 2024
71efdaa
Added "__init__.py"
dat-a-man Oct 9, 2024
cf1a092
Updated snippets.py
dat-a-man Oct 9, 2024
de31e7d
Updated path in MAKEFILE
dat-a-man Oct 9, 2024
cd04716
Added __init__.py in walkthroughs
dat-a-man Oct 10, 2024
1a2d744
Adjusted for black
dat-a-man Oct 10, 2024
d1fbc18
Modified mypy.ini added a pattern module_name_pattern = '[a-zA-Z0-9_\…
dat-a-man Oct 10, 2024
71ab82f
updated
dat-a-man Oct 10, 2024
9d0c70b
renamed deploy-a-pipeline with deploy_a_pipeline
dat-a-man Oct 10, 2024
4209656
Merge remote-tracking branch 'origin/devel' into docs/how-to-deploy-u…
dat-a-man Oct 11, 2024
1841007
Updated for errors in linting
dat-a-man Oct 11, 2024
e5d9a30
small changes
dat-a-man Oct 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ lint-and-test-snippets:
poetry run mypy --config-file mypy.ini docs/website docs/examples docs/tools --exclude docs/tools/lint_setup --exclude docs/website/docs_processed
poetry run flake8 --max-line-length=200 docs/website docs/examples docs/tools
cd docs/website/docs && poetry run pytest --ignore=node_modules
modal run docs/walkthroughs/deploy-a-pipeline/deploy-with-modal-snippets.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
modal run docs/walkthroughs/deploy-a-pipeline/deploy-with-modal-snippets.py
modal run docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-modal-snippets.py


lint-and-test-examples:
cd docs/tools && poetry run python prepare_examples_tests.py
Expand Down
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
import os
from tests.pipeline.utils import assert_load_info

def modal_snippet() -> None:
# @@@DLT_SNIPPET_START modal_image
import modal
import os

# Define the Modal Image
image = modal.Image.debian_slim().pip_install(
+ "dlt>=1.1.0",
+ "dlt[duckdb]", # destination
+ "dlt[sql_database]", # source (postgres)
+ "pymysql", # database driver for MySQL source
)
)

app = modal.App("example-dlt", image=image)

# Modal Volume used to store the duckdb database file
vol = modal.Volume.from_name("duckdb-vol", create_if_missing=True)
# @@@DLT_SNIPPET_END modal_image

# @@@DLT_SNIPPET_START modal_function
@app.function(
volumes={"/data/": vol},
schedule=modal.Period(days=1),
secrets=[modal.Secret.from_name("sql-secret")],
)
def load_tables():
import dlt
from dlt.sources.sql_database import sql_database

# Define the source database credentials; in production, you would save this as a Modal Secret which can be referenced here as an environment variable
# os.environ['SOURCES__SQL_DATABASE__CREDENTIALS']="mysql+pymysql://[email protected]:4497/Rfam"

# Load tables "family" and "genome"
source = sql_database().with_resources("family", "genome")

# Create dlt pipeline object
pipeline = dlt.pipeline(
pipeline_name="sql_to_duckdb_pipeline",
destination=dlt.destinations.duckdb(
+ "/data/rfam.duckdb"
+ ), # write the duckdb database file to this file location,, which will get mounted to the Modal Volume
dataset_name="sql_to_duckdb_pipeline_data",
progress="log", # output progress of the pipeline
)

# Run the pipeline
load_info = pipeline.run(source)

# Print run statistics
print(load_info)
# @@@DLT_SNIPPET_END modal_function

assert_load_info(load_info)
111 changes: 111 additions & 0 deletions docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-modal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
---
title: Deploy with Modal
description: How to deploy a pipeline with Modal
keywords: [how to, deploy a pipeline, Modal]
canonical: https://modal.com/blog/analytics-stack
---

# Deploy with Modal

## Introduction to Modal

[Modal](https://modal.com/) is a serverless platform designed for developers. It allows you to run and deploy code in the cloud without managing infrastructure.

With Modal, you can perform tasks like running generative models, large-scale batch jobs, and job queues, all while easily scaling compute resources.

### Modal features

- Serverless Compute: No infrastructure management; scales automatically from zero to thousands of CPUs/GPUs.
- Cloud Functions: Run Python code in the cloud instantly and scale horizontally.
- GPU/CPU Scaling: Easily attach GPUs for heavy tasks like AI model training with a single line of code.
- Web Endpoints: Expose any function as an HTTPS API endpoint quickly.
- Scheduled Jobs: Convert Python functions into scheduled tasks effortlessly.

To know more, please refer to [Modals's documentation.](https://modal.com/docs)

## Building data pipelines with dlt

dlt is an open-source Python library that allows you to declaratively load data sources into well-structured tables or datasets. It does this through automatic schema inference and evolution. The library simplifies building data pipelines by providing functionality to support the entire extract and load process.

### How does dlt integrate with Modal for pipeline orchestration?

As an example of how to set up a pipeline in Modal, we'll use the [building a cost-effective analytics stack with Modal, dlt, and dbt.](https://modal.com/blog/analytics-stack) case study.

The example demonstrates automating a workflow to load data from Postgres to Snowflake using dlt.

## How to run dlt on Modal

Here’s a dlt project setup to copy data from our MySQL into DuckDB:

1. Run the `dlt init` CLI command to initialize the SQL database source and set up the `sql_database_pipeline.py` template.
```sh
dlt init sql_database duckdb
```
2. Open the file and define the Modal Image you want to run `dlt` in:
<!--@@@DLT_SNIPPET modal_image-->

3. Define a Modal Function. A Modal Function is a containerized environment that runs tasks.
It can be scheduled (e.g., daily or on a Cron schedule), request more CPU/memory, and scale across
multiple containers.

Here’s how to include your SQL pipeline in the Modal Function:

<!--@@@DLT_SNIPPET modal_function-->

4. You can securely store your credentials using Modal secrets. When you reference secrets within a Modal script,
the defined secret is automatically set as an environment variable. dlt natively supports environment variables,
enabling seamless integration of your credentials. For example, to declare a connection string, you can define it as follows:
```text
SOURCES__SQL_DATABASE__CREDENTIALS=mysql+pymysql://[email protected]:4497/Rfam
```
In the script above, the credentials specified are automatically utilized by dlt.
For more details, please refer to the [documentation.](../../general-usage/credentials/setup#environment-variables)

4. Execute the pipeline once:
To run your pipeline a single time, use the following command:
```sh
modal run sql_pipeline.py
```

5. Deploy the pipeline
If you want to deploy your pipeline on Modal for continuous execution or scheduling, use this command:
```sh
modal deploy sql_pipeline.py
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the next step is missing: how this code ends up on Modal? How to trigger runs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added step 5

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This runs the pipeline once, but might be worth adding that you need to run modal deploy to actually schedule the pipelin.

## Advanced configuration
### Modal Proxy

If your database is in a private VPN, you can use [Modal Proxy](https://modal.com/docs/reference/modal.Proxy) as a bastion server (available for Enterprise customers).
To connect to a production read replica, attach the proxy to the function definition and change the hostname to localhost:
```py
@app.function(
secrets=[
modal.Secret.from_name("snowflake-secret"),
modal.Secret.from_name("postgres-read-replica-prod"),
],
schedule=modal.Cron("24 6 * * *"),
proxy=modal.Proxy.from_name("prod-postgres-proxy", environment_name="main"),
timeout=3000,
)
def task_pipeline(dev: bool = False) -> None:
pg_url = f'postgresql://{os.environ["PGUSER"]}:{os.environ["PGPASSWORD"]}@localhost:{os.environ["PGPORT"]}/{os.environ["PGDATABASE"]}'
```

### Capturing deletes
To capture updates or deleted rows from your database, consider using dlt's [Postgres CDC replication feature](../../dlt-ecosystem/verified-sources/pg_replication), which is
useful for tracking changes and deletions in the data.

### Sync Multiple Tables in Parallel
To sync multiple tables in parallel, map each table copy job to a separate container using [Modal.starmap](https://modal.com/docs/reference/modal.Function#starmap):

```py
@app.function(timeout=3000, schedule=modal.Cron("29 11 * * *"))
def main(dev: bool = False):
tables = [
("task", "enqueued_at", dev),
("worker", "launched_at", dev),
...
]
list(load_table_from_database.starmap(tables))
```
1 change: 1 addition & 0 deletions docs/website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,7 @@ const sidebars = {
'walkthroughs/deploy-a-pipeline/deploy-with-kestra',
'walkthroughs/deploy-a-pipeline/deploy-with-dagster',
'walkthroughs/deploy-a-pipeline/deploy-with-prefect',
'walkthroughs/deploy-a-pipeline/deploy-with-modal',
]
},
{
Expand Down
Loading
Loading