-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added deploy with modal. #1805
base: devel
Are you sure you want to change the base?
Added deploy with modal. #1805
Changes from 21 commits
3d327f2
8a49dce
87a1045
e1c8cbd
6bca836
9da1b27
1ad1ee9
48674f3
ba8cdbe
3160477
177ac20
fd225f9
ebbd06e
2d27c3f
c783772
a87d1b5
13bcf7d
e397ff6
cc8d5ae
71efdaa
cf1a092
de31e7d
cd04716
1a2d744
d1fbc18
71ab82f
9d0c70b
4209656
1841007
e5d9a30
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
import os | ||
from tests.pipeline.utils import assert_load_info | ||
|
||
def modal_snippet() -> None: | ||
# @@@DLT_SNIPPET_START modal_image | ||
import modal | ||
import os | ||
|
||
# Define the Modal Image | ||
image = modal.Image.debian_slim().pip_install( | ||
+ "dlt>=1.1.0", | ||
+ "dlt[duckdb]", # destination | ||
+ "dlt[sql_database]", # source (postgres) | ||
+ "pymysql", # database driver for MySQL source | ||
) | ||
) | ||
|
||
app = modal.App("example-dlt", image=image) | ||
|
||
# Modal Volume used to store the duckdb database file | ||
vol = modal.Volume.from_name("duckdb-vol", create_if_missing=True) | ||
# @@@DLT_SNIPPET_END modal_image | ||
|
||
# @@@DLT_SNIPPET_START modal_function | ||
@app.function( | ||
volumes={"/data/": vol}, | ||
schedule=modal.Period(days=1), | ||
secrets=[modal.Secret.from_name("sql-secret")], | ||
) | ||
def load_tables(): | ||
import dlt | ||
from dlt.sources.sql_database import sql_database | ||
|
||
# Define the source database credentials; in production, you would save this as a Modal Secret which can be referenced here as an environment variable | ||
# os.environ['SOURCES__SQL_DATABASE__CREDENTIALS']="mysql+pymysql://[email protected]:4497/Rfam" | ||
|
||
# Load tables "family" and "genome" | ||
source = sql_database().with_resources("family", "genome") | ||
|
||
# Create dlt pipeline object | ||
pipeline = dlt.pipeline( | ||
pipeline_name="sql_to_duckdb_pipeline", | ||
destination=dlt.destinations.duckdb( | ||
+ "/data/rfam.duckdb" | ||
+ ), # write the duckdb database file to this file location,, which will get mounted to the Modal Volume | ||
dataset_name="sql_to_duckdb_pipeline_data", | ||
progress="log", # output progress of the pipeline | ||
) | ||
|
||
# Run the pipeline | ||
load_info = pipeline.run(source) | ||
|
||
# Print run statistics | ||
print(load_info) | ||
# @@@DLT_SNIPPET_END modal_function | ||
|
||
assert_load_info(load_info) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
--- | ||
title: Deploy with Modal | ||
description: How to deploy a pipeline with Modal | ||
keywords: [how to, deploy a pipeline, Modal] | ||
canonical: https://modal.com/blog/analytics-stack | ||
--- | ||
|
||
# Deploy with Modal | ||
|
||
## Introduction to Modal | ||
|
||
[Modal](https://modal.com/) is a serverless platform designed for developers. It allows you to run and deploy code in the cloud without managing infrastructure. | ||
|
||
With Modal, you can perform tasks like running generative models, large-scale batch jobs, and job queues, all while easily scaling compute resources. | ||
|
||
### Modal features | ||
|
||
- Serverless Compute: No infrastructure management; scales automatically from zero to thousands of CPUs/GPUs. | ||
- Cloud Functions: Run Python code in the cloud instantly and scale horizontally. | ||
- GPU/CPU Scaling: Easily attach GPUs for heavy tasks like AI model training with a single line of code. | ||
- Web Endpoints: Expose any function as an HTTPS API endpoint quickly. | ||
- Scheduled Jobs: Convert Python functions into scheduled tasks effortlessly. | ||
|
||
To know more, please refer to [Modals's documentation.](https://modal.com/docs) | ||
|
||
## Building data pipelines with dlt | ||
|
||
dlt is an open-source Python library that allows you to declaratively load data sources into well-structured tables or datasets. It does this through automatic schema inference and evolution. The library simplifies building data pipelines by providing functionality to support the entire extract and load process. | ||
|
||
### How does dlt integrate with Modal for pipeline orchestration? | ||
|
||
As an example of how to set up a pipeline in Modal, we'll use the [building a cost-effective analytics stack with Modal, dlt, and dbt.](https://modal.com/blog/analytics-stack) case study. | ||
|
||
The example demonstrates automating a workflow to load data from Postgres to Snowflake using dlt. | ||
|
||
## How to run dlt on Modal | ||
|
||
Here’s a dlt project setup to copy data from our MySQL into DuckDB: | ||
|
||
1. Run the `dlt init` CLI command to initialize the SQL database source and set up the `sql_database_pipeline.py` template. | ||
```sh | ||
dlt init sql_database duckdb | ||
``` | ||
2. Open the file and define the Modal Image you want to run `dlt` in: | ||
<!--@@@DLT_SNIPPET modal_image--> | ||
|
||
3. Define a Modal Function. A Modal Function is a containerized environment that runs tasks. | ||
It can be scheduled (e.g., daily or on a Cron schedule), request more CPU/memory, and scale across | ||
multiple containers. | ||
|
||
Here’s how to include your SQL pipeline in the Modal Function: | ||
|
||
<!--@@@DLT_SNIPPET modal_function--> | ||
|
||
4. You can securely store your credentials using Modal secrets. When you reference secrets within a Modal script, | ||
the defined secret is automatically set as an environment variable. dlt natively supports environment variables, | ||
enabling seamless integration of your credentials. For example, to declare a connection string, you can define it as follows: | ||
```text | ||
SOURCES__SQL_DATABASE__CREDENTIALS=mysql+pymysql://[email protected]:4497/Rfam | ||
``` | ||
In the script above, the credentials specified are automatically utilized by dlt. | ||
For more details, please refer to the [documentation.](../../general-usage/credentials/setup#environment-variables) | ||
|
||
4. Execute the pipeline once: | ||
To run your pipeline a single time, use the following command: | ||
```sh | ||
modal run sql_pipeline.py | ||
``` | ||
|
||
5. Deploy the pipeline | ||
If you want to deploy your pipeline on Modal for continuous execution or scheduling, use this command: | ||
```sh | ||
modal deploy sql_pipeline.py | ||
``` | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like the next step is missing: how this code ends up on Modal? How to trigger runs? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added step 5 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This runs the pipeline once, but might be worth adding that you need to run |
||
## Advanced configuration | ||
### Modal Proxy | ||
|
||
If your database is in a private VPN, you can use [Modal Proxy](https://modal.com/docs/reference/modal.Proxy) as a bastion server (available for Enterprise customers). | ||
To connect to a production read replica, attach the proxy to the function definition and change the hostname to localhost: | ||
```py | ||
@app.function( | ||
secrets=[ | ||
modal.Secret.from_name("snowflake-secret"), | ||
modal.Secret.from_name("postgres-read-replica-prod"), | ||
], | ||
schedule=modal.Cron("24 6 * * *"), | ||
proxy=modal.Proxy.from_name("prod-postgres-proxy", environment_name="main"), | ||
timeout=3000, | ||
) | ||
def task_pipeline(dev: bool = False) -> None: | ||
pg_url = f'postgresql://{os.environ["PGUSER"]}:{os.environ["PGPASSWORD"]}@localhost:{os.environ["PGPORT"]}/{os.environ["PGDATABASE"]}' | ||
``` | ||
|
||
### Capturing deletes | ||
To capture updates or deleted rows from your database, consider using dlt's [Postgres CDC replication feature](../../dlt-ecosystem/verified-sources/pg_replication), which is | ||
useful for tracking changes and deletions in the data. | ||
|
||
### Sync Multiple Tables in Parallel | ||
To sync multiple tables in parallel, map each table copy job to a separate container using [Modal.starmap](https://modal.com/docs/reference/modal.Function#starmap): | ||
|
||
```py | ||
@app.function(timeout=3000, schedule=modal.Cron("29 11 * * *")) | ||
def main(dev: bool = False): | ||
tables = [ | ||
("task", "enqueued_at", dev), | ||
("worker", "launched_at", dev), | ||
... | ||
] | ||
list(load_table_from_database.starmap(tables)) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.