Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added deploy with modal. #1805

Open
wants to merge 30 commits into
base: devel
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
3d327f2
Added deploy with modal.
dat-a-man Sep 13, 2024
8a49dce
A few minor fixes
dat-a-man Sep 13, 2024
87a1045
updated links as per comment
dat-a-man Sep 16, 2024
e1c8cbd
Updated as per the comments.
dat-a-man Sep 16, 2024
6bca836
Update docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-m…
burnash Sep 16, 2024
9da1b27
Updated
dat-a-man Sep 23, 2024
1ad1ee9
Merge branch 'docs/how-to-deploy-using-modal' of https://github.com/d…
dat-a-man Sep 23, 2024
48674f3
Updated as per comments
dat-a-man Sep 23, 2024
ba8cdbe
Updated
dat-a-man Sep 23, 2024
3160477
minor fix for relative link
dat-a-man Sep 23, 2024
177ac20
Merge branch 'devel' into docs/how-to-deploy-using-modal
dat-a-man Oct 4, 2024
fd225f9
Incorporated comments and new script provided.
dat-a-man Oct 7, 2024
ebbd06e
Added the snippets
dat-a-man Oct 9, 2024
2d27c3f
Updated
dat-a-man Oct 9, 2024
c783772
Updated
dat-a-man Oct 9, 2024
a87d1b5
Merge branch 'devel' into docs/how-to-deploy-using-modal
dat-a-man Oct 9, 2024
13bcf7d
updated poetry.lock
dat-a-man Oct 9, 2024
e397ff6
Merge branch 'docs/how-to-deploy-using-modal' of https://github.com/d…
dat-a-man Oct 9, 2024
cc8d5ae
Updated "poetry.lock"
dat-a-man Oct 9, 2024
71efdaa
Added "__init__.py"
dat-a-man Oct 9, 2024
cf1a092
Updated snippets.py
dat-a-man Oct 9, 2024
de31e7d
Updated path in MAKEFILE
dat-a-man Oct 9, 2024
cd04716
Added __init__.py in walkthroughs
dat-a-man Oct 10, 2024
1a2d744
Adjusted for black
dat-a-man Oct 10, 2024
d1fbc18
Modified mypy.ini added a pattern module_name_pattern = '[a-zA-Z0-9_\…
dat-a-man Oct 10, 2024
71ab82f
updated
dat-a-man Oct 10, 2024
9d0c70b
renamed deploy-a-pipeline with deploy_a_pipeline
dat-a-man Oct 10, 2024
4209656
Merge remote-tracking branch 'origin/devel' into docs/how-to-deploy-u…
dat-a-man Oct 11, 2024
1841007
Updated for errors in linting
dat-a-man Oct 11, 2024
e5d9a30
small changes
dat-a-man Oct 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 154 additions & 0 deletions docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-modal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
---
title: Deploy with Modal
description: How to deploy a pipeline with Modal
keywords: [how to, deploy a pipeline, Modal]
canonical: https://modal.com/blog/analytics-stack
---

# Deploy with Modal

## Introduction to Modal

[Modal](https://modal.com/blog/analytics-stack) is a serverless platform designed for developers. It allows you to run and deploy code in the cloud without managing infrastructure.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there the link from Modal should go to the https://modal.com/. I can see that the blog post is already linked from another section below.

Copy link
Collaborator Author

@dat-a-man dat-a-man Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks! Corrected.


With Modal, you can perform tasks like running generative models, large-scale batch jobs, and job queues, all while easily scaling compute resources.

### Modal features

- Serverless Compute: No infrastructure management; scales automatically from zero to thousands of CPUs/GPUs.
- Cloud Functions: Run Python code in the cloud instantly and scale horizontally.
- GPU/CPU Scaling: Easily attach GPUs for heavy tasks like AI model training with a single line of code.
- Web Endpoints: Expose any function as an HTTPS API endpoint quickly.
- Scheduled Jobs: Convert Python functions into scheduled tasks effortlessly.

To know more, please refer to [Modals's documentation.](https://modal.com/docs)

## Building Data Pipelines with `dlt`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Building Data Pipelines with `dlt`
## Building data pipelines with dlt
  1. Through the docs please use the sentence case capitalization

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, Thanks!


**`dlt`** is an open-source Python library that allows you to declaratively load data sources into well-structured tables or datasets. It does this through automatic schema inference and evolution. The library simplifies building data pipelines by providing functionality to support the entire extract and load process.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**`dlt`** is an open-source Python library that allows you to declaratively load data sources into well-structured tables or datasets. It does this through automatic schema inference and evolution. The library simplifies building data pipelines by providing functionality to support the entire extract and load process.
dlt is an open-source Python library that allows you to declaratively load data sources into well-structured tables or datasets. It does this through automatic schema inference and evolution. The library simplifies building data pipelines by providing functionality to support the entire extract and load process.

Let's tone down the formatting here


### How does `dlt` integrate with Modal for pipeline orchestration?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### How does `dlt` integrate with Modal for pipeline orchestration?
### How does dlt integrate with Modal for pipeline orchestration?

Through the docs, please use plain "dlt" (no backticks) when referring the dlt as a project. Use backticks only when referring to dlt as a code (e.g. dlt the Python module in the script or dlt the command in context of command line)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!


To illustrate setting up a pipeline in Modal, we’ll be using the following example: [Building a cost-effective analytics stack with Modal, dlt, and dbt.](https://modal.com/blog/analytics-stack)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To illustrate setting up a pipeline in Modal, well be using the following example: [Building a cost-effective analytics stack with Modal, dlt, and dbt.](https://modal.com/blog/analytics-stack)
As an example of how to set up a pipeline in Modal, we'll use the [building a cost-effective analytics stack with Modal, dlt, and dbt.](https://modal.com/blog/analytics-stack) case study.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!


The example demonstrates automating a workflow to load data from Postgres to Snowflake using `dlt`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The example demonstrates automating a workflow to load data from Postgres to Snowflake using `dlt`.
The example demonstrates automating a workflow to load data from Postgres to Snowflake using dlt.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


## How to run `dlt` on Modal
burnash marked this conversation as resolved.
Show resolved Hide resolved

Here’s our `dlt` setup copying data from our Postgres read replica into Snowflake:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Here’s our `dlt` setup copying data from our Postgres read replica into Snowflake:
Here’s a dlt project setup to copy data from our Postgres read replica into Snowflake:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


1. Run the `dlt` SQL database setup to initialize their `sql_database_pipeline.py` template:
Copy link
Collaborator

@burnash burnash Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Run the `dlt` SQL database setup to initialize their `sql_database_pipeline.py` template:
1. Run the `dlt init` CLI command to initialize the SQL database source and setup the `sql_database_pipeline.py` template:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! However, I don't see these changes on GitHub. Is there a chance you haven't pushed the updates to GitHub?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also not clear what do we do with sql_database_pipeline.py? Are we discarding it? Or we're adding the code below to sql_database_pipeline.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

```sh
dlt init sql_database snowflake
```
2. Open the file and define the Modal Image you want to run `dlt` in:
```py
import dlt
import pendulum
from sql_database import sql_database, ConnectionStringCredentials, sql_table
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably update this for 1.1.0 right? i.e. from dlt.sources.sql_database import sql_database import *.

import modal
import os
image = (
modal.Image.debian_slim()
.apt_install(["libpq-dev"]) # system requirement for postgres driver
.pip_install(
"sqlalchemy>=1.4", # how `dlt` establishes connections
"dlt[snowflake]>=0.4.11",
"psycopg2-binary", # postgres driver
"dlt[parquet]",
"psutil==6.0.0", # for `dlt` logging
"connectorx", # creates arrow tables from database for fast data extraction
)
)
app = modal.App("dlt-postgres-pipeline", image=image)
```

3. Wrap the provided `load_table_from_database` with the Modal Function decorator, Modal Secrets containing your database credentials, and a daily cron schedule
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we take load_table_from_database from sql_database_pipeline.py we should note that. Otherwise it may be unclear.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the context

```py
# Function to load the table from the database, scheduled to run daily
@app.function(
secrets=[
modal.Secret.from_name("snowflake-secret"),
modal.Secret.from_name("postgres-read-replica-prod"),
],
# run this pipeline daily at 6:24 AM
schedule=modal.Cron("24 6 * * *"),
timeout=3000,
)
def load_table_from_database(
table: str,
incremental_col: str,
dev: bool = False,
) -> None:
# Placeholder for future logic
pass
```

4. Write your `dlt` pipeline:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where the user should put the code from this section? Is it still goes to sql_database_pipeline.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It goes to sql_database_pipeline.py

```py
# Modal Secrets are loaded as environment variables which are used here to create the SQLALchemy connection string
pg_url = f'postgresql://{os.environ["PGUSER"]}:{os.environ["PGPASSWORD"]}@localhost:{os.environ["PGPORT"]}/{os.environ["PGDATABASE"]}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the dlt-native way to configure connection with environment variables: https://dlthub.com/docs/devel/general-usage/credentials/setup#environment-variables that should eliminate the need of manual connection string construction and usage of ConnectionStringCredentials

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a note about this in step 3; I tested it, too, and it worked for source creds.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey original author here :). are you saying it's better practice to define the sql connection string as a single env variable and then reassign the env variable in the pipeline? e.g.

  1. Set a Modal secret like POSTGRES_CREDENTIAL_STRING = 'postgresql://sdfsd:sdlfkj' (this gets mounted as an env variable)
  2. In the pipeline, call os.environ["TASK_SOURCES__SQL_DATABASE__CREDENTIALS"] = os.environ["POSTGRES_CREDENTIAL_STRING"]?

Copy link
Contributor

@AstrakhantsevaAA AstrakhantsevaAA Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @kning! I would say it's a matter of taste, if you prefer string connection, use it, if not, don't, dlt supports both. In this example, I think, Anton wants to reduce the amount of code and unnecessary manipulations. For example, in this case you can avoid this

credentials = ConnectionStringCredentials(pg_url)
and
destination=dlt.destinations.snowflake(snowflake_url),

snowflake_url = f'snowflake://{os.environ["SNOWFLAKE_USER"]}:{os.environ["SNOWFLAKE_PASSWORD"]}@{os.environ["SNOWFLAKE_ACCOUNT"]}/{os.environ["SNOWFLAKE_DATABASE"]}'
# Create a pipeline
schema = "POSTGRES_DLT_DEV" if dev else "POSTGRES_DLT"
pipeline = dlt.pipeline(
pipeline_name="task",
destination=dlt.destinations.snowflake(snowflake_url),
dataset_name=schema,
progress="log",
)
credentials = ConnectionStringCredentials(pg_url)
# defines the postgres table to sync (in this case, the "task" table)
source_1 = sql_database(credentials, backend="connectorx").with_resources("task")
# defines which column to reference for incremental loading (i.e. only load newer rows)
source_1.task.apply_hints(
incremental=dlt.sources.incremental(
"enqueued_at",
initial_value=pendulum.datetime(2024, 7, 24, 0, 0, 0, tz="UTC"),
)
)
# if there are duplicates, merge the latest values
info = pipeline.run(source_1, write_disposition="merge")
print(info)
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the next step is missing: how this code ends up on Modal? How to trigger runs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added step 5

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This runs the pipeline once, but might be worth adding that you need to run modal deploy to actually schedule the pipelin.

## Advanced configuration
### Modal Proxy
If your database is in a private VPN, you can use [Modal Proxy](https://modal.com/docs/reference/modal.Proxy) as a bastion server (only available to Enterprise customers). We use Modal Proxy to connect to our production read replica by attaching it to the Function definition and changing the hostname to localhost:
dat-a-man marked this conversation as resolved.
Show resolved Hide resolved
```py
@app.function(
secrets=[
modal.Secret.from_name("snowflake-secret"),
modal.Secret.from_name("postgres-read-replica-prod"),
],
schedule=modal.Cron("24 6 * * *"),
proxy=modal.Proxy.from_name("prod-postgres-proxy", environment_name="main"),
timeout=3000,
)
def task_pipeline(dev: bool = False) -> None:
pg_url = f'postgresql://{os.environ["PGUSER"]}:{os.environ["PGPASSWORD"]}@localhost:{os.environ["PGPORT"]}/{os.environ["PGDATABASE"]}'
```
### Capturing deletes
One limitation of our simple approach above is that it does not capture updates or deletions of data. This isn’t a hard requirement yet for our use cases, but it appears that `dlt` does have a [Postgres CDC replication feature](../../dlt-ecosystem/verified-sources/pg_replication) that we are considering.
### Scaling out
The example above syncs one table from our Postgres data source. In practice, we are syncing multiple tables and mapping each table copy job to a single container using [Modal.starmap](https://modal.com/docs/reference/modal.Function#starmap):
```py
@app.function(timeout=3000, schedule=modal.Cron("29 11 * * *"))
def main(dev: bool = False):
tables = [
("task", "enqueued_at", dev),
("worker", "launched_at", dev),
...
]
list(load_table_from_database.starmap(tables))
```
1 change: 1 addition & 0 deletions docs/website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -283,6 +283,7 @@ const sidebars = {
'walkthroughs/deploy-a-pipeline/deploy-with-kestra',
'walkthroughs/deploy-a-pipeline/deploy-with-dagster',
'walkthroughs/deploy-a-pipeline/deploy-with-prefect',
'walkthroughs/deploy-a-pipeline/deploy-with-modal',
]
},
{
Expand Down
Loading