-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added deploy with modal. #1805
base: devel
Are you sure you want to change the base?
Added deploy with modal. #1805
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
|
||
### Capturing deletes | ||
|
||
One limitation of our simple approach above is that it does not capture updates or deletions of data. This isn’t a hard requirement yet for our use cases, but it appears that `dlt` does have a [Postgres CDC replication feature](https://dlthub.com/docs/dlt-ecosystem/verified-sources/pg_replication) that we are considering. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use relative links for the pages in the docs. E.g. ./dlt-ecosystem/...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @burnash. Updated the link. There's one thing though, the doc is not showing in the GitHub deploy preview here. But when using "npm" locally it shows fine.
0333c54
to
8a49dce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good content, @dat-a-man. I've added some suggestions to improve the style.
|
||
## Introduction to Modal | ||
|
||
[Modal](https://modal.com/blog/analytics-stack) is a serverless platform designed for developers. It allows you to run and deploy code in the cloud without managing infrastructure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there the link from Modal should go to the https://modal.com/. I can see that the blog post is already linked from another section below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks! Corrected.
|
||
## Building Data Pipelines with `dlt` | ||
|
||
**`dlt`** is an open-source Python library that allows you to declaratively load data sources into well-structured tables or datasets. It does this through automatic schema inference and evolution. The library simplifies building data pipelines by providing functionality to support the entire extract and load process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**`dlt`** is an open-source Python library that allows you to declaratively load data sources into well-structured tables or datasets. It does this through automatic schema inference and evolution. The library simplifies building data pipelines by providing functionality to support the entire extract and load process. | |
dlt is an open-source Python library that allows you to declaratively load data sources into well-structured tables or datasets. It does this through automatic schema inference and evolution. The library simplifies building data pipelines by providing functionality to support the entire extract and load process. |
Let's tone down the formatting here
|
||
**`dlt`** is an open-source Python library that allows you to declaratively load data sources into well-structured tables or datasets. It does this through automatic schema inference and evolution. The library simplifies building data pipelines by providing functionality to support the entire extract and load process. | ||
|
||
### How does `dlt` integrate with Modal for pipeline orchestration? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### How does `dlt` integrate with Modal for pipeline orchestration? | |
### How does dlt integrate with Modal for pipeline orchestration? |
Through the docs, please use plain "dlt" (no backticks) when referring the dlt as a project. Use backticks only when referring to dlt as a code (e.g. dlt
the Python module in the script or dlt
the command in context of command line)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
|
||
To know more, please refer to [Modals's documentation.](https://modal.com/docs) | ||
|
||
## Building Data Pipelines with `dlt` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Building Data Pipelines with `dlt` | |
## Building data pipelines with dlt |
- Through the docs please use the sentence case capitalization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted, Thanks!
|
||
### How does `dlt` integrate with Modal for pipeline orchestration? | ||
|
||
To illustrate setting up a pipeline in Modal, we’ll be using the following example: [Building a cost-effective analytics stack with Modal, dlt, and dbt.](https://modal.com/blog/analytics-stack) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To illustrate setting up a pipeline in Modal, we’ll be using the following example: [Building a cost-effective analytics stack with Modal, dlt, and dbt.](https://modal.com/blog/analytics-stack) | |
As an example of how to set up a pipeline in Modal, we'll use the [building a cost-effective analytics stack with Modal, dlt, and dbt.](https://modal.com/blog/analytics-stack) case study. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-modal.md
Outdated
Show resolved
Hide resolved
2ee3eab
to
e48f641
Compare
|
||
Here’s our `dlt` setup copying data from our Postgres read replica into Snowflake: | ||
|
||
1. Run the `dlt` SQL database setup to initialize their `sql_database_pipeline.py` template: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Run the `dlt` SQL database setup to initialize their `sql_database_pipeline.py` template: | |
1. Run the `dlt init` CLI command to initialize the SQL database source and setup the `sql_database_pipeline.py` template: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! However, I don't see these changes on GitHub. Is there a chance you haven't pushed the updates to GitHub?
|
||
## How to run dlt on Modal | ||
|
||
Here’s our `dlt` setup copying data from our Postgres read replica into Snowflake: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here’s our `dlt` setup copying data from our Postgres read replica into Snowflake: | |
Here’s a dlt project setup to copy data from our Postgres read replica into Snowflake: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
|
||
As an example of how to set up a pipeline in Modal, we'll use the [building a cost-effective analytics stack with Modal, dlt, and dbt.](https://modal.com/blog/analytics-stack) case study. | ||
|
||
The example demonstrates automating a workflow to load data from Postgres to Snowflake using `dlt`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example demonstrates automating a workflow to load data from Postgres to Snowflake using `dlt`. | |
The example demonstrates automating a workflow to load data from Postgres to Snowflake using dlt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dat-a-man thanks for the updates, please see my review comments
|
||
Here’s our `dlt` setup copying data from our Postgres read replica into Snowflake: | ||
|
||
1. Run the `dlt` SQL database setup to initialize their `sql_database_pipeline.py` template: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also not clear what do we do with sql_database_pipeline.py
? Are we discarding it? Or we're adding the code below to sql_database_pipeline.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
app = modal.App("dlt-postgres-pipeline", image=image) | ||
``` | ||
|
||
3. Wrap the provided `load_table_from_database` with the Modal Function decorator, Modal Secrets containing your database credentials, and a daily cron schedule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we take load_table_from_database
from sql_database_pipeline.py
we should note that. Otherwise it may be unclear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the context
pass | ||
``` | ||
|
||
4. Write your `dlt` pipeline: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where the user should put the code from this section? Is it still goes to sql_database_pipeline.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It goes to sql_database_pipeline.py
4. Write your `dlt` pipeline: | ||
```py | ||
# Modal Secrets are loaded as environment variables which are used here to create the SQLALchemy connection string | ||
pg_url = f'postgresql://{os.environ["PGUSER"]}:{os.environ["PGPASSWORD"]}@localhost:{os.environ["PGPORT"]}/{os.environ["PGDATABASE"]}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the dlt-native way to configure connection with environment variables: https://dlthub.com/docs/devel/general-usage/credentials/setup#environment-variables that should eliminate the need of manual connection string construction and usage of ConnectionStringCredentials
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a note about this in step 3; I tested it, too, and it worked for source creds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey original author here :). are you saying it's better practice to define the sql connection string as a single env variable and then reassign the env variable in the pipeline? e.g.
- Set a Modal secret like
POSTGRES_CREDENTIAL_STRING = 'postgresql://sdfsd:sdlfkj'
(this gets mounted as an env variable) - In the pipeline, call
os.environ["TASK_SOURCES__SQL_DATABASE__CREDENTIALS"] = os.environ["POSTGRES_CREDENTIAL_STRING"]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @kning! I would say it's a matter of taste, if you prefer string connection, use it, if not, don't, dlt supports both. In this example, I think, Anton wants to reduce the amount of code and unnecessary manipulations. For example, in this case you can avoid this
credentials = ConnectionStringCredentials(pg_url)
and
destination=dlt.destinations.snowflake(snowflake_url),
info = pipeline.run(source_1, write_disposition="merge") | ||
print(info) | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the next step is missing: how this code ends up on Modal? How to trigger runs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added step 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This runs the pipeline once, but might be worth adding that you need to run modal deploy
to actually schedule the pipelin.
…lt-hub/dlt into docs/how-to-deploy-using-modal
the more i think about it actually, maybe it makes sense to write a really pared down example for this space that is runnable end-to-end for the user (e.g. using duckdb) and linking out to our blog post for a "real-world example". happy to help contribute a pared down example |
here's a simpler gist that should just work if you run i think this section will have better engagement if the user can simply copy-paste a script and it works immediately; we can adapt this to your docs style and perhaps just link out to the original blog post as a more detailed, real-world example of dlt (i also need to update that one to be compatible with 1.1.0). lmk what you think! also happy to chat i know ive shared a lot of info here haha. https://gist.github.com/kning/6a2af9e08ebaad0e486968f98c1939be |
@kning hey! Thanks for your thoughts here, your idea with testing is great! We actually practice this, you can find here some getting started snippets that we test on every CI/CD run. We can also add your gist to our testing process, we just need to understand what we call a successful run, we can run this command |
running also checked out the snippets, are those ever surfaced in the docs? i guess i'd expect it to be synced with the snippets on this page but it looks different. |
It shouldn't be a problem.
We changed our docs significantly recently, and getting started page was removed and replaced with intro. Relevant example you can find here: doc and snippets |
i see. how do you think we should move forward then with the modal snippet? ideally i'd like to see a "deploy with modal" page that explains how to create a modal account and the runnable code snippet (which should also regularly be run somehow to ensure that it's correct) and finally a link to the blog post for a "real-world" example. but i guess from what i understand it seems that code in the docs page and CI/CD snippets are managed separately? |
@dat-a-man will do that, he will create a snippet file with example and use tags to ingest this snippet into doc page, here we will run modal command to test it |
amazing looks way cleaner now, thanks! |
…lt-hub/dlt into docs/how-to-deploy-using-modal
Makefile
Outdated
@@ -65,6 +65,7 @@ lint-and-test-snippets: | |||
poetry run mypy --config-file mypy.ini docs/website docs/examples docs/tools --exclude docs/tools/lint_setup --exclude docs/website/docs_processed | |||
poetry run flake8 --max-line-length=200 docs/website docs/examples docs/tools | |||
cd docs/website/docs && poetry run pytest --ignore=node_modules | |||
modal run docs/walkthroughs/deploy-a-pipeline/deploy-with-modal-snippets.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modal run docs/walkthroughs/deploy-a-pipeline/deploy-with-modal-snippets.py | |
modal run docs/website/docs/walkthroughs/deploy-a-pipeline/deploy-with-modal-snippets.py |
Description
Added deploy with modal