Skip to content

Commit

Permalink
Merge branch 'current' into sl-exports
Browse files Browse the repository at this point in the history
  • Loading branch information
mirnawong1 authored Jan 24, 2024
2 parents f65f6ec + cb91db7 commit ba5f0e8
Show file tree
Hide file tree
Showing 194 changed files with 2,218 additions and 839 deletions.
17 changes: 7 additions & 10 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,23 @@
## What are you changing in this pull request and why?
<!---
Describe your changes and why you're making them. If linked to an open
Describe your changes and why you're making them. If related to an open
issue or a pull request on dbt Core, then link to them here!
To learn more about the writing conventions used in the dbt Labs docs, see the [Content style guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md).
-->

## Checklist
<!--
Uncomment if you're publishing docs for a prerelease version of dbt (delete if not applicable):
Uncomment when publishing docs for a prerelease version of dbt:
- [ ] Add versioning components, as described in [Versioning Docs](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#versioning-entire-pages)
- [ ] Add a note to the prerelease version [Migration Guide](https://github.com/dbt-labs/docs.getdbt.com/tree/current/website/docs/docs/dbt-versions/core-upgrade)
-->
- [ ] Review the [Content style guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md) so my content adheres to these guidelines.
- [ ] For [docs versioning](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#about-versioning), review how to [version a whole page](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#adding-a-new-version) and [version a block of content](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#versioning-blocks-of-content).
- [ ] Add a checklist item for anything that needs to happen before this PR is merged, such as "needs technical review" or "change base branch."

Adding new pages (delete if not applicable):
- [ ] Add page to `website/sidebars.js`
- [ ] Provide a unique filename for the new page

Removing or renaming existing pages (delete if not applicable):
- [ ] Remove page from `website/sidebars.js`
- [ ] Add an entry `website/static/_redirects`
- [ ] Run link testing locally with `npm run build` to update the links that point to the deleted page
Adding or removing pages (delete if not applicable):
- [ ] Add/remove page in `website/sidebars.js`
- [ ] Provide a unique filename for new pages
- [ ] Add an entry for deleted pages in `website/static/_redirects`
- [ ] Run link testing locally with `npm run build` to update the links that point to deleted pages
10 changes: 10 additions & 0 deletions contributing/content-style-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -519,6 +519,7 @@ enter (in the command line) | type (in the command line)
email | e-mail
on dbt | on a remote server
person, human | client, customer
plan(s), account | organization, customer
press (a key) | hit, tap
recommended limit | soft limit
sign in | log in, login
Expand All @@ -529,6 +530,15 @@ dbt Cloud CLI | CLI, dbt CLI
dbt Core | CLI, dbt CLI
</div></b>

Note, let's make sure we're talking to our readers and keep them close to the content and documentation (second person).

For example, to explain that a feature is available on a particular dbt Cloud plan, you can use:
- “XYZ is available on Enterprise plans”
- “If you're on an Enterprise plan, you can access XYZ..”
- "Enterprise plans can access XYZ..." to keep users closer to the documentation.

This will signal users to check their plan or account status independently.

## Links

Links embedded in the documentation are about trust. Users trust that we will lead them to sites or pages related to their reading content. In order to maintain that trust, it's important that links are transparent, up-to-date, and lead to legitimate resources.
Expand Down
2 changes: 1 addition & 1 deletion contributing/developer-blog.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

The dbt Developer Blog is a place where analytics practitioners can go to share their knowledge with the community. Analytics Engineering is a discipline we’re all building together. The developer blog exists to cultivate the collective knowledge that exists on how to build and scale effective data teams.

We currently have editorial capacity for 10 Community contributed developer blogs per quarter - if we are oversubscribed we suggest you post on another platform or hold off until the editorial team is ready to take on more posts.
We currently have editorial capacity for a few Community contributed developer blogs per quarter - if we are oversubscribed we suggest you post on another platform or hold off until the editorial team is ready to take on more posts.

### What makes a good developer blog post?

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ Now that you’ve set up the dbt project, database, and have taken a peek at the

Identifying the business process is done in collaboration with the business user. The business user has context around the business objectives and business processes, and can provide you with that information.

<Lightbox src="/img/blog/2023-04-18-building-a-kimball-dimensional-model-with-dbt/conversation.png" width="65%" title="Conversation between business user and analytics engineer"/>
<Lightbox src="/img/blog/2023-04-18-building-a-kimball-dimensional-model-with-dbt/conversation.png" title="Conversation between business user and analytics engineer"/>

Upon speaking with the CEO of AdventureWorks, you learn the following information:

Expand Down
8 changes: 4 additions & 4 deletions website/blog/2023-08-01-announcing-materialized-views.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ and updates on how to test MVs.

The year was 2020. I was a kitten-only household, and dbt Labs was still Fishtown Analytics. A enterprise customer I was working with, Jetblue, asked me for help running their dbt models every 2 minutes to meet a 5 minute SLA.

After getting over the initial terror, we talked through the use case and soon realized there was a better option. Together with my team, I created [lambda views](https://discourse.getdbt.com/t/how-to-create-near-real-time-models-with-just-dbt-sql/1457%20?) to meet the need.
After getting over the initial terror, we talked through the use case and soon realized there was a better option. Together with my team, I created [lambda views](https://discourse.getdbt.com/t/how-to-create-near-real-time-models-with-just-dbt-sql/1457) to meet the need.

Flash forward to 2023. I’m writing this as my giant dog snores next to me (don’t worry the cats have multiplied as well). Jetblue has outgrown lambda views due to performance constraints (a view can only be so performant) and we are at another milestone in dbt’s journey to support streaming. What. a. time.

Expand All @@ -32,8 +32,8 @@ Today we are announcing that we now support Materialized Views in dbt. So, what
Materialized views are now an out of the box materialization in your dbt project once you upgrade to the latest version of dbt v1.6 on these following adapters:

- [dbt-postgres](/reference/resource-configs/postgres-configs#materialized-views)
- [dbt-redshift](reference/resource-configs/redshift-configs#materialized-views)
- [dbt-snowflake](reference/resource-configs/snowflake-configs#dynamic-tables)
- [dbt-redshift](/reference/resource-configs/redshift-configs#materialized-views)
- [dbt-snowflake](/reference/resource-configs/snowflake-configs#dynamic-tables)
- [dbt-databricks](/reference/resource-configs/databricks-configs#materialized-views-and-streaming-tables)
- [dbt-materialize*](/reference/resource-configs/materialize-configs#incremental-models-materialized-views)
- [dbt-trino*](/reference/resource-configs/trino-configs#materialized-view)
Expand Down Expand Up @@ -227,4 +227,4 @@ Depending on how you orchestrate your materialized views, you can either run the

## Conclusion

Well, I’m excited for everyone to remove the lines in your packages.yml that installed your experimental package (at least if you’re using it for MVs) and start to get your hands dirty. We are still new in our journey and I look forward to hearing all the things you are creating and how we can better our best practices in this.
Well, I’m excited for everyone to remove the lines in your packages.yml that installed your experimental package (at least if you’re using it for MVs) and start to get your hands dirty. We are still new in our journey and I look forward to hearing all the things you are creating and how we can better our best practices in this.
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
---
title: Serverless, free-tier data stack with dlt + dbt core.
description: "In this article, Euan shares his personal project to fetch property price data during his and his partner's house-hunting process, and how he created a serverless free-tier data stack by using Google Cloud Functions to run data ingestion tool dlt alongside dbt for transformation."
slug: serverless-dlt-dbt-stack

authors: [euan_johnston]

hide_table_of_contents: false

date: 2024-01-15
is_featured: false
---



## The problem, the builder and tooling

**The problem**: My partner and I are considering buying a property in Portugal. There is no reference data for the real estate market here - how many houses are being sold, for what price? Nobody knows except the property office and maybe the banks, and they don’t readily divulge this information. The only data source we have is Idealista, which is a portal where real estate agencies post ads.

Unfortunately, there are significantly fewer properties than ads - it seems many real estate companies re-post the same ad that others do, with intentionally different data and often misleading bits of info. The real estate agencies do this so the interested parties reach out to them for clarification, and from there they can start a sales process. At the same time, the website with the ads is incentivised to allow this to continue as they get paid per ad, not per property.

**The builder:** I’m a data freelancer who deploys end to end solutions, so when I have a data problem, I cannot just let it go.

**The tools:** I want to be able to run my project on [Google Cloud Functions](https://cloud.google.com/functions) due to the generous free tier. [dlt](https://dlthub.com/) is a new Python library for declarative data ingestion which I have wanted to test for some time. Finally, I will use dbt Core for transformation.

## The starting point

If I want to have reliable information on the state of the market I will need to:

- Grab the messy data from Idealista and historize it.
- Deduplicate existing listings.
- Try to infer what listings sold for how much.

Once I have deduplicated listings with some online history, I can get an idea:

- How expensive which properties are.
- How fast they get sold, hopefully a signal of whether they are “worth it” or not.

## Towards a solution

The solution has pretty standard components:

- An EtL pipeline. The little t stands for normalisation, such as transforming strings to dates or unpacking nested structures. This is handled by dlt functions written in Python.
- A transformation layer taking the source data loaded by my dlt functions and creating the tables necessary, handled by dbt.
- Due to the complexity of deduplication, I needed to add a human element to confirm the deduplication in Google Sheets.

These elements are reflected in the diagram below and further clarified in greater detail later in the article:

<Lightbox src="/img/blog/serverless-free-tier-data-stack-with-dlt-and-dbt-core/architecture_diagram.png" width="70%" title="Project architecture" />

### Ingesting the data

For ingestion, I use a couple of sources:

First, I ingest home listings from the Idealista API, accessed through [API Dojo's freemium wrapper](https://rapidapi.com/apidojo/api/idealista2). The dlt pipeline I created for ingestion is in [this repo](https://github.com/euanjohnston-dev/Idealista_pipeline).

After an initial round of transformation (described in the next section), the deduplicated data is loaded into BigQuery where I can query it from the Google Sheets client and manually review the deduplication.

When I'm happy with the results, I use the [ready-made dlt Sheets source connector](https://dlthub.com/docs/dlt-ecosystem/verified-sources/google_sheets) to pull the data back into BigQuery, [as defined here](https://github.com/euanjohnston-dev/gsheets_check_pipeline).

### Transforming the data

For transforming I use my favorite solution, dbt Core. For running and orchestrating dbt on Cloud Functions, I am using dlt’s dbt Core runner. The benefit of the runner in this context is that I can re-use the same credential setup, instead of creating a separate profiles.yml file.

This is the package I created: <https://github.com/euanjohnston-dev/idealista_dbt_pipeline>

### Production-readying the pipeline

To make the pipeline more “production ready”, I made some improvements:

- Using a credential store instead of hard-coding passwords, in this case Google Secret Manager.
- Be notified when the pipeline runs and what the outcome is. For this I sent data to Slack via a dlt decorator that posts the error on failure and the metadata on success.

```python
from dlt.common.runtime.slack import send_slack_message

def notify_on_completion(hook):
def decorator(func):
def wrapper(*args, **kwargs):
try:
load_info = func(*args, **kwargs)
message = f"Function {func.__name__} completed successfully. Load info: {load_info}"
send_slack_message(hook, message)
return load_info
except Exception as e:
message = f"Function {func.__name__} failed. Error: {str(e)}"
send_slack_message(hook, message)
raise
return wrapper
return decorator
```

## The outcome

The outcome was first and foremost a visualisation highlighting the unique properties available in my specific area of search. The map shown on the left of the page gives a live overview of location, number of duplicates (bubble size) and price (bubble colour) which can amongst other features be filtered using the sliders on the right. This represents a much better decluttered solution from which to observe the actual inventory available.

<Lightbox src="/img/blog/serverless-free-tier-data-stack-with-dlt-and-dbt-core/map_screenshot.png" width="70%" title="Dashboard mapping overview" />

Further charts highlight additional metrics which – now that deduplication is complete – can be accurately measured including most importantly, the development over time of “average price/square metre” and those properties which have been inferred to have been sold.

### Next steps

This version was very much about getting a base from which to analyze the properties for my own personal use case.

In terms of further development which could take place, I have had interest from people to run the solution on their own specific target area.

For this to work at scale I would need a more robust method to deal with duplicate attribution, which is a difficult problem as real estate agencies intentionally change details like number of rooms or surface area.

Perhaps this is a problem ML or GPT could solve equally well as a human, given the limited options available.

## Learnings and conclusion

The data problem itself was an eye opener into the real-estate market. It’s a messy market full of unknowns and noise, which adds a significant purchase risk to first time buyers.

Tooling wise, it was surprising how quick it was to set everything up. dlt integrates well with dbt and enables fast and simple data ingestion, making this project simpler than I thought it would be.

### dlt

Good:

- As a big fan of dbt I love how seamlessly the two solutions complement one another. dlt handles the data cleaning and normalisation automatically so I can focus on curating and modelling it in dbt. While the automatic unpacking leaves some small adjustments for the analytics engineer, it’s much better than cleaning and typing json in the database or in custom python code.
- When creating my first dummy pipeline I used duckdb. It felt like a great introduction into how simple it is to get started and provided a solid starting block before developing something for the cloud.

Bad:

- I did have a small hiccup with the google sheets connector assuming an oauth authentication over my desired sdk but this was relatively easy to rectify. (explicitly stating GcpServiceAccountCredentials in the init.py file for the source).
- Using both a verified source in the gsheets connector and building my own from Rapid API endpoints seemed equally intuitive. However I would have wanted more documentation on how to run these 2 pipelines in the same script with the dbt pipeline.

### dbt

No surprises there. I developed the project locally, and to deploy to cloud functions I injected credentials to dbt via the dlt runner. This meant I could re-use the setup I did for the other dlt pipelines.

```python
def dbt_run():
# make an authenticated connection with dlt to the dwh
pipeline = dlt.pipeline(
pipeline_name='dbt_pipeline',
destination='bigquery', # credentials read from env
dataset_name='dbt'
)
# make a venv in case we have lib conflicts between dlt and current env
venv = dlt.dbt.get_venv(pipeline)
# package the pipeline, dbt package and env
dbt = dlt.dbt.package(pipeline, "dbt/property_analytics", venv=venv)
# and run it
models = dbt.run_all()
# show outcome
for m in models:
print(f"Model {m.model_name} materialized in {m.time} with status {m.status} and message {m.message}"
```

### Cloud functions

While I had used cloud functions before, I had never previously set them up for dbt and I was able to easily follow dlt’s docs to run the pipelines there. Cloud functions is a great solution to cheaply run small scale pipelines and my running cost of the project is a few cents a month. If the insights drawn from the project help us save even 1% of a house price, the project will have been a success.

### To sum up

dlt feels like the perfect solution for anyone who has scratched the surface of python development. To be able to have schemas ready for transformation in such a short space of time is truly… transformational. As a freelancer, being able to accelerate the development of pipelines is a huge benefit within companies who are often frustrated with the amount of time it takes to start ‘showing value’.

I’d welcome the chance to discuss what’s been built to date or collaborate on any potential further development in the comments below.
2 changes: 1 addition & 1 deletion website/blog/2023-12-20-partner-integration-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ This guide doesn't include how to integrate with dbt Core. If you’re intereste
Instead, we're going to focus on integrating with dbt Cloud. Integrating with dbt Cloud is a key requirement to become a dbt Labs technology partner, opening the door to a variety of collaborative commercial opportunities.

Here I'll cover how to get started, potential use cases you want to solve for, and points of integrations to do so.

<!-- truncate -->
## New to dbt Cloud?

If you're new to dbt and dbt Cloud, we recommend you and your software developers try our [Getting Started Quickstarts](https://docs.getdbt.com/guides) after reading [What is dbt](https://docs.getdbt.com/docs/introduction). The documentation will help you familiarize yourself with how our users interact with dbt. By going through this, you will also create a sample dbt project to test your integration.
Expand Down
Loading

0 comments on commit ba5f0e8

Please sign in to comment.