This week, we’ll dive into Analytics Engineering.
We’ll cover:
- Basics of analytics engineering
- dbt Labs (data build tool)
- Testing and documenting
- Deployment to the cloud and locally
- Visualizing the data with Google Data Studio and Metabase
Goal: Transforming the data loaded in DWH to Analytical Views developing a dbt project.
See week_4_analytics_engineering on GitHub and slides
Youtube videos:
- DE Zoomcamp 4.1.1 - Analytics Engineering Basics
- DE Zoomcamp 4.1.2 - What is dbt
- DE Zoomcamp 4.2.1 - Start Your dbt Project: BigQuery and dbt Cloud (Alternative A)
- DE Zoomcamp 4.2.2 - Start Your dbt Project: Postgres and dbt Core Locally (Alternative B)
- DE Zoomcamp 4.3.1 - Build the First dbt Models (✦ see note below)
- DE Zoomcamp 4.3.2 - Testing and Documenting the Project (✦ see note below)
- DE Zoomcamp 4.4.1 - Deployment Using dbt Cloud (Alternative A)
- DE Zoomcamp 4.4.2 - Deployment Using dbt Locally (Alternative B)
- DE Zoomcamp 4.5.1 - Visualising the data with Google Data Studio (Alternative A)
- DE Zoomcamp 4.5.2 - Visualising the data with Metabase (Alternative B)
✦ Note: These videos are shown entirely on dbt cloud IDE but the same steps can be followed locally on the IDE of your choice.
- A running warehouse (BigQuery or postgres)
- A set of running pipelines ingesting the project dataset (week 3 completed): Taxi Rides NY dataset
- Yellow taxi data - Years 2019 and 2020
- Green taxi data - Years 2019 and 2020
- fhv data - Year 2019.
- Data can be found here: https://github.com/DataTalksClub/nyc-tlc-data
You will need to create a dbt cloud account using this link and connect to your warehouse following these instructions. More detailed instructions in dbt_cloud_setup.md.
As an alternative to the cloud, that require to have a cloud database, you will be able to run the project installing
dbt locally. You can follow the official dbt documentation or use a
docker image from official dbt repo. You will need to install the latest version
(1.0) with the postgres adapter (dbt-postgres). After local installation you will have to set up the connection to PG in
the profiles.yml
, you can find the templates
here.
Note: For the remainder of these notes, I have chosen to continue (or focus more on) with BigQuery, i.e. alternative A.
Start Prefect Orion with the following commands.
cd ~/github/de-zoomcamp/week3
conda activate zoom
prefect orion start
## Open a new terminal window.
conda activate zoom
prefect profile use default
prefect config set PREFECT_API_URL=http://127.0.0.1:4200/api
Check out the dashboard at http://127.0.0.1:4200.
Create a python script to load our data from GitHub repo to Google Cloud Storage (GCS).
File web_to_gcs.py
from pathlib import Path
import pandas as pd
from prefect import flow, task
from prefect_gcp.cloud_storage import GcsBucket
from random import randint
@task(retries=3, log_prints=True)
def fetch(dataset_url: str) -> pd.DataFrame:
print(dataset_url)
df = pd.read_csv(dataset_url, compression='gzip')
return df
@task(log_prints=True)
def clean(color: str, df: pd.DataFrame) -> pd.DataFrame:
"""
yellow 2019
VendorID int64
tpep_pickup_datetime object (green=lpep_pickup_datetime)
tpep_dropoff_datetime object (green=lpep_dropoff_datetime)
passenger_count int64
trip_distance float64
RatecodeID int64
store_and_fwd_flag object
PULocationID int64
DOLocationID int64
payment_type int64
fare_amount float64
extra float64
mta_tax float64
tip_amount float64
tolls_amount float64
improvement_surcharge float64
total_amount float64
congestion_surcharge float64
yellow 2020
VendorID float64 (2019=int64)
tpep_pickup_datetime object (green=lpep_pickup_datetime)
tpep_dropoff_datetime object (green=lpep_dropoff_datetime)
passenger_count float64 (2019=int64)
trip_distance float64
RatecodeID float64 (2019=int64)
store_and_fwd_flag object
PULocationID int64
DOLocationID int64
payment_type float64 (2019=int64)
fare_amount float64
extra float64
mta_tax float64
tip_amount float64
tolls_amount float64
improvement_surcharge float64
total_amount float64
congestion_surcharge float64
green 2019
VendorID int64
lpep_pickup_datetime object
lpep_dropoff_datetime object
store_and_fwd_flag object
RatecodeID int64
PULocationID int64
DOLocationID int64
passenger_count int64
trip_distance float64
fare_amount float64
extra float64
mta_tax float64
tip_amount float64
tolls_amount float64
ehail_fee float64
improvement_surcharge float64
total_amount float64
payment_type int64
trip_type int64
congestion_surcharge float64
green 2020
VendorID float64 (2019=int64)
lpep_pickup_datetime object
lpep_dropoff_datetime object
store_and_fwd_flag object
RatecodeID float64 (2019=int64)
PULocationID int64
DOLocationID int64
passenger_count float64 (2019=int64)
trip_distance float64
fare_amount float64
extra float64
mta_tax float64
tip_amount float64
tolls_amount float64
ehail_fee float64
improvement_surcharge float64
total_amount float64
payment_type float64 (2019=int64)
trip_type float64 (2019=int64)
congestion_surcharge float64
fhv 2019
dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number
fhv 2020
dispatching_base_num,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,SR_Flag,Affiliated_base_number
"""
if color == "yellow":
"""Fix dtype issues"""
df["tpep_pickup_datetime"] = pd.to_datetime(df["tpep_pickup_datetime"])
df["tpep_dropoff_datetime"] = pd.to_datetime(df["tpep_dropoff_datetime"])
if color == "green":
"""Fix dtype issues"""
df["lpep_pickup_datetime"] = pd.to_datetime(df["lpep_pickup_datetime"])
df["lpep_dropoff_datetime"] = pd.to_datetime(df["lpep_dropoff_datetime"])
df["trip_type"] = df["trip_type"].astype('Int64')
if color == "yellow" or color == "green":
df["VendorID"] = df["VendorID"].astype('Int64')
df["RatecodeID"] = df["RatecodeID"].astype('Int64')
df["PULocationID"] = df["PULocationID"].astype('Int64')
df["DOLocationID"] = df["DOLocationID"].astype('Int64')
df["passenger_count"] = df["passenger_count"].astype('Int64')
df["payment_type"] = df["payment_type"].astype('Int64')
if color == "fhv":
"""Rename columns"""
df.rename({'dropoff_datetime':'dropOff_datetime'}, axis='columns', inplace=True)
df.rename({'PULocationID':'PUlocationID'}, axis='columns', inplace=True)
df.rename({'DOLocationID':'DOlocationID'}, axis='columns', inplace=True)
"""Fix dtype issues"""
df["pickup_datetime"] = pd.to_datetime(df["pickup_datetime"])
df["dropOff_datetime"] = pd.to_datetime(df["dropOff_datetime"])
# See https://pandas.pydata.org/docs/user_guide/integer_na.html
df["PUlocationID"] = df["PUlocationID"].astype('Int64')
df["DOlocationID"] = df["DOlocationID"].astype('Int64')
print(df.head(2))
print(f"columns: {df.dtypes}")
print(f"rows: {len(df)}")
return df
@task()
def write_local(color: str, df: pd.DataFrame, dataset_file: str) -> Path:
"""Write DataFrame out locally as csv file"""
Path(f"data/{color}").mkdir(parents=True, exist_ok=True)
# path = Path(f"data/fhv/{dataset_file}.csv.gz")
# df.to_csv(path, compression="gzip")
path = Path(f"data/{color}/{dataset_file}.parquet")
df.to_parquet(path, compression="gzip")
return path
@task
def write_gcs(path: Path) -> None:
"""Upload local parquet file to GCS"""
gcs_block = GcsBucket.load("zoom-gcs")
gcs_block.upload_from_path(from_path=path, to_path=path)
return
@flow()
def web_to_gcs() -> None:
color = "fhv"
color = "green"
color = "yellow"
year = 2019
for month in range(1, 13):
dataset_file = f"{color}_tripdata_{year}-{month:02}"
dataset_url = f"https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{color}/{dataset_file}.csv.gz"
df = fetch(dataset_url)
df_clean = clean(color, df)
path = write_local(color, df_clean, dataset_file)
write_gcs(path)
if __name__ == "__main__":
web_to_gcs()
After, I ran this script successively for color
= fhv
, yellow
, green
and year
= 2019
, 2020
.
python web_to_gcs.py
You should see this on Prefect Orion UI.
You should see your buckets on Google Cloud.
Then, in BigQuery, I created the tables fhv_2019
, yellow_tripdata
and green_tripdata
like this.
I now see the three tables under trips_data_all
.
To check if everything is correct, I counted the number of rows of each of the tables.
SELECT COUNT(*) FROM `hopeful-summer-375416.trips_data_all.fhv_2019`;
--- 43,244,696
SELECT COUNT(*) FROM `hopeful-summer-375416.trips_data_all.yellow_tripdata`;
--- 109,047,518
SELECT COUNT(*) FROM `hopeful-summer-375416.trips_data_all.green_tripdata`;
--- 7,778,101
On February 15, 2023, I created a free dbt account with BigQuery. Then I followed the instructions in this file (dbt_cloud_setup.md).
In order to connect we need the service account JSON file generated from bigquery:
Step 1: Open the BigQuery credential wizard to create a service account in your taxi project
- Select an API: BigQuery API
- What data will you be accessing?: Application data
- Are you planning to use this API with Compute Engine…? No, I’n not using them
- Service account details: dbt-service-account
- Service account ID: dbt-service-account (Email address: dbt-service-account@hopeful-summer-375416.iam.gserviceaccount.com)
- Service account description: Service account for dbt cloud
- Role: BigQuery Data Editor
- Role: BigQuery Job User
- Role: BigQuery User
- Role: BigQuery Admin
Click on DONE button.
I reuse the json file (hopeful-summer-375416-c150de675a7d.json
) created in the previous weeks.
00:00/7:14 (4.1.1) Analytics Engineering Basics
1:02/7:14 (4.1.1) Roles in a data team
Roles in a data team:
- Data Engineer: Prepares and maintain the infrastructure the data team needs.
- Analytics Engineer: Introduces the good software engineering practices to the efforts of data analysts and data scientists
- Data Analyst: Uses data to answer questions and solve problems.
2:20/7:14 (4.1.1) Tooling
Tooling:
- Data Loading
- Data Storing (Cloud data warehouses like Snowflake, Bigquery, Redshift)
- Data modelling (Tools like dbt or Dataform)
- Data presentation (BI tools like google data studio, Looker, Mode or Tableau)
3:06/7:14 (4.1.1) Data modeling concepts
In the ELP approach, we will transform the data once the date is already in the data warehouse.
ETL vs ELT
- ETL
- Slightly more stable and compliant data analysis
- Higher storage and compute costs
- ELT
- Faster and more flexible data analysis.
- Lower cost and lower maintenance
4:17/7:14 (4.1.1) Kimball’s Dimensional Modeling
- Objective
- Deliver data understandable to the business users
- Deliver fast query performance
- Approach
- Prioritise user understandability and query performance over non redundant data (3NF)
- Other approaches
- Bill Inmon
- Data vault
5:05/7:14 (4.1.1) Elements of Dimensional Modeling
- Facts tables
- Measurements, metrics or facts
- Corresponds to a business process
- "verbs"
- Dimensions tables
- Corresponds to a business entity
- Provides context to a business process
- "nouns"
5:50/7:14 (4.1.1) Architecture of Dimensional Modeling
- Stage Area
- Contains the raw data
- Not meant to be exposed to everyone
- Processing area
- From raw data to data models
- Focuses in efficiency
- Ensuring standards
- Presentation area
- Final presentation of the data
- Exposure to business stakeholder
0:00/4:59 (4.1.2) What is dbt?
dbt is a transformation tool that allows anyone that knows SQL to deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation.
1:03/4:59 (4.1.2) How does dbt work?
How does dbt work?
- Each model is:
- A
*.sql
file - Select statement, no DDL (Data Definition Language) or DML (Data Manipulation Language)
- A file that dbt will compile and run in our DWH (Data warehouse)
- A
2:04/4:59 (4.1.2) How to use dbt?
- dbt Core: Open-source project that allows the data transformation.
- Builds and runs a dbt project (.sql and .yml files)
- Includes SQL compilation logic, macros and database adapters
- Includes a CLI (Command Line Interface) interface to run dbt commands locally
- Opens source and free to use
- dbt Cloud: SaaS (Software As A Service) application to develop and manage dbt projects.
- Web-based IDE (Integrated Development Environment) to develop, run and test a dbt project
- Jobs orchestration
- Logging and Alerting
- Integrated documentation
- Free for individuals (one developer seat)
3:30/4:59 (4.1.2) How are we going to use dbt?
- BigQuery (Alternative A):
- Development using cloud IDE
- No local installation of dbt core
- Postgres (Alternative B):
- Development using a local IDE of your choice.
- Local installation of dbt core connecting to the Postgres database
- Running dbt models through the CLI
At the end, our project will look like this.
How are we going to use dbt?
00:00/5:32 (4.2.1) Create a new dbt project presentation (cloud or terminal)
dbt provides an starter project with all the basic folders and files.
Starter project structure
taxi_rides_ny/
analysis/
data/
macros/
models/example/
snapshots/
tests/
.gitignore
README.md
dbt_project.yml
Example of dbt_project.yml
name: 'taxi_rides_ny'
version: '1.0.0'
config-version: 2
# This setting configures which "profile" dbt uses for this project.
profile: 'pg-dbt-workshop' # Using Postgres + dbt core (locally) (Alternative B)
profile: 'default' # Using BigQuery + dbt cloud (Alternative A)
# These configuration specify where dbt should look for different types of files.
# The `source-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably win't need to change these!
model-paths: ["models"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
seed-paths: ["data"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target" # directory which will store compiled SQL files
clean-targets: # directories to be removed by `dbt clean`
- "target"
- "dbt_packages"
# Configuring models
# Full decumentation: https://docs.getdbt.com/reference/model-configs
# In this example config, we tell dbt to build all models in the example/ directory
# as tables. These settings can be overridden in the individual model files
# using the `{{ config(...) }}` macro.
models:
taxi_rides_ny:
# Applies to all files under models/.../
staging:
materialized: view
core:
materialized: table
vars:
payment_type_values: [1, 2, 3, 4, 5, 6]
See About dbt projects for more.
1:50/5:32 (4.2.1) With the CLI
After having installed dbt locally and setup the profiles.yml
, run dbt
init in the path we want to start the project to clone the starter
project.
But as I wrote before, I have chosen to continue with dbt cloud and BigQuery, i.e. alternative A.
2:11/5:32 (4.2.1) With dbt cloud
After having set up the dbt cloud credentials (repo and dwh) we can start the project from the web-based IDE.
02:25/5:32 (4.2.1) BigQuery dataset and schemas
On Google Cloud, under BigQuery section, we should have these datasets. Under trips_data_all
, we should see
green_tripdata
and yellow_tripdata
.
03:10/5:32 (4.2.1) Repository structure
At the end, our github repository structure should look like the one from week 4 of the Data Engineering Zoomcamp.
03:24/5:32 (4.2.1) Start the project in dbt cloud
On dbt cloud interface, we need to click on Initialize your project button. All of the folders are created.
04:05/5:32 (4.2.1) dbt_project.yml in dbt cloud
To be able to modify the files, you must first create a new branch and switch to it.
See Checkout a new git branch.
We should change dbt_project.yml
file like this.
File dbt_project.yml
# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'taxi_rides_ny'
version: '1.0.0'
config-version: 2
# This setting configures which "profile" dbt uses for this project.
profile: 'default'
# These configurations specify where dbt should look for different types of files.
# The `source-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target" # directory which will store compiled SQL files
clean-targets: # directories to be removed by `dbt clean`
- "target"
- "dbt_packages"
# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models
# In this example config, we tell dbt to build all models in the example/ directory
# as tables. These settings can be overridden in the individual model files
# using the `{{ config(...) }}` macro.
models:
taxi_rides_ny:
# Applies to all files under models/example/
example:
materialized: view
- Name of the project:
taxi_rides_ny
. - Name of the model:
taxi_rides_ny
.
0:00/44:38 (4.3.1) Anatomy of a dbt model
Anatomy of a dbt model
In dbt, you can combine SQL with Jinja, a templating language.
Using Jinja turns our dbt project into a programming environment for SQL, giving you the ability to do things that aren’t normally possible in SQL (ie. evaluate an expression, variable or function call or print the result into the template).
See Jinja and macros for more.
Materializations are strategies for persisting dbt models in a warehouse. There are four types of materializations built into dbt. They are:
- table (most common)
- view (most common)
- incremental (useful for data that doesn’t really change every day)
- ephemeral ()
By default, dbt models are materialized as "views". Models can be configured with a different materialization by
supplying the materialized
configuration parameter.
Materializations can be configured by default in dbt-project.yml
or directly inside of the model sql
files.
{{ config(materialized='table') }}
select *
from ...
See Materializations for more.
03:15/44:38 (4.3.1) The FROM clause: Sources and seeds
Sources make it possible to name and describe the data loaded into your warehouse by your Extract and Load tools.
- The data loaded to our dwh that we use as sources for our models
- Configuration defined in the yml files in the models folder
- Used with the source macro that will resolve the name to the right schema, plus build the dependencies automatically
- Source freshness can be defined and tested
See Sources for more.
Example from file models/staging/schema.yml
sources:
- name: staging
database: taxi-rides-ny-339813 # For bigquery
schema: trips_data_all
# loaded_at_field: record_loaded_at
tables:
- name: green_tripdata
- name: yellow_tripdata
freshness:
error_after: {count: 6, period: hour}
The code below keeps only the first row of duplicates with the condition where rn = 1
. See 4 Ways to Check for
Duplicate Rows in SQL Server.
Example from file models/staging/stg_green_tripdata.sql
with tripdata as
(
select *,
row_number() over(partition by vendorid, lpep_pickup_datetime) as rn
from {{ source('staging','green_tripdata') }}
where vendorid is not null
)
05:22/44:38 (4.3.1) The FROM clause: Seeds (CSV files)
Seeds are CSV files in your dbt project (typically in your seeds
directory), that dbt can load into your data
warehouse using the dbt seed
command.
- CSV files stored in our repository under the seed folder
- Benefits of version controlling
- Equivalent to a copy command
- Recommended for data that doesn’t change frequently
- Runs with
dbt seed -s file_name
See Seeds for more.
Example taxi_zone_loopup.csv
"locationid","borough","zone","service_name"
...
Example models/core/dim_zones.sql
select
locationid,
borough,
zone,
replace(service_zone,'Boro','Green') as service_zone
from {{ ref('taxi_zone_lookup') }}
06:40/44:38 (4.3.1) The FROM clause: ref macro
The most important function in dbt is ref()
; it’s impossible to build even moderately complex models without it.
ref()
is how you reference one model within another. This is a very common behavior, as typically models are built to
be "stacked" on top of one another.
- Macro to reference the underlying tables and views that were building the data warehouse
- Run the same code in any environment, it will resolve the correct schema for you
- Dependencies are built automatically
See ref for more.
dbt model (models/core/fact_trips.sql
)
with green_data as (
select *,
'Green' as service_type
from {{ ref('stg_green_tripdata') }}
),
This model will be translated in compiled code.
Compiled code
with green_data as (
select *,
'Green' as service_type
from "production"."dbt_aboisvert"."stg_green_tripdata"
),
08:38/44:38 (4.3.1) Define a source and develop the first model (stg_green_tripdata)
Create these folders and files under the dbt project taxi_rides_ny
:
models/core
models/staging
schema.yml
stg_green_tripdata.sql
File models/staging/schema.yml
version: 2
sources:
- name: staging
# database: taxi-rides-ny-339813
database: hopeful-summer-375416
schema: trips_data_all
tables:
- name: green_tripdata
- name: yellow_tripdata
File models/staging/stg_green_tripdata.sql
{{ config(materialized="view") }}
select * from {{ source('staging', 'green_tripdata') }}
limit 100
Now, we can run this model with one of these following commands:
dbt run # Builds models in your target database.
dbt run --select stg_green_tripdata # Builds a specific model.
dbt run --select stg_green_tripdata+ # Builds a specific model and its children.
dbt run --select +stg_green_tripdata # Builds a specific model and its ancestors.
dbt run --select +stg_green_tripdata+ # Builds a specific model and its children and ancestors.
This is what appears in dbt Cloud after running the command dbt run
.
16:06/44:38 (4.3.1)
Then we modify the query models/staging/stg_green_tripdata.sql
by indicating the columns.
File models/staging/stg_green_tripdata.sql
{{ config(materialized="view") }}
select
-- identifiers
cast(vendorid as integer) as vendorid,
cast(ratecodeid as integer) as ratecodeid,
cast(pulocationid as integer) as pickup_locationid,
cast(dolocationid as integer) as dropoff_locationid,
-- timestamps
cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,
-- trip info
store_and_fwd_flag,
cast(passenger_count as integer) as passenger_count,
cast(trip_distance as numeric) as trip_distance,
cast(trip_type as integer) as trip_type,
-- payment info
cast(fare_amount as numeric) as fare_amount,
cast(extra as numeric) as extra,
cast(mta_tax as numeric) as mta_tax,
cast(tip_amount as numeric) as tip_amount,
cast(tolls_amount as numeric) as tolls_amount,
cast(ehail_fee as numeric) as ehail_fee,
cast(improvement_surcharge as numeric) as improvement_surcharge,
cast(total_amount as numeric) as total_amount,
cast(payment_type as integer) as payment_type,
cast(congestion_surcharge as numeric) as congestion_surcharge
from {{ source('staging', 'green_tripdata') }}
limit 100
I run this command.
dbt run --select stg_green_tripdata
Then we go to BigQuery and we see that the view stg_green_tripdata
is created.
17:24/44:38 (4.3.1) Definition and usage of macros
Macros in Jinja are pieces of code that can be reused multiple
times – they are analogous to "functions" in other programming languages, and are extremely useful if you find yourself
repeating code across multiple models. Macros are defined in .sql
files, typically in your macros
directory
(docs).
- Use control structures (e.g. if statements and for loops) in SQL
- Use environment variables in your dbt project for production deployments
- Operate on the results of one query to generate another query
- Abstract snippets of SQL into reusable macros — these are analogous to functions in most programming languages.
Create these folders and files under taxi_rides_ny
:
macros
get_payment_type_description.sql
Definition of the macro (macros/get_payment_type_description.sql
)
{#
This macro returns the description of the payment_type
#}
{% macro get_payment_type_description(payment_type) -%}
case {{ payment_type }}
when 1 then 'Credit card'
when 2 then 'Cash'
when 3 then 'No charge'
when 4 then 'Dispute'
when 5 then 'Unknown'
when 6 then 'Voided trip'
end
{%- endmacro %}
Usage of the macro (models/staging/stg_green_tripdata.sql
)
select ...
cast(payment_type as integer) as payment_type,
{{ get_payment_type_description('payment_type') }} as payment_type_description,
cast(congestion_surcharge as numeric) as congestion_surcharge
from {{ source('staging', 'green_tripdata') }}
limit 100
Compiled code of the macro
create of alter view production.dbt_aboisvert.stg_green_tripdata as
select ...
case payment_type
when 1 then 'Credit card'
when 2 then 'Cash'
when 3 then 'No charge'
when 4 then 'Dispute'
when 5 then 'Unknown'
when 6 then 'Voided trip'
end as payment_type_description,
cast(congestion_surcharge as numeric) as congestion_surcharge
from "production"."staging"."green_tripdata"
limit 100
Then we run this again.
dbt run --select stg_green_tripdata
Here the logs.
File console_output.txt
16:35:46 Began running node model.taxi_rides_ny.stg_green_tripdata
16:35:46 1 of 1 START sql view model dbt_aboisvert.stg_green_tripdata ................... [RUN]
16:35:46 Acquiring new bigquery connection 'model.taxi_rides_ny.stg_green_tripdata'
16:35:46 Began compiling node model.taxi_rides_ny.stg_green_tripdata
16:35:46 Writing injected SQL for node "model.taxi_rides_ny.stg_green_tripdata"
16:35:46 Timing info for model.taxi_rides_ny.stg_green_tripdata (compile): 2023-02-16 16:35:46.674102 => 2023-02-16 16:35:46.696478
16:35:46 Began executing node model.taxi_rides_ny.stg_green_tripdata
16:35:46 Writing runtime sql for node "model.taxi_rides_ny.stg_green_tripdata"
16:35:46 Opening a new connection, currently in state closed
16:35:46 On model.taxi_rides_ny.stg_green_tripdata: /* {"app": "dbt", "dbt_version": "1.4.1", "profile_name": "user", "target_name": "default", "node_id": "model.taxi_rides_ny.stg_green_tripdata"} */
create or replace view `hopeful-summer-375416`.`dbt_aboisvert`.`stg_green_tripdata`
OPTIONS()
as
select
-- identifiers
cast(vendorid as integer) as vendorid,
cast(ratecodeid as integer) as ratecodeid,
cast(pulocationid as integer) as pickup_locationid,
cast(dolocationid as integer) as dropoff_locationid,
-- timestamps
cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,
-- trip info
store_and_fwd_flag,
cast(passenger_count as integer) as passenger_count,
cast(trip_distance as numeric) as trip_distance,
cast(trip_type as integer) as trip_type,
-- payment info
cast(fare_amount as numeric) as fare_amount,
cast(extra as numeric) as extra,
cast(mta_tax as numeric) as mta_tax,
cast(tip_amount as numeric) as tip_amount,
cast(tolls_amount as numeric) as tolls_amount,
cast(ehail_fee as numeric) as ehail_fee,
cast(improvement_surcharge as numeric) as improvement_surcharge,
cast(total_amount as numeric) as total_amount,
cast(payment_type as integer) as payment_type,
case payment_type
when 1 then 'Credit card'
when 2 then 'Cash'
when 3 then 'No charge'
when 4 then 'Dispute'
when 5 then 'Unknown'
when 6 then 'Voided trip'
end as payment_type_description,
cast(congestion_surcharge as numeric) as congestion_surcharge
from `hopeful-summer-375416`.`trips_data_all`.`green_tripdata`
limit 100;
16:35:47 BigQuery adapter: https://console.cloud.google.com/bigquery?project=hopeful-summer-375416&j=bq:northamerica-northeast1:381afe21-4234-411c-8b26-fb40d05ccfeb&page=queryresults
16:35:47 Timing info for model.taxi_rides_ny.stg_green_tripdata (execute): 2023-02-16 16:35:46.697099 => 2023-02-16 16:35:47.326019
16:35:47 Sending event: {'category': 'dbt', 'action': 'run_model', 'label': '0b79f185-7d06-49cf-ab1e-1b6cacff236b', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f42700e9fa0>]}
16:35:47 1 of 1 OK created sql view model dbt_aboisvert.stg_green_tripdata .............. CREATE VIEW (0 processed) in 0.65s
16:35:47 Finished running node model.taxi_rides_ny.stg_green_tripdata
You can also see the compiled code under target > compiled > taxi_rides_ny > models > staging > stg_green_tripdata.sql.
24:10/44:38 (4.3.1) Importing and using dbt packages
dbt packages are in fact standalone dbt projects, with models and macros that tackle a specific problem area. As a dbt user, by adding a package to your project, the package’s models and macros will become part of your own project
- Like libraries in other programming languages
- Standalone dbt projects, with models and macros that tackle a specific problem area.
- By adding a package to your project, the package’s models and macros will become part of your own project.
- Imported in the
packages.yml
file and imported by runningdbt deps
- A list of useful packages can be find in dbt package hub.
See Packages for more.
Specifications of the packages to import in the project (packages.yml
)
packages:
- package: dbt-labs/dbt_utils
version: 0.8.0
To install this package, run this command:
dbt deps
We should see this logs and a lot of folders and files created under dbt_packages/dbt_utils
.
Let’s go back to the model.
Usage of a macro from a package (models/staging/stg_green_tripdata.sql
).
{{ config(materialized='view') }}
select
-- identifiers
{{ dbt_utils.surrogate_key(['vendorid', 'lpep_pickup_datetime']) }} as tripid,
cast(vendorid as integer) as vendorid,
cast(ratecodeid as integer) as ratecodeid,
...
Now, let’s run this with this command.
dbt run --select stg_green_tripdata
Here the compiled code.
File target/compiled/taxi_rides_ny/models/staging/stg_green_tripdata.sql
.
select
-- identifiers
to_hex(md5(cast(coalesce(cast(vendorid as
string
), '') || '-' || coalesce(cast(lpep_pickup_datetime as
string
), '') as
string
))) as tripid,
cast(vendorid as integer) as vendorid,
cast(ratecodeid as integer) as ratecodeid,
...
29:18/44:38 (4.3.1) Definition of variables and setting a variable from the cli
dbt provides a mechanism, variables, to provide data to models for compilation. Variables can be used to configure timezones, avoid hardcoding table names or otherwise provide data to models to configure how they are compiled.
- Variables are useful for defining values that should be used across the project
- With a macro, dbt allows us to provide data to models for compilation
- To use a variable we use the
{{ var('…') }}
function - Variables can defined in two ways:
- In the
dbt_project.yml
file - On the command line
- In the
See Project variables for more.
Global variable we define under dbt_project.yml
.
vars:
payment_type_values: [1, 2, 3, 4, 5, 6]
Variable whose value we can change via CLI (models/staging/stg_green_tripdata.sql
).
from {{ source('staging', 'green_tripdata') }}
where vendorid is not null
-- dbt build --m <model.sql> --var 'is_test_run: false'
{% if var('is_test_run', default=true) %}
limit 100
{% endif %}
Then, we can run the model.
dbt build --m stg_green_tripdata --var 'is_test_run: false'
33:33/44:38 (4.3.1) Add second model (stg_yellow_tripdata)
We make the same changes for models/staging/stg_yellow_tripdata.sql
.
File models/staging/stg_yellow_tripdata.sql
{{ config(materialized='view') }}
select
-- identifiers
{{ dbt_utils.surrogate_key(['vendorid', 'tpep_pickup_datetime']) }} as tripid,
cast(vendorid as integer) as vendorid,
cast(ratecodeid as integer) as ratecodeid,
cast(pulocationid as integer) as pickup_locationid,
cast(dolocationid as integer) as dropoff_locationid,
-- timestamps
cast(tpep_pickup_datetime as timestamp) as pickup_datetime,
cast(tpep_dropoff_datetime as timestamp) as dropoff_datetime,
-- trip info
store_and_fwd_flag,
cast(passenger_count as integer) as passenger_count,
cast(trip_distance as numeric) as trip_distance,
-- yellow cabs are always street-hail
1 as trip_type,
-- payment info
cast(fare_amount as numeric) as fare_amount,
cast(extra as numeric) as extra,
cast(mta_tax as numeric) as mta_tax,
cast(tip_amount as numeric) as tip_amount,
cast(tolls_amount as numeric) as tolls_amount,
cast(0 as numeric) as ehail_fee,
cast(improvement_surcharge as numeric) as improvement_surcharge,
cast(total_amount as numeric) as total_amount,
cast(payment_type as integer) as payment_type,
{{ get_payment_type_description('payment_type') }} as payment_type_description,
cast(congestion_surcharge as numeric) as congestion_surcharge
from {{ source('staging', 'yellow_tripdata') }}
where vendorid is not null
-- dbt build --m <model.sql> --var 'is_test_run: false'
{% if var('is_test_run', default=true) %}
limit 100
{% endif %}
Now use dbt run
because we have two models…
dbt run --var 'is_test_run: false'
We should see this log and stg_yellow_tripdata
created in BigQuery (we need to refresh the page).
35:16/44:38 (4.3.1) Creating and using dbt seed (taxi_zones_lookup and dim_zone)
The dbt seed command will load csv
files located in the seed-paths
directory of your dbt project into your data warehouse.
In our dbt cloud, create seeds
folder, create the file seeds/taxi_zone_loopup.csv
and paste in it the content of
that csv
file.
After, run the command dbt seed
to create table taxi_zone_loopup
in BigQuery. We should have 265 lines.
We need to specify the data types of the csv file in dbt_project.yml
.
File dbt_project.yml
seeds:
taxi_rides_ny:
taxi_zone_lookup:
+column_types:
locationid: numeric
If we slightly modify data (for example, change 1,"EWR","Newark Airport","EWR"
for 1,"NEWR","Newark Airport","EWR"
)
in the csv file, we can run the following command:
dbt seed --full-refresh
We can see the change in the BigQuery table taxi_zone_loopup
.
Then, create the file models/core/dim_zones.sql
.
File models/core/dim_zones.sql
{{ config(materialized='table') }}
select
locationid,
borough,
zone,
replace(service_zone,'Boro','Green') as service_zone
from {{ ref('taxi_zone_lookup') }}
Ideally, we want everything in the directory to be tables to have efficient queries.
41:20/44:38 (4.3.1) Unioning our models in fact_trips and understanding dependencies
Now, create the model models/core/fact_trips.sql
.
File models/core/fact_trips.sql
{{ config(materialized='table') }}
with green_data as (
select *,
'Green' as service_type
from {{ ref('stg_green_tripdata') }}
),
yellow_data as (
select *,
'Yellow' as service_type
from {{ ref('stg_yellow_tripdata') }}
),
trips_unioned as (
select * from green_data
union all
select * from yellow_data
),
dim_zones as (
select * from {{ ref('dim_zones') }}
where borough != 'Unknown'
)
select
trips_unioned.tripid,
trips_unioned.vendorid,
trips_unioned.service_type,
trips_unioned.ratecodeid,
trips_unioned.pickup_locationid,
pickup_zone.borough as pickup_borough,
pickup_zone.zone as pickup_zone,
trips_unioned.dropoff_locationid,
dropoff_zone.borough as dropoff_borough,
dropoff_zone.zone as dropoff_zone,
trips_unioned.pickup_datetime,
trips_unioned.dropoff_datetime,
trips_unioned.store_and_fwd_flag,
trips_unioned.passenger_count,
trips_unioned.trip_distance,
trips_unioned.trip_type,
trips_unioned.fare_amount,
trips_unioned.extra,
trips_unioned.mta_tax,
trips_unioned.tip_amount,
trips_unioned.tolls_amount,
trips_unioned.ehail_fee,
trips_unioned.improvement_surcharge,
trips_unioned.total_amount,
trips_unioned.payment_type,
trips_unioned.payment_type_description,
trips_unioned.congestion_surcharge
from trips_unioned
inner join dim_zones as pickup_zone
on trips_unioned.pickup_locationid = pickup_zone.locationid
inner join dim_zones as dropoff_zone
on trips_unioned.dropoff_locationid = dropoff_zone.locationid
We have this.
The dbt run
command will create everything, except the seeds. I we also want to run the seeds, we will use dbt build
command.
The command dbt build --select fast_trips
will only run the model fact_trips
.
The command dbt build --select +fast_trips
will run everything that fact_trips
need. dbt already konws the
dependencies.
0:00/13:51 (4.3.2) Testing
Tests are assertions you make about your models and other resources in your dbt project (e.g. sources, seeds and snapshots). When you run dbt test, dbt will tell you if each test in your project passes or fails.
- Assumptions that we make about our data
- Tests in dbt are essentially a
select
sql query - These assumptions get compiled to sql that returns the amount of failing records
- Test are defined on a column in the .yml file
- dbt provides basic tests to check if the column values are:
- Unique
- Not null
- Accepted values
- A foreign key to another table
- You can create your custom tests as queries
See Tests for more.
Compiled code of the not_null
test.
select *
from "production"."dbt_aboisvert"."stg_yellow_tripdata"
where tripid is null
Definition of basic tests in the .yml files (models/staging/schema.yml
)
- name: Payment_type
description: >
A numeric code signifying how the passenger paid for the trip.
tests:
- accepted_values:
values: "{{ var('payment_type_values') }}" # [1,2,3,4,5]
severity: warn
quote: false
[...]
- name: Pickup_locationid
description: locationid where the meter was engaged.
tests:
- relationships:
to: ref('taxi_zone_lookup')
field: locationid
severity: warn
[...]
columns:
- name: tripid
description: Primary key for this table, generated with a concatenation of vendorid+pickup_datetime
tests:
- unique:
severity: warn
- not_null:
severity: warn
Warnings in the CLI from running dbt test
.
3:17/13:51 (4.3.2) Documentation
-
dbt provides a way to generate documentation for your dbt project and render it as a website.
-
The documentation for your project includes:
- Information about your project:
- Model code (both from the
.sql
file and compiled) - Model dependencies
- Sources
- Auto generated DAG from the ref and source macros
- Descriptions (from
.yml
file) and tests
- Model code (both from the
- Information about your data warehouse (information_schema):
- Column names and data types
- Table stats like size and rows
- Information about your project:
-
dbt docs
can also be hosted in dbt cloud
See About documentation for more.
Example from models/core/schema.yml
version: 2
models:
- name: dim_zones
description: >
List of unique zones idefied by locationid.
Includes the service zone they correspond to (Green or yellow).
- name: fact_trips
description: >
Taxi trips corresponding to both service zones (Green and yellow).
The table contains records where both pickup and dropoff locations are valid and known zones.
Each record corresponds to a trip uniquely identified by tripid.
- name: dm_monthly_zone_revenue
description: >
Aggregated table of all taxi trips corresponding to both service zones (Green and yellow) per pickup zone, month and service.
The table contains monthly sums of the fare elements used to calculate the monthly revenue.
The table contains also monthly indicators like number of trips, and average trip distance.
columns:
- name: revenue_monthly_total_amount
description: Monthly sum of the the total_amount of the fare charged for the trip per pickup zone, month and service.
tests:
- not_null:
severity: error
4:28/13:51 (4.3.2) Deployment
We define a model in this file models/core/dm_monthly_zone_revenue.sql
.
File models/core/dm_monthly_zone_revenue.sql
{{ config(materialized='table') }}
with trips_data as (
select * from {{ ref('fact_trips') }}
)
select
-- Reveneue grouping
pickup_zone as revenue_zone,
-- date_trunc('month', pickup_datetime) as revenue_month,
-- Note: For BQ use instead: date_trunc(pickup_datetime, month) as revenue_month,
date_trunc(pickup_datetime, month) as revenue_month,
service_type,
-- Revenue calculation
sum(fare_amount) as revenue_monthly_fare,
sum(extra) as revenue_monthly_extra,
sum(mta_tax) as revenue_monthly_mta_tax,
sum(tip_amount) as revenue_monthly_tip_amount,
sum(tolls_amount) as revenue_monthly_tolls_amount,
sum(ehail_fee) as revenue_monthly_ehail_fee,
sum(improvement_surcharge) as revenue_monthly_improvement_surcharge,
sum(total_amount) as revenue_monthly_total_amount,
sum(congestion_surcharge) as revenue_monthly_congestion_surcharge,
-- Additional calculations
count(tripid) as total_monthly_trips,
avg(passenger_count) as avg_montly_passenger_count,
avg(trip_distance) as avg_montly_trip_distance
from trips_data
group by 1,2,3
Then, we add another section to define the model inside models/staging/schema.yml
. This section is used in particular
to document the model and to add tests.
File models/staging/schema.yml
models:
- name: stg_green_tripdata
description: >
Trip made by green taxis, also known as boro taxis and street-hail liveries.
Green taxis may respond to street hails,but only in the areas indicated in green on the
map (i.e. above W 110 St/E 96th St in Manhattan and in the boroughs).
The records were collected and provided to the NYC Taxi and Limousine Commission (TLC) by
technology service providers.
columns:
- name: tripid
description: Primary key for this table, generated with a concatenation of vendorid+pickup_datetime
tests:
- unique:
severity: warn
- not_null:
severity: warn
- name: VendorID
description: >
A code indicating the TPEP provider that provided the record.
1= Creative Mobile Technologies, LLC;
2= VeriFone Inc.
- name: pickup_datetime
description: The date and time when the meter was engaged.
- name: dropoff_datetime
description: The date and time when the meter was disengaged.
- name: Passenger_count
description: The number of passengers in the vehicle. This is a driver-entered value.
- name: Trip_distance
description: The elapsed trip distance in miles reported by the taximeter.
- name: Pickup_locationid
description: locationid where the meter was engaged.
tests:
- relationships:
to: ref('taxi_zone_lookup')
field: locationid
severity: warn
- name: dropoff_locationid
description: locationid where the meter was engaged.
tests:
- relationships:
to: ref('taxi_zone_lookup')
field: locationid
- name: RateCodeID
description: >
The final rate code in effect at the end of the trip.
1= Standard rate
2=JFK
3=Newark
4=Nassau or Westchester
5=Negotiated fare
6=Group ride
- name: Store_and_fwd_flag
description: >
This flag indicates whether the trip record was held in vehicle
memory before sending to the vendor, aka “store and forward,”
because the vehicle did not have a connection to the server.
Y= store and forward trip
N= not a store and forward trip
- name: Dropoff_longitude
description: Longitude where the meter was disengaged.
- name: Dropoff_latitude
description: Latitude where the meter was disengaged.
- name: Payment_type
description: >
A numeric code signifying how the passenger paid for the trip.
tests:
- accepted_values:
values: "{{ var('payment_type_values') }}"
severity: warn
quote: false
- name: payment_type_description
description: Description of the payment_type code
- name: Fare_amount
description: >
The time-and-distance fare calculated by the meter.
Extra Miscellaneous extras and surcharges. Currently, this only includes
the $0.50 and $1 rush hour and overnight charges.
MTA_tax $0.50 MTA tax that is automatically triggered based on the metered
rate in use.
- name: Improvement_surcharge
description: >
$0.30 improvement surcharge assessed trips at the flag drop. The
improvement surcharge began being levied in 2015.
- name: Tip_amount
description: >
Tip amount. This field is automatically populated for credit card
tips. Cash tips are not included.
- name: Tolls_amount
description: Total amount of all tolls paid in trip.
- name: Total_amount
description: The total amount charged to passengers. Does not include cash tips.
- We define the variable
payment_type_values
in thedbt_project.yml
.
Next, we define the variable payment_type_values
in the dbt_project.yml
.
File dbt_project.yml
vars:
payment_type_values: [1, 2, 3, 4, 5, 6]
9:04/13:51 (4.3.2) Run dbt test
Then, run dbt test
or one of the following commands:
dbt test # Run tests on data in deployed models.
dbt test --select stg_green_tripdata # Run tests on data in specified model.
dbt test --select stg_green_tripdata+ # Run tests on data in specified model and its children.
dbt test --select +stg_green_tripdata # Run tests on data in specified model and its ancestors.
dbt build # Run the seeds, run the tests and run the models.
We see that the key tripid
of the stg_green_tripdata
table is not unique contrary to what we thought.
See tests and Configuring test
severity
for more information about tests.
Finally, we add a part at the beginning of the sql files and change the source.
The code below keeps only the first row of duplicates with the condition where rn = 1
. See 4 Ways to Check for
Duplicate Rows in SQL Server.
File models/staging/stg_green_tripdata.sql
with tripdata as
(
select *,
row_number() over(partition by vendorid, lpep_pickup_datetime) as rn
from {{ source('staging','green_tripdata') }}
where vendorid is not null
)
select
...
from tripdata
where rn = 1
File models/staging/stg_yellow_tripdata.sql
with tripdata as
(
select *,
row_number() over(partition by vendorid, tpep_pickup_datetime) as rn
from {{ source('staging','yellow_tripdata') }}
where vendorid is not null
)
select
...
from tripdata
where rn = 1
Run dbt build
command and everything should be green.
Finally, create models/core/schema.yml
to comple the project.
File models/core/schema.yml
version: 2
models:
- name: dim_zones
description: >
List of unique zones idefied by locationid.
Includes the service zone they correspond to (Green or yellow).
- name: fact_trips
description: >
Taxi trips corresponding to both service zones (Green and yellow).
The table contains records where both pickup and dropoff locations are valid and known zones.
Each record corresponds to a trip uniquely identified by tripid.
- name: dm_monthly_zone_revenue
description: >
Aggregated table of all taxi trips corresponding to both service zones (Green and yellow) per pickup zone, month and service.
The table contains monthly sums of the fare elements used to calculate the monthly revenue.
The table contains also monthly indicators like number of trips, and average trip distance.
columns:
- name: revenue_monthly_total_amount
description: Monthly sum of the the total_amount of the fare charged for the trip per pickup zone, month and service.
tests:
- not_null:
severity: error
0:00/13:56 (4.4.1) Theory
Running dbt in production means setting up a system to run a dbt job on a schedule, rather than running dbt commands manually from the command line.
In addition to setting up a schedule, there are other considerations when setting up dbt to run in production.
- Process of running the models we created in our development environment in a production environment
- Development and later deployment allows us to continue building models and testing them without affecting our production environment
- A deployment environment will normally have a different schema in our data warehouse and ideally a different user
- A development - deployment workflow will be something like:
- Develop in a user branch
- Open a PR to merge into the main branch
- Merge the branch to the main branch
- Run the new models in the production environment using the main branch
- Schedule the models
See About deployments for more.
1:47/13:56 (4.4.1) Running a dbt project in production
- dbt cloud includes a scheduler where to create jobs to run in production
- A single job can run multiple commands
- Jobs can be triggered manually or on schedule
- Each job will keep a log of the runs over time
- Each run will have the logs for each command
- A job could also generate documentation, that could be viewed under the run information
- If dbt source freshness was run, the results can also be viewed at the end of a job
3:35/13:56 (4.4.1) What is Continuous Integration (CI)?
- CI is the practice of regularly merge development branches into a central repository, after which automated builds and tests are run.
- The goal is to reduce adding bugs to the production code and maintain a more stable project.
- dbt allows us to enable CI on pull requests (PR)
- Enabled via webhooks from GitHub or GitLab
- When a PR is ready to be merged, a webhooks is received in dbt Cloud that will enqueue a new run of the specified job.
- The run of the CI job will be against a temporary schema
- No PR will be able to be merged unless the run has been completed successfully
5:10/13:56 (4.4.1) How to do this with dbt cloud (Alternative A)
In dbt cloud, commit my-new-branch and pull request. Next, in my GitHub repository taxi_rides_ny, I merge this pull request to my main branch.
6:02/13:56 (4.4.1) Environments
Now go to Environments (Deploy > Environments) and click on Create Environment button.
Create the new environment with this information:
- Name: Production
- dbt Version: 1.4 (latest)
- Dataset: production
Click on Save button.
7:05/13:56 (4.4.1) Jobs
Now go to Jobs (Deploy > Jobs) and click on Create New Job button.
Create the new job with this information:
- Job Name: dbt build
- Environment: Production
- dbt Version: Inherited from Production (1.4 (latest))
- Target Name: default
- Threads: 4
- Run Timeout: 0
- Generate docs on run: Selected
- Commands:
dbt seed
dbt run --var 'is_test_run: false'
dbt test --var 'is_test_run: false'
- Triggers:
- Run on schedule: Selected
- Every day
- Every 6 hours (starting at modnight UTC)
Under Continous Integration (CI) tab, we could select Run on Pull Request if we want to.
Click on Save button.
8:57/13:56 (4.4.1) Let’s run this.
Let’s run this by clicking on Run now button.
Run Overview
We should see :
- Run Timing tab showing run steps
- Model Timing tab (unfortunately, the diagram does not appear)
- Artifacts tab showing all generated files
- View Documentation link showing the documentation in a beautiful website.
Under Run Timing tab, below the log for dbt run --var 'is_test_run: false'
.
✅ Invoke dbt run --var 'is_test_run: false'
14:53:46 Running with dbt=1.4.1
14:53:46 Found 5 models, 11 tests, 0 snapshots, 0 analyses, 522 macros, 0 operations, 1 seed file, 2 sources, 0 exposures, 0 metrics 376 macros...
14:53:46
14:53:47 Concurrency: 4 threads (target='default')
14:53:47
14:53:47 1 of 5 START sql table model production.dim_zones .............................. [RUN]
14:53:47 2 of 5 START sql view model production.stg_green_tripdata ...................... [RUN]
14:53:47 3 of 5 START sql view model production.stg_yellow_tripdata ..................... [RUN]
14:53:47 3 of 5 OK created sql view model production.stg_yellow_tripdata ................ [CREATE VIEW (0 processed) in 0.60s]
14:53:48 2 of 5 OK created sql view model production.stg_green_tripdata ................. [CREATE VIEW (0 processed) in 0.85s]
14:53:49 1 of 5 OK created sql table model production.dim_zones ......................... [CREATE TABLE (265.0 rows, 14.2 KB processed) in 1.83s] 265
14:53:49 4 of 5 START sql table model production.fact_trips ............................. [RUN]
14:54:04 4 of 5 OK created sql table model production.fact_trips ........................ [CREATE TABLE (61.6m rows, 15.1 GB processed) in 15.00s] 101
14:54:04 5 of 5 START sql table model production.dm_monthly_zone_revenue ................ [RUN]
14:54:07 5 of 5 OK created sql table model production.dm_monthly_zone_revenue ........... [CREATE TABLE (12.0k rows, 13.4 GB processed) in 3.07s] 40
14:54:07
14:54:07 Finished running 2 view models, 3 table models in 0 hours 0 minutes and 20.26 seconds (20.26s).
14:54:07
14:54:07 Completed successfully
14:54:07
14:54:07 Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5
Under Run Timing tab, below the log for dbt test --var 'is_test_run: false'
.
✅ Invoke dbt test --var 'is_test_run: false'
15:59:43 Running with dbt=1.4.1
15:59:43 Found 5 models, 11 tests, 0 snapshots, 0 analyses, 522 macros, 0 operations, 1 seed file, 2 sources, 0 exposures, 0 metrics
15:59:43
15:59:44 Concurrency: 4 threads (target='default')
15:59:44
15:59:44 1 of 11 START test accepted_values_stg_green_tripdata_Payment_type__False___var_payment_type_values_ [RUN]
15:59:44 2 of 11 START test accepted_values_stg_yellow_tripdata_Payment_type__False___var_payment_type_values_ [RUN]
15:59:44 3 of 11 START test not_null_dm_monthly_zone_revenue_revenue_monthly_total_amount [RUN]
15:59:44 4 of 11 START test not_null_stg_green_tripdata_tripid .......................... [RUN]
15:59:44 3 of 11 PASS not_null_dm_monthly_zone_revenue_revenue_monthly_total_amount ..... [PASS in 0.80s]
15:59:44 5 of 11 START test not_null_stg_yellow_tripdata_tripid ......................... [RUN]
15:59:46 2 of 11 PASS accepted_values_stg_yellow_tripdata_Payment_type__False___var_payment_type_values_ [PASS in 2.59s]
15:59:46 6 of 11 START test relationships_stg_green_tripdata_Pickup_locationid__locationid__ref_taxi_zone_lookup_ [RUN]
15:59:47 5 of 11 PASS not_null_stg_yellow_tripdata_tripid ............................... [PASS in 2.26s]
15:59:47 7 of 11 START test relationships_stg_green_tripdata_dropoff_locationid__locationid__ref_taxi_zone_lookup_ [RUN]
15:59:47 1 of 11 PASS accepted_values_stg_green_tripdata_Payment_type__False___var_payment_type_values_ [PASS in 3.09s]
15:59:47 8 of 11 START test relationships_stg_yellow_tripdata_Pickup_locationid__locationid__ref_taxi_zone_lookup_ [RUN]
15:59:49 6 of 11 PASS relationships_stg_green_tripdata_Pickup_locationid__locationid__ref_taxi_zone_lookup_ [PASS in 2.83s]
15:59:49 9 of 11 START test relationships_stg_yellow_tripdata_dropoff_locationid__locationid__ref_taxi_zone_lookup_ [RUN]
15:59:49 8 of 11 PASS relationships_stg_yellow_tripdata_Pickup_locationid__locationid__ref_taxi_zone_lookup_ [PASS in 2.82s]
15:59:49 10 of 11 START test unique_stg_green_tripdata_tripid ........................... [RUN]
15:59:50 7 of 11 PASS relationships_stg_green_tripdata_dropoff_locationid__locationid__ref_taxi_zone_lookup_ [PASS in 3.04s]
15:59:50 11 of 11 START test unique_stg_yellow_tripdata_tripid .......................... [RUN]
15:59:50 4 of 11 PASS not_null_stg_green_tripdata_tripid ................................ [PASS in 6.46s]
15:59:52 9 of 11 PASS relationships_stg_yellow_tripdata_dropoff_locationid__locationid__ref_taxi_zone_lookup_ [PASS in 2.82s]
15:59:53 10 of 11 PASS unique_stg_green_tripdata_tripid ................................. [PASS in 3.06s]
15:59:54 11 of 11 PASS unique_stg_yellow_tripdata_tripid ................................ [PASS in 4.16s]
15:59:54
15:59:54 Finished running 11 tests in 0 hours 0 minutes and 10.41 seconds (10.41s).
15:59:54
15:59:54 Completed successfully
15:59:54
15:59:54 Done. PASS=11 WARN=0 ERROR=0 SKIP=0 TOTAL=11
Documentation
12:23/13:56 (4.4.1) Account Settings
Go back to dbt cloud, go to Account Settings. Select the project Analytics. Under Artifacts, I could select the previous job.
Click on Save button. Now, I have the documentation of the project directly in dbt cloud.
0:00/20:00 (4.5.1) Google Data Studio
Looker Studio, formerly Google Data Studio, is an online tool for converting data into customizable informative reports and dashboards introduced by Google on March 15, 2016 as part of the enterprise Google Analytics 360 suite. In May 2016, Google announced a free version of Data Studio for individuals and small teams. See https://en.wikipedia.org/wiki/Looker_Studio.
Go to Looker Studio and follow these steps:
- Make sure you are in the correct Google account.
- Create a Data sources.
- Select BigQuery.
- Authorize Looker Studio to connect to your BigQuery projects.
- Select the project id
dtc-dez
. - Select dataset
production
. - Select table
fact_trips
. - Click on CONNECT button.
1:50/20:00 (4.5.1) Changes default aggregations
Changes default aggregations. Select Sum only for passenger_count
.
3:18/20:00 (4.5.1) Creates report
We create 5 charts:
- Between 4:23 and 8:24, we create a time series chart for Amount of trip per day and service type.
- Change default data range from 1 January 2019 to 31 December 2020.
- Between 8:29 and 9:53, we create a scorecard for Total trips recorded
- Between 9:58 and 11:14, we create a pie chart for Service type distribution.
- Between 11:33 and 12:58, we create a table for Trips per pickup zone.
- Between 13:04 and 18:06, we create a stacked column bar for Trips per month and year.
We should some thing like this.
In BigQuery, I run this.
SELECT COUNT(*) FROM `hopeful-summer-375416.production.fact_trips`;
-- 61,616,556
Warning: I may have an error. I should check since I get at 61,616,556 trips recorded while the instructor get at 61,636,714 (see video 4.5.1 at 10:56).
To learn how to use PipeRider together with dbt for detecting changes in model and data, sign up for a workshop here.