Billboard 200 Pipeline

Introduction

The Billboard 200 is a weekly music chart that measures the top selling (or streaming) albums in the United States. Placing at the top of the chart has historically been a marker of a release's success and can often be an important factor in how an artist defines their legacy.

Objective

While the Billboard 200 has been around since the 1940s, how we listen to and connect with music has evolved over the years. This project creates an end-to-end data pipeline to ingest existing Billboard 200 data from 1963 - 2019 to analyze and visualize the trends in charting artists over the years.

Technologies

To create this pipeline, I used a variety of technologies to extract, transform, and load the data to the cloud, where it could be easily accesible for analysis. These technologies include:

Infrastructure as Code (IaC): Terraform
Containerization: Docker
Cloud: Google Cloud Platform (GCP)
Data Lake: Google Cloud Storage
Data Warehouse: Google BigQuery
Batch Processing: Apache Spark (PySpark)
Workflow Orchestration: Mage
Visualization: Google Looker Studio

Data Pipeline

Preparing the Data

The data set orginates from Components.one, a research platform that makes its data sets publicly available for use. The data from this site first comes in the form of a .db file, which is most commonly used with SQLite. Because this is not a file type I am used to working with, I opted to take this .db file and convert it to a .csv.gz for the sake of working more comfortably.

Ingesting the Data

Using Mage, we then orechestrate the ingestion and exporting to our Data Lake in Google Cloud Storage (GCS).

During this step, we also convert the file from .csv.gz to .parquet to take up less space on GCS.

Processing the Data

Now that the data is available in GCS, we'll then set up a separate pipeline to take this data from GCS, process and transform it using Apache Spark, and export the data to BigQuery.

To run Spark within Mage, we have shell commands in place to make sure our Mage instance is connected to Dataproc in Google Cloud. This allows us to run Spark without issue and move some of the processing load to the cloud.

df = spark.read.parquet('gs://bb200/bb200_albums.parquet')

Tranforming the Data

Data transformations are performed using Python and SQL within Spark. These tranformations include:

Manage column data types: ensure data types are correctly defined

df = df \
    .withColumn('date', df['date'].cast(DateType())) \
    .withColumn('rank', df['rank'].cast(IntegerType())) \
    .withColumn('length', df['length'].cast(IntegerType()))

Rename columns: the column named 'length' seemed to ambiguous, so I had it changed to 'number_of_tracks'

df = df.withColumnRenamed('length', 'number_of_tracks')

Skip null values in first row: the first row of data always returns null values. To avoid issues, I skip this row entirely.

df = df.filter(df.id > 1)

Handle other null values: Some of the columns return null values based on the type of album that is charting.

For example, many re-released albums were showing null for either of these columns. To avoid null value errors, I opted to set these to 0 by default.

df = df.na.fill(value=0, subset=['number_of_tracks'])
df = df.na.fill(value=0, subset=['track_length'])

After this, the data is automatically loaded to BigQuery where it is then pulled into Google Looker Studio for visualization.

Dashboard

The dashboard below pulls from BiqQuery and creates charts and tables for further analysis.

LIVE VERSION

The tiles I've selected for this dashboard includes:

The artist with most chart placements at number 1
The percentage of number 1s each of the top-charting artists has
The number of artists that chart year over year

You can also adjust the YEAR filter in the corner to see which artists topped the charts for that partiicular year.

Future Improvements

While the current iteration of the pipeline is functional, there are some additional steps and precaution that could be taken to improve overall performance.

Additional Calculations: There are likely some additional calculated columns that could be incorporated.
Partition Data in GCS: There is room to partition the data for better organization.
Implement CI/CD: I haven't implemented any more formal CI/CD processes for pushing changes to the project, but this is something that I will likely explore in the future for a more realistic p roduction environment.

Reproducing the Project

The instructions below will help you set up the project environment and connect to cloud tools to reproduce the pipeline on your local machine.

Prerequisites

Create Google Account: Visit cloud.google.com and create a new Google account if you don't have one already
- Create a new Google Cloud Project called billboard-200-project
- Create Service Account w/ Owner permissions
  - Enable Google BigQuery API
  - Enable Google Cloud Dataproc API
  - Enable Google Compute Engine API
- Create Dataproc Cluster called bb200-cluster with default parameters
- Download Service Account Key (JSON)
Install Terraform: Follow instructions here to install Terraform

Using Terraform for Google Cloud Setup

First, we'll set up our Google Cloud Platform using Terraform. This makes it easier to ensure that the necessary storage resources are created.

Start by creating a new folder and copying the main.tf and variables.tf from the Terraform folder.

Within the variables.tf, replace the path to your Service Account Key file and the Google Cloud Storage Bucket name on your machine (as shown below).

(Line 1)
variable "credentials" {
    description = "My Credentials"
    default = "./PATH/TO/GCS_KEY.json"
}

...

(Line 25)
variable "gcs_bucket_name" {
    description = "My Storage Bucket Name"
    default = "YOUR-NAME-bb200"
}

After adding the path to your Service Account Key, open a terminal and navigate to the Terraform folder on your machine and run the command below.

terraform init

terraform plan

terraform apply

Configuring Google Cloud Platform

The Terraform files should have created a few resources within your Google Cloud Platform account.

Confirm the folllowing resources were created with their respective names.

Google Cloud Platform Project = billboard-200-project
Google Cloud Storage Bucket = YOUR-NAME-bb200
Google BigQuery Data Set = bb200_data

If any of these were not created created incorrectly, fix it manually.

Configure Spark_Dataproc.py

After everything on Google Cloud is set up correctly, we'll configure the spark_dataproc.py file in python/spark_dataproc.

Replace the variables below with what matches your project.

#(Line 30) Replace temp_gcs_bucket with the name of your temporary bucket in GCS
temp_gcs_bucket = 'name_of_temp_gcs_bucket'
spark.conf.set('temporaryGcsBucket', temp_gcs_bucket)

After updating these variables, in GCS, create a new folder called code and copy the spark_dataproc.py to it.

Docker/Mage Image Setup

Next, we'll be setting up Mage within Docker using the Quickstart Guide provided by Mage.

Using your terminal, cd .. until you're in your home directory and create a new directory called mage. Next, we'll run the command below to copy the quickstart repo to your own.

git clone https://github.com/mage-ai/compose-quickstart.git mage \
&& cd mage

In the mage directory, we'll update a couple files before starting up our Mage Container

Copy the contents of Dockerfile and docker-compose.yml from this repo and paste them into your Mage files
In dev.env, rename the Mage Project to bb200_project
In requirements.txt, add:

pyspark==3.5.1

Next, run the following commands:

cp dev.env .env && rm dev.env
docker compose up

Your Mage instance should now be live on localhost:6789.

Before moving on, we'll configure Mage to make sure it can connect to our Google Cloud Platform.

In the Mage UI, click on Files in the side menu.

Right click the project folder on left, select Upload files, and drag-and-drop your Service Account Key into the Mage window.
After the upload is complete, open the io_config.yml, scroll down to GOOGLE_SERVICE_ACC_KEY_FILEPATH and enter the path to your key.
Remove all of the other Google variables so your file looks like the image below.

Create Pipeline to Google Cloud Storage

Our first pipeline will take the billboard200_albums_data found here and upload it to our Data Lake in Google Cloud Storage.

For this, you can copy and paste the load_bb200_csv.py and export_bb200_gcs.py files in bb200_to_gcs.

Your pipeline should look like this:

This moves the data to Google Cloud Storage and converts the .csv.gz to parquet using Pandas.

Install Google Cloud CLI

Within the Mage UI, click on the Terminal button on the side menu as shown below.

Our goal is to run the Google Cloud CLI to be able to use gcloud scripts within Mage.

We'll start with installing the Google Cloud CLI. First, run the scripts below to download, extract, and install the files in the Mage Terminal.

#Download Google Cloud CLI
curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-471.0.0-linux-x86_64.tar.gz

#Extract Google Cloud CLI
tar -xf google-cloud-cli-471.0.0-linux-x86_64.tar.gz

#Install Google Cloud CLI
./google-cloud-sdk/install.sh

Next, we'll run the following script to authorize Mage to make changes to your Google Cloud account and make sure the Google Cloud CLI knows which project to make changes to.

#Authorize Google Cloud Login
./google-cloud-sdk/bin/gcloud auth login

#Set Google Cloud Project
./google-cloud-sdk/bin/gcloud config set project INSERT_PROJECT_ID

Note: You may have to edit the script above depending on your folder structure.

Create Pipeline to BigQuery

Now that our .parquet files are available in GCS, we will now process and transform this data, move it over to our Data Warehouse in Google BigQuery.

In the bb200_to_bq folder, copy and paste the code to assemble your own pipeline. For the data loader, copy the code in load_bb200_gcs.py and add the path to your Service Account Key.

For the data exporters, copy the code in export_bb200_bq.py.

Your pipeline should look like this:

Before running the pipeline, make sure to replace project with the name of your Google Cloud Project name.

@data_exporter
def export_data(data, *args, **kwargs):
    os.system("""
    ./gsutil/google-cloud-sdk/bin/gcloud dataproc jobs submit pyspark \
        --project=INSERT_PROJECT_NAME \
        --cluster=music-chart-cluster \
        --region=us-central1 \
        --jars=gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.23.2.jar \
    gs://bb200/code/spark_dataproc.py \
        -- \
            --input_albums=gs://bb200/bb200_albums.parquet \
            --output=bb200_data.bb200_albums
    """)

Running this pipeline loads data from GCS and performs the transformations using Apache Spark. The transformed data is then moved to our BigQuery where we can connect to it with Google Looker Studio for visualizations.

Create Visualizations in Google Looker Studio

Open Google Looker Studio and connect to BigQuery as your data source.

After this is added, you will be able to create tables, charts, and more to visualize the Billboard 200 data.

Removing Resources

To avoid accruing costs on Google Cloud Platform, you can run the following command in your system terminal (not your Mage one) to breakdown the Terraform resources.

terraform destroy

Back to top

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Billboard 200 Pipeline

Table of Contents

Introduction

Objective

Technologies

Data Pipeline

Preparing the Data

Ingesting the Data

Processing the Data

Tranforming the Data

Dashboard

Future Improvements

Reproducing the Project

Prerequisites

Using Terraform for Google Cloud Setup

Configuring Google Cloud Platform

Configure Spark_Dataproc.py

Docker/Mage Image Setup

Create Pipeline to Google Cloud Storage

Install Google Cloud CLI

Create Pipeline to BigQuery

Create Visualizations in Google Looker Studio

Removing Resources

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data		data
images		images
mage		mage
python/spark_dataproc		python/spark_dataproc
terraform		terraform
.gitignore		.gitignore
README.md		README.md

YoItsYong/billboard-200-pipeline

Folders and files

Latest commit

History

Repository files navigation

Billboard 200 Pipeline

Table of Contents

Introduction

Objective

Technologies

Data Pipeline

Preparing the Data

Ingesting the Data

Processing the Data

Tranforming the Data

Dashboard

Future Improvements

Reproducing the Project

Prerequisites

Using Terraform for Google Cloud Setup

Configuring Google Cloud Platform

Configure Spark_Dataproc.py

Docker/Mage Image Setup

Create Pipeline to Google Cloud Storage

Install Google Cloud CLI

Create Pipeline to BigQuery

Create Visualizations in Google Looker Studio

Removing Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages