Skip to content

Commit

Permalink
upload images
Browse files Browse the repository at this point in the history
  • Loading branch information
YoItsYong committed Apr 9, 2024
1 parent 66cb366 commit 54b2fa4
Show file tree
Hide file tree
Showing 9 changed files with 27 additions and 25 deletions.
38 changes: 24 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,9 +143,6 @@ After everything on Google Cloud is set up correctly, we'll configure the `spark
Replace the variables below with what matches your project.

```
# (Line 13) Replace with path to your Serivce Account Key
gcs_key = 'path/to/service_account_key.json'
#(Line 30) Replace temp_gcs_bucket with the name of your temporary bucket in GCS
temp_gcs_bucket = 'name_of_temp_gcs_bucket'
spark.conf.set('temporaryGcsBucket', temp_gcs_bucket)
Expand All @@ -167,7 +164,6 @@ In the `mage` directory, we'll update a couple files before starting up our Mage
- In `requirements.txt`, add:
```
pyspark==3.5.1
pyarrow==15.0.2
```

Next, run the following commands:
Expand All @@ -176,18 +172,18 @@ cp dev.env .env && rm dev.env
docker compose up
```

Your Mage instance should now be live on `localhost:6789`. Before moving on, using your IDE, copy the Service Account Key JSON file into `mage/bb200-project` and replace the path found in the `io_config.yaml` file within Mage.
Your Mage instance should now be live on `localhost:6789`.

### Install Google Cloud CLI
Within the Mage UI, click on the `Terminal` button on the side menu as shown below.
Before moving on, we'll configure Mage to make sure it can connect to our Google Cloud Platform.
1. In the Mage UI, click on `Files` in the side menu.

< insert image >
< IMAGE >

Our goal is to run the Google Cloud CLI to be able to use `gcloud` scripts within Mage.
2. Right click the project folder on left, select `Upload files`, and drag-and-drop your Service Account Key into the Mage window.
3. After the upload is complete, open the `io_config.yml`, scroll down to `GOOGLE_SERVICE_ACC_KEY_FILEPATH` and enter the path to your key.
4. Remove all of the other Google variables so your file looks like the image below.

We'll start with installing the Google Cloud CLI. First, run the scripts below to download, extract, and install the files in the Mage Terminal.

```
< IMAGE >

### Create Pipeline to Google Cloud Storage
Our first pipeline will take the `billboard200_albums_data` found [here](https://github.com/YoItsYong/billboard-200-pipeline/raw/main/data/billboard200_albums.csv.gz) and upload it to our Data Lake in Google Cloud Storage.
Expand All @@ -198,8 +194,20 @@ Your pipeline should look like this:

< insert image >

This moves the data to Google Cloud Storage and converts the `.csv.gz` to `parquet` using `PyArrow`.
This moves the data to Google Cloud Storage and converts the `.csv.gz` to `parquet` using Pandas.


### Install Google Cloud CLI
Within the Mage UI, click on the `Terminal` button on the side menu as shown below.

< insert image >

Our goal is to run the Google Cloud CLI to be able to use `gcloud` scripts within Mage.

We'll start with installing the Google Cloud CLI. First, run the scripts below to download, extract, and install the files in the Mage Terminal.


```
#Download Google Cloud CLI
curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-471.0.0-linux-x86_64.tar.gz
Expand All @@ -209,17 +217,19 @@ tar -xf google-cloud-cli-471.0.0-linux-x86_64.tar.gz
#Install Google Cloud CLI
./google-cloud-sdk/install.sh
```

Next, we'll run the following script to authorize Mage to make changes to your Google Cloud account and make sure the Google Cloud CLI knows which project to make changes to.
```
#Authorize Google Cloud Login
./google-cloud-sdk/bin/gcloud auth login
#Set Google Cloud Project
./google-cloud-sdk/bin/gcloud config set project billboard-200-project-2
./google-cloud-sdk/bin/gcloud config set project INSERT_PROJECT_ID
```

_Note: You may have to edit the script above depending on your folder structure._


### Create Pipeline to BigQuery
Now that our `.parquet` files are available in GCS, we will now process and transform this data, move it over to our Data Warehouse in Google BigQuery.

Expand Down
Binary file added images/bb200_to_bq.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/bb200_to_gcs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/mage_io_config.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/mage_ui_files.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/mage_ui_terminal.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 0 additions & 3 deletions mage/bb200_to_bq/data_loaders/load_bb200_gcs.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,6 @@
if 'test' not in globals():
from mage_ai.data_preparation.decorators import test

# Replace with path to your Google Service Account Key
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = './PATH/TO/SERVICE_ACCOUNT_KEY.json'

@data_loader
def load_from_google_cloud_storage(*args, **kwargs):
config_path = path.join(get_repo_path(), 'io_config.yaml')
Expand Down
2 changes: 1 addition & 1 deletion python/spark_dataproc/spark_dataproc.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
.getOrCreate()

#Replace temp_gcs_bucket with the name of your temporary bucket in GCS
temp_gcs_bucket = 'dataproc-temp-us-central1-692387679074-vn8xcn04'
temp_gcs_bucket = 'TEMP_GCS_BUCKET_NAME'
spark.conf.set('temporaryGcsBucket', temp_gcs_bucket)

df = spark.read.parquet('gs://bb200/bb200_albums.parquet')
Expand Down
9 changes: 2 additions & 7 deletions terraform/variables.tf
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
variable "credentials" {
description = "My Credentials"
default = "PATH/TO/SERVICE_ACC_KEY.json"
default = "PATH_TO_SERVICE_ACC_KEY.json"
}
variable "project" {
description = "Project"
default = "billboard-200-project"
default = "billboard-200-project-2"
}

variable "region" {
Expand All @@ -31,8 +31,3 @@ variable "gcs_storage_class" {
description = "Bucket Storage Class"
default = "STANDARD"
}

variable "dp_cluster_name" {
description = "Dataproc Cluster Name"
default = "music-chart-cluster"
}

0 comments on commit 54b2fa4

Please sign in to comment.