diff --git a/README.md b/README.md index a197d55..fda8b60 100644 --- a/README.md +++ b/README.md @@ -143,9 +143,6 @@ After everything on Google Cloud is set up correctly, we'll configure the `spark Replace the variables below with what matches your project. ``` -# (Line 13) Replace with path to your Serivce Account Key -gcs_key = 'path/to/service_account_key.json' - #(Line 30) Replace temp_gcs_bucket with the name of your temporary bucket in GCS temp_gcs_bucket = 'name_of_temp_gcs_bucket' spark.conf.set('temporaryGcsBucket', temp_gcs_bucket) @@ -167,7 +164,6 @@ In the `mage` directory, we'll update a couple files before starting up our Mage - In `requirements.txt`, add: ``` pyspark==3.5.1 -pyarrow==15.0.2 ``` Next, run the following commands: @@ -176,18 +172,18 @@ cp dev.env .env && rm dev.env docker compose up ``` -Your Mage instance should now be live on `localhost:6789`. Before moving on, using your IDE, copy the Service Account Key JSON file into `mage/bb200-project` and replace the path found in the `io_config.yaml` file within Mage. +Your Mage instance should now be live on `localhost:6789`. -### Install Google Cloud CLI -Within the Mage UI, click on the `Terminal` button on the side menu as shown below. +Before moving on, we'll configure Mage to make sure it can connect to our Google Cloud Platform. +1. In the Mage UI, click on `Files` in the side menu. -< insert image > + < IMAGE > -Our goal is to run the Google Cloud CLI to be able to use `gcloud` scripts within Mage. +2. Right click the project folder on left, select `Upload files`, and drag-and-drop your Service Account Key into the Mage window. +3. After the upload is complete, open the `io_config.yml`, scroll down to `GOOGLE_SERVICE_ACC_KEY_FILEPATH` and enter the path to your key. +4. Remove all of the other Google variables so your file looks like the image below. -We'll start with installing the Google Cloud CLI. First, run the scripts below to download, extract, and install the files in the Mage Terminal. - -``` +< IMAGE > ### Create Pipeline to Google Cloud Storage Our first pipeline will take the `billboard200_albums_data` found [here](https://github.com/YoItsYong/billboard-200-pipeline/raw/main/data/billboard200_albums.csv.gz) and upload it to our Data Lake in Google Cloud Storage. @@ -198,8 +194,20 @@ Your pipeline should look like this: < insert image > -This moves the data to Google Cloud Storage and converts the `.csv.gz` to `parquet` using `PyArrow`. +This moves the data to Google Cloud Storage and converts the `.csv.gz` to `parquet` using Pandas. + + +### Install Google Cloud CLI +Within the Mage UI, click on the `Terminal` button on the side menu as shown below. + +< insert image > + +Our goal is to run the Google Cloud CLI to be able to use `gcloud` scripts within Mage. +We'll start with installing the Google Cloud CLI. First, run the scripts below to download, extract, and install the files in the Mage Terminal. + + +``` #Download Google Cloud CLI curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-471.0.0-linux-x86_64.tar.gz @@ -209,17 +217,19 @@ tar -xf google-cloud-cli-471.0.0-linux-x86_64.tar.gz #Install Google Cloud CLI ./google-cloud-sdk/install.sh ``` + Next, we'll run the following script to authorize Mage to make changes to your Google Cloud account and make sure the Google Cloud CLI knows which project to make changes to. ``` #Authorize Google Cloud Login ./google-cloud-sdk/bin/gcloud auth login #Set Google Cloud Project -./google-cloud-sdk/bin/gcloud config set project billboard-200-project-2 +./google-cloud-sdk/bin/gcloud config set project INSERT_PROJECT_ID ``` _Note: You may have to edit the script above depending on your folder structure._ + ### Create Pipeline to BigQuery Now that our `.parquet` files are available in GCS, we will now process and transform this data, move it over to our Data Warehouse in Google BigQuery. diff --git a/images/bb200_to_bq.png b/images/bb200_to_bq.png new file mode 100644 index 0000000..6b9f03e Binary files /dev/null and b/images/bb200_to_bq.png differ diff --git a/images/bb200_to_gcs.png b/images/bb200_to_gcs.png new file mode 100644 index 0000000..cacd886 Binary files /dev/null and b/images/bb200_to_gcs.png differ diff --git a/images/mage_io_config.png b/images/mage_io_config.png new file mode 100644 index 0000000..761c6bc Binary files /dev/null and b/images/mage_io_config.png differ diff --git a/images/mage_ui_files.png b/images/mage_ui_files.png new file mode 100644 index 0000000..976c08b Binary files /dev/null and b/images/mage_ui_files.png differ diff --git a/images/mage_ui_terminal.png b/images/mage_ui_terminal.png new file mode 100644 index 0000000..ebed08a Binary files /dev/null and b/images/mage_ui_terminal.png differ diff --git a/mage/bb200_to_bq/data_loaders/load_bb200_gcs.py b/mage/bb200_to_bq/data_loaders/load_bb200_gcs.py index 898c231..69d1a8f 100644 --- a/mage/bb200_to_bq/data_loaders/load_bb200_gcs.py +++ b/mage/bb200_to_bq/data_loaders/load_bb200_gcs.py @@ -9,9 +9,6 @@ if 'test' not in globals(): from mage_ai.data_preparation.decorators import test -# Replace with path to your Google Service Account Key -os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = './PATH/TO/SERVICE_ACCOUNT_KEY.json' - @data_loader def load_from_google_cloud_storage(*args, **kwargs): config_path = path.join(get_repo_path(), 'io_config.yaml') diff --git a/python/spark_dataproc/spark_dataproc.py b/python/spark_dataproc/spark_dataproc.py index 755a063..83a149b 100644 --- a/python/spark_dataproc/spark_dataproc.py +++ b/python/spark_dataproc/spark_dataproc.py @@ -25,7 +25,7 @@ .getOrCreate() #Replace temp_gcs_bucket with the name of your temporary bucket in GCS -temp_gcs_bucket = 'dataproc-temp-us-central1-692387679074-vn8xcn04' +temp_gcs_bucket = 'TEMP_GCS_BUCKET_NAME' spark.conf.set('temporaryGcsBucket', temp_gcs_bucket) df = spark.read.parquet('gs://bb200/bb200_albums.parquet') diff --git a/terraform/variables.tf b/terraform/variables.tf index 13e6595..d750a52 100644 --- a/terraform/variables.tf +++ b/terraform/variables.tf @@ -1,10 +1,10 @@ variable "credentials" { description = "My Credentials" - default = "PATH/TO/SERVICE_ACC_KEY.json" + default = "PATH_TO_SERVICE_ACC_KEY.json" } variable "project" { description = "Project" - default = "billboard-200-project" + default = "billboard-200-project-2" } variable "region" { @@ -31,8 +31,3 @@ variable "gcs_storage_class" { description = "Bucket Storage Class" default = "STANDARD" } - -variable "dp_cluster_name" { - description = "Dataproc Cluster Name" - default = "music-chart-cluster" -} \ No newline at end of file