Skip to content

Latest commit

 

History

History
134 lines (96 loc) · 4.01 KB

README.md

File metadata and controls

134 lines (96 loc) · 4.01 KB

Multi-cloud ETL Pipeline

Table of Contents

Objective

  • To run the same ETL code in multiple cloud services based on your preference, thus saving time.
  • To develop ETL scripts for different environments and clouds.

Note

  • This repository currently supports Azure Databricks + AWS Glue.
  • Azure Databricks can't be configured locally, We can only connect our local IDE to running cluster in databricks. It works by pushing code in a Github repository then adding a workflow in databricks with URL of the repo & file.
  • For AWS Glue we will set up a local environment using Glue Docker image or shell script, then deploying it to AWS glue using github actions.
  • The "tasks.txt" file contains the details of transformations done in the main file.

Pre-requisite

  1. Python3.7 with PIP
  2. AWS CLI configured locally
  3. Install Java 8.
    # Make sure to export JAVA_HOME like this:
    export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_261.jdk/Contents/Home

Quick Start

  1. Clone this repo (for Windows use WSL).

  2. For setting up required libraries and packages locally, run:

    # If default SHELL is zsh use
    make setup-glue-local SOURCE_FILE_PATH=~/.zshrc

    # If default SHELL is bash use
    make setup-glue-local SOURCE_FILE_PATH=~/.bashrc
  1. Source SHELL profile using:
    # For zsh
    source ~/.zshrc

    # For bash
    source ~/.bashrc
  1. Install Dependencies:
    make install

Change Your Paths

  1. Enter your S3 & ADLS paths in the app/.custom_env file for Databricks. This file will be used by Databricks.

  2. Similarly, we'll make .evn file in the root folder for Local Glue. To create the required file run:

    make glue-demo-env

This command will copy your paths from app/.custom_env to .env file.

  1. (Optional) If you want to extract from kaggle, enter KAGGLE_KEY & KAGGLE_USERNAME in .evn file only. Note: Don't enter any sensitive keys in app/.custom_env file.

Setup Check

Finally, check if everything is working correctly by running:

    gluesparksubmit jobs/demo.py

Ensure "Execution Complete" is printed.

Make New Jobs

Write your jobs in the jobs folder. Refer demo.py file. One example is the jobs/main.py file.

Deployment

  1. Set up a Github action for AWS Glue. Make sure to pass the following secrets in your repository:
    AWS_ACCESS_KEY_ID
    AWS_SECRET_ACCESS_KEY
    S3_BUCKET_NAME
    S3_SCRIPTS_PATH
    AWS_REGION
    AWS_GLUE_ROLE

Rest all the key-value pairs that entered in the .env file. make sure to pass them using automation/deploy_glue_jobs.sh file.

  1. For Azure Databricks, make a workflow with the link to your repo & main file. Pass the following parameters with their correct values:
    kaggle_username
    kaggle_token
    storage_account_name
    datalake_access_key

Run Tests & Coverage Report

To run tests & coverage report, run the following commands in the root folder of the project:

    make test

    # To see the coverage report
    make coverage-report

References

Glue Programming libraries

Common Errors

'sparkDriver' failed after 16 retries