Multi-cloud ETL Pipeline

Objective

To run the same ETL code in multiple cloud services based on your preference, thus saving time.
To develop ETL scripts for different environments and clouds.

Note

This repository currently supports Azure Databricks + AWS Glue.
Azure Databricks can't be configured locally, We can only connect our local IDE to running cluster in databricks. It works by pushing code in a Github repository then adding a workflow in databricks with URL of the repo & file.
For AWS Glue we will set up a local environment using Glue Docker image or shell script, then deploying it to AWS glue using github actions.
The "tasks.txt" file contains the details of transformations done in the main file.

Pre-requisite

Python3.7 with PIP
AWS CLI configured locally

# Make sure to export JAVA_HOME like this:
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_261.jdk/Contents/Home

Quick Start

Clone this repo (for Windows use WSL).
For setting up required libraries and packages locally, run:

    # If default SHELL is zsh use
    make setup-glue-local SOURCE_FILE_PATH=~/.zshrc

    # If default SHELL is bash use
    make setup-glue-local SOURCE_FILE_PATH=~/.bashrc

Source SHELL profile using:

    # For zsh
    source ~/.zshrc

    # For bash
    source ~/.bashrc

Install Dependencies:

    make install

Change Your Paths

Enter your S3 & ADLS paths in the app/.custom_env file for Databricks. This file will be used by Databricks.
Similarly, we'll make .evn file in the root folder for Local Glue. To create the required file run:

    make glue-demo-env

This command will copy your paths from app/.custom_env to .env file.

(Optional) If you want to extract from kaggle, enter KAGGLE_KEY & KAGGLE_USERNAME in .evn file only. Note: Don't enter any sensitive keys in app/.custom_env file.

Setup Check

Finally, check if everything is working correctly by running:

    gluesparksubmit jobs/demo.py

Ensure "Execution Complete" is printed.

Make New Jobs

Write your jobs in the jobs folder. Refer demo.py file. One example is the jobs/main.py file.

Deployment

Set up a Github action for AWS Glue. Make sure to pass the following secrets in your repository:

    AWS_ACCESS_KEY_ID
    AWS_SECRET_ACCESS_KEY
    S3_BUCKET_NAME
    S3_SCRIPTS_PATH
    AWS_REGION
    AWS_GLUE_ROLE

Rest all the key-value pairs that entered in the .env file. make sure to pass them using automation/deploy_glue_jobs.sh file.

For Azure Databricks, make a workflow with the link to your repo & main file. Pass the following parameters with their correct values:

    kaggle_username
    kaggle_token
    storage_account_name
    datalake_access_key

Run Tests & Coverage Report

To run tests & coverage report, run the following commands in the root folder of the project:

    make test

    # To see the coverage report
    make coverage-report

References

Glue Programming libraries

Common Errors

'sparkDriver' failed after 16 retries

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Multi-cloud ETL Pipeline

Table of Contents

Objective

Note

Pre-requisite

Quick Start

Change Your Paths

Setup Check

Make New Jobs

Deployment

Run Tests & Coverage Report

References

Common Errors

Files

README.md

Latest commit

History

README.md

File metadata and controls

Multi-cloud ETL Pipeline

Table of Contents

Objective

Note

Pre-requisite

Quick Start

Change Your Paths

Setup Check

Make New Jobs

Deployment

Run Tests & Coverage Report

References

Common Errors