Add docs for GCP Dataproc deployment #4393

abhi8893 · 2024-12-30T11:26:06Z

Description

This PR adds docs for the deployment of Kedro projects to GCP Dataproc (Serverless).

What does this guide include? ✅

Dataproc serverless deployment
Base design pattern for both dev and prod workflows intended to allow developers to design their own deployment workflow
Basic guide to GCP resource provisioning

What does this guide NOT include? ❌

Full fledged dataproc pipeline deployment guide
CI/CD workflow guidance
GCP Best practices including IAM, Networking
Spark performance tuning guide

(WIP) Checklist:

Please note that the current docs are very much WIP, and aren't verbose enough for developers unfamiliar with GCP. I will refine them soon!

Review guidance needed

In addition to a review of the overall approach, please provide guidance on the following:

`Q1`: Kedro entrypoint script arguments

The recommended entrypoint script invokes kedro's built in cli main entrypoint as follows:

With kedro package wheel install:

import sys
from <PACKAGE_NAME>.__main__ import main

main(sys.argv[1:])

Without kedro package wheel install:

from kedro.framework import cli

cli.main(sys.argv[1:])

However, the implementation in this PR relies on passing the arbitrary kedro args from one py script i.e. deployment/dataproc/serverless/submit_batches.py to the main entrypoint script deployment/dataproc/entrypoint.py.
As I was unable to implement parsing arbitrary args with dashes --, I implemented it as a single --kedro-run-args named arg.

Requesting for a review to enable a better implementation here.

`Q2`: Incorporating spark configs while submitting jobs

Spark configs can be divided into 2 parts:

Spark config set at creation of SparkContext => These can't be set / overriden in a SparkSession by kedro hook (if implemented)

Examples: spark.driver.memory, spark.executor.instances

Spark config set that can be set both at creation of SparkContext and overriden for any new SparkSession

Examples: Most spark SQL configs

Since, the proposed implementation does NOT read in spark.yml config for the project when submitting the job to dataproc, this requires duplicating some of the configs in the submission script (outside kedro).

How do we enable passing of these spark configs at job/batches submission time?

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
~~Updated the documentation to reflect the code changes~~ (NA)
Added a description of this change in the RELEASE.md file
~~Added tests to cover my changes~~ (NA)
~~Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team~~ (NA)

Signed-off-by: Abhishek Bhatia <[email protected]>

docs/source/deployment/gcp_dataproc.md

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

astrojuanlu · 2025-01-10T13:54:44Z

Rendered result: https://kedro--4393.org.readthedocs.build/en/4393/deployment/gcp_dataproc.html

astrojuanlu · 2025-01-10T13:55:14Z

Thanks a lot for this contribution @abhi8893! 🙏🏼 We'll give it a look shortly.

abhi8893 · 2025-01-12T10:20:15Z

Thanks @astrojuanlu ! I will also revist it again to improve the flow and address any comments you may have 🙂

merelcht

Thanks for this extensive contribution @abhi8893 ⭐

I've done a very quick initial review mostly just looking at wording/spelling. I'll do a more thorough review and will try to test this as well.

merelcht · 2025-01-16T13:53:02Z

docs/source/deployment/gcp_dataproc.md

@@ -0,0 +1,556 @@
+# GCP Dataproc
+
+`Dataproc serverless` lets you run Spark workloads without requiring you to provision and manage your own Dataproc cluster. An advantage over `Dataproc compute engine` is that `Dataproc serverless` supports custom containers allowing you package your dependencies at build time. Refer [here](https://cloud.google.com/dataproc-serverless/docs/overview#s8s-compared) for the official comparison between Dataproc serverless and compute engine.


Suggested change

`Dataproc serverless` lets you run Spark workloads without requiring you to provision and manage your own Dataproc cluster. An advantage over `Dataproc compute engine` is that `Dataproc serverless` supports custom containers allowing you package your dependencies at build time. Refer [here](https://cloud.google.com/dataproc-serverless/docs/overview#s8s-compared) for the official comparison between Dataproc serverless and compute engine.

`Dataproc serverless` lets you run Spark workloads without requiring you to provision and manage your own Dataproc cluster. An advantage over `Dataproc compute engine` is that `Dataproc serverless` supports custom containers allowing you to package your dependencies at build time. Refer to [the Dataproc serverless documentation](https://cloud.google.com/dataproc-serverless/docs/overview#s8s-compared) for the official comparison between Dataproc serverless and compute engine.

merelcht · 2025-01-16T13:53:55Z

docs/source/deployment/gcp_dataproc.md

+
+`Dataproc serverless` lets you run Spark workloads without requiring you to provision and manage your own Dataproc cluster. An advantage over `Dataproc compute engine` is that `Dataproc serverless` supports custom containers allowing you package your dependencies at build time. Refer [here](https://cloud.google.com/dataproc-serverless/docs/overview#s8s-compared) for the official comparison between Dataproc serverless and compute engine.
+
+The guide details kedro pipeline deployment steps for `Dataproc serverless`.


Suggested change

The guide details kedro pipeline deployment steps for `Dataproc serverless`.

This guide describes the steps needed to deploy a Kedro pipeline with `Dataproc Serverless`.

merelcht · 2025-01-16T16:11:42Z

docs/source/deployment/gcp_dataproc.md

+
+## Overview
+
+The below diagram details the dataproc serverless dev and prod deployment workflows. 


Suggested change

The below diagram details the dataproc serverless dev and prod deployment workflows.

The below sections and diagrams detail the dataproc serverless dev and prod deployment workflows.

merelcht · 2025-01-16T16:13:50Z

docs/source/deployment/gcp_dataproc.md

+
+### DEV deployment (and experimentation)
+
+The following are the steps:


Suggested change

The following are the steps:

The following steps are needed to do a DEV deployment on Dataproc Serverless:

merelcht · 2025-01-16T16:14:07Z

docs/source/deployment/gcp_dataproc.md

+
+### PROD deployment
+
+The following are the steps:


Suggested change

The following are the steps:

The following steps are needed to do a PROD deployment on Dataproc Serverless:

merelcht · 2025-01-16T16:15:40Z

docs/source/deployment/gcp_dataproc.md

+
+1. **Cut a release from develop**: A release branch is cut from the `develop` branch as `release/v0.2.0`
+2. **Prepare release**: Minor fixes, final readiness and release notes are added to prepare the release.
+3. **Merge into main**: After all checks passes and necessary approvals, the release branch is merged into main, and the commit is tagged with the version


Suggested change

3. **Merge into main**: After all checks passes and necessary approvals, the release branch is merged into main, and the commit is tagged with the version

3. **Merge into main**: After all checks pass and necessary approvals are received, the release branch is merged into main, and the commit is tagged with the version

merelcht · 2025-01-16T16:16:52Z

docs/source/deployment/gcp_dataproc.md

+NOTE:
+
+> 1. The service account creation method below assigns all permissions needed for this walkthrough in one service account. 
+> 2. Different teired environments may have their own GCP Projects.


Suggested change

> 2. Different teired environments may have their own GCP Projects.

> 2. Different tiered environments may have their own GCP Projects.

merelcht · 2025-01-16T16:22:06Z

docs/source/deployment/gcp_dataproc.md

+```
+
+
+#### Authorize with service account


Nit: we use British spelling in the Kedro docs 🤓

Suggested change

#### Authorize with service account

#### Authorise with service account

merelcht · 2025-01-16T16:23:11Z

docs/source/deployment/gcp_dataproc.md

+`deployment/dataproc/serverless/build_push_docker.sh`
+
+- This script builds and pushes the docker image for user dev workflows by tagging each custom build with the branch name (or a custom tag).
+- The developer can experiment with any customizations to the docker image in their feature branches.


Suggested change

- The developer can experiment with any customizations to the docker image in their feature branches.

- The developer can experiment with any customisations to the docker image in their feature branches.

merelcht · 2025-01-16T16:24:03Z

docs/source/deployment/gcp_dataproc.md

@@ -0,0 +1,556 @@
+# GCP Dataproc
+
+`Dataproc serverless` lets you run Spark workloads without requiring you to provision and manage your own Dataproc cluster. An advantage over `Dataproc compute engine` is that `Dataproc serverless` supports custom containers allowing you package your dependencies at build time. Refer [here](https://cloud.google.com/dataproc-serverless/docs/overview#s8s-compared) for the official comparison between Dataproc serverless and compute engine.


Suggested change

`Dataproc serverless` lets you run Spark workloads without requiring you to provision and manage your own Dataproc cluster. An advantage over `Dataproc compute engine` is that `Dataproc serverless` supports custom containers allowing you package your dependencies at build time. Refer [here](https://cloud.google.com/dataproc-serverless/docs/overview#s8s-compared) for the official comparison between Dataproc serverless and compute engine.

`Dataproc Serverless` lets you run Spark workloads without requiring you to provision and manage your own Dataproc cluster. An advantage over `Dataproc compute engine` is that `Dataproc Serverless` supports custom containers allowing you package your dependencies at build time. Refer [here](https://cloud.google.com/dataproc-serverless/docs/overview#s8s-compared) for the official comparison between Dataproc Serverless and compute engine.

merelcht · 2025-01-17T13:24:46Z

To respond to your point about the parsing of the Kedro CLI args:

As I was unable to implement parsing arbitrary args with dashes --, I implemented it as a single --kedro-run-args named arg.

Your implementation looks fine to me. In Kedro we use Click for CLI, which can be a tricky library to work with at times. So depending on the format you receive the arguments in, it is indeed difficult to parse. Did you find any issues with this implementation, as in is there anything a user can't do now?

abhi8893 added 5 commits December 30, 2024 16:25

(wip): add initial draft for gcp dataproc deployment

a56b14b

Signed-off-by: Abhishek Bhatia <[email protected]>

fix submit_batches.py script

cdde986

Signed-off-by: Abhishek Bhatia <[email protected]>

add gcp dataproc deployment diagrams

eedd1dd

Signed-off-by: Abhishek Bhatia <[email protected]>

add gcp dataproc deployment sections

0465a61

Signed-off-by: Abhishek Bhatia <[email protected]>

Merge branch 'main' into docs/dataproc-deployment

fedc673

Signed-off-by: Abhishek Bhatia <[email protected]>

abhi8893 marked this pull request as ready for review January 5, 2025 11:20

abhi8893 requested review from yetudada and astrojuanlu as code owners January 5, 2025 11:20

astrojuanlu reviewed Jan 10, 2025

View reviewed changes

docs/source/deployment/gcp_dataproc.md Outdated Show resolved Hide resolved

astrojuanlu and others added 3 commits January 10, 2025 14:37

Update docs/source/deployment/gcp_dataproc.md

eafdf19

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

Update index.md

c8c47f9

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

Update index.md

a6ad421

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

merelcht mentioned this pull request Jan 13, 2025

Release Kedro 0.19.11 #4412

Closed

4 tasks

merelcht reviewed Jan 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add docs for GCP Dataproc deployment #4393

Add docs for GCP Dataproc deployment #4393

abhi8893 commented Dec 30, 2024 •

edited

Loading

astrojuanlu commented Jan 10, 2025

astrojuanlu commented Jan 10, 2025

abhi8893 commented Jan 12, 2025

merelcht left a comment

merelcht Jan 16, 2025

merelcht Jan 16, 2025

merelcht Jan 16, 2025

merelcht Jan 16, 2025

merelcht Jan 16, 2025

merelcht Jan 16, 2025

merelcht Jan 16, 2025

merelcht Jan 16, 2025

merelcht Jan 16, 2025

merelcht Jan 16, 2025

merelcht commented Jan 17, 2025

		@@ -0,0 +1,556 @@
		# GCP Dataproc

		`Dataproc serverless` lets you run Spark workloads without requiring you to provision and manage your own Dataproc cluster. An advantage over `Dataproc compute engine` is that `Dataproc serverless` supports custom containers allowing you package your dependencies at build time. Refer [here](https://cloud.google.com/dataproc-serverless/docs/overview#s8s-compared) for the official comparison between Dataproc serverless and compute engine.


		`Dataproc serverless` lets you run Spark workloads without requiring you to provision and manage your own Dataproc cluster. An advantage over `Dataproc compute engine` is that `Dataproc serverless` supports custom containers allowing you package your dependencies at build time. Refer [here](https://cloud.google.com/dataproc-serverless/docs/overview#s8s-compared) for the official comparison between Dataproc serverless and compute engine.

		The guide details kedro pipeline deployment steps for `Dataproc serverless`.

	The guide details kedro pipeline deployment steps for `Dataproc serverless`.
	This guide describes the steps needed to deploy a Kedro pipeline with `Dataproc Serverless`.


		## Overview

		The below diagram details the dataproc serverless dev and prod deployment workflows.

	The below diagram details the dataproc serverless dev and prod deployment workflows.
	The below sections and diagrams detail the dataproc serverless dev and prod deployment workflows.


		### DEV deployment (and experimentation)

		The following are the steps:

	The following are the steps:
	The following steps are needed to do a DEV deployment on Dataproc Serverless:

	The following are the steps:
	The following steps are needed to do a PROD deployment on Dataproc Serverless:

	3. Merge into main: After all checks passes and necessary approvals, the release branch is merged into main, and the commit is tagged with the version
	3. Merge into main: After all checks pass and necessary approvals are received, the release branch is merged into main, and the commit is tagged with the version

	> 2. Different teired environments may have their own GCP Projects.
	> 2. Different tiered environments may have their own GCP Projects.

	#### Authorize with service account
	#### Authorise with service account

	- The developer can experiment with any customizations to the docker image in their feature branches.
	- The developer can experiment with any customisations to the docker image in their feature branches.

Add docs for GCP Dataproc deployment #4393

Are you sure you want to change the base?

Add docs for GCP Dataproc deployment #4393

Conversation

abhi8893 commented Dec 30, 2024 • edited Loading

Description

What does this guide include? ✅

What does this guide NOT include? ❌

(WIP) Checklist:

Review guidance needed

Q1: Kedro entrypoint script arguments

Q2: Incorporating spark configs while submitting jobs

Developer Certificate of Origin

Checklist

astrojuanlu commented Jan 10, 2025

astrojuanlu commented Jan 10, 2025

abhi8893 commented Jan 12, 2025

merelcht left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merelcht commented Jan 17, 2025

abhi8893 commented Dec 30, 2024 •

edited

Loading

`Q1`: Kedro entrypoint script arguments

`Q2`: Incorporating spark configs while submitting jobs