Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs for GCP Dataproc deployment #4393

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

abhi8893
Copy link

@abhi8893 abhi8893 commented Dec 30, 2024

Description

This PR adds docs for the deployment of Kedro projects to GCP Dataproc (Serverless).

What does this guide include? ✅

  • Dataproc serverless deployment
  • Base design pattern for both dev and prod workflows intended to allow developers to design their own deployment workflow
  • Basic guide to GCP resource provisioning

What does this guide NOT include? ❌

  • Full fledged dataproc pipeline deployment guide
  • CI/CD workflow guidance
  • GCP Best practices including IAM, Networking
  • Spark performance tuning guide

(WIP) Checklist:

Please note that the current docs are very much WIP, and aren't verbose enough for developers unfamiliar with GCP. I will refine them soon!

  • Add an overall context section
  • Add descriptions for substeps
  • Refine entrypoint kedro run args implementation
  • Add GCP resource links
  • Add FAQs

Review guidance needed

In addition to a review of the overall approach, please provide guidance on the following:

Q1: Kedro entrypoint script arguments

The recommended entrypoint script invokes kedro's built in cli main entrypoint as follows:

With kedro package wheel install:

import sys
from <PACKAGE_NAME>.__main__ import main

main(sys.argv[1:])

Without kedro package wheel install:

from kedro.framework import cli

cli.main(sys.argv[1:])

However, the implementation in this PR relies on passing the arbitrary kedro args from one py script i.e. deployment/dataproc/serverless/submit_batches.py to the main entrypoint script deployment/dataproc/entrypoint.py.
As I was unable to implement parsing arbitrary args with dashes --, I implemented it as a single --kedro-run-args named arg.

Requesting for a review to enable a better implementation here.

Q2: Incorporating spark configs while submitting jobs

Spark configs can be divided into 2 parts:

  1. Spark config set at creation of SparkContext => These can't be set / overriden in a SparkSession by kedro hook (if implemented)
  • Examples: spark.driver.memory, spark.executor.instances
  1. Spark config set that can be set both at creation of SparkContext and overriden for any new SparkSession
  • Examples: Most spark SQL configs

Since, the proposed implementation does NOT read in spark.yml config for the project when submitting the job to dataproc, this requires duplicating some of the configs in the submission script (outside kedro).

How do we enable passing of these spark configs at job/batches submission time?

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes (NA)
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes (NA)
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team (NA)

@abhi8893 abhi8893 marked this pull request as ready for review January 5, 2025 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant