Add docs for GCP Dataproc deployment #4393
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR adds docs for the deployment of Kedro projects to GCP Dataproc (Serverless).
What does this guide include? ✅
What does this guide NOT include? ❌
(WIP) Checklist:
Please note that the current docs are very much WIP, and aren't verbose enough for developers unfamiliar with GCP. I will refine them soon!
Review guidance needed
In addition to a review of the overall approach, please provide guidance on the following:
Q1
: Kedro entrypoint script argumentsThe recommended entrypoint script invokes kedro's built in cli
main
entrypoint as follows:With kedro package wheel install:
Without kedro package wheel install:
However, the implementation in this PR relies on passing the arbitrary kedro args from one
py
script i.e.deployment/dataproc/serverless/submit_batches.py
to the main entrypoint scriptdeployment/dataproc/entrypoint.py
.As I was unable to implement parsing arbitrary args with dashes
--
, I implemented it as a single--kedro-run-args
named arg.Requesting for a review to enable a better implementation here.
Q2
: Incorporating spark configs while submitting jobsSpark configs can be divided into 2 parts:
SparkContext
=> These can't be set / overriden in aSparkSession
by kedro hook (if implemented)spark.driver.memory
,spark.executor.instances
SparkContext
and overriden for any newSparkSession
Since, the proposed implementation does NOT read in
spark.yml
config for the project when submitting the job to dataproc, this requires duplicating some of the configs in the submission script (outside kedro).How do we enable passing of these spark configs at job/batches submission time?
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
Updated the documentation to reflect the code changes(NA)RELEASE.md
fileAdded tests to cover my changes(NA)Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team(NA)