[WIP] Describe Spark submit in relation with client-mode (+ hadoop and dependencies) #25

echarles · 2018-01-14T09:23:13Z

DISCLAIMER: The state of this documents is pre-alpha and [WIP] (Work in Progress). It contains some [TDV] (To Be Validated) and [TBC] (To Be Checked) and is not intended to be merged in the Spark documentation as it is.

The goal of this document is to review the current submission process in cluster mode and introduce a way to also fully support the client mode thtat is mandatory for the exploraty projects (notebooks...). We want to ensure that the way the Spark Driver and Executors lifecycle is correctly understood to take correct decisions to let evolve the architecture. The Shuffle Service and Resource Staging Server are not impacted by the previous considerations, so we don't cover them.

foxish · 2018-01-16T16:56:42Z

Thanks Eric. cc/ @apache-spark-on-k8s/contributors

squito

I'm very new to the k8s integration, still getting up to speed, just some comments from a first read, perhaps not relevant for the target audience here.

squito · 2018-02-05T19:20:46Z

src/jekyll/spark-submit.md

+```
+
+```scala
+// Via cluster mode, we do not have full config as the hadoop conf dir is not mounted by configmap propagation but only available in the Spark Context.


is this supposed to say "client mode"?

squito · 2018-02-05T19:22:23Z

src/jekyll/spark-submit.md

+
+It is important to acknowledge this as during the creation of the manager and schedulers described in earlier point, some definitions are created and some communication with the Kubernetes cluster (the REST API) are instanciated.
+
+In the `InCluster` case, as pre-requisite to the next steps, we ask the user to define a property `spark.kubernetes.driver.pod.name` with the value being the exact name of the Pod where he is.


"where they are" -- we can be gender neutral

squito · 2018-02-05T19:26:02Z

src/jekyll/spark-submit.md

+
+### Submit in client mode from a client with restricted network access
+
+This is not possible. When the executor is invoked.


this is a little vague -- can you explain what restrictions don't work? obviously, if there is no communication out from the client, nothing is going to work. But one might expect that you only need some very limited communication from the client. Eg., only client to driver pod, but not client to executors.

squito · 2018-02-05T19:28:54Z

src/jekyll/spark-submit.md

+
+# Dependencies
+
+For `cluster-mode`, we have the configuration orchestrator and its Steps which ensure that when `--jars` or `-Dspark.jars` are defined (same reaseonaring applies for `--files`):


this comment really goes from all cluster managers with spark, just came to my mind as I was reading this and something you might want to consider.

one thing which has tripped up a ton of users is understanding when jars become available and how you override various things. As an extreme example, you can't use "--jars" to replace the code of spark's Executor, as the executor itself has already started by that point. (Nobody would want to do that, but there are plenty of cases where you might want to replace some dependency, and its tricky because sometimes they are loaded before they you get to downloading and loading "--jars".)

first iteration of spark submit description

b1e88fd

echarles mentioned this pull request Jan 14, 2018

In-cluster client mode apache-spark-on-k8s/spark#456

Open

second iteration of spark-submit doc

10a5693

more precision related to spark-shell, python and R

4dffabc

squito reviewed Feb 5, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Describe Spark submit in relation with client-mode (+ hadoop and dependencies) #25

[WIP] Describe Spark submit in relation with client-mode (+ hadoop and dependencies) #25

Uh oh!

echarles commented Jan 14, 2018

Uh oh!

foxish commented Jan 16, 2018

Uh oh!

squito left a comment

Uh oh!

squito Feb 5, 2018

Uh oh!

squito Feb 5, 2018

Uh oh!

squito Feb 5, 2018

Uh oh!

squito Feb 5, 2018

Uh oh!

Uh oh!


		It is important to acknowledge this as during the creation of the manager and schedulers described in earlier point, some definitions are created and some communication with the Kubernetes cluster (the REST API) are instanciated.

		In the `InCluster` case, as pre-requisite to the next steps, we ask the user to define a property `spark.kubernetes.driver.pod.name` with the value being the exact name of the Pod where he is.


		### Submit in client mode from a client with restricted network access

		This is not possible. When the executor is invoked.


		# Dependencies

		For `cluster-mode`, we have the configuration orchestrator and its Steps which ensure that when `--jars` or `-Dspark.jars` are defined (same reaseonaring applies for `--files`):

[WIP] Describe Spark submit in relation with client-mode (+ hadoop and dependencies) #25

Are you sure you want to change the base?

[WIP] Describe Spark submit in relation with client-mode (+ hadoop and dependencies) #25

Uh oh!

Conversation

echarles commented Jan 14, 2018

Uh oh!

foxish commented Jan 16, 2018

Uh oh!

squito left a comment

Choose a reason for hiding this comment

Uh oh!

squito Feb 5, 2018

Choose a reason for hiding this comment

Uh oh!

squito Feb 5, 2018

Choose a reason for hiding this comment

Uh oh!

squito Feb 5, 2018

Choose a reason for hiding this comment

Uh oh!

squito Feb 5, 2018

Choose a reason for hiding this comment

Uh oh!

Uh oh!