Move docker image management and test entrypoint to Maven #31

mccheah · 2018-01-12T21:33:34Z

Closes #30.

mccheah · 2018-01-12T21:34:10Z

Requires both #29 and #14.

mccheah · 2018-01-12T21:34:29Z

Also requires the test commands in Jenkins to change.

mccheah · 2018-01-12T21:36:56Z

todo and not handled by this patch:

Clean up docker images in the post-integration-test phase
Use your own docker images (it's theoretically possible but the usage of the examples file server and the usage of the same tag between the file server and the Spark code under test makes this workflow less than intuitive)

mccheah · 2018-01-12T21:37:38Z

build/mvn

@@ -0,0 +1,158 @@
+#!/usr/bin/env bash


Taken from https://github.com/apache/spark/blob/master/build/mvn

Why do we need this? We'll need to ensure that it stays in sync with our repo? Or maybe we can get rid of it once we merge this upstream?

We get rid of this when we merge into upstream, but even aside from that, this has hardly been changing in upstream and it's theoretically completely isolated from what we will use to build Spark (since the shell forks the second Maven process it's completely independent)

We also have to add it here now because unlike before, now mvn is the entrypoint into the whole system, as it's only the substep of the build reactor that clones the Spark repository down. Aside from that, we also allow for Spark TGZs to be provided to remove the usage of the Spark source code entirely. So no matter what we're going to need our own Maven here.

Instead of copying the content here, could you just have a much shorter script that just downloads from https://raw.githubusercontent.com/apache/spark/master/build/mvn and execs it? Something like build/mvn-getter whose content is:

#!/bin/sh curl -s https://raw.githubusercontent.com/apache/spark/master/build/mvn | bash

Hm, I think that works. Will incorporate that.

mccheah · 2018-01-12T21:38:53Z

scripts/parse-arguments.sh

+#
+
+set -ex
+TEST_ROOT_DIR=$(git rev-parse --show-toplevel)


A lot of these defaults are repeated again in pom.xml - the idea is that maintainers of the framework can run the scripts in isolation without using Maven. This can be done both with the top level setup-integration-test-env.sh script, or any of the sub-components - e.g. clone-spark.sh.

Hmm.. it seems error prone that this needs to stay in sync with the args in pom.xml. Is there a better dispatch mechanism than to use this?

The pom.xml can assume that these are the defaults and that they are correct, but I like having them explicitly enumerated in the pom.xml because the behavior can be known right from the pom. But I also like being able to run the script independently without any arguments. But all of this is just for convenience, so for the sake of developer sanity we can make one of these ideas less convenient.

Don't have a strong opinion on the best way to move forward.

mccheah · 2018-01-12T21:42:48Z

scripts/parse-arguments.sh

+DEPLOY_MODE=minikube
+IMAGE_REPO="docker.io/kubespark"
+SKIP_BUILDING_IMAGES=false
+SPARK_TGZ="N/A"


This is less than ideal. It's more elegant to have here: SPARK_TGZ= and then to check in other places, if [[ -z $SPARK_TGZ ]]. The problem is that Maven's exec plugin has to pass in a fixed set of command line arguments. Maven will always have to pass in some spark-tgz argument. But we can't pass in empty, because we then end up with commands like setup-integration-test-env --spark-tgz --deploy-mode cloud which screws up the argument shifting.

An alternative we can employ in general is to allow these variables to be set by environment variables as well and to pass them in as environment variables from Maven. Ideas are welcome here.

You could use the maven exec plugin like this:

<executable>/bin/sh</executable> <arguments> <argument>-c</argument> <argument>arbitrary length here ; another command</argument> </arguments>

The problem is that the arguments themselves have to be fixed. I can't sometimes pass in the --spark-tgz flag and sometimes not pass the --spark-tgz flag.

This gets simpler with the idea that this code always takes --spark-tgz, I think.

mccheah · 2018-01-12T21:49:08Z

scripts/write-docker-tag.sh

+fi
+
+rm -f $IMAGE_TAG_OUTPUT_FILE
+touch $IMAGE_TAG_OUTPUT_FILE


Less than ideal. The problem is that Maven wants to pass the image tag into KubernetesSuite, but Maven doesn't know it offhand until it's resolved by the test environment script. We use a UUID for the default image tag because there can be multiple builds concurrently running against the same Spark git hash. But the UUID can't be generated by Maven, meaning the script has to make it and pass that information along somehow, and a temporary file was the best bridge I could think of. Other more elegant ideas would be appreciated here.

Concurrent runs against the same git-sha, I'm not quite sure. So, if a new PR is submitted, it would apply that commit on the base and then run the tests - which would make each one have a unique sha. If they are the same sha, then invariably, they are identical and the image could be reused.

Think there's two git hashes to consider and neither will work:

If we're talking about the integration test repository git hash, multiple builds should certainly be able to run against the same commit. Multiple test runs on the same node testing two different commits of Spark but using the same integration test version comes to mind.

If we're talking about the Spark repository git hash, the inverse rule applies here - multiple versions of the integration test running against the same Spark version. Reusing the images is difficult because we need to know when it's safe to GC the image, which is only if all tests holding that given tag are finished. Then we can run into race conditions, etc.

Seems a lot easier to bias towards isolation for correctness and optimize later.

Would it be simpler to just GC any images that are from this, but over a day old?

There's still a potential race with who gets to build the image with the given hash though. Two builds trying to use the same hash need to agree on who builds the image. I don't want to have to deal with that locking at all, so just use completely unique tags everywhere.

mccheah · 2018-01-12T21:51:27Z

The diff is against a branch that is the outcome of merging master with #29 and #14.

foxish · 2018-01-12T22:28:23Z

e2e/e2e-prow.sh

@@ -1,39 +0,0 @@
-#!/bin/bash


This is specific to GKE testsuites, I think we still need this setup script, even if the runner goes away.

Ah I didn't notice the apt-get calls. I think we want these to be assumed by the framework. Installing components on demand in a test framework is scary-ish - not to mention that installation probably depends on the user's OS, e.g. if the user is running from Mac OSX this won't work. We should have separate scripts that aren't invoked in the build lifecycle for these.

Yes, did not intend for these to be part of the testsuite. These are external "setup" steps and the entrypoint for the k8s CI system.

foxish · 2018-01-12T22:44:00Z

Agreed. It's separate and the only expected consumer is k8s test infrastructure.

…

On Jan 12, 2018 2:42 PM, "mccheah" ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In e2e/e2e-prow.sh <#31 (comment)> : > @@ -1,39 +0,0 @@ -#!/bin/bash Ah I didn't notice the apt-get calls. I think we want these to be assumed by the framework. Installing components on demand in a test framework is scary-ish - not to mention that installation probably depends on the user's OS, e.g. if the user is running from Mac OSX this won't work. We should have separate scripts that aren't invoked in the build lifecycle for these. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#31 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA3U5zFSnkZLVZef7SZqoSf2Xwg__39uks5tJ9_lgaJpZM4Rc4bM> .

The Kubernetes integration tests now always expect an image to be pre-built, so we no longer build images with Scala code. Maven's pre-integration-test invokes a single script to bootstrap the environment with the built images, etc. In the transition we try to keep as much of the same semantics as possible.

foxish · 2018-01-12T23:12:46Z

README.md

+## Re-using Docker Images
+
+By default, the test framework will build new Docker images on every test execution. A unique image tag is generated,
+and it is written to file at `target/imageTag.txt`. To reuse the images built in a previous run, set:


I liked before that it was using the git sha - it made it easy to correlate to the distribution that was just tested.

Discussed in #31 (comment)

foxish · 2018-01-12T23:13:36Z

README.md

+By default, the test framework will test the master branch of Spark from [here](https://github.com/apache/spark). You
+can specify the following options to test against different source versions of Spark:
+
+* `spark.kubernetes.test.sparkRepo` to the git or http URI of the Spark git repository to clone


gitRepo and gitBranch may be better names for this.

foxish · 2018-01-12T23:15:21Z

build/mvn

@@ -0,0 +1,158 @@
+#!/usr/bin/env bash


Why do we need this? We'll need to ensure that it stays in sync with our repo? Or maybe we can get rid of it once we merge this upstream?

foxish · 2018-01-12T23:16:04Z

e2e/e2e-prow.sh

@@ -1,39 +0,0 @@
-#!/bin/bash


Yes, did not intend for these to be part of the testsuite. These are external "setup" steps and the entrypoint for the k8s CI system.

foxish · 2018-01-12T23:17:50Z

scripts/build-spark.sh

+mkdir -p $UNPACKED_SPARK_TGZ
+if [[ $SPARK_TGZ == "N/A" ]];
+then
+  ./dev/make-distribution.sh --tgz -Phadoop-2.7 -Pkubernetes -DskipTests;


If we omit --tgz, we should just get the unpacked distribution under dist/.

Makes it easier to reuse the distribution with the spark.kubernetes.test.sparkTgz option. Thus that configuration can pull double duty with custom Spark distributions and re-using dynamically built ones.

I think for jenkins, we will want to use the pre-built distributions exclusively. This is because we don't want to keep the minikube locked during the building of the distribution. If the spark distribution builds "inside" of the spark-integration test, then it will be harder to express this partial locking. As it currently stands, one jenkins job creates the spark distribution, and another jenkins job does the integration test, making the locking easy.

foxish · 2018-01-12T23:20:38Z

scripts/parse-arguments.sh

+#
+
+set -ex
+TEST_ROOT_DIR=$(git rev-parse --show-toplevel)


Hmm.. it seems error prone that this needs to stay in sync with the args in pom.xml. Is there a better dispatch mechanism than to use this?

foxish · 2018-01-12T23:25:45Z

scripts/write-docker-tag.sh

+fi
+
+rm -f $IMAGE_TAG_OUTPUT_FILE
+touch $IMAGE_TAG_OUTPUT_FILE


Concurrent runs against the same git-sha, I'm not quite sure. So, if a new PR is submitted, it would apply that commit on the base and then run the tests - which would make each one have a unique sha. If they are the same sha, then invariably, they are identical and the image could be reused.

foxish · 2018-01-12T23:27:15Z

src/test/scala/org/apache/spark/deploy/k8s/integrationtest/config.scala

+
+package object config {
+  val KUBERNETES_TEST_DOCKER_TAG_SYSTEM_PROPERTY = "spark.kubernetes.test.imageDockerTag"
+  val DRIVER_DOCKER_IMAGE = "spark.kubernetes.driver.container.image"


These need to be cleaned up I think.

mccheah · 2018-01-12T23:38:21Z

Changed the base to master and removed dependence on #29. Turns out we actually need this to run at all now - our tests are broken since the Docker image format change introduced in apache/spark#20192

foxish · 2018-01-12T23:58:26Z

scripts/setup-integration-test-env.sh

+
+if [[ $SPARK_TGZ == "N/A" ]];
+then
+  $SCRIPTS_DIR/clone-spark.sh "$@";


Instead of passing all the args around, we should parge the args here, and then set the right env-vars/args and call the other scripts maybe? I think it's an antipattern for each script to be calling parse-args and seeing unrelated args.

ssuchter · 2018-01-13T00:28:07Z

README.md

+      -Dspark.kubernetes.test.sparkRepo=https://github.com/apache-spark-on-k8s/spark \
+      -Dspark.kubernetes.test.sparkBranch=new-feature
+
+Additionally, you can use a pre-built Spark distribution. In this case, the repository is not cloned at all, and no


I don't understand the motivation of why we should have the integration test code ever try to clone and build spark. Why not just always depend on a pre-built spark distribution? (Sorry if I missed the reason for this)

I've found it a lot easier locally to be able to run a single command that both fetches the Spark distribution and builds it and then the integration test uses that Spark distribution. We provide the optionality for the local development scenario but we can still provide the TGZ specifically in Jenkins via spark.kubernetes.test.sparkTgz.

ssuchter · 2018-01-13T00:34:03Z

build/mvn

@@ -0,0 +1,158 @@
+#!/usr/bin/env bash


Instead of copying the content here, could you just have a much shorter script that just downloads from https://raw.githubusercontent.com/apache/spark/master/build/mvn and execs it? Something like build/mvn-getter whose content is:

#!/bin/sh curl -s https://raw.githubusercontent.com/apache/spark/master/build/mvn | bash

ssuchter · 2018-01-13T00:36:51Z

scripts/build-spark.sh

+mkdir -p $UNPACKED_SPARK_TGZ
+if [[ $SPARK_TGZ == "N/A" ]];
+then
+  ./dev/make-distribution.sh --tgz -Phadoop-2.7 -Pkubernetes -DskipTests;


I think for jenkins, we will want to use the pre-built distributions exclusively. This is because we don't want to keep the minikube locked during the building of the distribution. If the spark distribution builds "inside" of the spark-integration test, then it will be harder to express this partial locking. As it currently stands, one jenkins job creates the spark distribution, and another jenkins job does the integration test, making the locking easy.

ssuchter · 2018-01-13T00:40:46Z

scripts/write-docker-tag.sh

+fi
+
+rm -f $IMAGE_TAG_OUTPUT_FILE
+touch $IMAGE_TAG_OUTPUT_FILE


Would it be simpler to just GC any images that are from this, but over a day old?

mccheah · 2018-01-13T02:52:29Z

@ssuchter @foxish approach has changed - the dev/dev-run-integration-tests.sh is for local running and is whats documented in README.md, which builds the Spark distribution and sends it to the Maven reactor. The maven reactor itself uses a separate script to build the Docker image, but the Maven reactor must be given the Spark distribution to run with.

mccheah · 2018-01-13T02:53:08Z

Reason we have to do this was discussed in sig-big-data - Jenkins will only use the Maven build without building the Spark distribution, so that the maven build doesn't lock the Minikube resource on building Spark, which takes a long time.

kimoonkim · 2018-01-13T20:31:05Z

rerun integration tests please

kimoonkim · 2018-01-13T20:41:40Z

rerun integration tests please

mccheah · 2018-01-15T18:57:04Z

@kimoonkim the dev/dev-run-integration-tests.sh script is for development setups, though I took a look at the command and that command should theoretically work in Jenkins as well.

Couple of questions:

Minikube appears to be starting in the output, shouldn't it already be running?
Minikube is stopping at the end. Shouldn't we be keeping it running?

mccheah · 2018-01-15T19:19:35Z

@ssuchter let's just have quotes everywhere.

mccheah · 2018-01-15T19:20:37Z

@kimoonkim the minikube binary needs to be on the PATH.

mccheah · 2018-01-15T19:22:47Z

Sorry, forgot to mention what the specific command should be in Jenkins without using the dev script: build/mvn integration-test -Dspark.kubernetes.test.sparkTgz=<tgz-path>.

mccheah · 2018-01-16T18:03:23Z

rerun integration tests please

mccheah · 2018-01-16T19:10:09Z

@kimoonkim integration tests aren't running when I comment, can you trigger the build?

kimoonkim · 2018-01-16T19:13:59Z

@kimoonkim integration tests aren't running when I comment, can you trigger the build?

I think this build 98 actually ran upon your comment, given the timing of build start. I believe there is a quiet period of dozen secs. So maybe it did not trigger right away.

…aven

ssuchter

.

kimoonkim · 2018-01-16T19:17:30Z

Minikube appears to be starting in the output, shouldn't it already be running?
Minikube is stopping at the end. Shouldn't we be keeping it running?

Right. It gets stopped and deleted at the end. Pepperdata Jenkins cluster has only one node and it still has jobs running old integration tests from our fork. If we keep minikube running, that will interfere with the old integration tests. Stopping minikube is the best compromise for this setup, IMO.

mccheah · 2018-01-16T19:18:30Z

Hm, but the error is this:

  java.io.IOException: Cannot run program "minikube": error=2, No such file or directory
  at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)

But the tests work locally for me. I wonder if Java's ProcessBuilder doesn't use the PATH to find locations of executables.

kimoonkim · 2018-01-16T19:18:33Z

Sorry, forgot to mention what the specific command should be in Jenkins without using the dev script: build/mvn integration-test -Dspark.kubernetes.test.sparkTgz=.

Jenkins can switch to this command, but what is the advised way of building docker images now that the scala code does not build them?

mccheah · 2018-01-16T19:20:05Z

@kimoonkim this PR puts us in an in-between state where Maven is building the Docker images but we expect the external Jenkins job to build the Spark distribution. If we build using Maven, the reactor will still build the images for us.

kimoonkim · 2018-01-16T19:25:00Z

Hm, but the error is this:

java.io.IOException: Cannot run program "minikube": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
But the tests work locally for me. I wonder if Java's ProcessBuilder doesn't use the PATH to find locations of executables.

I see I'll try adding the minikube location to the PATH used by maven.

kimoonkim · 2018-01-16T19:26:27Z

rerun integration tests please

kimoonkim · 2018-01-16T19:34:41Z

Now, the error from build 101 is this:

/home/jenkins/workspace/PR-dev-spark-integration/target/spark-dist-unpacked/bin/docker-image-tool.sh: line 66: docker: command not found

@ssuchter I believe you installed minikube on Pepperdata Jenkins. Did we install docker?

ssuchter · 2018-01-16T20:05:24Z

We're working on that same issue (missing docker) in slack sig-big-data

mccheah · 2018-01-16T20:11:16Z

rerun integration tests please

mccheah · 2018-01-16T20:23:12Z

rerun integration tests please

mccheah commented Jan 12, 2018

View reviewed changes

foxish reviewed Jan 12, 2018

View reviewed changes

foxish mentioned this pull request Jan 12, 2018

Use a pre-installed Minikube instance -- porting over logic from PR 521 #14

Merged

mccheah changed the base branch from move-all-logic-to-maven-base to master January 12, 2018 23:19

mccheah added 7 commits January 12, 2018 15:24

Use an nginx server for remote jars tests.

1d7f36c

Update documentation

71ac3e9

Remove unnecessary .gitignore entries

df00060

Use $IMAGE_TAG instead of $TAG

dcd44f4

Don't write image tag file twice

03c5977

Remove using nginx file server

016505d

mccheah force-pushed the move-all-logic-to-maven branch from ed69645 to 016505d Compare January 12, 2018 23:27

foxish reviewed Jan 12, 2018

View reviewed changes

Remove some lines

f2aa748

foxish reviewed Jan 12, 2018

View reviewed changes

ssuchter reviewed Jan 13, 2018

View reviewed changes

Split building Spark for dev environment from build reactor

b4f2004

mccheah added 2 commits January 12, 2018 18:54

Small docs fix

dc97080

Docs formatting fix

f689415

mccheah mentioned this pull request Jan 15, 2018

Fixed integration test build #32

Closed

mccheah added 2 commits January 15, 2018 11:04

Fix scripts

b33e962

Don't get Maven if it's already there

555a898

Put quotes everywhere

0105297

Merge remote-tracking branch 'origin/master' into move-all-logic-to-m…

f67af08

…aven

ssuchter reviewed Jan 16, 2018

View reviewed changes

Minor formatting

f10c3b5

Hard set Minikube binary location.

87f7fb6

Run Minikube from bash -c

e9035aa

foxish merged commit c651127 into master Jan 16, 2018

foxish deleted the move-all-logic-to-maven branch January 16, 2018 21:04

Move docker image management and test entrypoint to Maven #31

Move docker image management and test entrypoint to Maven #31

Uh oh!

Conversation

mccheah commented Jan 12, 2018

Uh oh!

mccheah commented Jan 12, 2018

Uh oh!

mccheah commented Jan 12, 2018

Uh oh!

mccheah commented Jan 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mccheah Jan 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mccheah commented Jan 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

foxish commented Jan 12, 2018 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mccheah Jan 12, 2018 •

edited

Loading