Merge branch 'master' into improve-debugging

scylladb · Aug 27, 2024 · f8ec5d6 · f8ec5d6
2 parents 94d369c + 04aa85c
commit f8ec5d6
Show file tree

Hide file tree

Showing 24 changed files with 150 additions and 119 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -79,7 +79,14 @@ Follow the procedure documented [here](https://stackoverflow.com/a/15505308/5617
 
 ## Publishing
 
-Create a new [GitHub release](https://github.com/scylladb/scylla-migrator/releases), give it a tag name (please see below), a title, and a description, and then click Publish. A workflow will be triggered and will build the application fat-jar and upload it as a release asset.
+Create a new [GitHub release](https://github.com/scylladb/scylla-migrator/releases), give it a tag name (please see below), a title, and a description, and then click Publish. A workflow will be automatically triggered and will build the application fat-jar and upload it as a release asset. Last, _fast-forward-merge_ the branch `master` into the current stable feature-branch:
+
+~~~ sh
+git checkout branch-1.0.x
+git pull origin branch-1.0.x
+git merge --ff-only master
+git push origin branch-1.0.x
+~~~
 
 Rules for the release tag name:
 - Make sure to use tag names like `v1.2.3`, starting with `v` and followed by a [semantic version number](https://semver.org/).

diff --git a/ansible/files/start-slave.sh b/ansible/files/start-slave.sh
@@ -1,2 +1,2 @@
 source spark-env
-$SPARK_HOME/sbin/start-slave.sh spark://$SPARK_MASTER_HOST:7077 $SLAVESIZE
+$SPARK_HOME/sbin/start-worker.sh spark://$SPARK_MASTER_HOST:7077 $SLAVESIZE
diff --git a/ansible/files/start-spark.sh b/ansible/files/start-spark.sh
@@ -13,6 +13,4 @@ cd $SPARK_HOME/sbin
 
 ./start-history-server.sh
 
-./start-slave.sh spark://$SPARK_MASTER_HOST:7077 $SLAVESIZE
-
-./start-shuffle-service.sh
+./start-mesos-shuffle-service.sh
diff --git a/ansible/files/stop-slave.sh b/ansible/files/stop-slave.sh
@@ -1,2 +1,2 @@
 source spark-env
-$SPARK_HOME/sbin/stop-slave.sh
+$SPARK_HOME/sbin/stop-worker.sh
diff --git a/ansible/files/stop-spark.sh b/ansible/files/stop-spark.sh
@@ -7,8 +7,7 @@ source spark-env
 
 cd $SPARK_HOME/sbin
 
-./stop-shuffle-service.sh
-./stop-slave.sh spark://$SPARK_MASTER_HOST:7077 $SLAVESIZE
+./stop-mesos-shuffle-service.sh
 
 ./stop-history-server.sh
 

diff --git a/ansible/scylla-migrator.yml b/ansible/scylla-migrator.yml
@@ -43,7 +43,7 @@
 
     - name: Unarchive the installer.
       unarchive: 
-        src: awscliv2.zip
+        src: "{{ home_dir }}/awscliv2.zip"
         dest: "{{ home_dir }}"
         remote_src: yes
 
@@ -71,7 +71,7 @@
 
     - name: Extract spark
       ansible.builtin.unarchive:
-        src: "spark-3.5.1-bin-hadoop3-scala2.13.tgz"
+        src: "{{ home_dir }}/spark-3.5.1-bin-hadoop3-scala2.13.tgz"
         dest: "{{ home_dir }}"
         remote_src: yes
 
@@ -83,7 +83,7 @@
 
     - name: Move spark to opt/spark
       become: true
-      command: "sudo mv spark-3.5.1-bin-hadoop3-scala2.13 {{ spark_home }}"
+      command: "sudo mv {{ home_dir }}/spark-3.5.1-bin-hadoop3-scala2.13 {{ spark_home }}"
 
     - name: Set JAVA_HOME
       ansible.builtin.lineinfile:

diff --git a/ansible/templates/spark-env-master-sample b/ansible/templates/spark-env-master-sample
@@ -2,38 +2,31 @@
 # Master node requires a lot of memory.  We allocate 64G for the driver to avoid OOM when
 # there are a lot of tasks.
 # 
-# For this example, we're not using any workers on master.
-# so CORES = 0 and SPARK_WORKER_INSTANCES = 0.
+# EXECUTOR_MEMORY and EXECUTOR_CORES are used in the spark-submit job and controls the memory
+# and CPUs per executor. Here are our recommendations to set the number of cores per executor:
+#  - it should be between 5 and 10
+#  - the total number of cores per worker node should be a multiple of the number of cores
+#    per executor
+#  - it can not be higher than the number of cores on a worker node
+# For example, if there are 16 cores on each worker node, we could set EXECUTOR_CORES=8,
+# or if there are 2 cores on each worker node, we could set EXECUTOR_CORES=2.
 #
-# MEMORY is used in the spark-submit job and allocates the memory per executor.
-# You can have one or more executors per worker.
-# 
-# By using multiple workers on an instance, we can control the velocity of the migration.
+# By using multiple worker nodes, we can control the velocity of the migration.
 #
 # Eg. 
 #    Target system is 3 x i4i.4xlarge (16 vCPU, 128G)
-#    We can provision 3 x i4i.2xlarge (8vCPU, 64G)
+#    We can provision 2 x m7g.2xlarge (8vCPU, 32G)
 #
-#    SPARK_WORKER_INSTANCES = 1
-#    CORES = 4
-#    MEMORY = 8G     (4 cores per worker * 2G)
+#    EXECUTOR_CORES=8
+#    EXECUTOR_MEMORY=16G    # (8 cores per worker * 2G)
 #
-#    Start 1 worker per node,  or one at a time.  
+#    Start the Spark master node and the Spark worker node, and run the migration.
 #    - Monitor the pressure on the source system
-#    - Monitor the velocity of the migration in the spark jobs monito
-#
-#    You can increase the velocity by updating spark-env on worker nodes, increase the
-#    SPARK_WORKER_INSTANCES = 2, run ./start-worker.sh.  
-#    So long as there are tasks available, the new workers would pick up tasks and increase
-#    the velocity of the migration.
-#
-#    You should be mindful of over-provisioning the number of workers on an instance.
-#    Eg. if the node has 8 CPUs, then number of workers on an instance * cores <= 8...
+#    - Monitor the velocity of the migration in the spark jobs monitoring UI
+#    - If necessary, provision more worker nodes
 ##################################################################################################################
 
-export CORES=0                                          # 0 cores for slaves on master node
-export MEMORY=8G                                        # cores per worker x 2G
-export SPARK_WORKER_INSTANCES=0                         # no workers on master node
-export SPARK_MASTER_HOST={{ hostvars.spark_master.ansible_facts.default_ipv4.address }}
-
-export SLAVESIZE="-c $CORES "                 
+export SPARK_MASTER_HOST={{ hostvars.spark_master.ansible_default_ipv4.address }}
+export EXECUTOR_CORES=4
+# By default, allocate 2GB of memory per core
+export EXECUTOR_MEMORY="$((EXECUTOR_CORES * 2))G"
diff --git a/ansible/templates/spark-env-worker-sample b/ansible/templates/spark-env-worker-sample
@@ -1,38 +1,13 @@
 ##################################################################################################################
-# Master node requires a lot of memory.  We allocate 64G for the driver to avoid OOM when
-# there are a lot of tasks.
-# 
-# For this example, we're not using any workers on master.
-# so CORES = 0 and SPARK_WORKER_INSTANCES = 0.
-#
-# MEMORY is used in the spark-submit job and allocates the memory per executor.
-# You can have one or more executors per worker.
-# 
-# By using multiple workers on an instance, we can control the velocity of the migration.
-#
-# Eg. 
-#    Target system is 3 x i4i.4xlarge (16 vCPU, 128G)
-#    We can provision 3 x i4i.2xlarge (8vCPU, 64G)
-#
-#    SPARK_WORKER_INSTANCES = 1
-#    CORES = 4
-#    MEMORY = 8G     (4 cores per worker * 2G)
-#
-#    Start 1 worker per node,  or one at a time.  
-#    - Monitor the pressure on the source system
-#    - Monitor the velocity of the migration in the spark jobs monito
-#
-#    You can increase the velocity by updating spark-env on worker nodes, increase the
-#    SPARK_WORKER_INSTANCES = 2, run ./start-worker.sh.  
-#    So long as there are tasks available, the new workers would pick up tasks and increase
-#    the velocity of the migration.
-#
-#    You should be mindful of over-provisioning the number of workers on an instance.
-#    Eg. if the node has 8 CPUs, then number of workers on an instance * cores <= 8...
+# SLAVESIZE is used by the start-slave.sh script. By default, a Spark worker uses all the
+# resources of the machine it is started on. If you want to decrease the migration velocity,
+# you can set the maximum number of cores and memory that can be used by the worker.
 ##################################################################################################################
 
-export SPARK_MASTER_HOST={{ hostvars.spark_master.ansible_facts.default_ipv4.address }}
-export CORES=4                                          # number of cores per worker
-export SPARK_WORKER_INSTANCES=4                         # this is how many workers will be 
-                                                        # started/stopped on the node.
-export SLAVESIZE="-c $CORES"
+export SPARK_MASTER_HOST={{ hostvars.spark_master.ansible_default_ipv4.address }}
+
+export SLAVESIZE=""
+# Optionally, limit the resources available to the Spark worker by uncommenting the following lines
+#export MAX_CORES=2                  # max number of cores to use on the machine
+#export MAX_MEMORY=4G                # max amount of memory to use on the machine
+#export SLAVESIZE="--cores $MAX_CORES --memory $MAX_MEMORY"
diff --git a/ansible/templates/submit-alternator-job.sh b/ansible/templates/submit-alternator-job.sh
@@ -6,12 +6,11 @@ source spark-env
 
 mkdir /tmp/savepoints
 
-# 
-
 time spark-submit --class com.scylladb.migrator.Migrator \
---master spark://{{ hostvars.spark_master.ansible_facts.default_ipv4.address }}:7077 \
+--master spark://{{ hostvars.spark_master.ansible_default_ipv4.address }}:7077 \
 --conf spark.eventLog.enabled=true \
 --conf spark.scylla.config=/home/ubuntu/scylla-migrator/config.dynamodb.yml \
---conf spark.executor.memory=$MEMORY \
---conf spark.driver.memory=64G \
+--executor-memory $EXECUTOR_MEMORY \
+--executor-cores $EXECUTOR_CORES \
+--driver-memory 4G \
 /home/ubuntu/scylla-migrator/scylla-migrator-assembly.jar
diff --git a/ansible/templates/submit-cql-job-validator.sh b/ansible/templates/submit-cql-job-validator.sh
@@ -7,12 +7,12 @@ source spark-env
 mkdir /tmp/savepoints
 
 time spark-submit --class com.scylladb.migrator.Validator \
-  --master spark://{{ hostvars.spark_master.ansible_facts.default_ipv4.address }}:7077 \
+  --master spark://{{ hostvars.spark_master.ansible_default_ipv4.address }}:7077 \
   --conf spark.eventLog.enabled=true \
   --conf spark.scylla.config=config.yaml \
   --conf spark.cassandra.input.consistency.level=LOCAL_QUORUM \
   --conf spark.cassandra.output.consistency.level=LOCAL_QUORUM \
-  --num-executors $SPARK_WORKER_INSTANCES \
-  --executor-memory $MEMORY \
+  --executor-memory $EXECUTOR_MEMORY \
+  --executor-cores $EXECUTOR_CORES \
   --conf spark.cassandra.connection.localConnectionsPerExecutor=4 \
   /home/ubuntu/scylla-migrator/scylla-migrator-assembly.jar
diff --git a/ansible/templates/submit-cql-job.sh b/ansible/templates/submit-cql-job.sh
@@ -7,29 +7,29 @@ source spark-env
 mkdir /tmp/savepoints
 
 time spark-submit --class com.scylladb.migrator.Migrator \
-  --master spark://{{ hostvars.spark_master.ansible_facts.default_ipv4.address }}:7077 \
+  --master spark://{{ hostvars.spark_master.ansible_default_ipv4.address }}:7077 \
   --conf spark.eventLog.enabled=true \
   --conf spark.scylla.config=config.yaml \
   --conf spark.cassandra.input.consistency.level=LOCAL_QUORUM \
   --conf spark.cassandra.output.consistency.level=LOCAL_QUORUM \
-  --num-executors $SPARK_WORKER_INSTANCES \
-  --executor-memory $MEMORY \
+  --executor-memory $EXECUTOR_MEMORY \
+  --executor-cores $EXECUTOR_CORES \
   --conf spark.cassandra.connection.localConnectionsPerExecutor=4 \
   /home/ubuntu/scylla-migrator/scylla-migrator-assembly.jar
 
 #sometimes you will need a tuning for driver memory size
 #add this config to above to tune it:
-#  --conf spark.driver.memory=4G \
+#  --driver-memory 4G \
 
 # debug example
 #$SPARK_HOME/spark-submit --class com.scylladb.migrator.Migrator \
 #  --driver-java-options -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=65005 \
-#  --master spark://{{ hostvars.spark_master.ansible_facts.default_ipv4.address }}:7077 \
+#  --master spark://{{ hostvars.spark_master.ansible_default_ipv4.address }}:7077 \
 #  --conf spark.scylla.config=config.yaml \
 #  --conf spark.cassandra.input.consistency.level=LOCAL_QUORUM \
 #  --conf spark.cassandra.output.consistency.level=LOCAL_QUORUM \
-#  --num-executors 1 \
-#  --executor-memory $MEMORY \
+#  --executor-memory $EXECUTOR_MEMORY \
+#  --executor-cores $EXECUTOR_CORES \
 #  --conf "spark.executor.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=64000 -XX:+HeapDumpOnOutOfMemoryError" \
 #  --conf spark.cassandra.connection.localConnectionsPerExecutor=4 \
 #  /home/ubuntu/scylla-migrator/scylla-migrator-assembly.jar

diff --git a/docker-compose-tests.yml b/docker-compose-tests.yml
@@ -69,8 +69,8 @@ services:
     build: dockerfiles/spark
     command: worker
     environment:
-      SPARK_WORKER_CORES: 3
-      SPARK_WORKER_MEMORY: 1024m
+      SPARK_WORKER_CORES: 2
+      SPARK_WORKER_MEMORY: 4G
       SPARK_WORKER_WEBUI_PORT: 8081
       SPARK_PUBLIC_DNS: localhost
     expose:

diff --git a/docker-compose.yaml b/docker-compose.yaml
@@ -17,8 +17,8 @@ services:
     build: dockerfiles/spark
     command: worker
     environment:
-      SPARK_WORKER_CORES: 3
-      SPARK_WORKER_MEMORY: 1024m
+      SPARK_WORKER_CORES: 2
+      SPARK_WORKER_MEMORY: 4G
       SPARK_WORKER_WEBUI_PORT: 8081
       SPARK_PUBLIC_DNS: localhost
     ports:

diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -14,12 +14,13 @@
 TAGS = []
 BRANCHES = [
     "master",
+    "branch-1.0.x"
 ]
 # Sets the latest version.
-LATEST_VERSION = "master"
+LATEST_VERSION = "branch-1.0.x"
 # Set which versions are not released yet.
 # Todo: List master when there is more than one version.
-UNSTABLE_VERSIONS = []
+UNSTABLE_VERSIONS = ["master"]
 # Set which versions are deprecated
 DEPRECATED_VERSIONS = [""]
 # Sets custom build.

diff --git a/docs/source/getting-started/ansible.rst b/docs/source/getting-started/ansible.rst
@@ -4,7 +4,7 @@ Set Up a Spark Cluster with Ansible
 
 An `Ansible <https://www.ansible.com/>`_ playbook is provided in the `ansible folder <https://github.com/scylladb/scylla-migrator/tree/master/ansible>`_ of our Git repository. The Ansible playbook will install the pre-requisites, Spark, on the master and workers added to the ``ansible/inventory/hosts`` file.  Scylla-migrator will be installed on the spark master node.
 
-The Ansible playbook expects to be run in an Ubuntu environment where the directory ``/home/ubuntu`` already exists.
+The Ansible playbook expects to be run in an Ubuntu environment, by a user named ``ubuntu`` (like you get in AWS EC2 Ubuntu-based images).
 
 1. Clone the Migrator Git repository:
 

diff --git a/docs/source/getting-started/aws-emr.rst b/docs/source/getting-started/aws-emr.rst
@@ -65,9 +65,10 @@ This page describes how to use the Migrator in `Amazon EMR <https://aws.amazon.c
 
        .. code-block:: text
 
-         spark-submit --deploy-mode cluster --class com.scylladb.migrator.Migrator --conf spark.scylla.config=/mnt1/config.yaml /mnt1/scylla-migrator-assembly.jar
+         spark-submit --deploy-mode cluster --class com.scylladb.migrator.Migrator --conf spark.scylla.config=/mnt1/config.yaml <... other arguments> /mnt1/scylla-migrator-assembly.jar
+
+       See a complete description of the expected arguments to ``spark-submit`` in page :doc:`Run the Migration </run-the-migration>`, and replace “<... other arguments>” above with the appropriate arguments.
 
-       See also our `general recommendations to tune the Spark job <./#run-the-migration>`_.
 
    - Add a Bootstrap action to download the Migrator and the migration configuration:
 

diff --git a/docs/source/getting-started/docker.rst b/docs/source/getting-started/docker.rst
@@ -43,11 +43,12 @@ This page describes how to set up a Spark cluster locally on your machine by usi
        --master spark://spark-master:7077 \
        --conf spark.driver.host=spark-master \
        --conf spark.scylla.config=/app/config.yaml \
+       <... other arguments> \
        /jars/scylla-migrator-assembly.jar
 
    The ``spark-master`` container mounts the ``./migrator/target/scala-2.13`` dir on ``/jars`` and the repository root on ``/app``.
 
-   See also our `general recommendations to tune the Spark job <./#run-the-migration>`_.
+   See a complete description of the expected arguments to ``spark-submit`` in page :doc:`Run the Migration </run-the-migration>`, and replace “<... other arguments>” above with the appropriate arguments.
 
 7. You can monitor progress by observing the Spark web console you opened in step 4. Additionally, after the job has started, you can track progress via ``http://localhost:4040``.
 

diff --git a/docs/source/getting-started/index.rst b/docs/source/getting-started/index.rst
@@ -8,9 +8,9 @@ Since the Migrator is packaged as a Spark application, you first have to set up
 Set Up a Spark Cluster
 ----------------------
 
-A Spark cluster is made of several *nodes*, which can contain several *workers* (although there is usually just one worker per node). When you start the Migrator, the Spark *driver* looks at the job content and splits it into tasks. It then spawns *executors* on the cluster workers and feed them with the tasks to compute.
+A Spark cluster is made of several *nodes*, which can contain several *workers* (although there is usually just one worker per node). When you start the Migrator, the Spark *driver* looks at the job content and splits it into tasks. It then spawns *executors* on the cluster workers and feeds them with the tasks to compute. Since the tasks are processed in parallel, you can increase the possible throughput of the migration by increasing the number of worker nodes. Note that the migration throughput is also limited by the read throughput of the source database and the write throughput of the target database.
 
-We recommend provisioning at least 2 GB of memory per CPU on each node. For instance, a cluster node with 4 CPUs should have at least 8 GB of memory.
+We suggest starting with a small cluster containing a single worker node with 5 to 10 CPUs, and increasing the number of worker nodes (or the number of CPUs per node) if necessary, as long as the source and target database are not saturated. We recommend provisioning at least 2 GB of memory per CPU on each node. For instance, a cluster node with 8 CPUs should have at least 16 GB of memory.
 
 .. caution::
 
@@ -36,21 +36,7 @@ Once you have a Spark cluster ready to run the ``scylla-migrator-assembly.jar``,
 Run the Migration
 -----------------
 
-The way to start the Migrator depends on how the Spark cluster was installed. Please refer to the page that describes your Spark cluster setup to see how to invoke the ``spark-submit`` command. The remainder of this section describes general options you can use to fine-tune the Migration job.
-
-We recommend using between 5 to 10 CPUs per Spark executor. For instance, if your Spark worker node has 16 CPUs, you could use 8 CPUs per executor (the Spark driver would then allocate two executors on the worker to fully utilize its resources). You can control the number of CPUs per executors with the argument ``--executor-cores`` passed to the ``spark-submit`` command:
-
-.. code-block:: bash
-
-  --executor-cores 8
-
-We also recommend using 2 GB of memory per CPU. So, if you provide 8 CPU per executor, you should require 16 GB of memory on the executor. You can control the amount of memory per executor with the argument ``--executor-memory`` passed to the ``spark-submit`` command:
-
-.. code-block:: bash
-
-  --executor-memory 16G
-
-As long as your source and target databases are not saturated during the migration, you can increase the migration throughput by adding more worker nodes to your Spark cluster.
+Start the migration by invoking the ``spark-submit`` command with the appropriate arguments, as explained in the page :doc:`/run-the-migration`.
 
 --------------
 Extra Features

diff --git a/docs/source/getting-started/spark-standalone.rst b/docs/source/getting-started/spark-standalone.rst
@@ -30,8 +30,9 @@ This page describes how to set up a Spark cluster on your infrastructure and to
      spark-submit --class com.scylladb.migrator.Migrator \
        --master spark://<spark-master-hostname>:7077 \
        --conf spark.scylla.config=<path to config.yaml> \
+       <... other arguments> \
        <path to scylla-migrator-assembly.jar>
 
-   See also our `general recommendations to tune the Spark job <./#run-the-migration>`_.
+   See a complete description of the expected arguments to ``spark-submit`` in page :doc:`Run the Migration </run-the-migration>`, and replace “<spark-master-hostname>”, “<... other arguments>”, and “<path to scylla-migrator-assembly.jar>” above with appropriate values.
 
 6. You can monitor progress from the `Spark web UI <https://spark.apache.org/docs/latest/spark-standalone.html#monitoring-and-logging>`_.
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -21,6 +21,7 @@ The following table summarizes the required version of Spark and Scala for each
 Migrator  Spark  Scala
 ========  =====  ======
 0.9.x     3.5.x  2.13.x
+1.x       3.5.x  2.13.x
 ========  =====  ======
 
 .. toctree::
@@ -29,6 +30,7 @@ Migrator  Spark  Scala
   getting-started/index
   migrate-from-cassandra-or-parquet
   migrate-from-dynamodb
+  run-the-migration
   stream-changes
   rename-columns
   validate