Merge pull request #194 from julienrf/add-tutorial

Add tutorial showing how to migrate from DynamoDB
scylladb · Aug 14, 2024 · b9be9fb · b9be9fb
2 parents 0d9d818 + 24eb472
commit b9be9fb
Show file tree

Hide file tree

Showing 19 changed files with 341 additions and 8 deletions.
diff --git a/.github/workflows/tutorial-dynamodb.yaml b/.github/workflows/tutorial-dynamodb.yaml
@@ -0,0 +1,52 @@
+name: "Tests / Tutorials"
+on:
+  push:
+    branches:
+      - master
+  pull_request:
+
+env:
+  TUTORIAL_DIR: docs/source/tutorials/dynamodb-to-scylladb-alternator
+
+jobs:
+  test:
+    name: DynamoDB migration
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Cache Docker images
+        uses: ScribeMD/[email protected]
+        with:
+          key: docker-${{ runner.os }}-${{ hashFiles('docker-compose-tests.yml') }}
+      - uses: actions/setup-java@v4
+        with:
+          distribution: temurin
+          java-version: 8
+          cache: sbt
+      - name: Build migrator
+        run: |
+          ./build.sh
+          mv migrator/target/scala-2.13/scylla-migrator-assembly.jar "$TUTORIAL_DIR/spark-data"
+      - name: Set up services
+        run: |
+          cd $TUTORIAL_DIR
+          docker compose up -d
+      - name: Wait for the services to be up
+        run: |
+          .github/wait-for-port.sh 8000 # DynamoDB
+          .github/wait-for-port.sh 8001 # ScyllaDB Alternator
+          .github/wait-for-port.sh 8080 # Spark master
+          .github/wait-for-port.sh 8081 # Spark worker
+      - name: Run tutorial
+        run: |
+          cd $TUTORIAL_DIR
+          aws configure set region us-west-1
+          aws configure set aws_access_key_id dummy
+          aws configure set aws_secret_access_key dummy
+          sed -i 's/seq 1 40000/seq 1 40/g' ./create-data.sh
+          ./create-data.sh
+          . ./run-migrator.sh
+      - name: Stop services
+        run: |
+          cd $TUTORIAL_DIR
+          docker compose down
diff --git a/docs/source/getting-started/ansible.rst b/docs/source/getting-started/ansible.rst
@@ -35,7 +35,7 @@ The Ansible playbook expects to be run in an Ubuntu environment where the direct
    - Ensure networking is configured to allow you access spark master node via TCP ports 8080 and 4040
    - visit ``http://<spark-master-hostname>:8080``
 
-9. `Review and modify config.yaml <../#configure-the-migration>`_ based whether you're performing a migration to CQL or Alternator
+9. `Review and modify config.yaml <./#configure-the-migration>`_ based whether you're performing a migration to CQL or Alternator
 
    - If you're migrating to ScyllaDB CQL interface (from Apache Cassandra, ScyllaDB, or other CQL source), make a copy review the comments in ``config.yaml.example``, and edit as directed.
    - If you're migrating to Alternator (from DynamoDB or other ScyllaDB Alternator), make a copy, review the comments in ``config.dynamodb.yml``, and edit as directed.

diff --git a/docs/source/getting-started/aws-emr.rst b/docs/source/getting-started/aws-emr.rst
@@ -12,7 +12,7 @@ This page describes how to use the Migrator in `Amazon EMR <https://aws.amazon.c
        --output-document=config.yaml
 
 
-2. `Configure the migration <../#configure-the-migration>`_ according to your needs.
+2. `Configure the migration <./#configure-the-migration>`_ according to your needs.
 
 3. Download the latest release of the Migrator.
 
@@ -67,7 +67,7 @@ This page describes how to use the Migrator in `Amazon EMR <https://aws.amazon.c
 
          spark-submit --deploy-mode cluster --class com.scylladb.migrator.Migrator --conf spark.scylla.config=/mnt1/config.yaml /mnt1/scylla-migrator-assembly.jar
 
-       See also our `general recommendations to tune the Spark job <../#run-the-migration>`_.
+       See also our `general recommendations to tune the Spark job <./#run-the-migration>`_.
 
    - Add a Bootstrap action to download the Migrator and the migration configuration:
 

diff --git a/docs/source/getting-started/docker.rst b/docs/source/getting-started/docker.rst
@@ -33,7 +33,7 @@ This page describes how to set up a Spark cluster locally on your machine by usi
 
    http://localhost:8080
 
-5. Rename the file ``config.yaml.example`` to ``config.yaml``, and `configure <../#configure-the-migration>`_ it according to your needs.
+5. Rename the file ``config.yaml.example`` to ``config.yaml``, and `configure <./#configure-the-migration>`_ it according to your needs.
 
 6. Finally, run the migration.
 
@@ -47,7 +47,7 @@ This page describes how to set up a Spark cluster locally on your machine by usi
 
    The ``spark-master`` container mounts the ``./migrator/target/scala-2.13`` dir on ``/jars`` and the repository root on ``/app``.
 
-   See also our `general recommendations to tune the Spark job <../#run-the-migration>`_.
+   See also our `general recommendations to tune the Spark job <./#run-the-migration>`_.
 
 7. You can monitor progress by observing the Spark web console you opened in step 4. Additionally, after the job has started, you can track progress via ``http://localhost:4040``.
 

diff --git a/docs/source/getting-started/index.rst b/docs/source/getting-started/index.rst
@@ -12,6 +12,10 @@ A Spark cluster is made of several *nodes*, which can contain several *workers*
 
 We recommend provisioning at least 2 GB of memory per CPU on each node. For instance, a cluster node with 4 CPUs should have at least 8 GB of memory.
 
+.. caution::
+
+  Make sure the Spark version, the Scala version, and the Migrator version you use are `compatible together <../#compatibility-matrix>`_.
+
 The following pages describe various alternative ways to set up a Spark cluster:
 
 * :doc:`on your infrastructure, using Ansible </getting-started/ansible>`,

diff --git a/docs/source/getting-started/spark-standalone.rst b/docs/source/getting-started/spark-standalone.rst
@@ -21,7 +21,7 @@ This page describes how to set up a Spark cluster on your infrastructure and to
      wget https://github.com/scylladb/scylla-migrator/raw/master/config.yaml.example \
        --output-document=config.yaml
 
-4. `Configure the migration <../#configure-the-migration>`_ according to your needs.
+4. `Configure the migration <./#configure-the-migration>`_ according to your needs.
 
 5. Finally, run the migration as follows from the Spark master node.
 
@@ -32,6 +32,6 @@ This page describes how to set up a Spark cluster on your infrastructure and to
        --conf spark.scylla.config=<path to config.yaml> \
        <path to scylla-migrator-assembly.jar>
 
-   See also our `general recommendations to tune the Spark job <../#run-the-migration>`_.
+   See also our `general recommendations to tune the Spark job <./#run-the-migration>`_.
 
 6. You can monitor progress from the `Spark web UI <https://spark.apache.org/docs/latest/spark-standalone.html#monitoring-and-logging>`_.
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -9,7 +9,7 @@ The ScyllaDB Migrator is a Spark application that migrates data to ScyllaDB. Its
 * It can rename columns along the way.
 * When migrating from DynamoDB it can transfer a snapshot of the source data, or continuously migrate new data as they come.
 
-Read over the :doc:`Getting Started </getting-started/index>` page to set up a Spark cluster for a migration.
+Read over the :doc:`Getting Started </getting-started/index>` page to set up a Spark cluster and to configure your migration. Alternatively, follow our :doc:`step-by-step tutorial to perform a migration between fake databases using Docker </tutorials/dynamodb-to-scylladb-alternator/index>`.
 
 --------------------
 Compatibility Matrix
@@ -33,3 +33,4 @@ Migrator  Spark  Scala
   rename-columns
   validate
   configuration
+  tutorials/index
diff --git a/docs/source/tutorials/dynamodb-to-scylladb-alternator/architecture.png b/docs/source/tutorials/dynamodb-to-scylladb-alternator/architecture.png
diff --git a/docs/source/tutorials/dynamodb-to-scylladb-alternator/create-25-items.sh b/docs/source/tutorials/dynamodb-to-scylladb-alternator/create-25-items.sh
@@ -0,0 +1,27 @@
+#!/usr/bin/env sh
+
+generate_25_items() {
+  local items=""
+  for i in `seq 1 25`; do
+    items="${items}"'{
+      "PutRequest": {
+        "Item": {
+          "id": { "S": "'"$(uuidgen)"'" },
+          "col1": { "S": "'"$(uuidgen)"'" },
+          "col2": { "S": "'"$(uuidgen)"'" },
+          "col3": { "S": "'"$(uuidgen)"'" },
+          "col4": { "S": "'"$(uuidgen)"'" },
+          "col5": { "S": "'"$(uuidgen)"'" }
+        }
+      }
+    },'
+  done
+  echo "${items%,}" # remove trailing comma
+}
+
+aws \
+  --endpoint-url http://localhost:8000 \
+  dynamodb batch-write-item \
+    --request-items '{
+      "Example": ['"$(generate_25_items)"']
+    }' > /dev/null
diff --git a/docs/source/tutorials/dynamodb-to-scylladb-alternator/create-data.sh b/docs/source/tutorials/dynamodb-to-scylladb-alternator/create-data.sh
@@ -0,0 +1,14 @@
+#!/usr/bin/env sh
+
+# Create table
+aws \
+  --endpoint-url http://localhost:8000 \
+  dynamodb create-table \
+    --table-name Example \
+    --attribute-definitions AttributeName=id,AttributeType=S \
+    --key-schema AttributeName=id,KeyType=HASH \
+    --provisioned-throughput ReadCapacityUnits=100,WriteCapacityUnits=100
+
+# Add items in parallel
+# Change 40000 into 400 below for a faster demo (10,000 items instead of 1,000,000)
+seq 1 40000 | xargs --max-procs=8 --max-args=1 ./create-25-items.sh
diff --git a/docs/source/tutorials/dynamodb-to-scylladb-alternator/docker-compose.yaml b/docs/source/tutorials/dynamodb-to-scylladb-alternator/docker-compose.yaml
@@ -0,0 +1,40 @@
+services:
+
+  dynamodb:
+    command: "-jar DynamoDBLocal.jar -sharedDb -inMemory"
+    image: "amazon/dynamodb-local:2.5.2"
+    ports:
+      - "8000:8000"
+    working_dir: /home/dynamodblocal
+
+  spark-master:
+    build: dockerfiles/spark
+    command: master
+    environment:
+      SPARK_PUBLIC_DNS: localhost
+    ports:
+      - 4040:4040
+      - 8080:8080
+    volumes:
+      - ./spark-data:/app
+
+  spark-worker:
+    build: dockerfiles/spark
+    command: worker
+    environment:
+      SPARK_WORKER_CORES: 2
+      SPARK_WORKER_MEMORY: 4G
+      SPARK_WORKER_WEBUI_PORT: 8081
+      SPARK_PUBLIC_DNS: localhost
+    ports:
+      - 8081:8081
+    depends_on:
+      - spark-master
+
+  scylla:
+    image: scylladb/scylla:6.0.1
+    expose:
+      - 8001
+    ports:
+      - "8001:8001"
+    command: "--smp 1 --memory 2048M --alternator-port 8001 --alternator-write-isolation only_rmw_uses_lwt"
diff --git a/docs/source/tutorials/dynamodb-to-scylladb-alternator/dockerfiles b/docs/source/tutorials/dynamodb-to-scylladb-alternator/dockerfiles
@@ -0,0 +1 @@
+../../../../dockerfiles
diff --git a/docs/source/tutorials/dynamodb-to-scylladb-alternator/index.rst b/docs/source/tutorials/dynamodb-to-scylladb-alternator/index.rst
@@ -0,0 +1,151 @@
+=========================================================
+Migrate from DynamoDB to ScyllaDB Alternator Using Docker
+=========================================================
+
+In this tutorial, you will replicate 1,000,000 items from a DynamoDB table to ScyllaDB Alternator.
+
+All the scripts and configuration files shown on the tutorial can be found in our `GitHub repository <https://github.com/scylladb/scylla-migrator/tree/master/docs/source/tutorials/dynamodb-to-scylladb-alternator>`_.
+
+The whole system is composed of the DynamoDB service, a Spark cluster with a single worker node, and a ScyllaDB cluster with a single node, as illustrated below:
+
+.. image:: architecture.png
+  :alt: Architecture
+  :width: 600
+
+To follow this tutorial, you need to install `Docker <https://docker.com>`_ and the `AWS CLI <https://aws.amazon.com/cli/>`_.
+
+----------------------------------------------------
+Set Up the Services and Populate the Source Database
+----------------------------------------------------
+
+In an empty directory, create the following ``docker-compose.yaml`` file to define all the services:
+
+.. literalinclude:: docker-compose.yaml
+  :language: YAML
+
+Let’s break down this Docker Compose file.
+
+1. We define the DynamoDB service by reusing the official image ``amazon/dynamodb-local``. We use the TCP port 8000 for communicating with DynamoDB.
+2. We define the Spark master and Spark worker services by using a custom image (see below). Indeed, the official Docker images for Spark 3.5.1 only support Scala 2.12 for now, but we need Scala 2.13. We mount the local directory ``./spark-data`` to the Spark master container path ``/app`` so that we can supply the Migrator jar and configuration to the Spark master node. We expose the ports 8080 and 4040 of the master node to access the Spark UIs from our host environment. We allocate 2 cores and 4 GB of memory to the Spark worker node. As a general rule, we recommend allocating 2 GB of memory per core on each worker.
+3. We define the ScyllaDB service by reusing the official image ``scylladb/scylla``. We use the TCP port 8001 for communicating with ScyllaDB Alternator.
+
+Create the ``Dockerfile`` required by the Spark services at path ``./dockerfiles/spark/Dockerfile`` and write the following content:
+
+.. literalinclude:: dockerfiles/spark/Dockerfile
+  :language: Dockerfile
+
+This ``Dockerfile`` installs Java and a Spark distribution. It uses a custom shell script as entry point. Create the file ``./dockerfiles/spark/entrypoint.sh``, and write the following content:
+
+.. literalinclude:: dockerfiles/spark/entrypoint.sh
+  :language: sh
+
+The entry point takes an argument that can be either ``master`` or ``worker`` to control whether to start a master node or a worker node.
+
+Prepare your system for building the Spark Docker image with the following commands, which create the ``spark-data`` directory and make the entry point executable:
+
+.. code-block:: sh
+
+  mkdir spark-data
+  chmod +x entrypoint.sh
+
+Finally, start all the services with the following command:
+
+.. code-block:: sh
+
+  docker compose up
+
+Your system's Docker daemon will download the DynamoDB and ScyllaDB images and build your Spark Docker image.
+
+Check that you can access the Spark cluster UI by opening http://localhost:8080 in your browser. You should see your worker node in the workers list.
+
+.. image:: spark-cluster.png
+  :alt: Spark UI listing the worker node
+  :width: 883
+
+Once all the services are up, you can access your local DynamoDB instance and your local ScyllaDB instance by using the standard AWS CLI. Make sure to configure the AWS CLI as follows before running the ``dynamodb`` commands:
+
+.. code-block:: sh
+
+  # Set dummy region and credentials
+  aws configure set region us-west-1
+  aws configure set aws_access_key_id dummy
+  aws configure set aws_secret_access_key dummy
+  # Access DynamoDB
+  aws --endpoint-url http://localhost:8000 dynamodb list-tables
+  # Access ScyllaDB Alternator
+  aws --endpoint-url http://localhost:8001 dynamodb list-tables
+
+The last preparatory step consists of creating a table in DynamoDB and filling it with random data. Create a file named ``create-data.sh``, make it executable, and write the following content into it:
+
+.. literalinclude:: create-data.sh
+  :language: sh
+
+This script creates a table named ``Example`` and adds 1 million items to it. It does so by invoking another script, ``create-25-items.sh``, which uses the ``batch-write-item`` command to insert 25 items in a single call:
+
+.. literalinclude:: create-25-items.sh
+  :language: sh
+
+Every added item contains an id and five columns, all filled with random data.
+Run the script ``./create-data.sh`` and wait for a couple of hours until all the data is inserted (or change the last line of ``create-data.sh`` to insert fewer items).
+
+---------------------
+Perform the Migration
+---------------------
+
+Once you have set up the services and populated the source database, you are ready to perform the migration.
+
+Download the latest stable release of the Migrator in the ``spark-data`` directory:
+
+.. code-block::
+
+  wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar \
+    --directory-prefix=./spark-data
+
+Create a configuration file in ``./spark-data/config.yaml`` and write the following content:
+
+.. literalinclude:: spark-data/config.yaml
+  :language: YAML
+
+This configuration tells the Migrator to read the items from the table ``Example`` in the ``dynamodb`` service, and to write them to the table of the same name in the ``scylla`` service.
+
+Finally, start the migration with the following command:
+
+.. literalinclude:: run-migrator.sh
+  :language: sh
+
+This command calls ``spark-submit`` in the ``spark-master`` service with the file ``scylla-migrator-assembly.jar``, which bundles the Migrator and all its dependencies.
+
+In the ``spark-submit`` command invocation, we explicitly tell Spark to use 4 GB of memory; otherwise, it would default to 1 GB only. We also explicitly tell Spark to use 2 cores. This is not really necessary as the default behavior is to use all the available cores, but we set it for the sake of illustration. If the Spark worker node had 20 cores, it would be better to use only 10 cores per executor to optimize the throughput (big executors require more memory management operations, which decrease the overall application performance). We would achieve this by passing ``--executor-cores 10``, and the Spark engine would allocate two executors for our application to fully utilize the resources of the worker node.
+
+The migration process inspects the source table, replicates its schema to the target database if it does not exist, and then migrates the data. The data migration uses the Hadoop framework under the hood to leverage the Spark cluster resources. The migration process breaks down the data to transfer chunks of about 128 MB each, and processes all the partitions in parallel. Since the source is a DynamoDB table in our example, each partition translates into a `scan segment <https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Scan.html#Scan.ParallelScan>`_ to maximize the parallelism level when reading the data. Here is a diagram that illustrates the migration process:
+
+.. image:: process.png
+  :alt: Migration process
+  :width: 700
+
+During the execution of the command, a lot of logs are printed, mostly related to Spark scheduling. Still, you should be able to spot the following relevant lines:
+
+.. code-block:: text
+
+  24/07/22 15:46:13 INFO migrator: ScyllaDB Migrator 0.9.2
+  …
+  24/07/22 15:46:20 INFO alternator: We need to transfer: 2 partitions in total
+  24/07/22 15:46:20 INFO alternator: Starting write…
+  24/07/22 15:46:20 INFO DynamoUtils: Checking for table existence at destination
+
+And when the migration ends, you will see the following line printed:
+
+.. code-block:: text
+
+  24/07/22 15:46:24 INFO alternator: Done transferring table snapshot
+
+During the migration, it is possible to monitor the underlying Spark job by opening the Spark UI available at http://localhost:4040.
+
+.. image:: stages.png
+  :alt: Spark stages
+  :width: 900
+
+`Example of a migration broken down in 6 tasks. The Spark UI allows us to follow the overall progress, and it can also show specific metrics such as the memory consumption of an executor`.
+
+In our example the size of the source table is ~200 MB. In practice, it is common to migrate tables containing several terabytes of data. If necessary, and as long as your DynamoDB source supports a higher read throughput level, you can increase the migration throughput by adding more Spark worker nodes. The Spark engine will automatically spread the workload between all the worker nodes.
+
diff --git a/docs/source/tutorials/dynamodb-to-scylladb-alternator/process.png b/docs/source/tutorials/dynamodb-to-scylladb-alternator/process.png
diff --git a/docs/source/tutorials/dynamodb-to-scylladb-alternator/run-migrator.sh b/docs/source/tutorials/dynamodb-to-scylladb-alternator/run-migrator.sh
@@ -0,0 +1,9 @@
+docker compose exec spark-master \
+  /spark/bin/spark-submit \
+    --executor-memory 4G \
+    --executor-cores 2 \
+    --class com.scylladb.migrator.Migrator \
+    --master spark://spark-master:7077 \
+    --conf spark.driver.host=spark-master \
+    --conf spark.scylla.config=/app/config.yaml \
+    /app/scylla-migrator-assembly.jar
diff --git a/docs/source/tutorials/dynamodb-to-scylladb-alternator/spark-cluster.png b/docs/source/tutorials/dynamodb-to-scylladb-alternator/spark-cluster.png