Skip to content

scale sagemaker

Jian Zhang (James) edited this page Aug 6, 2023 · 10 revisions

Use GraphStorm on SageMaker#

GraphStorm can run on Amazon Sagemaker to leverage SageMaker's ML DevOps capabilities.

Prerequisites#

In order to use GraphStorm on Amazon SageMaker, users need to have AWS access to the following AWS services.

Setup GraphStorm SageMaker Docker Image#

GraphStorm uses SageMaker’s BYOC (Bring Your Own Container) mode. Therefore, before launching GraphStorm on SageMaker, there are two steps required to setup a GraphStorm SageMaker Docker image.

Step 1: Build a SageMaker-compatible Docker image#

Note

  • Please make sure your account has access key (AK) and security access key (SK) configured to authenticate accesses to AWS services.

  • For more details of Amazon ECR operation via CLI, users can refer to the Using Amazon ECR with the AWS CLI document.

First, in a Linux machine, configure a Docker environment by following the Docker documentation suggestions.

In order to use the SageMaker base Docker image, users need to use the following command to authenticate to pull SageMaker images.

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com

Then, clone GraphStorm source code, and build a GraphStorm SageMaker compatible Docker image from source with commands:

git clone https://github.com/awslabs/graphstorm.git

cd /path-to-graphstorm/docker/

bash /path-to-graphstorm/docker/build_docker_sagemaker.sh /path-to-graphstorm/ <DOCKER_TYPE> <DOCKER_NAME> <DOCKER_TAG>

The build_docker_sagemaker.sh script takes four arguments:

  1. path-to-graphstorm (required), is the absolute path of the graphstorm folder, where you cloned the GraphStorm source code. For example, the path could be /code/graphstorm.

  2. DOCKER_TYPE (optional), is the docker type of the to-be built Docker image. There are two options: cpu for building CPU-compatible images, and gpu for building Nvidia GPU-compatible images. Default is gpu.

  3. DOCKER_NAME (optional), is the assigned name of the to-be built Docker image. Default is graphstorm.

Warning

In order to upload the GraphStorm SageMaker Docker image to Amazon ECR, users need to define the <DOCKER_NAME> to include the ECR URI string, <AWS_ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com/, e.g., 888888888888.dkr.ecr.us-east-1.amazonaws.com/graphstorm.

  1. DOCKER_TAG (optional), is the assigned tag name of the to-be built Docker image. Default is sm.

Once the build_docker_sagemaker.sh command completes successfully, there will be a Docker image, named <DOCKER_NAME>:<DOCKER_TAG>, such as 888888888888.dkr.ecr.us-east-1.amazonaws.com/graphstorm:sm, in the local repository, which could be listed by running:

docker image ls

Step 2: Upload Docker Images to Amazon ECR Repository#

Because SageMaker relies on Amazon ECR to access customers’ own Docker images, users need to upload Docker images built in the Step 1 to their own ECR repository.

The following command will authenticate the user account to access to user’s ECR repository via AWS CLI.

aws ecr get-login-password --region <REGION> | docker login --username AWS --password-stdin <AWS_ACCOUNT_ID>.dkr.ecr.<REGION>.amazonaws.com

Please replace the <REGION> and <AWS_ACCOUNT_ID> with your own account information and be consistent with the values used in the Step 1.

In addition, users need to create an ECR repository at the specified <REGION> with the name as <DOCKER_NAME> WITHOUT the ECR URI string, e.g., graphstorm.

And then use the below command to push the built GraphStorm Docker image to users’ own ECR repository.

docker push <DOCKER_NAME>:<DOCKER_TAG>

Please replace the <DOCKER_NAME> and <DOCKER_TAG> with the actual Docker image name and tag, e.g., 888888888888.dkr.ecr.us-east-1.amazonaws.com/graphstorm:sm.

Run GraphStorm on SageMaker#

There are two ways to run GraphStorm on SageMaker.

  • Run with Amazon SageMaker service. In this way, users will use GraphStorm’s tools to submit SageMaker API calls, which request SageMaker services to start new SageMaker training or inference instances that run GraphStorm code. Users can submit the API calls on a properly configured machine without GPUs (e.g., C5.xlarge). This is the formal way to run GraphStorm experiments on large graphs and to deploy GraphStorm on SageMaker for production environment.

  • Run with Docker Compose in a local environment. In this way, users do not call the SageMaker service, but use Docker Compose to run SageMaker locally in a Linux instance that has GPUs. This is mainly for model developers and testers to simulate running GraphStorm on SageMaker.

Run GraphStorm with Amazon SageMaker service#

To run GraphStorm with the Amazon SageMaker service, users should set up an instance with the SageMaker library installed and GraphStorm’s SageMaker tools copied.

  1. Use the below command to install SageMaker.

pip install sagemaker
  1. Copy GraphStorm SageMaker tools. Users can clone the GraphStorm repository using the following command or copy the sagemaker folder to the instance.

git clone https://github.com/awslabs/graphstorm.git

Prepare graph data#

Unlike GraphStorm’s Standalone mode and the Distributed mode, which rely on local disk or shared file system to store the partitioned graph, SageMaker utilizes Amazon S3 as the shared data storage for distributing partitioned graphs and the configuration YAML file.

This tutorial uses the same three-partition OGB-MAG graph and the Link Prediction task as those introduced in the Partition a Graph section of the Use GraphStorm in a Distributed Cluster tutorial. After generating the partitioned OGB-MAG graphs, use the following commands to upload them and the configuration YAML file to an S3 bucket.

aws s3 cp --recursive /data/ogbn_mag_lp_3p s3://<PATH_TO_DATA>/ogbn_mag_lp_3p
aws s3 cp /graphstorm/training_scripts/gsgnn_lp/mag_lp.yaml s3://<PATH_TO_TRAINING_CONFIG>/mag_lp.yaml

Please replace <PATH_TO_DATA> and <PATH_TO_TRAINING_CONFIG> with your own S3 bucket URI.

Launch training#

Launching GraphStorm training on SageMaker is similar as launching in the Standalone mode and the Distributed mode, except for three diffences:

  • The launch commands are located in the graphstorm/sagemaker folder, and

  • Users need to provide AWS service-related information in the command.

  • All paths for saving models, embeddings, and prediction results should be specified as S3 locations using the S3 related arguments.

Users can use the following commands to launch a GraphStorm Link Prediction training job with the OGB-MAG graph.

cd /path-to-graphstorm/sagemaker/

python3 launch/launch_train.py </span> --image-url <AMAZON_ECR_IMAGE_URI> </span> --region <REGION> </span> --entry-point run/train_entry.py </span> --role <ROLE_ARN> </span> --instance-count 3 </span> --graph-data-s3 s3://<PATH_TO_DATA>/ogbn_mag_lp_3p </span> --yaml-s3 s3://<PATH_TO_TRAINING_CONFIG>/mag_lp.yaml </span> --model-artifact-s3 s3://<PATH_TO_SAVE_TRAINED_MODEL>/ </span> --graph-name ogbn-mag </span> --task-type link_prediction </span> --lp-decoder-type dot_product </span> --num-layers 1 </span> --fanout 10 </span> --hidden-size 128 </span> --backend gloo </span> --batch-size 128

Please replace <AMAZON_ECR_IMAGE_URI> with the <DOCKER_NAME>:<DOCKER_TAG> that are uploaded in the Step 2, e.g., 888888888888.dkr.ecr.us-east-1.amazonaws.com/graphstorm:sm, replace the <REGION> with the region where ECR image repository is located, e.g., us-east-1, and replace the <ROLE_ARN> with your AWS account ARN that has SageMaker execution role, e.g., "arn:aws:iam::<ACCOUNT_ID>:role/service-role/AmazonSageMaker-ExecutionRole-20220627T143571".

Because we are using a three-partition OGB-MAG graph, we need to set the --instance-count to 3 in this command.

The trained model artifact will be stored in the S3 location provided through the --model-artifact-s3 argument. You can use the following command to check the model artifacts after the training completes.

aws s3 ls s3://<PATH_TO_SAVE_TRAINED_MODEL>/

Launch inference#

Users can use the following command to launch a GraphStorm Link Prediction inference job on the OGB-MAG graph.

python3 launch/launch_infer.py \
        --image-url <AMAZON_ECR_IMAGE_URI> \
        --region <REGION> \
        --entry-point run/infer_entry.py \
        --role <ROLE_ARN> \
        --instance-count 3 \
        --graph-data-s3 s3://<PATH_TO_DATA>/ogbn_mag_lp_3p \
        --yaml-s3 s3://<PATH_TO_TRAINING_CONFIG>/mag_lp.yaml \
        --model-artifact-s3 s3://<PATH_TO_SAVE_TRAINED_MODEL>/ \
        --output-emb-s3 s3://<PATH_TO_SAVE_GENERATED_NODE_EMBEDDING>/ \
        --output-prediction-s3 s3://<PATH_TO_SAVE_PREDICTION_RESULTS> \
        --graph-name ogbn-mag \
        --task-type link_prediction \
        --num-layers 1 \
        --fanout 10 \
        --hidden-size 128 \
        --backend gloo \
        --batch-size 128

Note

Diffferent from the training command’s argument, in the inference command, the value of the --model-artifact-s3 argument needs to be path to a saved model. By default, it is stored under an S3 path with specific training epoch or epoch plus iteration number, e.g., s3://models/epoch-0-iter-999, where the trained model artifacts were saved.

As the outcomes of the inference command, the generated node embeddings will be uploaded to s3://<PATH_TO_SAVE_GENERATED_NODE_EMBEDDING>/. For node classification/regression or edge classification/regression tasks, users can use --output-prediction-s3 to specify the saving locations of prediction results.

Users can use the following commands to check the corresponding outputs:

aws s3 ls s3://<PATH_TO_SAVE_GENERATED_NODE_EMBEDDING>/
aws s3 ls s3://<PATH_TO_SAVE_PREDICTION_RESULTS>/

Run GraphStorm SageMaker with Docker Compose#

This section describes how to launch Docker compose jobs that emulate a SageMaker training execution environment. This can be used to develop and test GraphStorm model training and inference on SageMaker locally.

If users have never worked with Docker compose before the official description provides a great intro:

Hint

Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services. Then, with a single command, you create and start all the services from your configuration.

We will use this capability to launch multiple worker instances locally, that will be configured to “look like” a SageMaker training instance and communicate over a virtual network created by Docker Compose. This way our test environment will be as close to a real SageMaker distributed job as we can get, without needing to launch SageMaker jobs, or launch and configure multiple EC2 instances when developing features.

Get Started#

To run GraphStorm SageMaker with Docker Compose, we need to set up a local Linux instance with the following contents.

  1. Use the below command to install SageMaker.

pip install sagemaker
  1. Clone GraphStorm code.

git clone https://github.com/awslabs/graphstorm.git
  1. Setup GraphStorm in the PYTHONPATH variable.

export PYTHONPATH=/PATH_TO_GRAPHSTORM/python:$PYTHONPATH
  1. Build a SageMaker compatible Docker image following the Step 1.

  2. Follow the Docker Compose documentation to install Docker Compose.

Generate a Docker Compose file#

A Docker Compose file is a YAML file that tells Docker which containers to spin up and how to configure them. To launch the services with a Docker Compose file, we can use docker compose -f docker-compose.yaml up. This will launch the container and execute its entry point.

To emulate a SageMaker distributed execution environment based on the previously built Docker image (suppose the docker image is named graphstorm:sm), you would need a Docker Compose file that resembles the following:

version: '3.7'

networks: gfs: name: gsf-network

services: algo-1: image: graphstorm:sm container_name: algo-1 hostname: algo-1 networks: - gsf command: 'xxx' environment: SM_TRAINING_ENV: '{"hosts": ["algo-1", "algo-2", "algo-3", "algo-4"], "current_host": "algo-1"}' WORLD_SIZE: 4 MASTER_ADDR: 'algo-1' AWS_REGION: 'us-west-2' ports: - 22 working_dir: '/opt/ml/code/'

algo-2: [...]

Some explanation on the above elements (see the official docs for more details):

  • image: Specifies the Docker image that will be used for launching the container. In this case, the image is graphstorm:sm, which should correspond to the previously built Docker image.

  • environment: Sets the environment variables for the container.

  • command: Specifies the entry point, i.e., the command that will be executed when the container launches. In this case, /path/to/entrypoint.sh is the command that will be executed.

To help users generate yaml file automatically, GraphStorm provides a Python script, generate_sagemaker_docker_compose.py, that builds the docker compose file for users.

Note

The script uses the PyYAML library. Please use the below commnd to install it.

pip install pyyaml

This Python script has 4 required arguments that determine the Docker Compose file that will be generated:

  • –aws-access-key-id: The AWS access key ID for accessing S3 data within docker

  • –aws-secret-access-key: The AWS secret access key for accessing S3 data within docker.

  • –aws-session-token: The AWS session toekn used for accessing S3 data within docker.

  • –num-instances: The number of instances we want to launch. This will determine the number of algo-x services entries our compose file ends up with.

The rest of the arguments are passed on to sagemaker_train.py or sagemaker_infer.py:

  • –task-type: Task type.

  • –graph-data-s3: S3 location of the input graph.

  • –graph-name: Name of the input graph.

  • –yaml-s3: S3 location of yaml file for training and inference.

  • –custom-script: Custom training script provided by customers to run customer training logic. This should be a path to the Python script within the Docker image.

  • –output-emb-s3: S3 location to store GraphStorm generated node embeddings. This is an inference only argument.

  • –output-prediction-s3: S3 location to store prediction results. This is an inference only argument.

Run GraphStorm on Docker Compose for Training#

First, use the following command to generate a Compose YAML file for the Link Prediction training on OGB-MAG graph.

python3 generate_sagemaker_docker_compose.py \
        --aws-access-key <<AWS_ACCESS_KEY>> \
        --aws-secret-access-key <AWS_SECRET_ACCESS_KEY> \
        --aws-session-token <AWS_SESSION_TOKEN> \
        --num-instances 3 \
        --image <GRAPHSTORM_DOCKER_IMAGE> \
        --graph-data-s3 s3://<PATH_TO_DATA>/ogbn_mag_lp_3p \
        --yaml-s3 s3://<PATH_TO_TRAINING_CONFIG>/map_lp.yaml \
        --model-artifact-s3 s3://<PATH_TO_SAVE_TRAINED_MODEL> \
        --graph-name ogbn-mag \
        --task-type link_prediction \
        --num-layers 1 \
        --fanout 10 \
        --hidden-size 128 \
        --backend gloo \
        --batch-size 128

The above command will create a Docker Compose file named docker-compose-<task-type>-<num-instances>-train.yaml, which we can then use to launch the job.

As our Docker Compose will use a Docker network, named gsf-network, for inter-container communications, users need to run the following command to create the network before luanch Docker Compose.

docker network create "gsf-network"

Then, use the following command to run the Link Prediction training on OGB-MAG graph.

docker compose -f docker-compose-link_prediction-3-train.yaml up

Running the above command will launch 3 instances of the image, configured with the command and env vars that emulate a SageMaker execution environment and run the sagemaker_train.py script.

Note

The containers actually interact with S3, so the provided AWS assess key, security access key, and session token should be valid for access S3 bucket.

Run GraphStorm on Docker Compose for Inference#

The generate_sagemaker_docker_compose.py can build Compose file for the inference task with the same arguments as for training, and in addition, but add a new argument, --inference. The below command create the Compose file for the Link Prediction inference on OGB-MAG graph.

python3 generate_sagemaker_docker_compose.py \
        --aws-access-key <<AWS_ACCESS_KEY>> \
        --aws-secret-access-key <AWS_SECRET_ACCESS_KEY> \
        --aws-session-token <AWS_SESSION_TOKEN> \
        --num-instances 3 \
        --image <GRAPHSTORM_DOCKER_IMAGE> \
        --graph-data-s3 s3://<PATH_TO_DATA>/ogbn_mag_lp_3p \
        --yaml-s3 s3://<PATH_TO_TRAINING_CONFIG>/map_lp.yaml \
        --model-artifact-s3 s3://<PATH_TO_SAVE_TRAINED_MODEL> \
        --graph-name ogbn-mag \
        --task-type link_prediction \
        --num-layers 1 \
        --fanout 10 \
        --hidden-size 128 \
        --backend gloo \
        --batch-size 128 \
        --inference

The command will create a Docker compose file named docker-compose-<task-type>-<num-instances>-infer.yaml. And then, we can use the same command to spin up the inference job.

docker compose -f docker-compose-link_prediction-3-infer.yaml up

Clean Up#

To save computing resources, users can run the below command to clean up the Docker Compose environment.

docker compose -f docker-compose-file down