ACM SIGCOMM 2021 TUTORIAL: Network-Accelerated Distributed Deep Learning

Run Your First Distributed Training Job

Pick an untaken pair of instances from this sheet and put your name in the Taken By column.
ssh into the master instance (Primary IP) with the private key net-accel-dl-tutorial.pem available in the Slack channel:

ssh -i /path/to/net-accel-dl-tutorial.pem ubuntu@$PRIMARY_IP

If you see WARNING: UNPROTECTED PRIVATE KEY FILE!, please do chmod 400 /path/to/net-accel-dl-tutorial.pem if you are on a Unix-like system. Refer to this guide if you use Windows.

Run the script:

~/scripts/1_tf_mnist.sh
# the first run may take up to 3 miutes
# subsequent runs should take about 40 seconds

The script calls horovodrun, you can find detailed instructions on how to run as well as how to integrate horovod in existing training scripts.

Inspect the training script and identify Horovod code:

vim ~/scripts/tensorflow_mnist.py

This is Python code that trains an MNIST classifier using Horovod. Can you identify where the Horovod API is being used?

SwitchML Hands-on

This hands-on will give you a taste of what its like to configure and run SwitchML. It will then walk you through using SwitchML's API in code and how you can compile your own SwitchML applications. Finally, as an additional exercise, you will write your own prepostprocessor to control how the data is loaded and unloaded from user buffers to and from messages/packets.

For simplicity, and since not all of our attendees have a testbed with multiple nodes and a Tofino switch connecting them, we will be using a "dummy" backend that sleeps to simulate communication and is simple enough to run on a single node on your own personal laptops.

Without further ado, let's get started!

Prerequisites

The time of the tutorial is very limited, so in order to get the most out of it and to avoid any holdbacks due to slow internet or a slow machine, we encourage you to download needed materials and prepare your setup in advance by following these two simple steps:

Download and install docker on your machine.
Pull the switchml sigcomm image from docker hub by running: docker pull omaralama/switchml:sigcomm21_exercise in your machine's command line.

We also encourage you to browse SwitchML's documentation and source code a bit to get a general feel for the project although this is not required to benefit from the tutorial or to complete all exercises.

Ex.1 Compile SwitchML and Microbenchmark (4 min)

1.1 Start the docker container

Alright, first thing to do if you haven't already is start a container from the image that we pulled earlier. To do that you can run

docker run -it --name switchml omaralama/switchml:sigcomm21_exercise

This should start a container from the image you had pulled earlier and you should be in a new shell inside that container.

1.2 Compile the SwitchML client library

Next navigate to where we have predownloaded switchml for you. Then go to the dev_root directory where all of the source code is located (We will list all of our paths starting tfrom this directory).

cd /home/switchml/dev_root

Next let us compile the SwitchML client library. The client library is required to compile all examples, benchmarks, framework integrations. To do that go to the client_lib directory and run make.

cd client_lib
make

This will give us the switchml client library and since we did not provide the Makefile with neither DPDK=1 nor RDMA=1, it will only include the dummy backend. Verify that the library has been generated by listing the contents of dev_root/build/lib.

ls ../build/lib/

It should include libswitchml-client.a

1.3 Compile the SwitchML allreduce benchmark.

Now that we have switchml's client library compiled, we can go ahead and compile the microbenchmark. Navigate to dev_root/benchmarks and run make.

cd ../benchmarks
make

This should generate an allreduce_benchmark binary in dev_root/build/bin

To verify that everything is working navigate to dev_root/build/bin and print the benchmark's help message.

cd ../build/bin
./allreduce_benchmark --help

If you do not encounter any errors and you see the benchmark message then you are all set.

Ex.2: Configure and Run Microbenchmark (10 min)

For this exercise we want to run the microbenchmark with different configurations and arguments and see how that effects performance.

2.1. Copy configuration file to working directory

Firstly, for any SwitchML application to run, we need to copy a configuration file to our working directory. A default configuration file is always generated with any SwitchML build that includes only applicable options to that build. You can find it as /home/switchml/dev_root/build/switchml.cfg.

Go ahead and copy it to our bin folder cp ../switchml.cfg .

2.2. Edit configuration

Now let us open up the copied config and perform the following edits:

Set num_worker_threads=1 to start out with 1 worker thread.
Set packet_numel=131072 to start out with a big packet which reduces some processing on the dummy backend. (Remember nothing is being sent with the dummy so we can make this arbitrarily big)

The container has vim and nano at hand to perform these edits but feel free to quickly install your favorite editor if you like.

2.3. Run Benchmark

Now that we have our configuration ready, we can run the benchmark with different arguments.

As with the first exercise, feel free to print the help message to read through all available arguments.

But for now we can simply change the number of elements in each tensor that we want to aggregate and observe its effect on performance. And lets use floating point tensors for example.

Run:

./allreduce_benchmark --tensor-type float --tensor-numel 10000

At this point you should have some throughput statistics printed out. If you don't then please contact one of the organizers.

Make a note of the mean throughput and try to run 3 different times with different number of elements and see if that is affecting your performance.

Record all your results in a table.

2.4. Try out different configurations and arguments

Go back to step 2.2 and edit your configuration increasing the number of worker threads from 1 to 4 to 8. And test out each configuration with different number of elements. How is increasing the number of threads affecting the performance? Any trends? compare with your previously recorded results.

That's it !

You have successfully run SwitchML. The RDMA and DPDK backends are as easy to run assuming you have setup the switch side correctly. Each backend has its own simple configuration.

If time permits feel free to play around with other configurations or arguments, record and plot your results, and deduce relationships and correlations.

Ex.3: Write your own SwitchML application (11 min)

In this exercise, we will create our own small SwitchML benchmark from scratch. The objective is to perform the following

Allocate a local buffer.
Start the SwitchML context and get its handle.
Submit a synchronous allreduce job to reduce the local buffer's data
Report how much time that took.
Submit an asynchronous allreduce job to reduce the same local buffer's data.
Report how much time that took.
Wait for the job to actually finish.
Report how much time that took.
Stop the context.

3.1 Writing the application

To start with let us create a new directory in dev_root/examples with our name.

mkdir your_name
cd your_name
pwd

pwd should print /home/switchml/dev_root/examples/your_name

Now we will create our main.cc file using vim main.cc and start writing our own application. Copy the following template over to your new main.cc file as a starting point. ( You may need to copy the code to a notepad first to get rid of any formatting).

// TODO(1): Include switchml's context header


#include <stdio.h>
#include <chrono>

typedef std::chrono::steady_clock clk;

int main() {
    std::chrono::time_point<clk> begin, end;
    
    // TODO(2): Get a reference to the context
    // (First time this is called the context will be created)


    printf("Hello from %s, this is %s's SwitchML benchmark.\n",
           "<Your country>", "<Your name>");

    // TODO(3): Start the context
    // This loads configuration and starts all worker threads 


    printf("I am allocating a tensor\n");
    float* our_data = new float[1<<24]; // 64 MB tensor. Feel free to change this.

    begin = clk::now();

    // TODO(4) Submit synchronous allreduce job.


    end = clk::now();

    printf("Synchronous allreduce call took %ld ns.\n",
            std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count());

    begin = clk::now();

    // TODO(5) Submit asynchronous allreduce job and store its job handle.


    end = clk::now();

    printf("Asynchronous allreduce call took %ld ns.\n",
            std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count());

    begin = clk::now();

    // TODO(6) Wait for the job you submitted to finish.


    end = clk::now();

    printf("Asynchronous allreduce waiting for the job took %ld ns.\n",
            std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count());

    // TODO(7) Stop the switchml context

    delete [] our_data;
}

Consult SwitchML's Context API documentation and the hello_world example located in dev_root/examples/hello_world. To continue the exercise

3.2 Compile the application

Once you are ready to test your application, it is time for compilation. SwitchML has multiple libraries that it needs to link to at the final compilation stage. Thus, we have to use the examples makefile and edit it slightly to include our new application.

Open up dev_root/examples/Makefile

Locate the following lines at the end of the file:

hello_world: $(BINDIR)/hello_world

$(BINDIR)/hello_world: $(BINDIR)
    $(CXX) $(CXXFLAGS) $(INC) $(SRCDIR)/hello_world/main.cc $(LDFLAGS) -o $(BINDIR)/hello_world

Duplicate these lines replacing all hello_world strings with the name of your application.

your_name: $(BINDIR)/your_name

$(BINDIR)/your_name: $(BINDIR)
    $(CXX) $(CXXFLAGS) $(INC) $(SRCDIR)/your_name/main.cc $(LDFLAGS) -o $(BINDIR)/your_name

That's it. Save your file and quit. You can now compile your application by running make your_name in the dev_root/examples directory.

Assuming your application was written correctly the first time the compilation should finish without any errors. However, we all know that this is a far fetched assumption :D. So when an error pops up, go back, fix it, and try again.

Once the compilation completes with no errors, then make sure your application's binary has been generated in dev_root/build/bin.

3.3 Run your application

Just like the second exercise, copy the switchml configuration over if its not already there, and run your application. If you survive segmentation faults on your first run then congratulations. If not go back, debug, remove old benchmark binary, compile, and test again.

Once you start seeing meaningful results, record them and compare synchronous calls with asynchronous calls and waits.

If you are ahead, then you can go back to your source code and add more jobs, change the size of the buffer, or add verification code. Then compile again and test your changes.

Kudos!

Congratulations! You have successfully, compiled, configured, and ran SwitchML. You have also written, compiled, and tested your own small benchmark.

We hope that the tutorial was useful to you and that it showed you how easy it is to use and build on top of SwitchML.

Don't hesitate to contact the SwitchML team if you have any questions.

OmniReduce Hands-on

Prerequisites and Setup

1. Software setup

You need recent versions of the following software installed on your local computer:

Vagrant
VirtualBox

2. VM creation

This part of the tutorial requires to run a pair of VMs. To simplify the process, we ship a pre-imaged VM box using vagrant.

First, you will need to clone our git repository.

$ git clone https://github.com/sands-lab/sigcomm21_omnireduce_tutorial.git

To setup the vagrant box, simply cd sigcomm21_omnireduce_tutorial and run the following commands on your host

# The first `vagrant up` invocation fetches the vagrant box.
# It is likely that this takes some time, so launch this command ASAP!
$ vagrant up

Once done, two virtual machines, node1 and node2, are created and started.

3. Add RDMA device

We use Soft-RoCE to support RDMA. We have already pre-configured Soft-RoCE within the VMs. But you need to add an RDMA link for the specified type (rxe) of the network device on both node1 and node2. Firstly, open two terminals on your host and run the following commands to SSH node1 and node2

# In terminal 1, SSH node1
$ vagrant ssh node1

# In terminal 2, SSH node2
$ vagrant ssh node2

Starting from now, we assume that otherwise stated, all commands are run inside the vagrant box.

After that, you need to run the following commands inside both node1 and node2

# Check the network interface name, it should be eth1
$ ifconfig
# Add an rdma link for rxe to eth1
$ sudo rdma link add roce type rxe netdev eth1
# Check the status of RDMA configuration, make sure that device was added under RDEV (rxe device)
# If there is no error, you will see the following outputs.
$ ibv_devices
#    device                 node GUID
#    ------              ----------------
#    rocep0s8            0a0027fffedd1943

4. Other setups

We have configured an SSH login without password between node1 and node2 for mpirun. You need to check it with the following command

# On node1
$ ssh 192.168.59.102

# On node2
$ ssh 192.168.59.101

By default, vagrant will share your project directory (the directory with the Vagrantfile) to /vagrant. Check whether node1 and node2 can access to this folder. If it is accessible, copy omnireduce project folder to vagrant with following command

# On node1
$ cp -r /home/vagrant/omnireduce /vagrant

After the tutorial you may clean up the files that the above copy leaves behind in your local disk.

Ex.1: Run OmniReduce

We run OmniReduce on node1 and node2 in colocated mode, which means we run both one aggregator and one worker process on each VM. To reduce the hardware requirements, we only run AllReduce on CPU tensors in this tutorial. If your machine has got GPUs, you would be able to run it on GPU tensors.

The steps are as follows.

1.1 Open three terminals on the host

cd to the cloned git repository on your host and start two terminals with to SSH node1 and one to SSH node2.

# In terminal 1, SSH node1
$ vagrant ssh node1

# In terminal 2, SSH node1
$ vagrant ssh node1

# In terminal 3, SSH node2
$ vagrant ssh node2

1.2 Check the configuration

Inspect omnireduce.cfg in /vagrant/omnireduce/omnireduce-RDMA/example

In most cases, you do not need to change anything in this file. One thing you need to confirm is that ib_hca is the same to the device output of ibv_devices command.

1.3 Run the `aggregator` program

Let's start two aggregators processes, one per node, on terminal 2 and 3.

# In terminal 2, inside node1
$ cd /vagrant/omnireduce/omnireduce-RDMA/example
$ ./aggregator

# In terminal 3, inside node2
$ cd /vagrant/omnireduce/omnireduce-RDMA/example
$ ./aggregator

1.4 Run the `worker` program

We use mpirun to execute a simple microbenchmark based on OmniReduce. This command runs a copy of the worker program one each node.

# In terminal 1, inside node1
$ cd /vagrant/omnireduce/omnireduce-RDMA/example
$ mpirun -n 2 -hosts 192.168.59.101,192.168.59.102 ./worker

The benchmark code we run is worker_test.cpp. In the default setting, the float tensor size is 262144 and block density is 1.0. If you still have time, you can modify these values in the worker_test.cpp (tensor_size in line 16 and density in line 25). Then run make in this folder. Try to run the experiment again and observe the differences. Note that you need to restart aggregator before each run.

GRACE Hands-on

Prerequisites

Use the same instructions as with the initial part of this tutorial to acquire a pair of VMs.

Ex.1: Migrate to GRACE from existing training scripts

GRACE overrides Horovod's DistributedOptimizer and DistributedGradientTape APIs. Users simply need to construct and pass a parameter grace to the above APIs:

# TensorFlow minimal example
from grace_dl.tensorflow.communicator.allgather import Allgather
from grace_dl.tensorflow.compressor.topk import TopKCompressor
from grace_dl.tensorflow.memory.residual import ResidualMemory
grc = Allgather(TopKCompressor(0.3), ResidualMemory(), hvd.size())

optimizer = hvd.DistributedOptimizer(optimizer, grace=grc)
## or:
# tape = hvd.DistributedGradientTape(tape, grace=grc)

GRACE also provides a handy helper which parses dict input and returns the grace object:

from grace_dl.tensorflow.helper import grace_from_params
# users can make params a command line input with argparse
params = {'compressor': 'topk', 'memory': 'residual', 'communicator': 'allgather', 'compress_ratio': 0.01}
grc = grace_from_params(params)

1.1 Hands-on

modify ~/scripts/tensorflow_mnist_grace.py to use GRACE.

you can then run:

~/scripts/2_tf_mnist_grace.sh

to check correctness. If done correctly you should see outputs like:

==Debug== grace is called

rather than

==Debug== grace not called

Refer to ~/scripts/exercise_solutions/tensorflow_mnist_grace.py for a sample solution.

Ex.2: Implement Custom Compressors

The main components of GRACE framework are Communicator, Compressor, and Memory abstract classes. Take PyTorch API for example, the entry point for grace is:

class Communicator(ABC):
    @abstractmethod
    def send_receive(self, tensors, name, ctx):
        # 1. communicate
        # 2. decompress
        # 3. aggregation
        raise NotImplemented("send was not implemented.")

    def __init__(self, compressor, memory, world_size):
        self.compressor = compressor
        self.memory = memory
        self.world_size = world_size

    def step(self, tensor, name):
        tensor = self.memory.compensate(tensor, name)
        tensors_compressed, ctx = self.compressor.compress(tensor, name)
        self.memory.update(tensor, name, self.compressor, tensors_compressed, ctx)
        return self.send_receive(tensors_compressed, name, ctx)

For custom Compressors, users need to implement the compress and decompress method. Take PyTorch TopKCompressor for example:

class TopKCompressor(Compressor):
    def __init__(self, compress_ratio):
        super().__init__()
        # compressor specific params should go here
        self.compress_ratio = compress_ratio

    def compress(self, tensor, name):
        # your are given a gradient tensor and its unique name as the input
        tensor_flat = tensor.flatten()
        k = max(1, int(tensor_flat.numel() * self.compress_ratio))
        _, indices = torch.topk(tensor_flat.abs(), k, sorted=False,)
        values = torch.gather(tensor_flat, 0, indices)
        tensors = values, indices
        ctx = tensor.numel(), tensor.size()
        # ctx is anything you need to save and use when decompressing
        return tensors, ctx

    def decompress(self, tensors, ctx):
        """Decompress by filling empty slots with zeros and reshape back using the original shape"""
        numel, shape = ctx
        values, indices = tensors
        tensor_decompressed = torch.zeros(numel, dtype=values.dtype, layout=values.layout, device=values.device)
        tensor_decompressed.scatter_(0, indices, values)
        # you need to return a tensor with the same shape as the "original" gradient
        return tensor_decompressed.view(shape)

2.1 Hands-on

We will implement a variant of Top-K and Threshold in which we use a threshold to clip the gradients, but ensure that we send at least K% of the gradient elements. This prevents bad threshold selections that zeros out the whole gradient. We will name it guarded_threshold.

Go ahead to ~/scripts/guarded_threshold.py and implement the compress method. A template is already provided for you. You may refer to the TopK and Threshold implementation available at ~/src/grace-dl/grace_dl/tensorflow/compressor/topk.py and ~/src/grace-dl/grace_dl/tensorflow/compressor/threshold.py

A simple test script is provided for your convenience, invoke

~/scripts/3_test_custom_compressor.sh

and you will see Test passed! if your implementation works. If you messed up the code and want to reset, you can do

cd ~/scripts
git checkout -- guarded_threshold.py

Refer to ~/scripts/exercise_solutions/guarded_threshold.py for a sample solution.

Finally you can run the end to end experiments with:

~/scripts/4_tf_mnist_custom_compressor.sh

Check out how we use the GuardedThresholdCompressor in tensorflow_mnist_custom_compressor.py lines 132-136

Files

HANDS-ON.md

Latest commit

History

HANDS-ON.md

File metadata and controls

ACM SIGCOMM 2021 TUTORIAL: Network-Accelerated Distributed Deep Learning

Run Your First Distributed Training Job

SwitchML Hands-on

Prerequisites

Ex.1 Compile SwitchML and Microbenchmark (4 min)

1.1 Start the docker container

1.2 Compile the SwitchML client library

1.3 Compile the SwitchML allreduce benchmark.

Ex.2: Configure and Run Microbenchmark (10 min)

2.1. Copy configuration file to working directory

2.2. Edit configuration

2.3. Run Benchmark

2.4. Try out different configurations and arguments

That's it !

Ex.3: Write your own SwitchML application (11 min)

3.1 Writing the application

3.2 Compile the application

3.3 Run your application

Kudos!

OmniReduce Hands-on

Prerequisites and Setup

1. Software setup

2. VM creation

3. Add RDMA device

4. Other setups

Ex.1: Run OmniReduce

1.1 Open three terminals on the host

1.2 Check the configuration

1.3 Run the aggregator program

1.4 Run the worker program

GRACE Hands-on

Prerequisites

Ex.1: Migrate to GRACE from existing training scripts

1.1 Hands-on

Ex.2: Implement Custom Compressors

2.1 Hands-on

1.3 Run the `aggregator` program

1.4 Run the `worker` program