This is the artifact of the paper "DistMind: Efficient Resource Disaggregation for Deep Learning Workloads". We are going to guide you through the process of reproducing the main results in the paper.
Here is a high level overview of the whole process:
- Environment setup: Create GPU and memory instances on AWS (or machines with rdma).
- Kick-the-tires: Run an example to verify DistMind are working.
- Run: Run experiments.
Note that all logs for tests will be stored under ./tmp and figures will be stored under ./AE/{testname}.
Originally, the experiments were run in AWS EC2 instances p3dn.24xlarge and c5n.18xlarge. However, Amazon has changed their rules, making these two instances no longer support RDMA. Based on present rules, we recommend g6.12xlarge for GPU servers and c6in.32xlarge for memory servers. We also provide verbs version if you run tests on machines using verbs API rdma.
If you use AWS instances, follow the instructions on AWS User Guide to setup efa and nccl. You should run code under branch efa.
If you use verbs API rdma, follow the instructions on libfabric install guide to install libfabric to /opt/libfabric. You should install nccl if your machine supported. You should run code under branch main.
- make sure you have finished one of the previous section
- cuda-12.4 / cuDNN: 9.5.1
- anaconda
- pybind11:
git submodule update --recursive --init
- spdlog:
sudo apt install libspdlog-dev
- libtorch: download from Website and unzip to {proj_path}/libtorch
conda create -n distmind python=3.10 matplotlib
, and add activation to ~/.bashrcpip install transformers==4.49
pip install ray[serve]==1.13
pip install nvgpu
pip install posix_ipc
pip install parallel-ssh
pip install pydantic==1.10.8
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia
In project dictionary run the following command:
mkdir build
cd build
cmake ..
make -j 8
cd ../source/server/cpp_extension/torch
python setup.py install
When building end successfully, and following to ~/.bashrc
# python path
proj_path="~/DistMindAE" # replace path with absolute path
export PYTHONPATH="$proj_path/build/lib/python:$proj_path/build/lib:$proj_path:$PYTHONPATH"
If you are using AWS instances for testing, save this as an AMI to create multiple machines.
Follow the steps below to run an example test.
- prepare one gpu server, we will run an example test on single machine.
- check ips in settings/config.sh are all 127.0.0.1, MODE=local, modify GPU_LIST and WORLD_SIZE based on your machine's resources. Make sure settings/storage_list.txt is something like
storage_address, storage_port
127.0.0.1, 7777
127.0.0.1, 7778
- In the project path, run the following cmd.
mkdir -p tmp
mkdir -p tmp/test1
mkdir -p tmp/test1/distmind_remote
./AE/1_Meeting_latency_SLOs/run_distmind_test1.sh
- When the script end, check tmp/test1/distmind_remote/log_client.txt. If in the end it says "All threads finished.", you have started distmind successfully.
Before running any test, you should modify files in settings correctly. You should prepare 4 memory servers and 4 GPU servers ideally, but you can reduce the size if your resources are limited. Choose one memory server as local and modify the settings as instructed below:
- settings/serverhost_list.txt: add your server ip in to each line with format "[ip] slots=[gpu_num]"
- settings/storage_list.txt: replace the first line's ip with local ip, keep port as 7777. Then add your memory server with format "[ip], 7778"
- settings/controller.json & settings/mps_controller.json & settings/ray_controller.json: fill inference_workload_s with server_number * gpu_per_server
- settings/config.sh: replace all ips with local ip, set MODE=remote
- in each gpu server: in settings/config.sh replace LOCAL_IP with the server's ip, modify GPU_LIST and WORLD_SIZE based on your machine's resources
- you should enable password free ssh connection for all servers, and replace the username in settings/username.txt with your config
In local's terminal, run ./AE/1_Meeting_latency_SLOs/run_test1.sh
. When the script finished(without error), run python ./AE/1_Meeting_latency_SLOs/drawplot.py
. The plot will be saved to
./AE/1_Meeting_latency_SLOs/fig6.png
In local's terminal, run ./AE/2_End-to-end_performance/run_test2.sh
. When the script finished(without error), run python ./AE/2_End-to-end_performance/drawplot.py
. The plot will be saved to
./AE/2_End-to-end_performance/fig7.png & ./AE/2_End-to-end_performance/fig8.png
In this test, you should modify the inference_workload_s list in three controller.json files. Assume you have 4 GPU server with 4 GPUs per server, then you should set it to [ 4, 4, 4, 4, 4, 4, 4, 4, 8, 8, 8, 8, 8, 8, 8, 8, 16, 16, 16, 16, 16, 16, 16, 16 ] Change to fit your resources. Note that the changing period shall be a multiple of the number of GPUs per machine.
- In local's terminal, run
./AE/3_Sharing_inference_and_training/run_test3.sh
. - change three controller.json files' inference_workload_s to [0, 0] run
./AE/3_Sharing_inference_and_training/run_test3_bound.sh
- change settings/controller.josn file's inference_workload_s to [max_gpu_count] run
./AE/3_Sharing_inference_and_training/run_test3_gpu_bound.sh
- When all the script finished(without error), run
python ./AE/3_Sharing_inference_and_training/gather_result.py
to get throughput result and runpython ./AE/3_Sharing_inference_and_training/drawplot.py
to plot. The plot will be saved to
./AE/3_Sharing_inference_and_training/{system_type}_utilization.png
This is only a stimulation test.
In local's terminal, run python ./AE/4_Reducing_memory_usage/drawplot.py
. The plot will be saved to
./AE/4_Reducing_memory_usage/fig10.png