This is the ASTRA-sim distributed Deep Learning Training simulator, developed in collaboration between Georgia Tech, Facebook and Intel.
An overview is presented here:
The full description of the tool and its strength can be found in the paper below:
Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna, "ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms" In Proc of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Apr 2020 [pdf][slides][video]
ASTRA-SIM tutorials can be found here: https://astra-sim.github.io/
Bibtex
@inproceedings{astrasim,
author = {Saeed Rashidi and
Srinivas Sridharan and
Sudarshan Srinivasan and
Tushar Krishna},
title = {{ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms}},
booktitle = {{IEEE} International Symposium on Performance Analysis of Systems
and Software, {ISPASS} 2020, Boston, MA, USA, August 22-26, 2020},
publisher = {{IEEE}},
year = {2020},
}
# Clone the repository
$ git clone https://github.com/astra-sim/astra-sim.git
# cloning the submodules
$ cd astra-sim
$ git submodule init
$ git submodule update
- Run
./build/astra_garnet/build.sh -c
to compile and integrate astra-sim with gem5 (-l
flag will clean the compilation). This will create a binary file where garnet is integrated with astra-sim. The analytical backend is hosted at https://github.com/georgia-tech-synergy-lab/gem5_astra . - Run an example inside the
examples/
directory with garnet as a backend. Example:examples/run_allreduce.sh -n garnet
. This command will run a single all-reduce collective on a Torus topology. - The results of example script runs will be dumped inside
examples/results/
path.
- Run
./build/astra_analytical/build.sh -c
to compile and integrate astra-sim with analytical backend (-l
flag will clean the compilation). This will create a binary file where analytical backend is integrated with astra-sim. Please refer to this page for more details about compilation. The analytical backend is hosted at https://github.com/astra-sim/analytical . - Run an example inside the
examples/
directory with the analytical model as a backend. Example:examples/run_allreduce.sh -n analytical
. This command will run a single all-reduce collective on a Torus topology. - The results of example script runs will be dumped inside
examples/results/
path.
Coming Soon!
NOTE: The on-screen reported delays (no matter what backend is used) after the end of simulation are in cycles (by default each cycle is 1 nanosecond) while the delays inside the csv files are in terms of microseconds.
When running the binary file (no matter what backend is used), the following options may be passed to the binary file (see example scripts):
--network-configuration (required): The network input file dir.
--system-configuration (required): The system input file dir.
--workload-configuration (required): The workload input file dir.
--path (required): The path to dump the results.
--run-name (required): Name of the current run.
--num-passes (required): Number of training passes to simulate.
--total-stat-rows (required): Total number of runs that want to write to the same csv file (please see run_multi.sh inside the "examples/"" directory). This is useful when multiple runs want to write to the same csv file. This value should be 1 if only 1 run is executed.
--stat-row (required): The position of the run to write its stats into the csv stat files (please see run_multi.sh inside the "examples/"" directory). This is useful when multiple runs want to write to the same csv file. This value should be 0 if only 1 run is executed.
--compute-scale (optional): Scales the all compute times (reported in the workload input file) by this scale. Tge default value is 1.
--comm-scale (optional): Scales the all communication sizes (reported in the workload input file) by this scale. Tge default value is 1.
NOTE: The garnet+astra-sim binary also allows all of the network input options be overridden by the command line options.
- Workload:
inputs/workload/
- see
inputs/workload/README.md
- see
scripts/workload_generator/README.md
for instruction on how to use an automated script to generate workload input files.
- see
- System:
inputs/system/
- see
inputs/system/README.md
- see
- Network:
inputs/network/garnet
(for garnet backend inputs)- see inputs/network/garnet/README.md`
inputs/network/analytical
(for analytical backend inputs)- see
inputs/network/analytical/README.md
- see
Please email Saeed Rashidi ([email protected]) or Srinivas Sridharan ([email protected]) or Tushar Krishna ([email protected]) if you have any questions.
- Saeed Rashidi (Georgia Tech)
- Srinivas Sridharan (Facebook)
- Jiayi Huang (University of California, Santa Barbara)
- Apurve Chawde (Georgia Tech)
- Santosh Kumar Elangoven (Georgia Tech)
- William Won (Georgia Tech)
- Tushar Krishna (Georgia Tech)
- Greg Steinbrecher (Facebook)