Merge pull request #3 from OSU-Nowlab/readme-update

Update Readme
OSU-Nowlab · May 1, 2024 · ef3ae99 · ef3ae99
2 parents 84b4780 + 691c1c0
commit ef3ae99
Show file tree

Hide file tree

Showing 6 changed files with 223 additions and 392 deletions.
diff --git a/README.md b/README.md
@@ -106,7 +106,7 @@ mpirun_rsh --export-all -np $total_np\
         --enable-evaluation &>> $OUTFILE 2>&1
 ```
 
-Refer [Spatial Parallelism](benchmarks/spatial_parallelism), [Layer Parallelism](benchmarks/layer_parallelism) and [with Single GPU]() for more benchmarks.
+Refer [Spatial Parallelism](benchmarks/spatial_parallelism), [Layer Parallelism](benchmarks/layer_parallelism) and [Single GPU](benchmarks/single_gpu_inference) for more benchmarks.
 
 ## References
 1. MPI4DL : https://github.com/OSU-Nowlab/MPI4DL

diff --git a/benchmarks/gems_master_model/README.md b/benchmarks/gems_master_model/README.md
@@ -1,33 +1,55 @@
 # GEMS: <u>G</u>PU-<u>E</u>nabled <u>M</u>emory-Aware Model-Parallelism <u>S</u>ystem for Distributed DNN Training
 Model Parallelism is necessary for training out-of-core models; however, it can lead to the underutilization of resources. To address this limitation, Pipeline Parallelism is employed, where the batch size is set to greater than 1. But, when dealing with very high-resolution images, certain state-of-the-art models can only work with a unit batch size. GEMS is a memory-efficient design for model parallelism that enables training models with any batch size while utilizing the same resources. For more details, please refer to the original paper: [GEMS: <u>G</u>PU-<u>E</u>nabled <u>M</u>emory-Aware Model-Parallelism <u>S</u>ystem for Distributed DNN Training](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9355254).
 
-## Run GEMS-MASTER:
+## Run GEMS-MASTER Inference:
 
 #### Generic command:
 ```bash
-$MV2_HOME/bin/mpirun_rsh --export-all -np $np --hostfile ${HOSTFILE} MV2_USE_GDRCOPY=0 MV2_ENABLE_AFFINITY=0 MV2_USE_CUDA=1 LD_PRELOAD=$MV2_HOME/lib/libmpi.so python ${gems_model_script} --split-size ${split_size} --image-size ${image_size} --batch-size ${batch_size} --times ${times}
+mpirun_rsh --export-all -np $np\
+        --hostfile ${hostfile}  \
+        python gems_master_model/benchmark_resnet_gems_master.py \
+        --batch-size ${batch_size} \
+        --split-size ${split_size} \
+        --parts ${parts} \
+        --times ${times} \
+        --image-size ${image_size} \
+        --backend ${backend} \
+        --precision ${precision} \
+        --checkpoint ${checkpoint_path} \
+        --datapath ${datapath} \
+        --enable-evaluation &>> $OUTFILE 2>&1
 ```
 #### Examples
 
-- Example to run AmoebaNet MASTER model for 1024 * 1024 image size with 4 model split size(i.e. # of partitions for MP), model replication factor (η = 2) and batch size for each model replica as 1 (i.e. effective batch size (EBS) = η × BS = 2).
+- Example to run ResNet MASTER model inference for 1024 * 1024 image size with 4 model split size(i.e. # of partitions for MP), model replication factor (η = 2) and batch size for each model replica as 1 (i.e. effective batch size (EBS) = η × BS = 2).
 
 ```bash
-$MV2_HOME/bin/mpirun_rsh --export-all -np $np --hostfile ${HOSTFILE} MV2_USE_GDRCOPY=0 MV2_ENABLE_AFFINITY=0 MV2_USE_CUDA=1 LD_PRELOAD=$MV2_HOME/lib/libmpi.so python benchmarks/gems_master_model/benchmark_amoebanet_gems_master.py --split-size 4 --image-size 1024 --batch-size 1 --times 2
+$MV2_HOME/bin/mpirun_rsh --export-all -np $np\
+        --hostfile ${hostfile}  \
+        MV2_USE_CUDA=1 \
+        MV2_HYBRID_BINDING_POLICY=spread \
+        MV2_CPU_BINDING_POLICY=hybrid \
+        MV2_USE_GDRCOPY=0 \
+        PYTHONNOUSERSITE=true \
+        LD_PRELOAD=$MV2_HOME/lib/libmpi.so \
+        python ../benchmarks/gems_master_model/benchmark_resnet_gems_master.py \
+        --batch-size 1 \
+        --split-size 4 \
+        --parts 1 \
+        --times 2 \
+        --image-size 1024 \
+        --backend 'nccl' \
+        --precision 'fp_16' \
+        --enable-evaluation &>> $OUTFILE 2>&1
+
 ```
-- Similarly, we can run benchmark for ResNet MASTER model.
-Below is example to run ResNet MASTER model for 2048 * 2048 image size with 4 model split size(i.e. # of partitions for MP), model replication factor (η = 4) and batch size for each model replica as 1 (i.e. effective batch size (EBS) = η × BS = 4).
-```bash
-$MV2_HOME/bin/mpirun_rsh --export-all -np $np --hostfile ${HOSTFILE} MV2_USE_GDRCOPY=0 MV2_ENABLE_AFFINITY=0 MV2_USE_CUDA=1 LD_PRELOAD=$MV2_HOME/lib/libmpi.so python benchmarks/gems_master_model/benchmark_resnet_gems_master.py --split-size 4 --image-size 2048 --batch-size 1 --times 4 &>> $OUTFILE 2>&1
+
 
 ```
 
 Below are the available configuration options :
 
 <pre>
-usage: benchmark_amoebanet_sp.py [-h] [-v] [--batch-size BATCH_SIZE] [--parts PARTS] [--split-size SPLIT_SIZE] [--num-spatial-parts NUM_SPATIAL_PARTS]
-                        [--spatial-size SPATIAL_SIZE] [--times TIMES] [--image-size IMAGE_SIZE] [--num-epochs NUM_EPOCHS] [--num-layers NUM_LAYERS]
-                        [--num-filters NUM_FILTERS] [--balance BALANCE] [--halo-D2] [--fused-layers FUSED_LAYERS] [--local-DP LOCAL_DP] [--slice-method SLICE_METHOD]
-                        [--app APP] [--datapath DATAPATH]
 
 SP-MP-DP Configuration Script
 
@@ -46,22 +68,32 @@ optional arguments:
   --times TIMES         Number of times to repeat MASTER 1: 2 repications, 2: 4 replications (default: 1)
   --image-size IMAGE_SIZE
                         Image size for synthetic benchmark (default: 32)
-  --num-epochs NUM_EPOCHS
-                        Number of epochs (default: 1)
-  --num-layers NUM_LAYERS
-                        Number of layers in amoebanet (default: 18)
-  --num-filters NUM_FILTERS
-                        Number of layers in amoebanet (default: 416)
+  --num-classes NUM_CLASSES
+                        Number of classes (default: 10)
   --balance BALANCE     length of list equals to number of partitions and sum should be equal to num layers (default: None)
   --halo-D2             Enable design2 (do halo exhange on few convs) for spatial conv. (default: False)
   --fused-layers FUSED_LAYERS
                         When D2 design is enables for halo exchange, number of blocks to fuse in ResNet model (default: 1)
-  --local-DP LOCAL_DP   LBANN intergration of SP with MP. MP can apply data parallelism. 1: only one GPU for a given split, 2: two gpus for a given split (uses DP)
-                        (default: 1)
+  --local-DP LOCAL_DP   LBANN intergration of SP with MP. MP can apply data parallelism. 1: only one GPU for a given split, 2: two gpus for a given split (uses DP) (default: 1)
   --slice-method SLICE_METHOD
                         Slice method (square, vertical, and horizontal) in Spatial parallelism (default: square)
   --app APP             Application type (1.medical, 2.cifar, and synthetic) in Spatial parallelism (default: 3)
   --datapath DATAPATH   local Dataset path (default: ./train)
-  </pre>
-
-  *Note:"--times" is GEMS specific parameter and certain parameters such as "--num-spatial-parts", "--slice-method", "--halo-D2" would not be required by GEMS.*
+  --enable-master-comm-opt
+                        Enable communication optimization for MASTER in Spatial (default: False)
+  --enable-evaluation   Enable evaluation mode in GEMS to perform inference (default: False)
+  --backend BACKEND     Precision for evaluation [Note: not tested on training] (default: mpi)
+  --precision PRECISION
+                        Precision for evaluation [Note: not tested on training] (default: fp32)
+  --num-workers NUM_WORKERS
+                        Slice method (square, vertical, and horizontal) in Spatial parallelism (default: 0)
+  --optimizer OPTIMIZER
+                        Optimizer (default: adam)
+  --learning-rate LEARNING_RATE
+                        Learning Rate (default: 0.001)
+  --weight-decay WEIGHT_DECAY
+                        Weight Decay (default: 0.0001)
+  --learning-rate-decay LEARNING_RATE_DECAY
+                        Learning Rate Decay (default: 0.85)
+  --checkpoint CHECKPOINT
+                        Checkpoint path (default: None)
diff --git a/benchmarks/layer_parallelism/README.md b/benchmarks/layer_parallelism/README.md
@@ -1,28 +1,49 @@
-# Layer Parallelism Benchmarks
+# Run Layer Parallelism Inference
 
-## Run Layer parallelism:
 
 #### Generic command:
 ```bash
 
-$MV2_HOME/bin/mpirun_rsh --export-all -np $np --hostfile  {$HOSTFILE} MV2_USE_CUDA=1 MV2_HYBRID_BINDING_POLICY=spread MV2_CPU_BINDING_POLICY=hybrid MV2_USE_GDRCOPY=0 PYTHONNOUSERSITE=true LD_PRELOAD=$MV2_HOME/lib/libmpi.so python ${lp_model_script} --image-size ${image_size} --batch-size ${batch_size} --split-size ${split_size} --parts ${parts}
+mpirun_rsh --export-all -np $np\
+        --hostfile ${hostfile}  \
+        python layer_parallelism/benchmark_resnet_lp.py \
+        --batch-size ${batch_size} \
+        --split-size ${split_size} \
+        --parts ${parts} \
+        --image-size ${image_size} \
+        --backend ${backend} \
+        --precision ${precision} \
+        --checkpoint ${checkpoint_path} \
+        --datapath ${datapath} \
+        --enable-evaluation
 
 ```
 #### Examples
 
 - With 4 GPUs [split size: 4]
 
-Example to run AmoebaNet LP model with 4 model split size(i.e. # of partitions for LP) for 1024 * 1024 image size. 
+Example to run ResNet LP Inference with 4 model split size(i.e. # of partitions for LP) for 1024 * 1024 image size.
 
 ```bash
-$MV2_HOME/bin/mpirun_rsh --export-all -np $np --hostfile ${hostfile} MV2_USE_CUDA=1 MV2_HYBRID_BINDING_POLICY=spread MV2_CPU_BINDING_POLICY=hybrid MV2_USE_GDRCOPY=0 PYTHONNOUSERSITE=true LD_PRELOAD=$MV2_HOME/lib/libmpi.so python benchmarks/layer_parallelism/benchmark_amoebanet_lp.py --batch-size 1 --image-size 1024 --split-size 4 
+mpirun_rsh --export-all -np $np\
+        --hostfile ${hostfile}  \
+        MV2_USE_CUDA=1 \
+        MV2_HYBRID_BINDING_POLICY=spread \
+        MV2_CPU_BINDING_POLICY=hybrid \
+        MV2_USE_GDRCOPY=0 \
+        PYTHONNOUSERSITE=true \
+        LD_PRELOAD=$MV2_HOME/lib/libmpi.so \
+        python layer_parallelism/benchmark_resnet_lp.py \
+        --batch-size 1 \
+        --split-size 4 \
+        --parts 1 \
+        --image-size 1024 \
+        --backend "nccl" \
+        --precision "fp_16" \
+        --enable-evaluation
 ```
 
-Similarly, we can run benchmark for ResNet LP model.
 
-```bash
-$MV2_HOME/bin/mpirun_rsh --export-all -np $np --hostfile ${hostfile} MV2_USE_CUDA=1 MV2_HYBRID_BINDING_POLICY=spread MV2_CPU_BINDING_POLICY=hybrid MV2_USE_GDRCOPY=0 PYTHONNOUSERSITE=true LD_PRELOAD=$MV2_HOME/lib/libmpi.so python benchmarks/layer_parallelism/benchmark_resnet_lp.py --batch-size 1 --image-size 1024 --split-size 4 
-``` 
 
 Below are the available configuration options :
 
@@ -35,19 +56,41 @@ optional arguments:
   --parts PARTS         Number of parts for MP (default: 1)
   --split-size SPLIT_SIZE
                         Number of process for MP (default: 2)
+  --num-spatial-parts NUM_SPATIAL_PARTS
+                        Number of partitions in spatial parallelism (default: 4)
+  --spatial-size SPATIAL_SIZE
+                        Number splits for spatial parallelism (default: 1)
+  --times TIMES         Number of times to repeat MASTER 1: 2 repications, 2: 4 replications (default: 1)
   --image-size IMAGE_SIZE
                         Image size for synthetic benchmark (default: 32)
-  --num-epochs NUM_EPOCHS
-                        Number of epochs (default: 1)
   --num-layers NUM_LAYERS
                         Number of layers in amoebanet (default: 18)
-  --num-filters NUM_FILTERS
-                        Number of layers in amoebanet (default: 416)
+  --num-classes NUM_CLASSES
+                        Number of classes (default: 10)
   --balance BALANCE     length of list equals to number of partitions and sum should be equal to num layers (default: None)
+  --halo-D2             Enable design2 (do halo exhange on few convs) for spatial conv. (default: False)
   --fused-layers FUSED_LAYERS
                         When D2 design is enables for halo exchange, number of blocks to fuse in ResNet model (default: 1)
-  --local-DP LOCAL_DP   LBANN intergration of SP with MP. MP can apply data parallelism. 1: only one GPU for a given split, 2: two gpus for a given split (uses DP)
-                        (default: 1)
+  --local-DP LOCAL_DP   LBANN intergration of SP with MP. MP can apply data parallelism. 1: only one GPU for a given split, 2: two gpus for a given split (uses DP) (default: 1)
+  --slice-method SLICE_METHOD
+                        Slice method (square, vertical, and horizontal) in Spatial parallelism (default: square)
   --app APP             Application type (1.medical, 2.cifar, and synthetic) in Spatial parallelism (default: 3)
   --datapath DATAPATH   local Dataset path (default: ./train)
-  </pre>
+  --enable-master-comm-opt
+                        Enable communication optimization for MASTER in Spatial (default: False)
+  --enable-evaluation   Enable evaluation mode in GEMS to perform inference (default: False)
+  --backend BACKEND     Precision for evaluation [Note: not tested on training] (default: mpi)
+  --precision PRECISION
+                        Precision for evaluation [Note: not tested on training] (default: fp32)
+  --num-workers NUM_WORKERS
+                        Slice method (square, vertical, and horizontal) in Spatial parallelism (default: 0)
+  --optimizer OPTIMIZER
+                        Optimizer (default: adam)
+  --learning-rate LEARNING_RATE
+                        Learning Rate (default: 0.001)
+  --weight-decay WEIGHT_DECAY
+                        Weight Decay (default: 0.0001)
+  --learning-rate-decay LEARNING_RATE_DECAY
+                        Learning Rate Decay (default: 0.85)
+  --checkpoint CHECKPOINT
+                        Checkpoint path (default: None)