diff --git a/.github/nowlab_cla.md b/.github/nowlab_cla.md
new file mode 100644
index 00000000..d291de24
--- /dev/null
+++ b/.github/nowlab_cla.md
@@ -0,0 +1,66 @@
+### NOWLAB Individual Contributor License Agreement
+
+Thank you for your interest in contributing to open source software projects (“Projects”) made available by the Network-Based Computing Laboratory (NBCL) or its affiliates (“NBCL”). This Individual Contributor License Agreement (“Agreement”) sets out the terms governing any source code, object code, bug fixes, configuration changes, tools, specifications, documentation, data, materials, feedback, information or other works of authorship that you submit or have submitted, in any form and in any manner, to NBCL in respect of any of the Projects (collectively “Contributions”). If you have any questions respecting this Agreement, please contact panda@cse.ohio-state.edu.
+
+
+You agree that the following terms apply to all of your past, present and future Contributions. Except for the licenses granted in this Agreement, you retain all of your right, title and interest in and to your Contributions.
+
+
+**Copyright License.** You hereby grant, and agree to grant, to NBCL a non-exclusive, perpetual, irrevocable, worldwide, fully-paid, royalty-free, transferable copyright license to reproduce, prepare derivative works of, publicly display, publicly perform, and distribute your Contributions and such derivative works, with the right to sublicense the foregoing rights through multiple tiers of sublicensees.
+
+
+**Patent License.** You hereby grant, and agree to grant, to NBCL a non-exclusive, perpetual, irrevocable,
+worldwide, fully-paid, royalty-free, transferable patent license to make, have made, use, offer to sell, sell,
+import, and otherwise transfer your Contributions, where such license applies only to those patent claims
+licensable by you that are necessarily infringed by your Contributions alone or by combination of your
+Contributions with the Project to which such Contributions were submitted, with the right to sublicense the
+foregoing rights through multiple tiers of sublicensees.
+
+
+**Moral Rights.** To the fullest extent permitted under applicable law, you hereby waive, and agree not to
+assert, all of your “moral rights” in or relating to your Contributions for the benefit of NBCL, its assigns, and
+their respective direct and indirect sublicensees.
+
+
+**Third Party Content/Rights.** If your Contribution includes or is based on any source code, object code, bug
+fixes, configuration changes, tools, specifications, documentation, data, materials, feedback, information or
+other works of authorship that were not authored by you (“Third Party Content”) or if you are aware of any
+third party intellectual property or proprietary rights associated with your Contribution (“Third Party Rights”),
+then you agree to include with the submission of your Contribution full details respecting such Third Party
+Content and Third Party Rights, including, without limitation, identification of which aspects of your
+Contribution contain Third Party Content or are associated with Third Party Rights, the owner/author of the
+Third Party Content and Third Party Rights, where you obtained the Third Party Content, and any applicable
+third party license terms or restrictions respecting the Third Party Content and Third Party Rights. For greater
+certainty, the foregoing obligations respecting the identification of Third Party Content and Third Party Rights
+do not apply to any portion of a Project that is incorporated into your Contribution to that same Project.
+
+
+**Representations.** You represent that, other than the Third Party Content and Third Party Rights identified by
+you in accordance with this Agreement, you are the sole author of your Contributions and are legally entitled
+to grant the foregoing licenses and waivers in respect of your Contributions. If your Contributions were
+created in the course of your employment with your past or present employer(s), you represent that such
+employer(s) has authorized you to make your Contributions on behalf of such employer(s) or such employer
+(s) has waived all of their right, title or interest in or to your Contributions.
+
+
+**Disclaimer.** To the fullest extent permitted under applicable law, your Contributions are provided on an "asis"
+basis, without any warranties or conditions, express or implied, including, without limitation, any implied
+warranties or conditions of non-infringement, merchantability or fitness for a particular purpose. You are not
+required to provide support for your Contributions, except to the extent you desire to provide support.
+
+
+**No Obligation.** You acknowledge that NBCL is under no obligation to use or incorporate your Contributions
+into any of the Projects. The decision to use or incorporate your Contributions into any of the Projects will be
+made at the sole discretion of NBCL or its authorized delegates ..
+
+
+**Disputes.** This Agreement shall be governed by and construed in accordance with the laws of the State of
+New York, United States of America, without giving effect to its principles or rules regarding conflicts of laws,
+other than such principles directing application of New York law. The parties hereby submit to venue in, and
+jurisdiction of the courts located in New York, New York for purposes relating to this Agreement. In the event
+that any of the provisions of this Agreement shall be held by a court or other tribunal of competent jurisdiction
+to be unenforceable, the remaining portions hereof shall remain in full force and effect.
+
+
+**Assignment.** You agree that NBCL may assign this Agreement, and all of its rights, obligations and licenses
+hereunder.
\ No newline at end of file
diff --git a/signatures/version1/cla.json b/.github/signatures/version1/cla.json
similarity index 100%
rename from signatures/version1/cla.json
rename to .github/signatures/version1/cla.json
diff --git a/.github/workflows/cla.yml b/.github/workflows/cla.yml
index dcd86e1c..85b191c1 100644
--- a/.github/workflows/cla.yml
+++ b/.github/workflows/cla.yml
@@ -25,8 +25,8 @@ jobs:
           # This token is required only if you have configured to store the signatures in a remote repository/organization
           PERSONAL_ACCESS_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
         with:
-          path-to-signatures: 'signatures/version1/cla.json'
-          path-to-document: 'https://github.com/cla-assistant/github-action/blob/master/SAPCLA.md' # e.g. a CLA or a DCO document
+          path-to-signatures: '.github/signatures/version1/cla.json'
+          path-to-document: '.github/nowlab_cla.md' # e.g. a CLA or a DCO document
           # branch should not be protected
           branch: 'main'
           allowlist: user1,bot*
@@ -40,4 +40,4 @@ jobs:
           #custom-pr-sign-comment: 'The signature to be committed in order to sign the CLA'
           #custom-allsigned-prcomment: 'pull request comment when all contributors has signed, defaults to **CLA Assistant Lite bot** All Contributors have signed the CLA.'
           #lock-pullrequest-aftermerge: false - if you don't want this bot to automatically lock the pull request after merging (default - true)
-          #use-dco-flag: true - If you are using DCO instead of CLA
+          #use-dco-flag: true - If you are using DCO instead of CLA
\ No newline at end of file
diff --git a/.gitignore b/.gitignore
new file mode 100644
index 00000000..56533c9e
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,4 @@
+src/models/__pycache__/
+src/torchgems/__pycache__/
+now_dl.egg-info/
+benchmarks/single_gpu/
\ No newline at end of file
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index 07ee79c2..0056c15c 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -1,6 +1,6 @@
 repos:
 - repo: https://github.com/psf/black
-  rev: stable
+  rev: 23.3.0
   hooks:
   - id: black
     name: black-format-test
diff --git a/README.md b/README.md
index 122dc338..3e765e29 100644
--- a/README.md
+++ b/README.md
@@ -1,88 +1,74 @@
-# now-dl
+# MPI4DL
 
-This is a Deep Learning Parallelism framework written in PyTorch that provides distributed implementations for spatial parallelism [1] and GEMS (GPU-Enabled Memory-Aware Model-Parallelism System) [2]. Furthermore, the framework provide support for training of DNN models using hybrid parallelism, which includes data, model, spatial, pipeline, and bi-directional parallelism.
-
-The objective is to facilitate the training of out-of-core or large-scale Deep Neural Networks for high-resolution images. Additionally, it utilizes various parallelism techniques to further reduce the training time, making it faster and more efficient.
-
-## Spatial Parallelism:
-
-Spatial parallelism involves dividing the input image spatially into partitions based on spatial types such as "vertical," "horizontal," or "square." These partitions are then distributed among different GPUs to perform computations, enabling deep learning training on high-resolution images. Furthermore, spatial parallelism can be combined with other parallelism techniques, such as model, pipeline, and bi-directional parallelism, to further optimize the training process.
+There are several approaches that have been proposed to address some of the limitations of layer parallelism. However, most studies are performed for low-resolution images that exhibit different characteristics. Compared to low-resolution images, high-resolution images (e.g. Digital pathology images) result in higher activation memory and larger tensors, which in turn lead to a larger communication overhead.
 
 <div align="center">
- <img src="docs/assets/images/SpatialParallelism.jpg" width="600px">
+ <img src="docs/assets/images/DP_MP_SP_Vs_Memory.png" width="600px">
+ <br>
+ <figcaption>Figure 1. Capabilities of each parallelism scheme for low-resolution, high-resoution and very high resolutioin images. 
+</figcaption>
+
+<br>
 </div>
+<br>
 
-*Figure 1 illustrates the combination of spatial and model parallelism. In this approach, the CNN is divided into four partitions. Spatial parallelism is applied to the first partition, while the rest of the partitions use model parallelism. Configurations used num_spatial_parts = 4, split_size = 4, spatial_size = 1.*
+Figure 1. shows capabilities of each parallelism scheme with respective to diferent image sizes. Data parallelism has a memory limitation and cannot be performed for out-of-core models. Layer parallelism overcomes the limitation of data parallelism by distributing the model across different GPUs. However, it causes GPU underutilization as only one GPU is utilized. Pipeline parallelism accelerates the performance of layer parallelism by training the model in a pipeline fashion. However, pipeline parallelism is only possible when the model is trainable with a batch size > 1, which is typically impossible with high-resolution images due to memory constraints. To train high-resolution images, spatial parallelism can be used, which distributes images across multiple GPUs. On the other hand, it has performance issues due to high communication overhead and the inability to accelerate low-resolution images that are common in the latter half of DNNs.
 
-You could refer to [Spatial Parallelism implementation](https://github.com/OSU-Nowlab/now-dl/tree/main/benchmarks/spatial) for more details.
 
-## GEMS(GPU-Enabled Memory-Aware Model-Parallelism System):
-GEMS aims to train very large DNNs for high-resolution images, commonly used in digital pathology. It achieves significant speed-up over state-of-the-art systems by utilizing memory-aware designs, overlapping computation, and data parallelism. This includes different design schemes, such as GEMS-MAST, GEMS-MASTER, and GEMS-Hybrid.
 
-<div align="center">
- <img src="docs/assets/images/GEMS_MAST.jpg" width="600px">
-</div>
+**Our objective is efficiently utilizing distributed training for very high-resolution images that appear in real-world applications. Integrating spatial and layer parallelism can solve the aforementioned limitations of spatial parallelism and layer parallelism. Spatial parallelism enables training high-resolution images efficiently even when the model size is large, and layer parallelism accelerates low-resolution images in the latter half of DNNs. This schema enables training high-resolution images efficiently. This project is a PyTorch implementation of this technique and is based on [Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters](https://dl.acm.org/doi/abs/10.1007/978-3-031-07312-0_6).**
 
-*Figure 2 shows a memory view of the GEMS MAST design for forward and backward passes of two model replicas and the improvement made possible by it.*
+# Background
 
-The implementation will soon be made available as an open-source project.
+## Layer Parallelism: 
+Layer parallelism distributes the DNN model on separate GPUs before applying distributed forward and backward passes. These distributed forward and backward passes are implemented with simple Send and Recv operations. Thus, layer parallelism suffers from under-utilization of resources and scalability, as only a single GPU can operate at once.
 
-## Installation:
+## Pipeline Parallelism
+Pipelining divides the input batch into smaller batches called micro-batches, the number of which we call parts. The goal of pipeline parallelism is to reduce underutilization by overlapping micro-batches, which allows multiple GPUs to proceed with computation within the forward and backward passes.
+## Spatial Parallelism:
 
-### Prerequisite:
-- Python 3.8 or later (for Linux, Python 3.8.1+ is needed)
-- PyTorch :
-To enable MPI support, it is required to install PyTorch from source. 
+In spatial parallelism, the convolution layer is replicated across multiple GPUs, and image parts are partitioned across replicas. Convolution and Pooling layers can be distributed across multiple GPUs to work on different regions of the image. Hence, unlike layer parallelism, this approach enables simultaneous computation on multiple GPUs while facilitating the training of the out-of-core convolution layer, but it requires extra communication to receive border pixels from neighboring partitions, also called halo-exchange. Refer [Halo exchnage](benchmarks/communication/halo) for more information.
 
-```bash
-git clone https://github.com/pytorch/pytorch
-git checkout v1.12.1
-```
+## Spatial Parallelism + Layer Parallelism
+<div align="center">
+ <img src="docs/assets/images/Spatial_Parallelism.jpg" width="600px">
+ </br>
+ <figcaption>Figure 2. Combination of spatial and layer parallelism. </figcaption>
+    </br>
+</div>
 
-Add cuda-aware MPI support
-Modify pytorch/caffe2/mpi/mpi_ops_gpu.cc:
-```bash
-#define CAFFE2_HAS_CUDA_MPI_BASICS 1
-#define CAFFE2_HAS_CUDA_MPI_ALLREDUCE 1
-```
+Above figure shows combination of spatial and layer parallelism. In this approach, the model is divided into 4 partitions, and spatial parallelism is used for the first partition to perform convolution operations on the input image. The second layer aggregates the output from the first layer and then sends it, while lateral layers use layer parallelism.
 
-Modify torch/csrc/distributed/c10d/ProcessGroupMPI.cpp
-```bash
-#if defined(MPIX_CUDA_AWARE_SUPPORT)
-  if (MPIX_Query_cuda_support() == 1) {
-    return true;
-  } else {
-    return true;
-  }
-#else // !defined(MPIX_CUDA_AWARE_SUPPORT)
-  return true;
-#endif // MPIX_CUDA_AWARE_SUPPORT
-}
-```
+Due to the increased communication overhead, spatial parallelism is more suitable for large images, which makes this approach inappropriate for the latter half of CNNs where the image input size usually consists of few pixels. Layer parallelism can be used to compute this latter half. Figure 2 shows a combination of spatial parallelism and layer parallelism for a CNN partitioned into four partitions at the layer granularity. Spatial parallelism is applied to the first model partition, and layer parallelism is applied to the other three model partitions.
 
-Refer following page to install PyTorch https://github.com/pytorch/pytorch
+Refer [Spatial Parallelism](benchmarks/spatial_parallelism) for more details.
 
-- MVAPICH2
-To install MVAPICH2, follow instructions mentioned on MVAPICH download http://mvapich.cse.ohio-state.edu/downloads/#mv2.3.7
+## Installation:
 
-- NVIDIA CUDA 11.0 or above
-- NVIDIA GPU Compute Capability >= 7.0 (V100/RTX20 and higher)
-- Linux OS
+### Prerequisite:
+- Python 3.8 or later (for Linux, Python 3.8.1+ is needed).
+- MVAPICH2
+Refer [MVAPICH2 installation guide](docs/installation/MVAPICH_INSTALLATION_GUIDE.md) to install MVAPICH2.
+- PyTorch :  1.12.1 or 1.13.1
+Refer [PyTorch installation guide](/docs/installation/PYTORCH_INSTALLATION_GUIDE.md) to install PyTorch from source and configure MVAPICH2 support. 
 
-[Note:
+*Note:
 We used the following versions during implementation and testing.
-Python=3.9.16, cuda=11.6, gcc=10.3.0, cmake=3.22.2, PyTorch=1.12.0a0+git35202d2, MVAPICH2-GDR=2.3.7]
+Python=3.9.16, cuda=11.6, gcc=10.3.0, cmake=3.22.2, PyTorch=1.12.1, MVAPICH2-GDR=2.3.7*
 
-### Install now-dl:
-- Load Required model:
+### Install mpi4dl
 ```bash
-cd torch-gems
+cd mpi4dl
 python setup.py install
 ```
-Example to run Amoebanet model with partition size for model as two, spatial partition as four and spatial size (i.e. number of model partition which will use spatial partition) as 1
+### Run model benchmark:
+Example to run AmoebaNet model with partition size for model as two, spatial partition as four and spatial size (i.e. number of model partition which will use spatial partition) as 1
 ```bash
-$MV2_HOME/bin/mpirun_rsh --export-all -np 5 --hostfile {$HOSTFILE} MV2_USE_GDRCOPY=0 MV2_ENABLE_AFFINITY=0 MV2_USE_CUDA=1 LD_PRELOAD=$MV2_HOME/lib/libmpi.so python benchmarks/spatial/model/amoebanet_run.py --image-size 512 --num-spatial-parts 4 --slice-method "vertical" --split-size 2 --spatial-size 1
+$MV2_HOME/bin/mpirun_rsh --export-all -np 5 --hostfile {$HOSTFILE} MV2_USE_CUDA=1 MV2_HYBRID_BINDING_POLICY=spread MV2_CPU_BINDING_POLICY=hybrid MV2_USE_GDRCOPY=0 PYTHONNOUSERSITE=true LD_PRELOAD=$MV2_HOME/lib/libmpi.so python benchmarks/spatial_parallelism/benchmark_amoebanet_sp.py --image-size 512 --num-spatial-parts 4 --slice-method "vertical" --split-size 2 --spatial-size 1
 ```
+
+Refer [Spatial Parallelism](benchmarks/spatial_parallelism) and [Halo Exchange](benchmarks/communication/halo) for more spatial benchmarks.
+
 ## References:
 1. Arpan Jain, Ammar Ahmad Awan, Asmaa M. Aljuhani, Jahanzeb Maqbool Hashmi, Quentin G. Anthony, Hari Subramoni, Dhableswar K. Panda, Raghu Machiraju, and Anil Parwani. 2020. GEMS: <u>G</u>PU-<u>e</u>nabled <u>m</u>emory-aware model-parallelism <u>s</u>ystem for distributed DNN training. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20). IEEE Press, Article 45, 1–15.
 2. Arpan Jain, Aamir Shafi, Quentin Anthony, Pouya Kousha, Hari Subramoni, and Dhableswar K. Panda. 2022. Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters. In High Performance Computing: 37th International Conference, ISC High Performance 2022, Hamburg, Germany, May 29 – June 2, 2022, Proceedings. Springer-Verlag, Berlin, Heidelberg, 109–130. https://doi.org/10.1007/978-3-031-07312-0_6
diff --git a/benchmarks/communication/halo/README.md b/benchmarks/communication/halo/README.md
new file mode 100644
index 00000000..5b1377c5
--- /dev/null
+++ b/benchmarks/communication/halo/README.md
@@ -0,0 +1,66 @@
+#  Halo exchnage:
+In spatial parallelism, Convolution and Pooling layers can be distributed across multiple GPUs to work on the different regions of the image. Thus, spatial parallelism requires a halo exchange (shown in Figure. 1.) at every convolution and pooling layer to compute the result for the pixels present on the boundary of image parts. Halo exchange can also be performed in parallel with convolution operations on available input pixels.
+<div align="center">
+ <img src="../../../docs/assets/images/Halo_Exchange.jpg" width="600px">
+ </br>
+  <figcaption>Figure 1. Halo exchange in spatial parallelism. The input image is partitioned into four regions, and each region is given to the different processes. To calculate the convolution operation at X location, the value of nearby pixels is required. 
+  </figcaption>
+    </br>
+</div>
+
+
+## halo-exchange benchmarks:
+- *benchmark_sp_halo_exchange.py* and *benchmark_sp_halo_exchange_with_compute.py* are used to test the proper functioning of send and receive operations for halo regions.
+- *benchmark_sp_halo_exchange_with_compute_val.py* is utilized to validate the received inputs, in addition to testing the halo region send and receive operations.
+### Run halo-exchange benchmark
+
+#### Generic command:
+```bash
+
+$MV2_HOME/bin/mpirun_rsh --export-all -np $np --hostfile  {$HOSTFILE} MV2_USE_CUDA=1 MV2_HYBRID_BINDING_POLICY=spread MV2_CPU_BINDING_POLICY=hybrid MV2_USE_GDRCOPY=0 PYTHONNOUSERSITE=true LD_PRELOAD=$MV2_HOME/lib/libmpi.so python ${halo_benchmark} --image-size ${image_size} --batch-size ${batch_size} --num-spatial-parts ${num_spatial_parts} --slice-method ${partition}
+
+```
+#### Example:
+Example to run halo exchange benchmark for 4 vertical partition of 1024 * 1024 image with halo-len and batch size of 3 and 1 respectively: 
+```bash
+$MV2_HOME/bin/mpirun_rsh --export-all -np 4 --hostfile {$HOSTFILE} MV2_USE_CUDA=1 MV2_HYBRID_BINDING_POLICY=spread MV2_CPU_BINDING_POLICY=hybrid MV2_USE_GDRCOPY=0 PYTHONNOUSERSITE=true LD_PRELOAD=$MV2_HOME/lib/libmpi.so python benchmarks/communication/halo/benchmark_sp_halo_exchange.py --image-size 1024 --batch-size 1 --halo-len 3 --num-spatial-parts 4 --slice-method "vertical"
+```
+
+Expected output:
+```
+rank : 0 size:  4
+Rank:0 Time taken (ms):0.3337113571166992
+Validation passed Rank:0
+rank : 3 size:  4
+Rank:3 Time taken (ms):0.3339980697631836
+Validation passed Rank:3
+rank : 2 size:  4
+Rank:2 Time taken (ms):0.33376255035400393
+Validation passed Rank:2
+rank : 1 size:  4
+Rank:1 Time taken (ms):0.33356800079345705
+Validation passed Rank:1
+```
+Halo exchange benchmarks can also be configured for different num-spatial-parts, slice-method, etc. Find all available options below:
+<pre>
+usage: spatial_halo_exchange_bench.py [-h] [--fp16-allreduce] [--image-size IMAGE_SIZE] [--batch-size BATCH_SIZE] [--halo-len HALO_LEN] [--in-channels IN_CHANNELS]
+                                      [--warmup WARMUP] [--iterations ITERATIONS] [--out-channels OUT_CHANNELS]
+
+Halo exchange benchmark
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --fp16-allreduce      use fp16 compression during allreduce (default: False)
+  --image-size IMAGE_SIZE
+                        Full image size (default: 8)
+  --batch-size BATCH_SIZE
+                        input batch size (default: 1)
+  --halo-len HALO_LEN   halo length (default: 1)
+  --in-channels IN_CHANNELS
+                        Number of channels in the input (default: 1)
+  --warmup WARMUP       warmups (default: 10)
+  --iterations ITERATIONS
+                        Iterations (default: 100)
+  --out-channels OUT_CHANNELS
+                        number of output channels (default: 256)
+</pre>
diff --git a/benchmarks/spatial/halo/spatial_halo_exchange_bench.py b/benchmarks/communication/halo/benchmark_sp_halo_exchange.py
similarity index 95%
rename from benchmarks/spatial/halo/spatial_halo_exchange_bench.py
rename to benchmarks/communication/halo/benchmark_sp_halo_exchange.py
index 5d8a8553..b533bb4e 100644
--- a/benchmarks/spatial/halo/spatial_halo_exchange_bench.py
+++ b/benchmarks/communication/halo/benchmark_sp_halo_exchange.py
@@ -277,7 +277,6 @@ def start_halo_exchange(self, halo_input):
         req = []
         for i in range(9):
             if self.neighbours[i] == 1:
-                # print("Local rank:",self.local_rank, " to:",self.local_rank + self.rank_neighbours[i], " I:",i)
                 temp = (
                     halo_input[
                         :,
@@ -394,7 +393,6 @@ def init_comm(backend="mpi"):
     dist.init_process_group(backend)
     size = dist.get_world_size()
     rank = dist.get_rank()
-    print("rank :", rank, "size: ", size)
     return size, rank
 
 
@@ -553,15 +551,13 @@ def test_output(output, expected_output, rank):
     np_out = output.to("cpu").numpy()
 
     if np.equal(np_out.astype("int"), expected_output.astype("int")).all():
-        print("Validation passed Rank:" + str(rank))
+        print(f"Validation passed for rank: {rank}")
     else:
         uneq = np.not_equal(np_out.astype("int"), expected_output.astype("int"))
         print(
-            "Rank:" + str(rank),
-            np_out.astype("int")[uneq],
-            expected_output.astype("int")[uneq],
+            f"Rank : {rank} => Received : {np_out[uneq]} Expected : {expected_output[uneq]}"
         )
-        print("Validation failed Rank:" + str(rank))
+        print(f"Validation failed for rank: {rank}")
 
 
 def run_benchmark(rank, size, hostname):
@@ -596,7 +592,7 @@ def run_benchmark(rank, size, hostname):
 
     t = start_event.elapsed_time(end_event)
 
-    print("Rank:" + str(rank) + " Time taken (ms):" + str(t / iterations))
+    print(f"Rank: {rank} Time taken (ms): {(t / iterations)}")
 
     test_output(y, expected_output, rank)
 
diff --git a/benchmarks/spatial/halo/bench_spatial.py b/benchmarks/communication/halo/benchmark_sp_halo_exchange_conv.py
similarity index 89%
rename from benchmarks/spatial/halo/bench_spatial.py
rename to benchmarks/communication/halo/benchmark_sp_halo_exchange_conv.py
index 834f994e..09602863 100644
--- a/benchmarks/spatial/halo/bench_spatial.py
+++ b/benchmarks/communication/halo/benchmark_sp_halo_exchange_conv.py
@@ -448,7 +448,6 @@ def start_halo_exchange(self, halo_input):
         req = []
         for i in range(9):
             if self.neighbours[i] == 1:
-                # print("Local rank:",self.local_rank, " to:",self.local_rank + self.rank_neighbours[i], " I:",i)
                 temp = (
                     halo_input[
                         :,
@@ -701,7 +700,7 @@ def create_input_square(kernel_size, halo_len, image_size, comm_size, rank):
 
     expected_output = np.pad(np_x, pad_width=pad_width, mode="constant")
 
-    print("Overall Expected output shape", expected_output.shape)
+    print(f"Overall Expected output shape {expected_output.shape}")
 
     expected_out_width = image_width_local + 2 * halo_len_width
     expected_out_height = image_height_local + 2 * halo_len_height
@@ -758,15 +757,6 @@ def test_output_square(image_size, output, expected_output, rank, size, mode="CO
     e_top_idx = row * expected_out_height
     e_bottom_idx = (row + 1) * expected_out_height
 
-    # if(rank==0):
-    # 	expected_output = expected_output[:,:,:expected_out_width, :expected_out_height]
-    # elif(rank==1):
-    # 	expected_output = expected_output[:,:,:expected_out_width, -expected_out_height:]
-    # elif(rank==2):
-    # 	expected_output = expected_output[:,:,-expected_out_width:, :expected_out_height]
-    # elif(rank==3):
-    # 	expected_output = expected_output[:,:,-expected_out_width:, -expected_out_height:]
-
     expected_output = expected_output[
         :, :, e_top_idx:e_bottom_idx, e_left_idx:e_right_idx
     ]
@@ -775,19 +765,9 @@ def test_output_square(image_size, output, expected_output, rank, size, mode="CO
     output = output.detach().cpu().numpy()
 
     if np.equal(output, expected_output).all():
-        print(mode + " Validation passed Rank:" + str(rank))
+        print(f"{mode} : Validation passed for rank: {rank}")
     else:
-        if rank == 0:
-            # debug statements
-            # uneq = np.not_equal(output,expected_output)
-            # print("Rank:"+str(rank), output[uneq].size , expected_output[uneq].size)
-            None
-
-        print(
-            mode
-            + " Validation failed Rank.............................................................................:"
-            + str(rank)
-        )
+        print(f"{mode} : Validation failed for rank: {rank}")
 
 
 def test_output_vertical(image_size, output, expected_output, rank, size, mode="CONV"):
@@ -811,19 +791,9 @@ def test_output_vertical(image_size, output, expected_output, rank, size, mode="
     output = output.detach().cpu().numpy()
 
     if np.equal(output, expected_output).all():
-        print(mode + " Validation passed Rank:" + str(rank))
+        print(f"{mode} : Validation passed for rank: {rank}")
     else:
-        if rank == 0:
-            # debug statements
-            # uneq = np.not_equal(output,expected_output)
-            # print("Rank:"+str(rank), output[uneq].size , expected_output[uneq].size)
-            None
-
-        print(
-            mode
-            + " Validation failed Rank.............................................................................:"
-            + str(rank)
-        )
+        print(f"{mode} : Validation failed for rank: {rank}")
 
 
 def test_output_horizontal(
@@ -847,39 +817,21 @@ def test_output_horizontal(
     output = output.detach().cpu().numpy()
 
     if np.equal(output, expected_output).all():
-        print(mode + " Validation passed Rank:" + str(rank))
+        print(f"{mode} : Validation passed for rank: {rank}")
     else:
-        if rank == 0:
-            # debug statements
-            # uneq = np.not_equal(output,expected_output)
-            # print("Rank:"+str(rank), output[uneq].size , expected_output[uneq].size)
-            None
-
-        print(
-            mode
-            + " Validation failed Rank.............................................................................:"
-            + str(rank)
-        )
+        print(f"{mode} : Validation failed for rank: {rank}")
 
 
 def test_output_recv(output, expected_output, rank):
-    # only padding ==  halo_len case is supported
-
-    # np_out = output.data.cpu().numpy()
     np_out = output.to("cpu").numpy()
-
-    # time.sleep(rank*10)
-
     if np.equal(np_out, expected_output).all():
-        print("Recv Tensor Validation passed Rank:" + str(rank))
+        print(f"Validation passed for rank: {rank}")
     else:
         uneq = np.not_equal(np_out.astype("int"), expected_output.astype("int"))
         print(
-            "Recv Tensor Rank:" + str(rank),
-            np_out.astype("int")[uneq],
-            expected_output.astype("int")[uneq],
+            f"Rank : {rank} => Received : {np_out[uneq]} Expected : {expected_output[uneq]}"
         )
-        print("Recv Tensor Validation failed Rank:" + str(rank))
+        print(f"Validation failed for rank: {rank}")
 
 
 halo_len = args.halo_len
@@ -927,12 +879,9 @@ def test_output_recv(output, expected_output, rank):
     )
 
 print(
-    "Size of input:{} Size of Output:{}".format(
-        input_tensor_local.shape, expected_output_recv.shape
-    )
+    f"Size of input:{input_tensor_local.shape} Size of Output:{expected_output_recv.shape}"
 )
 
-
 b_pt2pt = halo_bench_pt2pt(
     local_rank=rank,
     comm_size=size,
@@ -979,10 +928,9 @@ def test_output_recv(output, expected_output, rank):
     torch.cuda.synchronize()
     t = start_event.elapsed_time(end_event)
 else:
-    # time module gives time in secs
     t = (time.time() - start_time) * 1000
 
-print("Rank:" + str(rank) + " Time taken (ms):" + str(t / iterations))
+print(f"Rank: {rank} Time taken (ms): {(t / iterations)}")
 
 if ENABLE_VAL_RECV_TENSORS:
     test_output_recv(recv, expected_output_recv, rank)
@@ -1047,7 +995,7 @@ def test_output_recv(output, expected_output, rank):
 else:
     t = (time.time() - start_time) * 1000
 
-print("Rank:" + str(rank) + " Time taken Seq (ms):" + str(t / iterations))
+print(f"Rank: {rank} Time taken Seq (ms): {(t / iterations)}")
 
 if ENABLE_VAL_CONV:
     if args.slice_method == "vertical":
@@ -1124,9 +1072,3 @@ def test_output_recv(output, expected_output, rank):
         test_output_square(
             image_size, output, expected_output, rank, size, mode="SMALL CONV"
         )
-
-# time.sleep(rank*10)
-
-
-# if(rank==0):
-# 	print(np_x)
diff --git a/benchmarks/spatial/halo/spatial_halo_exchange_with_compute_bench.py b/benchmarks/communication/halo/benchmark_sp_halo_exchange_with_compute.py
similarity index 93%
rename from benchmarks/spatial/halo/spatial_halo_exchange_with_compute_bench.py
rename to benchmarks/communication/halo/benchmark_sp_halo_exchange_with_compute.py
index e65a04b1..5b950b72 100644
--- a/benchmarks/spatial/halo/spatial_halo_exchange_with_compute_bench.py
+++ b/benchmarks/communication/halo/benchmark_sp_halo_exchange_with_compute.py
@@ -377,7 +377,6 @@ def run(self, tensor):
         res_final = super(halo_bench_pt2pt, self).forward(tensor)
 
         return res_final
-        print("Rank:", self.local_rank, "\n", tensor)
 
 
 def env2int(env_list, default=-1):
@@ -403,7 +402,6 @@ def init_comm(backend="mpi"):
     dist.init_process_group(backend)
     size = dist.get_world_size()
     rank = dist.get_rank()
-    print("rank :", rank, "size: ", size)
     return size, rank
 
 
@@ -562,17 +560,10 @@ def create_input(halo_len, image_size, comm_size, rank, slice_method):
 def test_output(output, expected_output, rank):
     # only padding ==  halo_len case is supported
     np_out = output.data.cpu().numpy()
-
-    # time.sleep(rank*10)
-
     if np.equal(np_out.astype("int"), expected_output.astype("int")).all():
-        print("Validation passed Rank:" + str(rank))
+        print(f"Validation passed for rank: {rank}")
     else:
-        # uneq = np.not_equal(np_out.astype('int'),expected_output.astype('int'))
-        # print("Rank:"+str(rank), np_out.astype('int')[uneq],  expected_output.astype('int')[uneq])
-        print("Validation failed Rank:" + str(rank))
-    # print(np.equal(np_out.astype('int'),expected_output.astype('int')))
-    # print("Rank:",rank,"\n", np_out.astype('int'),"\n",expected_output.astype('int'))
+        print(f"Validation failed for rank: {rank}")
 
 
 def run_benchmark(rank, size, hostname):
@@ -611,9 +602,7 @@ def run_benchmark(rank, size, hostname):
 
     t = start_event.elapsed_time(end_event)
 
-    print("Rank:" + str(rank) + " Time taken (ms):" + str(t / iterations))
-
-    # test_output(y, expected_output, rank)
+    print(f"Rank: {rank} Time taken (ms): {(t / iterations)}")
 
     """
 	Sequential processing 
@@ -656,7 +645,7 @@ def run_benchmark(rank, size, hostname):
 
     t = start_event_seq.elapsed_time(end_event_seq)
 
-    print("Rank:" + str(rank) + " Time taken Seq (ms):" + str(t / iterations))
+    print(f"Rank: {rank} Time taken Seq (ms): {(t / iterations)}")
 
 
 def init_processes(hostname, fn, backend="mpi"):
diff --git a/benchmarks/spatial/halo/spatial_halo_exchange_with_compute_val_bench.py b/benchmarks/communication/halo/benchmark_sp_halo_exchange_with_compute_val.py
similarity index 90%
rename from benchmarks/spatial/halo/spatial_halo_exchange_with_compute_val_bench.py
rename to benchmarks/communication/halo/benchmark_sp_halo_exchange_with_compute_val.py
index 7c829b0b..6afacac1 100644
--- a/benchmarks/spatial/halo/spatial_halo_exchange_with_compute_val_bench.py
+++ b/benchmarks/communication/halo/benchmark_sp_halo_exchange_with_compute_val.py
@@ -290,7 +290,6 @@ def start_halo_exchange(self, halo_input):
         req = []
         for i in range(9):
             if self.neighbours[i] == 1:
-                # print("Local rank:",self.local_rank, " to:",self.local_rank + self.rank_neighbours[i], " I:",i)
                 temp = (
                     halo_input[
                         :,
@@ -373,7 +372,6 @@ def run(self, tensor):
         res_final = super(halo_bench_pt2pt, self).forward(tensor)
 
         return tensor, res_final
-        print("Rank:", self.local_rank, "\n", tensor)
 
 
 def env2int(env_list, default=-1):
@@ -580,18 +578,9 @@ def test_output_square(image_size, output, expected_output, rank, size):
     output = output.detach().cpu().numpy()
 
     if np.equal(output, expected_output).all():
-        print(" Validation passed Rank:" + str(rank))
+        print(f"Validation passed for rank: {rank}")
     else:
-        if rank == 0:
-            # debug statements
-            # uneq = np.not_equal(output,expected_output)
-            # print("Rank:"+str(rank), output[uneq].size , expected_output[uneq].size)
-            None
-
-        print(
-            " Validation failed Rank.............................................................................:"
-            + str(rank)
-        )
+        print(f"Validation failed for rank: {rank}")
 
 
 def test_output_vertical(image_size, output, expected_output, rank, size):
@@ -615,18 +604,9 @@ def test_output_vertical(image_size, output, expected_output, rank, size):
     output = output.detach().cpu().numpy()
 
     if np.equal(output, expected_output).all():
-        print(" Validation passed Rank:" + str(rank))
+        print(f"Validation passed for rank: {rank}")
     else:
-        if rank == 0:
-            # debug statements
-            # uneq = np.not_equal(output,expected_output)
-            # print("Rank:"+str(rank), output[uneq].size , expected_output[uneq].size)
-            None
-
-        print(
-            " Validation failed Rank.............................................................................:"
-            + str(rank)
-        )
+        print(f"Validation failed for rank: {rank}")
 
 
 def test_output_horizontal(image_size, output, expected_output, rank, size):
@@ -648,18 +628,9 @@ def test_output_horizontal(image_size, output, expected_output, rank, size):
     output = output.detach().cpu().numpy()
 
     if np.equal(output, expected_output).all():
-        print(" Validation passed Rank:" + str(rank))
+        print(f"Validation passed for rank: {rank}")
     else:
-        if rank == 0:
-            # debug statements
-            # uneq = np.not_equal(output,expected_output)
-            # print("Rank:"+str(rank), output[uneq].size , expected_output[uneq].size)
-            None
-
-        print(
-            " Validation failed Rank.............................................................................:"
-            + str(rank)
-        )
+        print(f"Validation failed for rank: {rank}")
 
 
 def test_output(image_size, output, expected_output, rank, size):
@@ -677,15 +648,13 @@ def test_output_recv(output, expected_output, rank):
     np_out = output.to("cpu").numpy()
 
     if np.equal(np_out, expected_output).all():
-        print("Recv Validation passed Rank:" + str(rank))
+        print(f"Validation passed for rank: {rank}")
     else:
         uneq = np.not_equal(np_out.astype("int"), expected_output.astype("int"))
         print(
-            "Recv Rank:" + str(rank),
-            np_out.astype("int")[uneq],
-            expected_output.astype("int")[uneq],
+            f"Rank : {rank} => Received : {np_out[uneq]} Expected : {expected_output[uneq]}"
         )
-        print("Recv Validation failed Rank:" + str(rank))
+        print(f"Validation failed for rank: {rank}")
 
 
 halo_len = args.halo_len
@@ -735,7 +704,7 @@ def test_output_recv(output, expected_output, rank):
 
 t = start_event.elapsed_time(end_event)
 
-print("Rank:" + str(rank) + " Time taken (ms):" + str(t / iterations))
+print(f"Rank: {rank} Time taken (ms): {(t / iterations)}")
 
 test_output_recv(recv, expected_output_recv, rank)
 
@@ -786,7 +755,6 @@ def test_output_recv(output, expected_output, rank):
 
 t = start_event_seq.elapsed_time(end_event_seq)
 
-print("Rank:" + str(rank) + " Time taken Seq (ms):" + str(t / iterations))
-
+print(f"Rank: {rank} Time taken Seq (ms): {(t / iterations)}")
 
 test_output(image_size, output, expected_output, rank, size)
diff --git a/benchmarks/layer_parallelism/README.md b/benchmarks/layer_parallelism/README.md
new file mode 100644
index 00000000..895e9265
--- /dev/null
+++ b/benchmarks/layer_parallelism/README.md
@@ -0,0 +1,53 @@
+# Layer Parallelism Benchmarks
+
+## Run Layer parallelism:
+
+#### Generic command:
+```bash
+
+$MV2_HOME/bin/mpirun_rsh --export-all -np $np --hostfile  {$HOSTFILE} MV2_USE_CUDA=1 MV2_HYBRID_BINDING_POLICY=spread MV2_CPU_BINDING_POLICY=hybrid MV2_USE_GDRCOPY=0 PYTHONNOUSERSITE=true LD_PRELOAD=$MV2_HOME/lib/libmpi.so python ${lp_model_script} --image-size ${image_size} --batch-size ${batch_size} --split-size ${split_size} --parts ${parts}
+
+```
+#### Examples
+
+- With 4 GPUs [split size: 4]
+
+Example to run AmoebaNet LP model with 4 model split size(i.e. # of partitions for LP) for 1024 * 1024 image size. 
+
+```bash
+$MV2_HOME/bin/mpirun_rsh --export-all -np $np --hostfile ${hostfile} MV2_USE_CUDA=1 MV2_HYBRID_BINDING_POLICY=spread MV2_CPU_BINDING_POLICY=hybrid MV2_USE_GDRCOPY=0 PYTHONNOUSERSITE=true LD_PRELOAD=$MV2_HOME/lib/libmpi.so python benchmarks/layer_parallelism/benchmark_amoebanet_lp.py --batch-size 1 --image-size 1024 --split-size 4 
+```
+
+Similarly, we can run benchmark for ResNet LP model.
+
+```bash
+$MV2_HOME/bin/mpirun_rsh --export-all -np $np --hostfile ${hostfile} MV2_USE_CUDA=1 MV2_HYBRID_BINDING_POLICY=spread MV2_CPU_BINDING_POLICY=hybrid MV2_USE_GDRCOPY=0 PYTHONNOUSERSITE=true LD_PRELOAD=$MV2_HOME/lib/libmpi.so python benchmarks/layer_parallelism/benchmark_resnet_lp.py --batch-size 1 --image-size 1024 --split-size 4 
+``` 
+
+Below are the available configuration options :
+
+<pre>
+optional arguments:
+  -h, --help            show this help message and exit
+  -v, --verbose         Prints performance numbers or logs (default: False)
+  --batch-size BATCH_SIZE
+                        input batch size (default: 32)
+  --parts PARTS         Number of parts for MP (default: 1)
+  --split-size SPLIT_SIZE
+                        Number of process for MP (default: 2)
+  --image-size IMAGE_SIZE
+                        Image size for synthetic benchmark (default: 32)
+  --num-epochs NUM_EPOCHS
+                        Number of epochs (default: 1)
+  --num-layers NUM_LAYERS
+                        Number of layers in amoebanet (default: 18)
+  --num-filters NUM_FILTERS
+                        Number of layers in amoebanet (default: 416)
+  --balance BALANCE     length of list equals to number of partitions and sum should be equal to num layers (default: None)
+  --fused-layers FUSED_LAYERS
+                        When D2 design is enables for halo exchange, number of blocks to fuse in ResNet model (default: 1)
+  --local-DP LOCAL_DP   LBANN intergration of SP with MP. MP can apply data parallelism. 1: only one GPU for a given split, 2: two gpus for a given split (uses DP)
+                        (default: 1)
+  --app APP             Application type (1.medical, 2.cifar, and synthetic) in Spatial parallelism (default: 3)
+  --datapath DATAPATH   local Dataset path (default: ./train)
+  </pre>
diff --git a/benchmarks/layer_parallelism/benchmark_amoebanet_lp.py b/benchmarks/layer_parallelism/benchmark_amoebanet_lp.py
new file mode 100644
index 00000000..1c3b504a
--- /dev/null
+++ b/benchmarks/layer_parallelism/benchmark_amoebanet_lp.py
@@ -0,0 +1,226 @@
+import torch
+import torchvision.transforms as transforms
+import torchvision
+import numpy as np
+import sys
+import math
+import logging
+from torchgems import parser
+import time
+from torchgems.mp_pipeline import model_generator, train_model
+from models import amoebanet
+import torchgems.comm as gems_comm
+
+
+parser_obj = parser.get_parser()
+args = parser_obj.parse_args()
+
+if args.verbose:
+    logging.basicConfig(level=logging.DEBUG)
+
+gems_comm.initialize_cuda()
+
+
+class Unbuffered(object):
+    def __init__(self, stream):
+        self.stream = stream
+
+    def write(self, data):
+        self.stream.write(data)
+        self.stream.flush()
+
+    def writelines(self, datas):
+        self.stream.writelines(datas)
+        self.stream.flush()
+
+    def __getattr__(self, attr):
+        return getattr(self.stream, attr)
+
+
+sys.stdout = Unbuffered(sys.stdout)
+
+np.random.seed(seed=1405)
+ENABLE_ASYNC = True
+ENABLE_APP = False
+
+parts = args.parts
+batch_size = args.batch_size
+epoch = args.num_epochs
+image_size = int(args.image_size)
+num_layers = args.num_layers
+num_filters = args.num_filters
+balance = args.balance
+mp_size = args.split_size
+times = args.times
+datapath = args.datapath
+steps = 100
+
+##################### AmoebaNet model specific parameters #####################
+
+image_size_seq = 512
+num_classes = 1000
+
+###############################################################################
+
+mpi_comm = gems_comm.MPIComm(split_size=mp_size, ENABLE_MASTER=False)
+rank = mpi_comm.rank
+local_rank = rank % mp_size
+if balance is not None:
+    balance = [int(i) for i in balance.split(",")]
+
+# Initialize AmoebaNet model
+model = amoebanet.amoebanetd(
+    num_classes=num_classes, num_layers=args.num_layers, num_filters=args.num_filters
+)
+
+# Initialize parameters for Model Parallelism
+model_gen = model_generator(
+    model=model,
+    split_size=mp_size,
+    input_size=(int(batch_size / parts), 3, image_size_seq, image_size_seq),
+    balance=balance,
+)
+
+# Get the shape of model on each split rank for image_size_seq and move it to device
+# Note : we take shape w.r.t image_size_seq as model w.r.t image_size may not be
+# able to fit in memory
+
+model_gen.ready_model(split_rank=local_rank, GET_SHAPES_ON_CUDA=True)
+
+# Get the shape of model on each split rank for image_size
+image_size_times = int(image_size / image_size_seq)
+amoebanet_shapes_list = []
+for output_shape in model_gen.shape_list:
+    if isinstance(output_shape, list):
+        temp_shape = []
+        for shape_tuple in output_shape:
+            x = (
+                shape_tuple[0],
+                shape_tuple[1],
+                int(shape_tuple[2] * image_size_times),
+                int(shape_tuple[3] * image_size_times),
+            )
+            temp_shape.append(x)
+        amoebanet_shapes_list.append(temp_shape)
+    else:
+        if len(output_shape) == 2:
+            amoebanet_shapes_list.append(output_shape)
+        else:
+            x = (
+                output_shape[0],
+                output_shape[1],
+                int(output_shape[2] * image_size_times),
+                int(output_shape[3] * image_size_times),
+            )
+            amoebanet_shapes_list.append(x)
+
+model_gen.shape_list = amoebanet_shapes_list
+
+logging.info(f"Shape of model on local_rank {local_rank} : {model_gen.shape_list}")
+
+del model_gen
+del model
+torch.cuda.ipc_collect()
+
+model = amoebanet.amoebanetd(
+    num_classes=num_classes, num_layers=args.num_layers, num_filters=args.num_filters
+)
+
+model_gen = model_generator(
+    model=model,
+    split_size=mp_size,
+    input_size=(int(batch_size / parts), 3, image_size, image_size),
+    balance=balance,
+    shape_list=amoebanet_shapes_list,
+)
+
+# Move model it it's repective devices
+model_gen.ready_model(split_rank=local_rank, GET_SHAPES_ON_CUDA=True)
+
+tm = train_model(
+    model_gen,
+    local_rank,
+    batch_size,
+    epoch,
+    criterion=None,
+    optimizer=None,
+    parts=parts,
+    ASYNC=ENABLE_ASYNC,
+)
+
+############################## Dataset Definition ##############################
+
+transform = transforms.Compose(
+    [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
+)
+torch.manual_seed(0)
+if ENABLE_APP == True:
+    trainset = torchvision.datasets.ImageFolder(
+        datapath, transform=transform, target_transform=None
+    )
+    my_dataloader = torch.utils.data.DataLoader(
+        trainset, batch_size=batch_size, shuffle=True, num_workers=0, pin_memory=True
+    )
+else:
+    my_dataset = torchvision.datasets.FakeData(
+        size=10 * batch_size,
+        image_size=(3, image_size, image_size),
+        num_classes=num_classes,
+        transform=transform,
+        target_transform=None,
+        random_offset=0,
+    )
+    my_dataloader = torch.utils.data.DataLoader(
+        my_dataset,
+        batch_size=batch_size * times,
+        shuffle=False,
+        num_workers=0,
+        pin_memory=True,
+    )
+    size_dataset = 10 * batch_size
+
+################################################################################
+
+
+################################# Train Model ##################################
+
+perf = []
+
+
+def run_epoch():
+    for i_e in range(epoch):
+        loss = 0
+        t = time.time()
+        for i, data in enumerate(my_dataloader, 0):
+            start_event = torch.cuda.Event(enable_timing=True, blocking=True)
+            end_event = torch.cuda.Event(enable_timing=True, blocking=True)
+            start_event.record()
+
+            if i > math.floor(size_dataset / (times * batch_size)) - 1:
+                break
+            inputs, labels = data
+
+            temp_loss = tm.run_step(inputs, labels)
+            loss += temp_loss
+            tm.update()
+
+            end_event.record()
+            torch.cuda.synchronize()
+            t = start_event.elapsed_time(end_event) / 1000
+
+            if local_rank == mp_size - 1:
+                logging.info(f"Step :{i}, LOSS: {temp_loss}, Global loss: {loss/(i+1)}")
+
+            if local_rank == 0:
+                print(f"Epoch: {i_e} images per sec:{batch_size / t}")
+                perf.append(batch_size / t)
+
+            t = time.time()
+
+
+run_epoch()
+
+if local_rank == 0:
+    print(f"Mean {sum(perf) / len(perf)} Median {np.median(perf)}")
+
+################################################################################
diff --git a/benchmarks/layer_parallelism/benchmark_resnet_lp.py b/benchmarks/layer_parallelism/benchmark_resnet_lp.py
new file mode 100644
index 00000000..72697357
--- /dev/null
+++ b/benchmarks/layer_parallelism/benchmark_resnet_lp.py
@@ -0,0 +1,238 @@
+import torch
+import torchvision.transforms as transforms
+import torchvision
+import numpy as np
+import sys
+import math
+import logging
+from torchgems import parser
+import time
+from torchgems.mp_pipeline import model_generator, train_model
+from models import resnet_cifar_torch
+import torchgems.comm as gems_comm
+
+
+parser_obj = parser.get_parser()
+args = parser_obj.parse_args()
+
+if args.verbose:
+    logging.basicConfig(level=logging.DEBUG)
+
+gems_comm.initialize_cuda()
+
+
+class Unbuffered(object):
+    def __init__(self, stream):
+        self.stream = stream
+
+    def write(self, data):
+        self.stream.write(data)
+        self.stream.flush()
+
+    def writelines(self, datas):
+        self.stream.writelines(datas)
+        self.stream.flush()
+
+    def __getattr__(self, attr):
+        return getattr(self.stream, attr)
+
+
+sys.stdout = Unbuffered(sys.stdout)
+
+np.random.seed(seed=1405)
+ENABLE_ASYNC = True
+ENABLE_APP = False
+parts = args.parts
+batch_size = args.batch_size
+epoch = args.num_epochs
+image_size = int(args.image_size)
+balance = args.balance
+mp_size = args.split_size
+times = args.times
+datapath = args.datapath
+steps = 100
+
+################## ResNet model specific parameters/functions ##################
+
+image_size_seq = 32
+num_classes = 10
+resnet_n = 12
+
+
+def get_depth(version, n):
+    if version == 1:
+        return n * 6 + 2
+    elif version == 2:
+        return n * 9 + 2
+
+
+###############################################################################
+
+mpi_comm = gems_comm.MPIComm(split_size=mp_size, ENABLE_MASTER=False)
+rank = mpi_comm.rank
+
+local_rank = rank % mp_size
+
+if balance is not None:
+    balance = [int(i) for i in balance.split(",")]
+
+# Initialize ResNet model
+model = resnet_cifar_torch.get_resnet_v2(
+    (int(batch_size / parts), 3, image_size_seq, image_size_seq),
+    depth=get_depth(2, resnet_n),
+)
+
+mul_shape = int(args.image_size / image_size_seq)
+
+# Initialize parameters for Model Parallelism
+model_gen = model_generator(
+    model=model,
+    split_size=mp_size,
+    input_size=(int(batch_size / parts), 3, image_size_seq, image_size_seq),
+    balance=balance,
+)
+
+# Get the shape of model on each split rank for image_size_seq and move it to device
+# Note : we take shape w.r.t image_size_seq as model w.r.t image_size may not be
+# able to fit in memory
+model_gen.ready_model(split_rank=local_rank, GET_SHAPES_ON_CUDA=True)
+
+# Get the shape of model on each split rank for image_size
+image_size_times = int(image_size / image_size_seq)
+resnet_shapes_list = []
+for output_shape in model_gen.shape_list:
+    if isinstance(output_shape, list):
+        temp_shape = []
+        for shape_tuple in output_shape:
+            x = (
+                shape_tuple[0],
+                shape_tuple[1],
+                int(shape_tuple[2] * image_size_times),
+                int(shape_tuple[3] * image_size_times),
+            )
+            temp_shape.append(x)
+        resnet_shapes_list.append(temp_shape)
+    else:
+        if len(output_shape) == 2:
+            resnet_shapes_list.append(output_shape)
+        else:
+            x = (
+                output_shape[0],
+                output_shape[1],
+                int(output_shape[2] * image_size_times),
+                int(output_shape[3] * image_size_times),
+            )
+            resnet_shapes_list.append(x)
+
+model_gen.shape_list = resnet_shapes_list
+logging.info(f"Shape of model on local_rank {local_rank} : {model_gen.shape_list}")
+
+
+del model_gen
+del model
+torch.cuda.ipc_collect()
+
+model = resnet_cifar_torch.get_resnet_v2(
+    (int(batch_size / parts), 3, image_size, image_size), get_depth(2, resnet_n)
+)
+
+model_gen = model_generator(
+    model=model,
+    split_size=mp_size,
+    input_size=(int(batch_size / parts), 3, image_size, image_size),
+    balance=balance,
+    shape_list=resnet_shapes_list,
+)
+
+# Move model it it's repective devices
+model_gen.ready_model(split_rank=local_rank, GET_SHAPES_ON_CUDA=True)
+
+tm = train_model(
+    model_gen,
+    local_rank,
+    batch_size,
+    epoch,
+    criterion=None,
+    optimizer=None,
+    parts=parts,
+    ASYNC=ENABLE_ASYNC,
+)
+
+############################## Dataset Definition ##############################
+
+transform = transforms.Compose(
+    [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
+)
+torch.manual_seed(0)
+if ENABLE_APP == True:
+    trainset = torchvision.datasets.ImageFolder(
+        datapath,
+        transform=transform,
+        target_transform=None,
+    )
+    my_dataloader = torch.utils.data.DataLoader(
+        trainset, batch_size=batch_size, shuffle=True, num_workers=0, pin_memory=True
+    )
+else:
+    my_dataset = torchvision.datasets.FakeData(
+        size=10 * batch_size,
+        image_size=(3, image_size, image_size),
+        num_classes=num_classes,
+        transform=transform,
+        target_transform=None,
+        random_offset=0,
+    )
+    my_dataloader = torch.utils.data.DataLoader(
+        my_dataset,
+        batch_size=batch_size * times,
+        shuffle=False,
+        num_workers=0,
+        pin_memory=True,
+    )
+    size_dataset = 10 * batch_size
+
+################################################################################
+
+################################# Train Model ##################################
+
+perf = []
+
+
+def run_epoch():
+    for i_e in range(epoch):
+        loss = 0
+        t = time.time()
+        for i, data in enumerate(my_dataloader, 0):
+            start_event = torch.cuda.Event(enable_timing=True, blocking=True)
+            end_event = torch.cuda.Event(enable_timing=True, blocking=True)
+            start_event.record()
+
+            if i > math.floor(size_dataset / (times * batch_size)) - 1:
+                break
+
+            inputs, labels = data
+
+            temp_loss = tm.run_step(inputs, labels)
+            loss += temp_loss
+            tm.update()
+
+            end_event.record()
+            torch.cuda.synchronize()
+            t = start_event.elapsed_time(end_event) / 1000
+
+            if local_rank == mp_size - 1:
+                logging.info(f"Step :{i}, LOSS: {temp_loss}, Global loss: {loss/(i+1)}")
+
+            if local_rank == 0:
+                print(f"Epoch: {i_e} images per sec:{batch_size / t}")
+                perf.append(batch_size / t)
+
+            t = time.time()
+
+
+run_epoch()
+
+if local_rank == 0:
+    print(f"Mean {sum(perf) / len(perf)} Median {np.median(perf)}")
+
+################################################################################
diff --git a/benchmarks/spatial/README.md b/benchmarks/spatial/README.md
deleted file mode 100644
index ffcf78ec..00000000
--- a/benchmarks/spatial/README.md
+++ /dev/null
@@ -1,121 +0,0 @@
-# Spatial Parallelism Benchmarks
-
-Spatial parallelism benchmarks include halo exchange and model benchmarks. These benchmarks will test the working of spatial parallelism.
-
-
-##  Halo exchnage benchmark:
-- While performing convolutional operations on each partition of the image, halo exchange will be performed to receive input from neighboring partitions.
-- Halo exchange can also be performed in parallel, while convolution operations on available input are done in parallel while performing halo exchange.
-- spatial_halo_exchange_bench.py and spatial_halo_exchange_with_compute_bench.py are used to test the proper functioning of send and receive operations for halo regions.
-- spatial_halo_exchange_with_compute_val_bench.py is utilized to validate the received inputs, in addition to testing the halo region send and receive operations.
-
-
-**Run halo-exchange benchmarks:**
-
-- Load Required model:
-```bash
-cd torch-gems
-python setup.py install
-```
-
-- Example to run halo exchange benchmark for four vertical partition : 
-```bash
-cd benchmarks/spatial/model/
-$MV2_HOME/bin/mpirun_rsh --export-all -np 4 --hostfile {$HOSTFILE} MV2_USE_GDRCOPY=0 MV2_ENABLE_AFFINITY=0 MV2_USE_CUDA=1 LD_PRELOAD=$MV2_HOME/lib/libmpi.so python spatial_halo_exchange_bench.py --image-size 32 --batch-size 32 --num-spatial-parts 4 --slice-method "vertical"
-```
-
-Halo exchange benchmarks can also be configured for different num-spatial-parts, slice-method, etc. Find all available options below:
-<pre>
-usage: spatial_halo_exchange_bench.py [-h] [--fp16-allreduce] [--image-size IMAGE_SIZE] [--batch-size BATCH_SIZE] [--halo-len HALO_LEN] [--in-channels IN_CHANNELS]
-                                      [--warmup WARMUP] [--iterations ITERATIONS] [--out-channels OUT_CHANNELS]
-
-Halo exchange benchmark
-
-optional arguments:
-  -h, --help            show this help message and exit
-  --fp16-allreduce      use fp16 compression during allreduce (default: False)
-  --image-size IMAGE_SIZE
-                        Full image size (default: 8)
-  --batch-size BATCH_SIZE
-                        input batch size (default: 1)
-  --halo-len HALO_LEN   halo length (default: 1)
-  --in-channels IN_CHANNELS
-                        Number of channels in the input (default: 1)
-  --warmup WARMUP       warmups (default: 10)
-  --iterations ITERATIONS
-                        Iterations (default: 100)
-  --out-channels OUT_CHANNELS
-                        number of output channels (default: 256)
-</pre>
-
-## Model benchmarks
-
-Model benchmarks for spatial parallelism also require performing model parallelism. To configure the number of model partitions and the number of model partitions that will use spatial parallelism, you can use the --split-size and --spatial-size arguments respectively.
-
-1. Amoebanet benchmark
-
-Run spatial parallelism for Amoebanet model:
-
-Example to run Amoebanet model with partition size for model as two, spatial partition as four and spatial size (i.e. number of model partition which will use spatial partition) as 1
-```bash
-$MV2_HOME/bin/mpirun_rsh --export-all -np 5 --hostfile {$HOSTFILE} MV2_USE_GDRCOPY=0 MV2_ENABLE_AFFINITY=0 MV2_USE_CUDA=1 LD_PRELOAD=$MV2_HOME/lib/libmpi.so python amoebanet_run.py --image-size 512 --num-spatial-parts 4 --slice-method "vertical" --split-size 2 --spatial-size 1
-```
-
-Below are the available configuration options :
-
-<pre>
-usage: amoebanet_run.py [-h] [--fp16-allreduce] [--model MODEL] [--batch-size BATCH_SIZE] [--learning-rate LEARNING_RATE] [--num-gpus-mp NUM_GPUS_MP]
-                        [--mem-per-process MEM_PER_PROCESS] [--parts PARTS] [--split-size SPLIT_SIZE] [--num-spatial-parts NUM_SPATIAL_PARTS]
-                        [--spatial-size SPATIAL_SIZE] [--times TIMES] [--image-size IMAGE_SIZE] [--dp-per-node DP_PER_NODE] [--enable-dp] [--enable-master-comm-opt]
-                        [--num-gpu-per-node NUM_GPU_PER_NODE] [--num-epochs NUM_EPOCHS] [--num-layers NUM_LAYERS] [--num-filters NUM_FILTERS] [--unet-b UNET_B]
-                        [--unet-c UNET_C] [--balance BALANCE] [--halo-D2] [--fused-layers FUSED_LAYERS] [--local-DP LOCAL_DP] [--slice-method SLICE_METHOD]
-
-MP-DP ResNet Script
-
-optional arguments:
-  -h, --help            show this help message and exit
-  --fp16-allreduce      use fp16 compression during allreduce (default: False)
-  --model MODEL         model to benchmark (default: resnet50)
-  --batch-size BATCH_SIZE
-                        input batch size (default: 32)
-  --learning-rate LEARNING_RATE
-                        learning rate for the optimizer (default: 0.001)
-  --num-gpus-mp NUM_GPUS_MP
-                        number of GPUS per node for MP (default: 1)
-  --mem-per-process MEM_PER_PROCESS
-                        TF GPU memory per GPU (default: 1)
-  --parts PARTS         Number of parts for MP (default: 1)
-  --split-size SPLIT_SIZE
-                        Number of process for MP (default: 2)
-  --num-spatial-parts NUM_SPATIAL_PARTS
-                        Number of partitions in spatial parallelism (default: 4)
-  --spatial-size SPATIAL_SIZE
-                        Number splits for spatial parallelism (default: 1)
-  --times TIMES         Number of times to repeat MASTER 1: 2 repications, 2: 4 replications (default: 1)
-  --image-size IMAGE_SIZE
-                        Image size for synthetic benchmark (default: 32)
-  --dp-per-node DP_PER_NODE
-                        Number of DP modes per node (default: 1)
-  --enable-dp           Enable DP for pytorch scripts (default: False)
-  --enable-master-comm-opt
-                        Enable communication optimization for MASTER in Spatial (default: False)
-  --num-gpu-per-node NUM_GPU_PER_NODE
-                        Number of GPUs per node (default: 4)
-  --num-epochs NUM_EPOCHS
-                        Number of epochs (default: 1)
-  --num-layers NUM_LAYERS
-                        Number of layers in amoebanet (default: 18)
-  --num-filters NUM_FILTERS
-                        Number of layers in amoebanet (default: 416)
-  --unet-b UNET_B       B hyperparamter in unet (default: 6)
-  --unet-c UNET_C       C hyperparamter in unet (default: 72)
-  --balance BALANCE     length of list equals to number of partitions and sum should be equal to num layers (default: None)
-  --halo-D2             Enable design2 (do halo exhange on few convs) for spatial conv. (default: False)
-  --fused-layers FUSED_LAYERS
-                        When D2 design is enables for halo exchange, number of blocks to fuse in ResNet model (default: 1)
-  --local-DP LOCAL_DP   LBANN intergration of SP with MP. MP can apply data parallelism. 1: only one GPU for a given split, 2: two gpus for a given split (uses DP)
-                        (default: 1)
-  --slice-method SLICE_METHOD
-                        Slice method (square, vertical, and horizontal) in Spatial parallelism (default: square)
-
-</pre>
diff --git a/benchmarks/spatial/model/master_amoebanet_run.py b/benchmarks/spatial/model/master_amoebanet_run.py
deleted file mode 100644
index dc4af405..00000000
--- a/benchmarks/spatial/model/master_amoebanet_run.py
+++ /dev/null
@@ -1,459 +0,0 @@
-import torch
-import torch.distributed as dist
-import torchvision.transforms as transforms
-import torchvision
-
-# import matplotlib.pyplot as plt
-import numpy as np
-import time
-import sys
-import math
-
-sys.path.append("/usr/WS1/jain8/project/pytorch_mp/mp/torch-gems/")
-
-
-from torchgems import parser
-from torchgems.mp_pipeline import model_generator
-from torchgems.train_spatial import get_shapes_spatial, split_input
-from torchgems.train_spatial_master import train_spatial_model_master
-import torchgems.comm as gems_comm
-
-parser_obj = parser.get_parser()
-args = parser_obj.parse_args()
-
-if args.halo_d2:
-    from models import amoebanet
-    from models import amoebanet_d2
-
-else:
-    from models import amoebanet
-
-gems_comm.initialize_cuda()
-
-
-class Unbuffered(object):
-    def __init__(self, stream):
-        self.stream = stream
-
-    def write(self, data):
-        self.stream.write(data)
-        self.stream.flush()
-
-    def writelines(self, datas):
-        self.stream.writelines(datas)
-        self.stream.flush()
-
-    def __getattr__(self, attr):
-        return getattr(self.stream, attr)
-
-
-def init_processes(backend="tcp"):
-    """Initialize the distributed environment."""
-    dist.init_process_group(backend)
-    size = dist.get_world_size()
-    rank = dist.get_rank()
-    return size, rank
-
-
-def get_depth(version, n):
-    if version == 1:
-        return n * 6 + 2
-    elif version == 2:
-        return n * 9 + 2
-
-
-sys.stdout = Unbuffered(sys.stdout)
-
-# torch.set_num_threads(1)
-np.random.seed(seed=1405)
-parts = args.parts
-batch_size = args.batch_size
-resnet_n = 12
-epoch = args.num_epochs
-ENABLE_ASYNC = True
-
-# APP
-# 1: Medical
-# 2: Cifar
-# 3: synthetic
-APP = 3
-amoebanet_test = False
-image_size = int(args.image_size)
-print("image size", image_size)
-steps = 100
-num_layers = args.num_layers
-num_filters = args.num_filters
-balance = args.balance
-split_size = args.split_size
-spatial_size = args.spatial_size
-ENABLE_MASTER_OPT = args.enable_master_comm_opt
-
-temp_num_spatial_parts = args.num_spatial_parts.split(",")
-
-if len(temp_num_spatial_parts) == 1:
-    num_spatial_parts_list = [int(temp_num_spatial_parts[0])]
-    num_spatial_parts = int(temp_num_spatial_parts[0])
-else:
-    num_spatial_parts = [int(i) for i in temp_num_spatial_parts]
-    num_spatial_parts_list = num_spatial_parts
-
-times = 1
-num_classes = 1000
-LOCAL_DP_LP = args.local_DP
-
-
-mpi_comm_first = gems_comm.MPIComm(
-    split_size=split_size,
-    ENABLE_MASTER=False,
-    ENABLE_SPATIAL=True,
-    num_spatial_parts=num_spatial_parts,
-    spatial_size=spatial_size,
-    LOCAL_DP_LP=LOCAL_DP_LP,
-)
-mpi_comm_second = gems_comm.MPIComm(
-    split_size=split_size,
-    ENABLE_MASTER=True,
-    ENABLE_SPATIAL=True,
-    num_spatial_parts=num_spatial_parts,
-    spatial_size=spatial_size,
-    LOCAL_DP_LP=LOCAL_DP_LP,
-    DISABLE_INIT=True,
-)
-
-gems_comm.sync_comms_for_master(mpi_comm_first, mpi_comm_second)
-comm_size = mpi_comm_first.size
-# rank = mpi_comm.local_rank
-# comm_size = mpi_comm.size
-# local_rank = rank
-
-# split_rank = mpi_comm.split_rank
-
-
-if args.balance != None:
-    balance = args.balance.split(",")
-    balance = [int(j) for j in balance]
-else:
-    balance = None
-
-
-image_size_seq = 512
-
-model_seq = amoebanet.amoebanetd(
-    num_layers=num_layers, num_filters=num_filters, num_classes=num_classes
-)
-print("length", len(model_seq), balance)
-model_gen_seq = model_generator(
-    model=model_seq,
-    split_size=split_size,
-    input_size=(int(batch_size / parts), 3, image_size_seq, image_size_seq),
-    balance=balance,
-)
-model_gen_seq.ready_model(
-    split_rank=mpi_comm_second.split_rank, GET_SHAPES_ON_CUDA=True
-)
-
-image_size_times = int(image_size / image_size_seq)
-
-resnet_shapes_list = get_shapes_spatial(
-    shape_list=model_gen_seq.shape_list,
-    slice_method=args.slice_method,
-    spatial_size=spatial_size,
-    num_spatial_parts_list=num_spatial_parts_list,
-    image_size_times=image_size_times,
-)
-
-print(model_gen_seq.shape_list, resnet_shapes_list)
-
-del model_seq
-del model_gen_seq
-torch.cuda.ipc_collect()
-
-
-if args.halo_d2:
-    model1 = amoebanet_d2.amoebanetd_spatial(
-        local_rank=mpi_comm_first.local_rank % mpi_comm_first.total_spatial_processes,
-        spatial_size=spatial_size,
-        num_spatial_parts=num_spatial_parts,
-        mp_size=split_size,
-        balance=balance,
-        slice_method="square",
-        num_classes=num_classes,
-        num_layers=num_layers,
-        num_filters=num_filters,
-    )
-
-    model2 = amoebanet_d2.amoebanetd_spatial(
-        local_rank=mpi_comm_second.local_rank % mpi_comm_second.total_spatial_processes,
-        spatial_size=spatial_size,
-        num_spatial_parts=num_spatial_parts,
-        mp_size=split_size,
-        balance=balance,
-        slice_method="square",
-        num_classes=num_classes,
-        num_layers=num_layers,
-        num_filters=num_filters,
-    )
-else:
-    model1 = amoebanet.amoebanetd_spatial(
-        local_rank=mpi_comm_first.local_rank % mpi_comm_first.total_spatial_processes,
-        spatial_size=spatial_size,
-        num_spatial_parts=num_spatial_parts,
-        mp_size=split_size,
-        balance=balance,
-        slice_method="square",
-        num_classes=num_classes,
-        num_layers=num_layers,
-        num_filters=num_filters,
-    )
-
-    model2 = amoebanet.amoebanetd_spatial(
-        local_rank=mpi_comm_second.local_rank % mpi_comm_second.total_spatial_processes,
-        spatial_size=spatial_size,
-        num_spatial_parts=num_spatial_parts,
-        mp_size=split_size,
-        balance=balance,
-        slice_method="square",
-        num_classes=num_classes,
-        num_layers=num_layers,
-        num_filters=num_filters,
-    )
-
-
-model_gen1 = model_generator(
-    model=model1,
-    split_size=split_size,
-    input_size=(int(batch_size / parts), 3, image_size, image_size),
-    balance=balance,
-    shape_list=resnet_shapes_list,
-)
-model_gen1.ready_model(split_rank=mpi_comm_first.split_rank)
-# model_gen1.DDP_model(mpi_comm_first, num_spatial_parts, spatial_size, bucket_size=25, local_rank = mpi_comm_first.local_rank)
-
-
-model_gen2 = model_generator(
-    model=model2,
-    split_size=split_size,
-    input_size=(int(batch_size / parts), 3, image_size, image_size),
-    balance=balance,
-    shape_list=resnet_shapes_list,
-)
-model_gen2.ready_model(split_rank=mpi_comm_second.split_rank)
-# model_gen2.DDP_model(mpi_comm_second, num_spatial_parts, spatial_size, bucket_size=25, local_rank = mpi_comm_second.local_rank)
-
-
-# model_gen.mp_size = 5
-print("Shape list", resnet_shapes_list)
-
-
-# t_s1 = train_model_spatial(model_gen1, mpi_comm_first.local_rank,batch_size,epochs=1, spatial_size=spatial_size, num_spatial_parts=num_spatial_parts ,criterion=None,optimizer=None,parts=parts,ASYNC=True,GEMS_INVERSE=False, slice_method = args.slice_method,
-# 							LOCAL_DP_LP=LOCAL_DP_LP,
-# 							mpi_comm = mpi_comm_first)
-
-
-# t_s2 = train_model_spatial(model_gen2, mpi_comm_second.local_rank,batch_size,epochs=1, spatial_size=spatial_size, num_spatial_parts=num_spatial_parts ,criterion=None,optimizer=None,parts=parts,ASYNC=True,GEMS_INVERSE=True, slice_method = args.slice_method,
-# 							LOCAL_DP_LP=LOCAL_DP_LP,
-# 							mpi_comm = mpi_comm_second)
-
-t_s_master = train_spatial_model_master(
-    model_gen1,
-    model_gen2,
-    batch_size,
-    spatial_size,
-    num_spatial_parts,
-    args.slice_method,
-    mpi_comm_first,
-    mpi_comm_second,
-    LOCAL_DP_LP=LOCAL_DP_LP,
-    criterion=None,
-    optimizer=None,
-    parts=parts,
-    ASYNC=True,
-    replications=int(args.times / 2),
-)
-
-x = torch.zeros(
-    (batch_size, 3, int(image_size / 2), int(image_size / 2)), device="cuda"
-)
-y = torch.zeros((batch_size,), dtype=torch.long, device="cuda")
-
-
-transform = transforms.Compose(
-    [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
-)
-torch.manual_seed(0)
-
-if APP == 1:
-    trainset = torchvision.datasets.ImageFolder(
-        "/usr/workspace/jain8/project/cancer/1024_1024_5/train",
-        transform=transform,
-        target_transform=None,
-    )
-    my_dataloader = torch.utils.data.DataLoader(
-        trainset,
-        batch_size=times * batch_size,
-        shuffle=True,
-        num_workers=0,
-        pin_memory=True,
-    )
-    size_dataset = 1030
-elif APP == 2:
-    trainset = torchvision.datasets.CIFAR10(
-        root="./data", train=True, download=True, transform=transform
-    )
-    my_dataloader = torch.utils.data.DataLoader(
-        trainset,
-        batch_size=times * batch_size,
-        shuffle=False,
-        num_workers=0,
-        pin_memory=True,
-    )
-    size_dataset = 50000
-else:
-    my_dataset = torchvision.datasets.FakeData(
-        size=10 * batch_size * args.times,
-        image_size=(3, image_size, image_size),
-        num_classes=num_classes,
-        transform=transform,
-        target_transform=None,
-        random_offset=0,
-    )
-    my_dataloader = torch.utils.data.DataLoader(
-        my_dataset,
-        batch_size=batch_size * args.times,
-        shuffle=False,
-        num_workers=0,
-        pin_memory=True,
-    )
-    size_dataset = 10 * batch_size
-
-
-# sync_allreduce.sync_model_spatial(model_gen)
-perf = []
-
-sync_comm = gems_comm.SyncAllreduce(mpi_comm_first)
-
-
-MASTER = args.times
-
-print("ENABLE_MASTER_OPT", ENABLE_MASTER_OPT)
-
-
-def run_epoch():
-    for i_e in range(epoch):
-        loss = 0
-        correct = 0
-        t = time.time()
-        for i, data in enumerate(my_dataloader, 0):
-            start_event = torch.cuda.Event(enable_timing=True, blocking=True)
-            end_event = torch.cuda.Event(enable_timing=True, blocking=True)
-            start_event.record()
-            if i > math.floor(size_dataset / (times * batch_size)) - 1:
-                break
-            # inputs=data_x
-            # labels = data_y
-            inputs, labels = data
-
-            # inputs = inputs.to(device)
-            # labels = labels.to(device)
-
-            # t= time.time()
-            if mpi_comm_first.local_rank < num_spatial_parts_list[0]:
-                x = split_input(
-                    inputs=inputs,
-                    image_size=image_size,
-                    slice_method=args.slice_method,
-                    local_rank=mpi_comm_first.local_rank,
-                    num_spatial_parts_list=num_spatial_parts_list,
-                )
-            elif mpi_comm_second.local_rank < num_spatial_parts_list[0]:
-                x = split_input(
-                    inputs=inputs,
-                    image_size=image_size,
-                    slice_method=args.slice_method,
-                    local_rank=mpi_comm_second.local_rank,
-                    num_spatial_parts_list=num_spatial_parts_list,
-                )
-            else:
-                x = inputs
-
-            # for j in range(MASTER):
-
-            # 	temp_loss,temp_correct = t_s1.run_step(x,labels)
-            # 	temp_loss,temp_correct = t_s2.run_step(x,labels)
-
-            if ENABLE_MASTER_OPT:
-                temp_loss, temp_correct = t_s_master.run_step_allreduce(
-                    x, labels, i % 2 == 1
-                )
-            else:
-                temp_loss, temp_correct = t_s_master.run_step(x, labels)
-
-            loss += temp_loss
-            correct += temp_correct
-
-            start_event_allreduce = torch.cuda.Event(enable_timing=True, blocking=True)
-            end_event_allreduce = torch.cuda.Event(enable_timing=True, blocking=True)
-            start_event_allreduce.record()
-            t_allreduce_temp = time.time()
-
-            if ENABLE_MASTER_OPT == False:
-                sync_comm.apply_allreduce_master_master(
-                    model_gen1, model_gen2, mpi_comm_first, mpi_comm_second
-                )
-
-            """
-			if(local_rank < spatial_size * num_spatial_parts):
-				None
-				#No need for this as, DDP is now used 
-				# sync_allreduce.apply_allreduce(model_gen,mpi_comm.spatial_allreduce_grp)
-			"""
-            torch.cuda.synchronize()
-
-            if ENABLE_MASTER_OPT:
-                if i % 2 == 1:
-                    t_s_master.train_model1.update()
-                else:
-                    t_s_master.train_model2.update()
-            else:
-                t_s_master.train_model1.update()
-                t_s_master.train_model2.update()
-
-            end_event_allreduce.record()
-            torch.cuda.synchronize()
-            t_allreduce = start_event_allreduce.elapsed_time(end_event_allreduce) / 1000
-            t_allreduce = time.time() - t_allreduce_temp
-
-            if mpi_comm_second.local_rank == comm_size - 1:
-                None
-                # print("Step",i," LOSS",temp_loss, " Global loss:",loss/(i+1), " Acc:",temp_correct)
-
-            if ENABLE_MASTER_OPT:
-                torch.distributed.barrier()
-
-            end_event.record()
-            torch.cuda.synchronize()
-            t = start_event.elapsed_time(end_event) / 1000
-            if mpi_comm_second.local_rank == 0:
-                None
-                print(
-                    "images per sec:",
-                    batch_size / t,
-                    "Time:",
-                    t,
-                    " Time Allreduce:",
-                    t_allreduce,
-                )
-                perf.append(batch_size / t)
-
-            t = time.time()
-        if mpi_comm_second.local_rank == comm_size - 1:
-            print("epoch", i_e, " Global loss:", loss, " acc", correct / i)
-
-
-run_epoch()
-
-if mpi_comm_second.local_rank == 0:
-    print("Mean {} Median {}".format(sum(perf) / len(perf), np.median(perf)))
-# y, _= t_s.forward_pass(x,y,part_number=0)
-# t_s.backward_pass(y,part_number=0)
-exit()
diff --git a/benchmarks/spatial_parallelism/README.md b/benchmarks/spatial_parallelism/README.md
new file mode 100644
index 00000000..474590dd
--- /dev/null
+++ b/benchmarks/spatial_parallelism/README.md
@@ -0,0 +1,76 @@
+# Spatial Parallelism Benchmarks
+
+Model benchmarks for spatial parallelism also require performing model parallelism. To configure the number of model partitions and the number of model partitions that will use spatial parallelism, you can use the --split-size and --spatial-size arguments respectively.
+
+## Run spatial parallelism:
+
+#### Generic command:
+```bash
+
+$MV2_HOME/bin/mpirun_rsh --export-all -np $np --hostfile  {$HOSTFILE} MV2_USE_CUDA=1 MV2_HYBRID_BINDING_POLICY=spread MV2_CPU_BINDING_POLICY=hybrid MV2_USE_GDRCOPY=0 PYTHONNOUSERSITE=true LD_PRELOAD=$MV2_HOME/lib/libmpi.so python ${sp_model_script} --halo-D2 --num-spatial-parts ${num_spatial_parts}  --image-size ${image_size} --batch-size ${batch_size} --slice-method ${partition}
+
+```
+#### Examples
+
+- With 5 GPUs [split size: 2, num_spatial_parts: 4, spatial_size: 1]
+
+Example to run AmoebaNet model with 2 model split size(i.e. # of partitions for MP), spatial partition (# of image partitions) as 4 and 1 as spatial size (i.e. number of model partition which will use spatial partition). In this configuration, we split model into two parts where first part will use spatial parallelism. 
+
+```bash
+$MV2_HOME/bin/mpirun_rsh --export-all -np 5 --hostfile {$HOSTFILE} MV2_USE_CUDA=1 MV2_HYBRID_BINDING_POLICY=spread MV2_CPU_BINDING_POLICY=hybrid MV2_USE_GDRCOPY=0 PYTHONNOUSERSITE=true LD_PRELOAD=$MV2_HOME/lib/libmpi.so python benchmarks/spatial_parallelism/benchmark_amoebanet_sp.py --image-size 512 --num-spatial-parts 4 --slice-method "vertical" --split-size 2 --spatial-size 1
+```
+- With 9 GPUs [split size: 3, num_spatial_parts: 4, spatial_size: 2]
+In this configuration, we split model int three parts where first two part will use spatial parallelism. 
+
+```bash
+$MV2_HOME/bin/mpirun_rsh --export-all -np 9 --hostfile {$HOSTFILE} MV2_USE_CUDA=1 MV2_HYBRID_BINDING_POLICY=spread MV2_CPU_BINDING_POLICY=hybrid MV2_USE_GDRCOPY=0 PYTHONNOUSERSITE=true LD_PRELOAD=$MV2_HOME/lib/libmpi.so python benchmarks/spatial_parallelism/benchmark_amoebanet_sp.py --image-size 512 --num-spatial-parts 4 --slice-method "vertical" --split-size 3 --spatial-size 2
+```
+
+- Similarly, we can run benchmark for ResNet model.
+Find the example to run ResNet with halo-D2 enabled to reduce communication opertaions. To learn more about halo-D2, refer [Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters](https://dl.acm.org/doi/abs/10.1007/978-3-031-07312-0_6)
+```bash
+$MV2_HOME/bin/mpirun_rsh --export-all -np 5 --hostfile {$HOSTFILE} MV2_USE_CUDA=1 MV2_HYBRID_BINDING_POLICY=spread MV2_CPU_BINDING_POLICY=hybrid MV2_USE_GDRCOPY=0 PYTHONNOUSERSITE=true LD_PRELOAD=$MV2_HOME/lib/libmpi.so benchmarks/spatial_parallelism/benchmark_resnet_sp.py --halo-D2 --num-spatial-parts 4 --image-size 1024 --batch-size 2 --slice-method "square"
+``` 
+
+Below are the available configuration options :
+
+<pre>
+usage: benchmark_amoebanet_sp.py [-h] [-v] [--batch-size BATCH_SIZE] [--parts PARTS] [--split-size SPLIT_SIZE] [--num-spatial-parts NUM_SPATIAL_PARTS]
+                        [--spatial-size SPATIAL_SIZE] [--times TIMES] [--image-size IMAGE_SIZE] [--num-epochs NUM_EPOCHS] [--num-layers NUM_LAYERS]
+                        [--num-filters NUM_FILTERS] [--balance BALANCE] [--halo-D2] [--fused-layers FUSED_LAYERS] [--local-DP LOCAL_DP] [--slice-method SLICE_METHOD]
+                        [--app APP] [--datapath DATAPATH]
+
+SP-MP-DP Configuration Script
+
+optional arguments:
+  -h, --help            show this help message and exit
+  -v, --verbose         Prints performance numbers or logs (default: False)
+  --batch-size BATCH_SIZE
+                        input batch size (default: 32)
+  --parts PARTS         Number of parts for MP (default: 1)
+  --split-size SPLIT_SIZE
+                        Number of process for MP (default: 2)
+  --num-spatial-parts NUM_SPATIAL_PARTS
+                        Number of partitions in spatial parallelism (default: 4)
+  --spatial-size SPATIAL_SIZE
+                        Number splits for spatial parallelism (default: 1)
+  --times TIMES         Number of times to repeat MASTER 1: 2 repications, 2: 4 replications (default: 1)
+  --image-size IMAGE_SIZE
+                        Image size for synthetic benchmark (default: 32)
+  --num-epochs NUM_EPOCHS
+                        Number of epochs (default: 1)
+  --num-layers NUM_LAYERS
+                        Number of layers in amoebanet (default: 18)
+  --num-filters NUM_FILTERS
+                        Number of layers in amoebanet (default: 416)
+  --balance BALANCE     length of list equals to number of partitions and sum should be equal to num layers (default: None)
+  --halo-D2             Enable design2 (do halo exhange on few convs) for spatial conv. (default: False)
+  --fused-layers FUSED_LAYERS
+                        When D2 design is enables for halo exchange, number of blocks to fuse in ResNet model (default: 1)
+  --local-DP LOCAL_DP   LBANN intergration of SP with MP. MP can apply data parallelism. 1: only one GPU for a given split, 2: two gpus for a given split (uses DP)
+                        (default: 1)
+  --slice-method SLICE_METHOD
+                        Slice method (square, vertical, and horizontal) in Spatial parallelism (default: square)
+  --app APP             Application type (1.medical, 2.cifar, and synthetic) in Spatial parallelism (default: 3)
+  --datapath DATAPATH   local Dataset path (default: ./train)
+  </pre>
diff --git a/benchmarks/spatial/model/amoebanet_run.py b/benchmarks/spatial_parallelism/benchmark_amoebanet_sp.py
similarity index 79%
rename from benchmarks/spatial/model/amoebanet_run.py
rename to benchmarks/spatial_parallelism/benchmark_amoebanet_sp.py
index 5ac43868..93875931 100644
--- a/benchmarks/spatial/model/amoebanet_run.py
+++ b/benchmarks/spatial_parallelism/benchmark_amoebanet_sp.py
@@ -52,36 +52,30 @@ def init_processes(backend="tcp"):
     return size, rank
 
 
-def get_depth(version, n):
-    if version == 1:
-        return n * 6 + 2
-    elif version == 2:
-        return n * 9 + 2
-
-
 sys.stdout = Unbuffered(sys.stdout)
 
+ENABLE_ASYNC = True
+
 np.random.seed(seed=1405)
 parts = args.parts
 batch_size = args.batch_size
-resnet_n = 12
 epoch = args.num_epochs
-ENABLE_ASYNC = True
+image_size = int(args.image_size)
+num_layers = args.num_layers
+num_filters = args.num_filters
+balance = args.balance
+split_size = args.split_size
+spatial_size = args.spatial_size
+times = args.times
+datapath = args.datapath
 
+LOCAL_DP_LP = args.local_DP
 # APP
 # 1: Medical
 # 2: Cifar
 # 3: synthetic
 APP = args.app
-amoebanet_test = False
-image_size = int(args.image_size)
-print("image size", image_size)
 steps = 100
-num_layers = args.num_layers
-num_filters = args.num_filters
-balance = args.balance
-split_size = args.split_size
-spatial_size = args.spatial_size
 
 temp_num_spatial_parts = args.num_spatial_parts.split(",")
 
@@ -92,11 +86,7 @@ def get_depth(version, n):
     num_spatial_parts = [int(i) for i in temp_num_spatial_parts]
     num_spatial_parts_list = num_spatial_parts
 
-times = 1
-num_classes = 1000
-LOCAL_DP_LP = args.local_DP
-
-# DDP support
+spatial_part_size = num_spatial_parts_list[0]  # Partition size for spatial parallelism
 
 
 def isPowerTwo(num):
@@ -127,16 +117,32 @@ def verify_config():
 
     if args.slice_method == "square":
         assert isPowerTwo(
-            int(image_size / math.sqrt(num_spatial_parts))
+            int(image_size / math.sqrt(spatial_part_size))
         ), "Image size of each partition should be power of Two"
     else:
         assert isPowerTwo(
-            int(image_size / num_spatial_parts)
+            int(image_size / spatial_part_size)
         ), "Image size of each partition should be power of Two"
 
+    for each_part_size in num_spatial_parts_list:
+        assert (
+            each_part_size == spatial_part_size
+        ), "Size of each SP partition should be same"
+
 
 verify_config()
 
+##################### AmoebaNet model specific parameters #####################
+
+"""
+"image_size_seq" is required to determine the output shape after spatial partitioning of images. 
+The shape of the output will be determined for each model partition based on the values in "image_size_seq."
+These values will then be used to calculate the output shape for a given input size and spatial partition.
+"""
+image_size_seq = 512
+num_classes = 1000
+
+###############################################################################
 
 mpi_comm = gems_comm.MPIComm(
     split_size=split_size,
@@ -155,37 +161,36 @@ def verify_config():
 split_rank = mpi_comm.split_rank
 
 
-if args.balance != None:
-    balance = args.balance.split(",")
+if balance != None:
+    balance = balance.split(",")
     balance = [int(j) for j in balance]
 else:
     balance = None
 
-"""
-"image_size_seq" is required to determine the output shape after spatial partitioning of images. 
-The shape of the output will be determined for each model partition based on the values in "image_size_seq."
-These values will then be used to calculate the output shape for a given input size and spatial partition.
-"""
-image_size_seq = 512
 
+# Initialize AmoebaNet model
 model_seq = amoebanet.amoebanetd(
     num_layers=num_layers, num_filters=num_filters, num_classes=num_classes
 )
-print("length", len(model_seq), balance)
+
+# Initialize parameters for Model Parallelism
 model_gen_seq = model_generator(
     model=model_seq,
     split_size=split_size,
     input_size=(int(batch_size / parts), 3, image_size_seq, image_size_seq),
     balance=balance,
 )
+# Get the shape of model on each split rank for image_size_seq and move it to device
+# Note : we take shape w.r.t image_size_seq as model w.r.t image_size may not be
+# able to fit in memory
 model_gen_seq.ready_model(split_rank=split_rank, GET_SHAPES_ON_CUDA=True)
 
-image_size_times = int(image_size / image_size_seq)
-
 
+# Get the shape of model on each split rank for image_size and number of spatial parts
+image_size_times = int(image_size / image_size_seq)
 temp_count = 0
 if args.slice_method == "square":
-    resnet_shapes_list = []
+    amoebanet_shapes_list = []
     for output_shape in model_gen_seq.shape_list:
         if isinstance(output_shape, list):
             temp_shape = []
@@ -207,11 +212,11 @@ def verify_config():
                         int(shape_tuple[3] * image_size_times),
                     )
                     temp_shape.append(x)
-            resnet_shapes_list.append(temp_shape)
+            amoebanet_shapes_list.append(temp_shape)
         else:
             if len(output_shape) == 2:
                 x = (int(output_shape[0]), output_shape[1])
-                resnet_shapes_list.append(x)
+                amoebanet_shapes_list.append(x)
             else:
                 if temp_count < spatial_size:
                     x = (
@@ -220,7 +225,7 @@ def verify_config():
                         int(output_shape[2] * image_size_times / 2),
                         int(output_shape[3] * image_size_times / 2),
                     )
-                    resnet_shapes_list.append(x)
+                    amoebanet_shapes_list.append(x)
                 else:
                     x = (
                         int(output_shape[0]),
@@ -228,11 +233,11 @@ def verify_config():
                         int(output_shape[2] * image_size_times),
                         int(output_shape[3] * image_size_times),
                     )
-                    resnet_shapes_list.append(x)
+                    amoebanet_shapes_list.append(x)
         temp_count += 1
 
 elif args.slice_method == "vertical":
-    resnet_shapes_list = []
+    amoebanet_shapes_list = []
     for output_shape in model_gen_seq.shape_list:
         if isinstance(output_shape, list):
             temp_shape = []
@@ -257,11 +262,11 @@ def verify_config():
                         int(shape_tuple[3] * image_size_times),
                     )
                     temp_shape.append(x)
-            resnet_shapes_list.append(temp_shape)
+            amoebanet_shapes_list.append(temp_shape)
         else:
             if len(output_shape) == 2:
                 x = (int(output_shape[0]), output_shape[1])
-                resnet_shapes_list.append(x)
+                amoebanet_shapes_list.append(x)
             else:
                 if temp_count < spatial_size:
                     x = (
@@ -274,7 +279,7 @@ def verify_config():
                             / num_spatial_parts_list[temp_count]
                         ),
                     )
-                    resnet_shapes_list.append(x)
+                    amoebanet_shapes_list.append(x)
                 else:
                     x = (
                         int(output_shape[0]),
@@ -282,12 +287,12 @@ def verify_config():
                         int(output_shape[2] * image_size_times),
                         int(output_shape[3] * image_size_times),
                     )
-                    resnet_shapes_list.append(x)
+                    amoebanet_shapes_list.append(x)
         temp_count += 1
 
 
 elif args.slice_method == "horizontal":
-    resnet_shapes_list = []
+    amoebanet_shapes_list = []
     for output_shape in model_gen_seq.shape_list:
         if isinstance(output_shape, list):
             temp_shape = []
@@ -312,11 +317,11 @@ def verify_config():
                         int(shape_tuple[3] * image_size_times),
                     )
                     temp_shape.append(x)
-            resnet_shapes_list.append(temp_shape)
+            amoebanet_shapes_list.append(temp_shape)
         else:
             if len(output_shape) == 2:
                 x = (int(output_shape[0]), output_shape[1])
-                resnet_shapes_list.append(x)
+                amoebanet_shapes_list.append(x)
             else:
                 if temp_count < spatial_size:
                     x = (
@@ -329,7 +334,7 @@ def verify_config():
                         ),
                         int(output_shape[3] * image_size_times / 1),
                     )
-                    resnet_shapes_list.append(x)
+                    amoebanet_shapes_list.append(x)
                 else:
                     x = (
                         int(output_shape[0]),
@@ -337,16 +342,14 @@ def verify_config():
                         int(output_shape[2] * image_size_times),
                         int(output_shape[3] * image_size_times),
                     )
-                    resnet_shapes_list.append(x)
+                    amoebanet_shapes_list.append(x)
         temp_count += 1
 
-
-print(model_gen_seq.shape_list, resnet_shapes_list)
-
 del model_seq
 del model_gen_seq
 torch.cuda.ipc_collect()
 
+# Initialize AmoebaNet model with Spatial and Model Parallelism support
 if args.halo_d2:
     model = amoebanet_d2.amoebanetd_spatial(
         local_rank=local_rank % mpi_comm.total_spatial_processes,
@@ -378,17 +381,18 @@ def verify_config():
     split_size=split_size,
     input_size=(int(batch_size / parts), 3, image_size, image_size),
     balance=balance,
-    shape_list=resnet_shapes_list,
+    shape_list=amoebanet_shapes_list,
 )
 
-
+# Move model it it's repective devices
 model_gen.ready_model(split_rank=split_rank)
 model_gen.DDP_model(mpi_comm, num_spatial_parts, spatial_size, bucket_size=0)
 
-
-print("Shape list", resnet_shapes_list)
+logging.info(f"Shape of model on local_rank {local_rank} : {model_gen.shape_list}")
 
 
+# Initialize parameters require for training the model with Spatial and Model
+# Parallelism support
 t_s = train_model_spatial(
     model_gen,
     local_rank,
@@ -411,6 +415,7 @@ def verify_config():
 )
 y = torch.zeros((batch_size,), dtype=torch.long, device="cuda")
 
+############################## Dataset Definition ##############################
 
 transform = transforms.Compose(
     [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
@@ -419,7 +424,7 @@ def verify_config():
 
 if APP == 1:
     trainset = torchvision.datasets.ImageFolder(
-        "/usr/workspace/jain8/project/cancer/1024_1024_5/train",
+        datapath,
         transform=transform,
         target_transform=None,
     )
@@ -433,7 +438,7 @@ def verify_config():
     size_dataset = 1030
 elif APP == 2:
     trainset = torchvision.datasets.CIFAR10(
-        root="./data", train=True, download=True, transform=transform
+        root=datapath, train=True, download=True, transform=transform
     )
     my_dataloader = torch.utils.data.DataLoader(
         trainset,
@@ -461,20 +466,18 @@ def verify_config():
     )
     size_dataset = 10 * batch_size
 
-
-# sync_allreduce.sync_model_spatial(model_gen)
-perf = []
+################################################################################
 
 
 def split_input(inputs):
     if args.slice_method == "square":
-        image_height_local = int(image_size / math.sqrt(num_spatial_parts))
-        image_width_local = int(image_size / math.sqrt(num_spatial_parts))
+        image_height_local = int(image_size / math.sqrt(spatial_part_size))
+        image_width_local = int(image_size / math.sqrt(spatial_part_size))
 
-        total_rows = int(math.sqrt(num_spatial_parts))
-        total_cols = int(math.sqrt(num_spatial_parts))
+        total_rows = int(math.sqrt(spatial_part_size))
+        total_cols = int(math.sqrt(spatial_part_size))
 
-        # current position of rank in matrix of math.sqrt(num_spatial_parts) * math.sqrt(num_spatial_parts)
+        # current position of rank in matrix of math.sqrt(spatial_part_size) * math.sqrt(num_spatial_parts)
         row = int(local_rank / total_cols)
         col = int(local_rank % total_cols)
 
@@ -487,13 +490,13 @@ def split_input(inputs):
         return inputs[:, :, start_top:end_bottom, start_left:end_right]
 
     elif args.slice_method == "vertical":
-        image_height_local = int(image_size / num_spatial_parts)
-        image_width_local = int(image_size / num_spatial_parts)
+        image_height_local = int(image_size / spatial_part_size)
+        image_width_local = int(image_size / spatial_part_size)
 
         start_left = local_rank * image_width_local
         end_right = (local_rank + 1) * image_width_local
 
-        if local_rank == num_spatial_parts - 1:
+        if local_rank == spatial_part_size - 1:
             # In case of GPU count, partition size will be uneven and last
             # rank will receive remaining image
             return inputs[:, :, :, start_left:]
@@ -501,13 +504,13 @@ def split_input(inputs):
             return inputs[:, :, :, start_left:end_right]
 
     elif args.slice_method == "horizontal":
-        image_height_local = int(image_size / num_spatial_parts)
-        image_width_local = int(image_size / num_spatial_parts)
+        image_height_local = int(image_size / spatial_part_size)
+        image_width_local = int(image_size / spatial_part_size)
 
         start_top = local_rank * image_height_local
         end_bottom = (local_rank + 1) * image_height_local
 
-        if local_rank == num_spatial_parts - 1:
+        if local_rank == spatial_part_size - 1:
             # In case of odd GPU count, partition size will be uneven and last
             # rank will receive remaining image
             return inputs[:, :, start_top:, :]
@@ -515,6 +518,11 @@ def split_input(inputs):
             return inputs[:, :, start_top:end_bottom, :]
 
 
+################################# Train Model ##################################
+
+perf = []
+
+
 def run_epoch():
     for i_e in range(epoch):
         loss = 0
@@ -528,7 +536,7 @@ def run_epoch():
                 break
             inputs, labels = data
 
-            if local_rank < num_spatial_parts_list[0]:
+            if local_rank < spatial_part_size:
                 x = split_input(inputs)
             else:
                 x = inputs
@@ -549,18 +557,19 @@ def run_epoch():
             torch.cuda.synchronize()
             t = start_event.elapsed_time(end_event) / 1000
             if local_rank == 0:
-                None
-                print("images per sec:", batch_size / t)
+                print(f"Epoch: {i_e} images per sec:{batch_size / t}")
                 perf.append(batch_size / t)
 
             t = time.time()
         if local_rank == comm_size - 1:
-            print("epoch", i_e, " Global loss:", loss, " acc", correct / i)
+            print(f"Epoch {i_e} Global loss: {loss} Acc {correct / i}")
 
 
 run_epoch()
 
 if local_rank == 0:
-    print("Mean {} Median {}".format(sum(perf) / len(perf), np.median(perf)))
+    print(f"Mean {sum(perf) / len(perf)} Median {np.median(perf)}")
+
+################################################################################
 
 exit()
diff --git a/benchmarks/spatial/model/resnet_model.py b/benchmarks/spatial_parallelism/benchmark_resnet_sp.py
similarity index 50%
rename from benchmarks/spatial/model/resnet_model.py
rename to benchmarks/spatial_parallelism/benchmark_resnet_sp.py
index 0a4bd639..e4aab4fa 100644
--- a/benchmarks/spatial/model/resnet_model.py
+++ b/benchmarks/spatial_parallelism/benchmark_resnet_sp.py
@@ -51,35 +51,26 @@ def init_processes(backend="mpi"):
     return size, rank
 
 
-def get_depth(version, n):
-    if version == 1:
-        return n * 6 + 2
-    elif version == 2:
-        return n * 9 + 2
-
-
 sys.stdout = Unbuffered(sys.stdout)
 
 np.random.seed(seed=1405)
+
+ENABLE_ASYNC = True
 parts = args.parts
 batch_size = args.batch_size
-resnet_n = 12
 epoch = args.num_epochs
-ENABLE_ASYNC = True
+image_size = int(args.image_size)
+balance = args.balance
+split_size = args.split_size
+spatial_size = args.spatial_size
+times = args.times
+datapath = args.datapath
 
 # APP
 # 1: Medical
 # 2: Cifar
 # 3: synthetic
 APP = args.app
-amoebanet_test = False
-image_size = int(args.image_size)
-print("image size", image_size)
-steps = 100
-balance = args.balance
-split_size = args.split_size
-spatial_size = args.spatial_size
-
 temp_num_spatial_parts = args.num_spatial_parts.split(",")
 
 if len(temp_num_spatial_parts) == 1:
@@ -89,10 +80,31 @@ def get_depth(version, n):
     num_spatial_parts = [int(i) for i in temp_num_spatial_parts]
     num_spatial_parts_list = num_spatial_parts
 
-times = 1
+spatial_part_size = num_spatial_parts_list[0]  # Partition size for spatial parallelism
+steps = 100
+
+################## ResNet model specific parameters/functions ##################
+
+"""
+"image_size_seq" is required to determine the output shape after spatial partitioning of images. 
+The shape of the output will be determined for each model partition based on the values in "image_size_seq."
+These values will then be used to calculate the output shape for a given input size and spatial partition.
+"""
+image_size_seq = 32
+resnet_n = 12
 num_classes = 10
 
 
+def get_depth(version, n):
+    if version == 1:
+        return n * 6 + 2
+    elif version == 2:
+        return n * 9 + 2
+
+
+###############################################################################
+
+
 def isPowerTwo(num):
     return not (num & (num - 1))
 
@@ -120,13 +132,18 @@ def verify_config():
 
     if args.slice_method == "square":
         assert isPowerTwo(
-            int(image_size / math.sqrt(num_spatial_parts))
+            int(image_size / math.sqrt(spatial_part_size))
         ), "Image size of each partition should be power of Two"
     else:
         assert isPowerTwo(
-            int(image_size / num_spatial_parts)
+            int(image_size / spatial_part_size)
         ), "Image size of each partition should be power of Two"
 
+    for each_part_size in num_spatial_parts_list:
+        assert (
+            each_part_size == spatial_part_size
+        ), "Size of each SP partition should be same"
+
 
 verify_config()
 
@@ -139,93 +156,210 @@ def verify_config():
 )
 sync_allreduce = gems_comm.SyncAllreduce(mpi_comm)
 rank = mpi_comm.rank
-
+comm_size = mpi_comm.size
 local_rank = rank
 split_rank = mpi_comm.split_rank
 
 
-if args.balance != None:
-    balance = args.balance.split(",")
+if balance != None:
+    balance = balance.split(",")
     balance = [int(j) for j in balance]
 else:
     balance = None
 
-"""
-"image_size_seq" is required to determine the output shape after spatial partitioning of images. 
-The shape of the output will be determined for each model partition based on the values in "image_size_seq."
-These values will then be used to calculate the output shape for a given input size and spatial partition.
-"""
-image_size_seq = 32
-
+# Initialize ResNet model
 model_seq = resnet_cifar_torch.get_resnet_v2(
     (int(batch_size / parts), 3, image_size_seq, image_size_seq), depth=get_depth(2, 12)
 )
-print("length", len(model_seq), balance)
+
 model_gen_seq = model_generator(
     model=model_seq,
     split_size=split_size,
     input_size=(int(batch_size / parts), 3, image_size_seq, image_size_seq),
     balance=balance,
 )
+
+# Get the shape of model on each split rank for image_size_seq and move it to device
+# Note : we take shape w.r.t image_size_seq as model w.r.t image_size may not be
+# able to fit in memory
 model_gen_seq.ready_model(split_rank=split_rank, GET_SHAPES_ON_CUDA=True)
 
+
+# Get the shape of model on each split rank for image_size and number of spatial parts
 image_size_times = int(image_size / image_size_seq)
+temp_count = 0
 if args.slice_method == "square":
-    resnet_shapes_list = [
-        (
-            model_gen_seq.shape_list[0][0],
-            model_gen_seq.shape_list[0][1],
-            int(
-                model_gen_seq.shape_list[0][2]
-                * image_size_times
-                / math.sqrt(num_spatial_parts)
-            ),
-            int(
-                model_gen_seq.shape_list[0][3]
-                * image_size_times
-                / math.sqrt(num_spatial_parts)
-            ),
-        ),
-        model_gen_seq.shape_list[1],
-    ]
+    resnet_shapes_list = []
+    for output_shape in model_gen_seq.shape_list:
+        if isinstance(output_shape, list):
+            temp_shape = []
+            for shape_tuple in output_shape:
+                if temp_count < spatial_size:
+                    # reduce shape only when it is smaller than spatial size
+                    x = (
+                        int(shape_tuple[0]),
+                        shape_tuple[1],
+                        int(shape_tuple[2] * image_size_times / 2),
+                        int(shape_tuple[3] * image_size_times / 2),
+                    )
+                    temp_shape.append(x)
+                else:
+                    x = (
+                        int(shape_tuple[0]),
+                        shape_tuple[1],
+                        int(shape_tuple[2] * image_size_times),
+                        int(shape_tuple[3] * image_size_times),
+                    )
+                    temp_shape.append(x)
+            resnet_shapes_list.append(temp_shape)
+        else:
+            if len(output_shape) == 2:
+                x = (int(output_shape[0]), output_shape[1])
+                resnet_shapes_list.append(x)
+            else:
+                if temp_count < spatial_size:
+                    x = (
+                        int(output_shape[0]),
+                        output_shape[1],
+                        int(output_shape[2] * image_size_times / 2),
+                        int(output_shape[3] * image_size_times / 2),
+                    )
+                    resnet_shapes_list.append(x)
+                else:
+                    x = (
+                        int(output_shape[0]),
+                        output_shape[1],
+                        int(output_shape[2] * image_size_times),
+                        int(output_shape[3] * image_size_times),
+                    )
+                    resnet_shapes_list.append(x)
+        temp_count += 1
 
 elif args.slice_method == "vertical":
-    resnet_shapes_list = [
-        (
-            model_gen_seq.shape_list[0][0],
-            model_gen_seq.shape_list[0][1],
-            int(model_gen_seq.shape_list[0][2] * image_size_times / 1),
-            int(model_gen_seq.shape_list[0][3] * image_size_times / num_spatial_parts),
-        ),
-        model_gen_seq.shape_list[1],
-    ]
+    resnet_shapes_list = []
+    for output_shape in model_gen_seq.shape_list:
+        if isinstance(output_shape, list):
+            temp_shape = []
+            for shape_tuple in output_shape:
+                if temp_count < spatial_size:
+                    x = (
+                        int(shape_tuple[0]),
+                        shape_tuple[1],
+                        int(shape_tuple[2] * image_size_times / 1),
+                        int(
+                            shape_tuple[3]
+                            * image_size_times
+                            / num_spatial_parts_list[temp_count]
+                        ),
+                    )
+                    temp_shape.append(x)
+                else:
+                    x = (
+                        int(shape_tuple[0]),
+                        shape_tuple[1],
+                        int(shape_tuple[2] * image_size_times),
+                        int(shape_tuple[3] * image_size_times),
+                    )
+                    temp_shape.append(x)
+            resnet_shapes_list.append(temp_shape)
+        else:
+            if len(output_shape) == 2:
+                x = (int(output_shape[0]), output_shape[1])
+                resnet_shapes_list.append(x)
+            else:
+                if temp_count < spatial_size:
+                    x = (
+                        int(output_shape[0]),
+                        output_shape[1],
+                        int(output_shape[2] * image_size_times / 1),
+                        int(
+                            output_shape[3]
+                            * image_size_times
+                            / num_spatial_parts_list[temp_count]
+                        ),
+                    )
+                    resnet_shapes_list.append(x)
+                else:
+                    x = (
+                        int(output_shape[0]),
+                        output_shape[1],
+                        int(output_shape[2] * image_size_times),
+                        int(output_shape[3] * image_size_times),
+                    )
+                    resnet_shapes_list.append(x)
+        temp_count += 1
+
 
 elif args.slice_method == "horizontal":
-    resnet_shapes_list = [
-        (
-            model_gen_seq.shape_list[0][0],
-            model_gen_seq.shape_list[0][1],
-            int(model_gen_seq.shape_list[0][2] * image_size_times / num_spatial_parts),
-            int(model_gen_seq.shape_list[0][3] * image_size_times / 1),
-        ),
-        model_gen_seq.shape_list[1],
-    ]
+    resnet_shapes_list = []
+    for output_shape in model_gen_seq.shape_list:
+        if isinstance(output_shape, list):
+            temp_shape = []
+            for shape_tuple in output_shape:
+                if temp_count < spatial_size:
+                    x = (
+                        int(shape_tuple[0]),
+                        shape_tuple[1],
+                        int(
+                            shape_tuple[2]
+                            * image_size_times
+                            / num_spatial_parts_list[temp_count]
+                        ),
+                        int(shape_tuple[3] * image_size_times / 1),
+                    )
+                    temp_shape.append(x)
+                else:
+                    x = (
+                        int(shape_tuple[0]),
+                        shape_tuple[1],
+                        int(shape_tuple[2] * image_size_times),
+                        int(shape_tuple[3] * image_size_times),
+                    )
+                    temp_shape.append(x)
+            resnet_shapes_list.append(temp_shape)
+        else:
+            if len(output_shape) == 2:
+                x = (int(output_shape[0]), output_shape[1])
+                resnet_shapes_list.append(x)
+            else:
+                if temp_count < spatial_size:
+                    x = (
+                        int(output_shape[0]),
+                        output_shape[1],
+                        int(
+                            output_shape[2]
+                            * image_size_times
+                            / num_spatial_parts_list[temp_count]
+                        ),
+                        int(output_shape[3] * image_size_times / 1),
+                    )
+                    resnet_shapes_list.append(x)
+                else:
+                    x = (
+                        int(output_shape[0]),
+                        output_shape[1],
+                        int(output_shape[2] * image_size_times),
+                        int(output_shape[3] * image_size_times),
+                    )
+                    resnet_shapes_list.append(x)
+        temp_count += 1
 
 
 del model_seq
 del model_gen_seq
 torch.cuda.ipc_collect()
 
+# Initialize ResNet model with Spatial and Model Parallelism support
 if args.halo_d2:
     model, balance = resnet_cifar_torch_spatial.get_resnet_v2(
         input_shape=(batch_size / parts, 3, image_size, image_size),
         depth=get_depth(2, 12),
-        local_rank=local_rank % num_spatial_parts,
+        local_rank=local_rank % spatial_part_size,
         mp_size=split_size,
         balance=balance,
         spatial_size=spatial_size,
         num_spatial_parts=num_spatial_parts,
-        num_classes=10,
+        num_classes=num_classes,
         fused_layers=args.fused_layers,
         slice_method=args.slice_method,
     )
@@ -233,12 +367,12 @@ def verify_config():
     model = resnet_cifar_torch_spatial.get_resnet_v2(
         input_shape=(batch_size / parts, 3, image_size, image_size),
         depth=get_depth(2, 12),
-        local_rank=local_rank % num_spatial_parts,
+        local_rank=local_rank % spatial_part_size,
         mp_size=split_size,
         balance=balance,
         spatial_size=spatial_size,
         num_spatial_parts=num_spatial_parts,
-        num_classes=10,
+        num_classes=num_classes,
         fused_layers=args.fused_layers,
         slice_method=args.slice_method,
     )
@@ -252,15 +386,13 @@ def verify_config():
     shape_list=resnet_shapes_list,
 )
 
-
+# Move model it it's repective devices
 model_gen.ready_model(split_rank=split_rank)
 
-print("Shape list", resnet_shapes_list)
-
-if local_rank == num_spatial_parts:
-    print(model_gen.models)
-
+logging.info(f"Shape of model on local_rank {local_rank} : {model_gen.shape_list}")
 
+# Initialize parameters require for training the model with Spatial and Model
+# Parallelism support
 t_s = train_model_spatial(
     model_gen,
     local_rank,
@@ -283,6 +415,7 @@ def verify_config():
 y = torch.zeros((batch_size,), dtype=torch.long, device="cuda")
 
 
+############################## Dataset Definition ##############################
 transform = transforms.Compose(
     [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
 )
@@ -290,7 +423,7 @@ def verify_config():
 
 if APP == 1:
     trainset = torchvision.datasets.ImageFolder(
-        "./train", transform=transform, target_transform=None
+        datapath, transform=transform, target_transform=None
     )
     my_dataloader = torch.utils.data.DataLoader(
         trainset,
@@ -302,7 +435,7 @@ def verify_config():
     size_dataset = 1030
 elif APP == 2:
     trainset = torchvision.datasets.CIFAR10(
-        root="./data", train=True, download=True, transform=transform
+        root=datapath, train=True, download=True, transform=transform
     )
     my_dataloader = torch.utils.data.DataLoader(
         trainset,
@@ -330,20 +463,20 @@ def verify_config():
     )
     size_dataset = 10 * batch_size
 
+################################################################################
 
 sync_allreduce.sync_model_spatial(model_gen)
-perf = []
 
 
 def split_input(inputs):
     if args.slice_method == "square":
-        image_height_local = int(image_size / math.sqrt(num_spatial_parts))
-        image_width_local = int(image_size / math.sqrt(num_spatial_parts))
+        image_height_local = int(image_size / math.sqrt(spatial_part_size))
+        image_width_local = int(image_size / math.sqrt(spatial_part_size))
 
-        total_rows = int(math.sqrt(num_spatial_parts))
-        total_cols = int(math.sqrt(num_spatial_parts))
+        total_rows = int(math.sqrt(spatial_part_size))
+        total_cols = int(math.sqrt(spatial_part_size))
 
-        # current position of rank in matrix of math.sqrt(num_spatial_parts) * math.sqrt(num_spatial_parts)
+        # current position of rank in matrix of math.sqrt(spatial_part_size) * math.sqrt(spatial_part_size)
         row = int(local_rank / total_cols)
         col = int(local_rank % total_cols)
 
@@ -356,13 +489,13 @@ def split_input(inputs):
         return inputs[:, :, start_top:end_bottom, start_left:end_right]
 
     elif args.slice_method == "vertical":
-        image_height_local = int(image_size / num_spatial_parts)
-        image_width_local = int(image_size / num_spatial_parts)
+        image_height_local = int(image_size / spatial_part_size)
+        image_width_local = int(image_size / spatial_part_size)
 
         start_left = local_rank * image_width_local
         end_right = (local_rank + 1) * image_width_local
 
-        if local_rank == num_spatial_parts - 1:
+        if local_rank == spatial_part_size - 1:
             # In case of GPU count, partition size will be uneven and last
             # rank will receive remaining image
             return inputs[:, :, :, start_left:]
@@ -370,13 +503,13 @@ def split_input(inputs):
             return inputs[:, :, :, start_left:end_right]
 
     elif args.slice_method == "horizontal":
-        image_height_local = int(image_size / num_spatial_parts)
-        image_width_local = int(image_size / num_spatial_parts)
+        image_height_local = int(image_size / spatial_part_size)
+        image_width_local = int(image_size / spatial_part_size)
 
         start_top = local_rank * image_height_local
         end_bottom = (local_rank + 1) * image_height_local
 
-        if local_rank == num_spatial_parts - 1:
+        if local_rank == spatial_part_size - 1:
             # In case of odd GPU count, partition size will be uneven and last
             # rank will receive remaining image
             return inputs[:, :, start_top:, :]
@@ -384,6 +517,11 @@ def split_input(inputs):
             return inputs[:, :, start_top:end_bottom, :]
 
 
+################################# Train Model ##################################
+
+perf = []
+
+
 def run_epoch():
     for i_e in range(epoch):
         loss = 0
@@ -397,7 +535,7 @@ def run_epoch():
                 break
             inputs, labels = data
 
-            if local_rank < num_spatial_parts_list[0]:
+            if local_rank < spatial_part_size:
                 x = split_input(inputs)
             else:
                 x = inputs
@@ -405,14 +543,14 @@ def run_epoch():
             temp_loss, temp_correct = t_s.run_step(x, labels)
             loss += temp_loss
             correct += temp_correct
-            if local_rank < spatial_size * num_spatial_parts:
+            if local_rank < spatial_size * spatial_part_size:
                 sync_allreduce.apply_allreduce(
                     model_gen, mpi_comm.spatial_allreduce_grp
                 )
             torch.cuda.synchronize()
 
             t_s.update()
-            if local_rank == num_spatial_parts:
+            if local_rank == spatial_part_size:
                 logging.info(
                     f"Step :{i}, LOSS: {temp_loss}, Global loss: {loss/(i+1)} Acc: {temp_correct}"
                 )
@@ -421,17 +559,19 @@ def run_epoch():
             torch.cuda.synchronize()
             t = start_event.elapsed_time(end_event) / 1000
             if local_rank == 0:
-                print("images per sec:", batch_size / t)
+                print(f"Epoch: {i_e} images per sec:{batch_size / t}")
                 perf.append(batch_size / t)
 
             t = time.time()
-        if local_rank == num_spatial_parts:
-            print("epoch", i_e, " Global loss:", loss, " acc", correct / i)
+        if local_rank == comm_size - 1:
+            print(f"Epoch {i_e} Global loss: {loss} Acc {correct / i}")
 
 
 run_epoch()
 
 if local_rank == 0:
-    print("Mean {} Median {}".format(sum(perf) / len(perf), np.median(perf)))
+    print(f"Mean {sum(perf) / len(perf)} Median {np.median(perf)}")
+
+################################################################################
 
 exit()
diff --git a/docs/assets/images/AmeobaNet_img_size_1024.png b/docs/assets/images/AmeobaNet_img_size_1024.png
new file mode 100644
index 00000000..35574ff4
Binary files /dev/null and b/docs/assets/images/AmeobaNet_img_size_1024.png differ
diff --git a/docs/assets/images/AmeobaNet_img_size_2048.png b/docs/assets/images/AmeobaNet_img_size_2048.png
new file mode 100644
index 00000000..4b9a8244
Binary files /dev/null and b/docs/assets/images/AmeobaNet_img_size_2048.png differ
diff --git a/docs/assets/images/DP_MP_SP_Vs_Memory.png b/docs/assets/images/DP_MP_SP_Vs_Memory.png
new file mode 100644
index 00000000..ea0e0fa5
Binary files /dev/null and b/docs/assets/images/DP_MP_SP_Vs_Memory.png differ
diff --git a/docs/assets/images/GEMS_MAST.jpg b/docs/assets/images/GEMS_MAST.jpg
deleted file mode 100644
index 85f0d891..00000000
Binary files a/docs/assets/images/GEMS_MAST.jpg and /dev/null differ
diff --git a/docs/assets/images/Halo_Exchange.jpg b/docs/assets/images/Halo_Exchange.jpg
new file mode 100644
index 00000000..d011882b
Binary files /dev/null and b/docs/assets/images/Halo_Exchange.jpg differ
diff --git a/docs/assets/images/ResNet_img_size_1024.png b/docs/assets/images/ResNet_img_size_1024.png
new file mode 100644
index 00000000..1f9a2f39
Binary files /dev/null and b/docs/assets/images/ResNet_img_size_1024.png differ
diff --git a/docs/assets/images/ResNet_img_size_2048.png b/docs/assets/images/ResNet_img_size_2048.png
new file mode 100644
index 00000000..80a9b953
Binary files /dev/null and b/docs/assets/images/ResNet_img_size_2048.png differ
diff --git a/docs/assets/images/SpatialParallelism.jpg b/docs/assets/images/SpatialParallelism.jpg
deleted file mode 100644
index 10ebd4b5..00000000
Binary files a/docs/assets/images/SpatialParallelism.jpg and /dev/null differ
diff --git a/docs/assets/images/Spatial_Parallelism.jpg b/docs/assets/images/Spatial_Parallelism.jpg
new file mode 100644
index 00000000..caba45dc
Binary files /dev/null and b/docs/assets/images/Spatial_Parallelism.jpg differ
diff --git a/docs/assets/images/halo-exchange.png b/docs/assets/images/halo-exchange.png
new file mode 100644
index 00000000..ca1e14d2
Binary files /dev/null and b/docs/assets/images/halo-exchange.png differ
diff --git a/docs/assets/images/halo-exchange_with_compute.png b/docs/assets/images/halo-exchange_with_compute.png
new file mode 100644
index 00000000..21be8cdb
Binary files /dev/null and b/docs/assets/images/halo-exchange_with_compute.png differ
diff --git a/docs/installation/MVAPICH_INSTALLATION_GUIDE.md b/docs/installation/MVAPICH_INSTALLATION_GUIDE.md
new file mode 100644
index 00000000..ae8a58e5
--- /dev/null
+++ b/docs/installation/MVAPICH_INSTALLATION_GUIDE.md
@@ -0,0 +1,103 @@
+# Installation Guide for MVAPICH2-GDR
+
+**To install MVAPICH2-GDR, refer https://mvapich.cse.ohio-state.edu/userguide/gdr/**
+
+
+<div align="center">
+  <b>OR</b> 
+</div>
+
+**You can follow below instructions to install MVAPICH2-GDR from RPMs**
+
+
+### Get the approriate RPMs for your system
+
+In this case, we used MOFED 5.5 RPMs
+```bash
+wget https://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3.7/mofed5.5/mvapich2-gdr-cuda11.6.mofed5.5.gnu8.5.0-2.3.7-1.el8.x86_64.rpm
+```
+### Unpack the rpm
+```bash
+rpm2cpio mvapich2-gdr-cuda11.6.mofed5.5.gnu8.5.0-2.3.7-1.el8.x86_64.rpm | cpio -id
+```
+
+### Note the path of your rpm. It should ook as follows:
+```
+<directory>/opt/mvapich2/gdr/2.3.7/no-mpittool/no-openacc/cuda11.6/mofed5.5/mpirun/gnu8.5.0
+```
+
+### Add the rpm to your path (this needs to be run every time you start a new job)
+
+```bash
+export RPM_HOME=<directory>/opt/mvapich2/gdr/2.3.7/no-mpittool/no-openacc/cuda11.6/mofed5.5/mpirun/gnu8.5.0
+export PATH=$RPM_HOME/bin:$PATH
+export LD_LIBRARY_PATH=$RPM_HOME/lib64:$LD_LIBRARY_PATH
+export CPATH=$RPM_HOME/include:$CPATH
+```
+
+### Load the gcc and cuda versions for our rpm
+```bash
+module load cuda/11.6 gcc/8.5.0
+```
+
+### Update the compiler paths to be absolute instead of relative 
+```bash
+vi $RPM_HOME/bin/mpicc
+```
+
+#### Incorrect paths look like:
+```
+prefix=/opt/mvapich2/gdr/2.3.7/no-mpittool/no-openacc/cuda11.6/mofed5.5/mpirun/gnu8.5.0
+exec_prefix=/opt/mvapich2/gdr/2.3.7/no-mpittool/no-openacc/cuda11.6/mofed5.5/mpirun/gnu8.5.0
+sysconfdir=/opt/mvapich2/gdr/2.3.7/no-mpittool/no-openacc/cuda11.6/mofed5.5/mpirun/gnu8.5.0/etc
+includedir=/opt/mvapich2/gdr/2.3.7/no-mpittool/no-openacc/cuda11.6/mofed5.5/mpirun/gnu8.5.0/include
+libdir=/opt/mvapich2/gdr/2.3.7/no-mpittool/no-openacc/cuda11.6/mofed5.5/mpirun/gnu8.5.0/lib64
+```
+#### Change these with the value of $RPM_HOME prepended like:
+```
+prefix=<directory>/opt/mvapich2/gdr/2.3.7/no-mpittool/no-openacc/cuda11.6/mofed5.5/mpirun/gnu8.5.0/opt/mvapich2/gdr/2.3.7/no-mpittool/no-openacc/cuda11.6/mofed5.5/mpirun/gnu8.5.0
+exec_prefix=<directory>/opt/mvapich2/gdr/2.3.7/no-mpittool/no-openacc/cuda11.6/mofed5.5/mpirun/gnu8.5.0/opt/mvapich2/gdr/2.3.7/no-mpittool/no-openacc/cuda11.6/mofed5.5/mpirun/gnu8.5.0
+sysconfdir=<directory>/opt/mvapich2/gdr/2.3.7/no-mpittool/no-openacc/cuda11.6/mofed5.5/mpirun/gnu8.5.0/opt/mvapich2/gdr/2.3.7/no-mpittool/no-openacc/cuda11.6/mofed5.5/mpirun/gnu8.5.0/etc
+includedir=<directory>/opt/mvapich2/gdr/2.3.7/no-mpittool/no-openacc/cuda11.6/mofed5.5/mpirun/gnu8.5.0/opt/mvapich2/gdr/2.3.7/no-mpittool/no-openacc/cuda11.6/mofed5.5/mpirun/gnu8.5.0/include
+libdir=<directory>/opt/mvapich2/gdr/2.3.7/no-mpittool/no-openacc/cuda11.6/mofed5.5/mpirun/gnu8.5.0/opt/mvapich2/gdr/2.3.7/no-mpittool/no-openacc/cuda11.6/mofed5.5/mpirun/gnu8.5.0/lib64
+```
+
+
+### Check installation by running "osu micro benchmarks (OMB)" benchmark
+
+OMB performs common MPI operations like allreduce, bcast, send/recv, etc. OMB is located in **$RPM_HOME/libexec**
+
+#### Run an allreduce benchmark on 2 GPUs.
+
+following output is expected 
+
+```
+[gulhane.2@a100-01 libexec]$ $MV2_HOME/bin/mpirun_rsh --export-all -np 2 a100-01 a100-01 MV2_USE_CUDA=1 osu-micro-benchmarks/mpi/collective/osu_allreduce -d
+ cuda
+[a100-01.cluster:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/.
+[a100-01.cluster:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1
+
+# OSU MPI-CUDA Allreduce Latency Test v5.9
+# Size       Avg Latency(us)
+4                       1.98
+8                       2.04
+16                      1.60
+[a100-01.cluster:mpi_rank_0][dreg_register] [Performance Impact Warning]: Entries are being evicted from the InfiniBand registration cache. This can lead to degraded performance. Consider increasing MV2_NDREG_ENTRIES_MAX (current value: 16384) and MV2_NDREG_ENTRIES (current value: 8196)
+[a100-01.cluster:mpi_rank_1][dreg_register] [Performance Impact Warning]: Entries are being evicted from the InfiniBand registration cache. This can lead to degraded performance. Consider increasing MV2_NDREG_ENTRIES_MAX (current value: 16384) and MV2_NDREG_ENTRIES (current value: 8196)
+32                     13.30
+64                     13.47
+128                    13.48
+256                    15.15
+512                    13.82
+1024                   13.93
+2048                   14.31
+4096                   15.40
+8192                   16.65
+16384                  19.72
+32768                 456.73
+65536                 503.67
+131072                500.02
+262144                457.37
+524288                631.29
+1048576               630.67
+```
diff --git a/docs/installation/PYTORCH_INSTALLATION_GUIDE.md b/docs/installation/PYTORCH_INSTALLATION_GUIDE.md
new file mode 100644
index 00000000..8540fde1
--- /dev/null
+++ b/docs/installation/PYTORCH_INSTALLATION_GUIDE.md
@@ -0,0 +1,102 @@
+# Installation Guide for PyTorch with MPI support
+
+*Note : To enable MPI support, it is required to install PyTorch from source.*</br>
+
+## Install PyTorch from source
+### Install Miniconda and activate conda environment on Linux
+
+```bash
+bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda
+source $HOME/miniconda/bin/activate
+conda create -n PyTorch_env python=3.9.16
+conda activate PyTorch_env
+export PYTHONNOUSERSITE=true
+```
+
+### Clone PyTorch repository
+```bash
+git clone https://github.com/pytorch/pytorch
+cd pytorch
+git checkout v1.12.1
+```
+
+### Add cuda-aware MPI support
+Modify pytorch/caffe2/mpi/mpi_ops_gpu.cc:
+```bash
+#define CAFFE2_HAS_CUDA_MPI_BASICS 1
+#define CAFFE2_HAS_CUDA_MPI_ALLREDUCE 1
+```
+
+Modify pytorch/torch/csrc/distributed/c10d/ProcessGroupMPI.cpp
+```bash
+#if defined(MPIX_CUDA_AWARE_SUPPORT)
+  if (MPIX_Query_cuda_support() == 1) {
+    return true;
+  } else {
+    return true;
+  }
+#else // !defined(MPIX_CUDA_AWARE_SUPPORT)
+  return true;
+#endif // MPIX_CUDA_AWARE_SUPPORT
+}
+```
+
+#### Create a different branch with MPI support changes and commit the changes
+
+```bash
+git checkout -b v1.12.1-cudaMPI
+git add .
+git commit -m "Support for CUDA-aware MPI"
+```
+
+### Install Dependencies
+```bash
+conda install pytorch dependencies 
+conda install astunparse numpy ninja pyyaml setuptools cmake typing_extensions six requests dataclasses
+conda install mkl mkl-include
+
+```
+### Set environment variable 
+```bash
+export CUDA_HOME=/opt/cuda/$CUDA_VERSION
+export CPATH=$CUDA_HOME/include:$CPATH
+export CUDNN_LIB_DIR=/home/gulhane.2/cuda/lib64
+export CUDNN_INCLUDE_DIR=/home/gulhane.2/cuda/include
+```
+
+### Install PyTorch
+```bash
+git submodule sync
+git submodule update --init --recursive
+python setup.py develop
+```
+### For more information refer PyTorch installation guide 
+- https://github.com/pytorch/pytorch
+
+
+## Install Torchvision from source
+When we install PyTorch from source, torchvision package doesn't come up with PyTorch. Thus, we need to install torchvision seperately.
+
+### Clone repo
+https://github.com/pytorch/vision.
+
+### Clone Torchvision repository
+```bash
+git clone https://github.com/pytorch/vision
+```
+
+### Checkout appropriate branch
+
+*Note : torchvision versioin should be compatible with PyTorch version. Refer https://github.com/pytorch/vision#installation to get torchvision version corresponding to PyTorch version*
+```bash
+cd pytorch
+git checkout v0.13.0
+```
+
+### Install Torchvisioin
+
+```bash
+python setup.py install
+```
+### For more information refer Torchvision installation guide 
+- https://github.com/pytorch/vision
diff --git a/models/inception_v4.py b/models/inception_v4.py
deleted file mode 100644
index d3c379e0..00000000
--- a/models/inception_v4.py
+++ /dev/null
@@ -1,664 +0,0 @@
-import torch
-import torch.nn as nn
-from torchgems.spatial_new import conv_spatial, halo_exchange_layer
-import math
-
-global_info = {}
-
-
-def set_basic_informations(local_rank, spatial_size, num_spatial_parts, slice_method):
-    global_info["local_rank"] = local_rank
-    global_info["spatial_size"] = spatial_size
-    global_info["num_spatial_parts"] = num_spatial_parts
-    global_info["slice_method"] = slice_method
-
-
-class BasicConv2d(nn.Module):
-    def __init__(
-        self,
-        in_planes,
-        out_planes,
-        kernel_size,
-        stride,
-        padding=0,
-        ENABLE_SPATIAL=False,
-    ):
-        super(BasicConv2d, self).__init__()
-        if ENABLE_SPATIAL:
-            self.conv = conv_spatial(
-                local_rank=global_info["local_rank"],
-                spatial_size=global_info["spatial_size"],
-                num_spatial_parts=global_info["num_spatial_parts"],
-                in_channels=in_planes,
-                out_channels=out_planes,
-                kernel_size=kernel_size,
-                stride=stride,
-                padding=padding,
-                slice_method=global_info["slice_method"],
-            )
-        else:
-            self.conv = nn.Conv2d(
-                in_planes,
-                out_planes,
-                kernel_size=kernel_size,
-                stride=stride,
-                padding=padding,
-                bias=False,
-            )  # verify bias false
-        self.bn = nn.BatchNorm2d(out_planes, eps=0.001, momentum=0, affine=True)
-        self.relu = nn.ReLU(inplace=True)
-
-    def forward(self, x):
-        x = self.conv(x)
-        x = self.bn(x)
-        x = self.relu(x)
-        return x
-
-
-class Mixed_3a(nn.Module):
-    def __init__(self, ENABLE_SPATIAL=False):
-        super(Mixed_3a, self).__init__()
-
-        if ENABLE_SPATIAL:
-            self.halo_len_layer = halo_exchange_layer(
-                local_rank=global_info["local_rank"],
-                spatial_size=global_info["spatial_size"],
-                num_spatial_parts=global_info["num_spatial_parts"],
-                halo_len=1,
-                slice_method=global_info["slice_method"],
-            )
-            self.maxpool = nn.MaxPool2d(3, stride=2, padding=0)
-        else:
-            self.maxpool = nn.MaxPool2d(3, stride=2, padding=1)
-
-        self.conv = BasicConv2d(
-            64, 96, kernel_size=3, stride=2, ENABLE_SPATIAL=ENABLE_SPATIAL, padding=1
-        )
-
-        self.ENABLE_SPATIAL = ENABLE_SPATIAL
-
-    def forward(self, x):
-        if self.ENABLE_SPATIAL:
-            x_halo = self.halo_len_layer(x)
-            x0 = self.maxpool(x_halo)
-        else:
-            x0 = self.maxpool(x)
-
-        x1 = self.conv(x)
-        out = torch.cat((x0, x1), 1)
-        return out
-
-
-class Mixed_4a(nn.Module):
-    def __init__(self, ENABLE_SPATIAL=False):
-        super(Mixed_4a, self).__init__()
-
-        self.branch0 = nn.Sequential(
-            BasicConv2d(
-                160, 64, kernel_size=1, stride=1, ENABLE_SPATIAL=ENABLE_SPATIAL
-            ),
-            BasicConv2d(
-                64,
-                96,
-                kernel_size=3,
-                stride=1,
-                ENABLE_SPATIAL=ENABLE_SPATIAL,
-                padding=1,
-            ),
-        )
-
-        self.branch1 = nn.Sequential(
-            BasicConv2d(
-                160, 64, kernel_size=1, stride=1, ENABLE_SPATIAL=ENABLE_SPATIAL
-            ),
-            BasicConv2d(
-                64,
-                64,
-                kernel_size=(1, 7),
-                stride=1,
-                padding=(0, 3),
-                ENABLE_SPATIAL=ENABLE_SPATIAL,
-            ),
-            BasicConv2d(
-                64,
-                64,
-                kernel_size=(7, 1),
-                stride=1,
-                padding=(3, 0),
-                ENABLE_SPATIAL=ENABLE_SPATIAL,
-            ),
-            BasicConv2d(
-                64,
-                96,
-                kernel_size=(3, 3),
-                stride=1,
-                ENABLE_SPATIAL=ENABLE_SPATIAL,
-                padding=1,
-            ),
-        )
-
-    def forward(self, x):
-        x0 = self.branch0(x)
-        x1 = self.branch1(x)
-        out = torch.cat((x0, x1), 1)
-        return out
-
-
-class Mixed_5a(nn.Module):
-    def __init__(self, ENABLE_SPATIAL=False):
-        super(Mixed_5a, self).__init__()
-
-        if ENABLE_SPATIAL:
-            self.halo_len_layer = halo_exchange_layer(
-                local_rank=global_info["local_rank"],
-                spatial_size=global_info["spatial_size"],
-                num_spatial_parts=global_info["num_spatial_parts"],
-                halo_len=1,
-                slice_method=global_info["slice_method"],
-            )
-            self.maxpool = nn.MaxPool2d(3, stride=2, padding=0)
-        else:
-            self.maxpool = nn.MaxPool2d(3, stride=2, padding=1)
-
-        self.conv = BasicConv2d(
-            192, 192, kernel_size=3, stride=2, ENABLE_SPATIAL=ENABLE_SPATIAL, padding=1
-        )
-        self.ENABLE_SPATIAL = ENABLE_SPATIAL
-
-    def forward(self, x):
-        x0 = self.conv(x)
-        if self.ENABLE_SPATIAL:
-            x_halo = self.halo_len_layer(x)
-            x1 = self.maxpool(x_halo)
-        else:
-            x1 = self.maxpool(x)
-        out = torch.cat((x0, x1), 1)
-        return out
-
-
-class branch_pooling(nn.Module):
-    def __init__(self, ENABLE_SPATIAL=False):
-        super(branch_pooling, self).__init__()
-
-        if ENABLE_SPATIAL:
-            self.halo_len_layer = halo_exchange_layer(
-                local_rank=global_info["local_rank"],
-                spatial_size=global_info["spatial_size"],
-                num_spatial_parts=global_info["num_spatial_parts"],
-                halo_len=1,
-                slice_method=global_info["slice_method"],
-            )
-            self.pool = nn.AvgPool2d(3, stride=1, padding=0, count_include_pad=False)
-        else:
-            self.pool = nn.AvgPool2d(3, stride=1, padding=1, count_include_pad=False)
-
-        self.ENABLE_SPATIAL = ENABLE_SPATIAL
-
-    def forward(self, x):
-        if self.ENABLE_SPATIAL:
-            x = self.halo_len_layer(x)
-        x = self.pool(x)
-        return x
-
-
-class Inception_A(nn.Module):
-    def __init__(self, ENABLE_SPATIAL=False):
-        super(Inception_A, self).__init__()
-        self.branch0 = BasicConv2d(
-            384, 96, kernel_size=1, stride=1, ENABLE_SPATIAL=ENABLE_SPATIAL
-        )
-
-        self.branch1 = nn.Sequential(
-            BasicConv2d(
-                384, 64, kernel_size=1, stride=1, ENABLE_SPATIAL=ENABLE_SPATIAL
-            ),
-            BasicConv2d(
-                64,
-                96,
-                kernel_size=3,
-                stride=1,
-                padding=1,
-                ENABLE_SPATIAL=ENABLE_SPATIAL,
-            ),
-        )
-
-        self.branch2 = nn.Sequential(
-            BasicConv2d(
-                384, 64, kernel_size=1, stride=1, ENABLE_SPATIAL=ENABLE_SPATIAL
-            ),
-            BasicConv2d(
-                64,
-                96,
-                kernel_size=3,
-                stride=1,
-                padding=1,
-                ENABLE_SPATIAL=ENABLE_SPATIAL,
-            ),
-            BasicConv2d(
-                96,
-                96,
-                kernel_size=3,
-                stride=1,
-                padding=1,
-                ENABLE_SPATIAL=ENABLE_SPATIAL,
-            ),
-        )
-
-        self.branch3 = nn.Sequential(
-            branch_pooling(),
-            BasicConv2d(
-                384, 96, kernel_size=1, stride=1, ENABLE_SPATIAL=ENABLE_SPATIAL
-            ),
-        )
-
-    def forward(self, x):
-        x0 = self.branch0(x)
-        x1 = self.branch1(x)
-        x2 = self.branch2(x)
-        x3 = self.branch3(x)
-        out = torch.cat((x0, x1, x2, x3), 1)
-        return out
-
-
-class Reduction_A(nn.Module):
-    def __init__(self, ENABLE_SPATIAL=False):
-        super(Reduction_A, self).__init__()
-        self.branch0 = BasicConv2d(
-            384, 384, kernel_size=3, stride=2, ENABLE_SPATIAL=ENABLE_SPATIAL, padding=1
-        )
-
-        self.branch1 = nn.Sequential(
-            BasicConv2d(
-                384, 192, kernel_size=1, stride=1, ENABLE_SPATIAL=ENABLE_SPATIAL
-            ),
-            BasicConv2d(
-                192,
-                224,
-                kernel_size=3,
-                stride=1,
-                padding=1,
-                ENABLE_SPATIAL=ENABLE_SPATIAL,
-            ),
-            BasicConv2d(
-                224,
-                256,
-                kernel_size=3,
-                stride=2,
-                ENABLE_SPATIAL=ENABLE_SPATIAL,
-                padding=1,
-            ),
-        )
-
-        self.branch2 = nn.MaxPool2d(3, stride=2, padding=1)
-
-    def forward(self, x):
-        x0 = self.branch0(x)
-        x1 = self.branch1(x)
-        x2 = self.branch2(x)
-        out = torch.cat((x0, x1, x2), 1)
-        return out
-
-
-class Inception_B(nn.Module):
-    def __init__(self, ENABLE_SPATIAL=False):
-        super(Inception_B, self).__init__()
-        self.branch0 = BasicConv2d(
-            1024, 384, kernel_size=1, stride=1, ENABLE_SPATIAL=ENABLE_SPATIAL
-        )
-
-        self.branch1 = nn.Sequential(
-            BasicConv2d(
-                1024, 192, kernel_size=1, stride=1, ENABLE_SPATIAL=ENABLE_SPATIAL
-            ),
-            BasicConv2d(
-                192,
-                224,
-                kernel_size=(1, 7),
-                stride=1,
-                padding=(0, 3),
-                ENABLE_SPATIAL=ENABLE_SPATIAL,
-            ),
-            BasicConv2d(
-                224,
-                256,
-                kernel_size=(7, 1),
-                stride=1,
-                padding=(3, 0),
-                ENABLE_SPATIAL=ENABLE_SPATIAL,
-            ),
-        )
-
-        self.branch2 = nn.Sequential(
-            BasicConv2d(
-                1024, 192, kernel_size=1, stride=1, ENABLE_SPATIAL=ENABLE_SPATIAL
-            ),
-            BasicConv2d(
-                192,
-                192,
-                kernel_size=(7, 1),
-                stride=1,
-                padding=(3, 0),
-                ENABLE_SPATIAL=ENABLE_SPATIAL,
-            ),
-            BasicConv2d(
-                192,
-                224,
-                kernel_size=(1, 7),
-                stride=1,
-                padding=(0, 3),
-                ENABLE_SPATIAL=ENABLE_SPATIAL,
-            ),
-            BasicConv2d(
-                224,
-                224,
-                kernel_size=(7, 1),
-                stride=1,
-                padding=(3, 0),
-                ENABLE_SPATIAL=ENABLE_SPATIAL,
-            ),
-            BasicConv2d(
-                224,
-                256,
-                kernel_size=(1, 7),
-                stride=1,
-                padding=(0, 3),
-                ENABLE_SPATIAL=ENABLE_SPATIAL,
-            ),
-        )
-
-        self.branch3 = nn.Sequential(
-            branch_pooling(),
-            BasicConv2d(
-                1024, 128, kernel_size=1, stride=1, ENABLE_SPATIAL=ENABLE_SPATIAL
-            ),
-        )
-
-    def forward(self, x):
-        x0 = self.branch0(x)
-        x1 = self.branch1(x)
-        x2 = self.branch2(x)
-        x3 = self.branch3(x)
-        out = torch.cat((x0, x1, x2, x3), 1)
-        return out
-
-
-class Reduction_B(nn.Module):
-    def __init__(self, ENABLE_SPATIAL=False):
-        super(Reduction_B, self).__init__()
-
-        self.branch0 = nn.Sequential(
-            BasicConv2d(
-                1024, 192, kernel_size=1, stride=1, ENABLE_SPATIAL=ENABLE_SPATIAL
-            ),
-            BasicConv2d(
-                192,
-                192,
-                kernel_size=3,
-                stride=2,
-                padding=1,
-                ENABLE_SPATIAL=ENABLE_SPATIAL,
-            ),
-        )
-
-        self.branch1 = nn.Sequential(
-            BasicConv2d(
-                1024, 256, kernel_size=1, stride=1, ENABLE_SPATIAL=ENABLE_SPATIAL
-            ),
-            BasicConv2d(
-                256,
-                256,
-                kernel_size=(1, 7),
-                stride=1,
-                padding=(0, 3),
-                ENABLE_SPATIAL=ENABLE_SPATIAL,
-            ),
-            BasicConv2d(
-                256,
-                320,
-                kernel_size=(7, 1),
-                stride=1,
-                padding=(3, 0),
-                ENABLE_SPATIAL=ENABLE_SPATIAL,
-            ),
-            BasicConv2d(
-                320,
-                320,
-                kernel_size=3,
-                stride=2,
-                padding=1,
-                ENABLE_SPATIAL=ENABLE_SPATIAL,
-            ),
-        )
-
-        self.branch2 = nn.MaxPool2d(3, stride=2, padding=1)
-
-    def forward(self, x):
-        x0 = self.branch0(x)
-        x1 = self.branch1(x)
-        x2 = self.branch2(x)
-        out = torch.cat((x0, x1, x2), 1)
-        return out
-
-
-class Inception_C(nn.Module):
-    def __init__(self, ENABLE_SPATIAL=False):
-        super(Inception_C, self).__init__()
-
-        self.branch0 = BasicConv2d(
-            1536, 256, kernel_size=1, stride=1, ENABLE_SPATIAL=ENABLE_SPATIAL
-        )
-
-        self.branch1_0 = BasicConv2d(
-            1536, 384, kernel_size=1, stride=1, ENABLE_SPATIAL=ENABLE_SPATIAL
-        )
-        self.branch1_1a = BasicConv2d(
-            384,
-            256,
-            kernel_size=(1, 3),
-            stride=1,
-            padding=(0, 1),
-            ENABLE_SPATIAL=ENABLE_SPATIAL,
-        )
-        self.branch1_1b = BasicConv2d(
-            384,
-            256,
-            kernel_size=(3, 1),
-            stride=1,
-            padding=(1, 0),
-            ENABLE_SPATIAL=ENABLE_SPATIAL,
-        )
-
-        self.branch2_0 = BasicConv2d(
-            1536, 384, kernel_size=1, stride=1, ENABLE_SPATIAL=ENABLE_SPATIAL
-        )
-        self.branch2_1 = BasicConv2d(
-            384,
-            448,
-            kernel_size=(3, 1),
-            stride=1,
-            padding=(1, 0),
-            ENABLE_SPATIAL=ENABLE_SPATIAL,
-        )
-        self.branch2_2 = BasicConv2d(
-            448,
-            512,
-            kernel_size=(1, 3),
-            stride=1,
-            padding=(0, 1),
-            ENABLE_SPATIAL=ENABLE_SPATIAL,
-        )
-        self.branch2_3a = BasicConv2d(
-            512,
-            256,
-            kernel_size=(1, 3),
-            stride=1,
-            padding=(0, 1),
-            ENABLE_SPATIAL=ENABLE_SPATIAL,
-        )
-        self.branch2_3b = BasicConv2d(
-            512,
-            256,
-            kernel_size=(3, 1),
-            stride=1,
-            padding=(1, 0),
-            ENABLE_SPATIAL=ENABLE_SPATIAL,
-        )
-
-        self.branch3 = nn.Sequential(
-            branch_pooling(),
-            BasicConv2d(
-                1536, 256, kernel_size=1, stride=1, ENABLE_SPATIAL=ENABLE_SPATIAL
-            ),
-        )
-
-    def forward(self, x):
-        x0 = self.branch0(x)
-
-        x1_0 = self.branch1_0(x)
-        x1_1a = self.branch1_1a(x1_0)
-        x1_1b = self.branch1_1b(x1_0)
-        x1 = torch.cat((x1_1a, x1_1b), 1)
-
-        x2_0 = self.branch2_0(x)
-        x2_1 = self.branch2_1(x2_0)
-        x2_2 = self.branch2_2(x2_1)
-        x2_3a = self.branch2_3a(x2_2)
-        x2_3b = self.branch2_3b(x2_2)
-        x2 = torch.cat((x2_3a, x2_3b), 1)
-
-        x3 = self.branch3(x)
-
-        out = torch.cat((x0, x1, x2, x3), 1)
-        return out
-
-
-class end_part(nn.Module):
-    def __init__(self, image_size, num_classes, ENABLE_SPATIAL=False):
-        super(end_part, self).__init__()
-
-        self.padding = 3
-
-        self.pool = nn.AvgPool2d(8, padding=self.padding, count_include_pad=False)
-
-        h_in = math.floor(image_size / 32)
-        h_out = math.floor((h_in + 2 * self.padding - 8) / 8 + 1)
-
-        self.classif = nn.Linear(1536 * h_out * h_out, num_classes)
-
-        self.ENABLE_SPATIAL = ENABLE_SPATIAL
-
-        if ENABLE_SPATIAL:
-            self.halo_len_layer = halo_exchange_layer(
-                local_rank=global_info["local_rank"],
-                spatial_size=global_info["spatial_size"],
-                num_spatial_parts=global_info["num_spatial_parts"],
-                halo_len=1,
-                slice_method=global_info["slice_method"],
-            )
-
-    def forward(self, x):
-        if self.ENABLE_SPATIAL:
-            x = self.halo_len_layer(x)
-        x = self.pool(x)
-        x = x.view(x.size(0), -1)
-        out = self.classif(x)
-
-        return out
-
-
-def get_inceptionv4(image_size, num_classes=1001):
-    model = nn.Sequential(
-        BasicConv2d(3, 32, kernel_size=3, stride=2, padding=1),
-        BasicConv2d(32, 32, kernel_size=3, stride=1, padding=1),
-        BasicConv2d(32, 64, kernel_size=3, stride=1, padding=1),
-        Mixed_3a(),
-        Mixed_4a(),
-        Mixed_5a(),
-        Inception_A(),
-        Inception_A(),
-        Inception_A(),
-        Inception_A(),
-        Reduction_A(),  # Mixed_6a
-        Inception_B(),
-        Inception_B(),
-        Inception_B(),
-        Inception_B(),
-        Inception_B(),
-        Inception_B(),
-        Inception_B(),
-        Reduction_B(),  # Mixed_7a
-        Inception_C(),
-        Inception_C(),
-        Inception_C(),
-        end_part(image_size, num_classes),
-    )
-    return model
-
-
-def get_inceptionv4_spatial(
-    image_size, num_classes, local_rank, spatial_size, num_spatial_parts
-):
-    set_basic_informations(local_rank, spatial_size, num_spatial_parts, "square")
-
-    model = nn.Sequential(
-        BasicConv2d(3, 32, kernel_size=3, stride=2, padding=1, ENABLE_SPATIAL=True),
-        BasicConv2d(32, 32, kernel_size=3, stride=1, padding=1, ENABLE_SPATIAL=True),
-        BasicConv2d(32, 64, kernel_size=3, stride=1, padding=1, ENABLE_SPATIAL=True),
-        Mixed_3a(ENABLE_SPATIAL=True),
-        Mixed_4a(ENABLE_SPATIAL=True),
-        Mixed_5a(ENABLE_SPATIAL=True),
-        Inception_A(ENABLE_SPATIAL=True),
-        Inception_A(ENABLE_SPATIAL=True),
-        Inception_A(ENABLE_SPATIAL=True),
-        Inception_A(ENABLE_SPATIAL=True),
-        Reduction_A(ENABLE_SPATIAL=True),  # Mixed_6a
-        Inception_B(ENABLE_SPATIAL=True),
-        Inception_B(ENABLE_SPATIAL=True),
-        Inception_B(ENABLE_SPATIAL=True),
-        Inception_B(ENABLE_SPATIAL=True),
-        Inception_B(ENABLE_SPATIAL=True),
-        Inception_B(),
-        Inception_B(),
-        Reduction_B(),  # Mixed_7a
-        Inception_C(),
-        Inception_C(),
-        Inception_C(),
-        end_part(image_size, num_classes),
-    )
-    return model
-
-
-class InceptionV4(nn.Module):
-    def __init__(self, num_classes=1001):
-        super(InceptionV4, self).__init__()
-        self.features = nn.Sequential(
-            BasicConv2d(3, 32, kernel_size=3, stride=2, ENABLE_SPATIAL=True),
-            BasicConv2d(32, 32, kernel_size=3, stride=1, ENABLE_SPATIAL=True),
-            BasicConv2d(
-                32, 64, kernel_size=3, stride=1, padding=1, ENABLE_SPATIAL=True
-            ),
-            Mixed_3a(ENABLE_SPATIAL=True),
-            Mixed_4a(ENABLE_SPATIAL=True),
-            Mixed_5a(ENABLE_SPATIAL=True),
-            Inception_A(ENABLE_SPATIAL=True),
-            Inception_A(ENABLE_SPATIAL=True),
-            Inception_A(ENABLE_SPATIAL=True),
-            Inception_A(ENABLE_SPATIAL=True),
-            Reduction_A(ENABLE_SPATIAL=True),  # Mixed_6a
-            Inception_B(ENABLE_SPATIAL=True),
-            Inception_B(ENABLE_SPATIAL=True),
-            Inception_B(ENABLE_SPATIAL=True),
-            Inception_B(ENABLE_SPATIAL=True),
-            Inception_B(ENABLE_SPATIAL=True),
-            Inception_B(),
-            Inception_B(),
-            Reduction_B(),  # Mixed_7a
-            Inception_C(),
-            Inception_C(),
-            Inception_C(),
-            end_part(),
-        )
-        self.classif = nn.Linear(1536, num_classes)
-
-    def forward(self, x):
-        return self
diff --git a/setup.py b/setup.py
new file mode 100644
index 00000000..f22ce15f
--- /dev/null
+++ b/setup.py
@@ -0,0 +1,8 @@
+from setuptools import setup, find_packages
+
+setup(
+    name="mpi4dl",
+    version="1.0",
+    packages=find_packages(where="src"),
+    package_dir={"": "src"},
+)
diff --git a/models/__init__.py b/src/models/__init__.py
similarity index 100%
rename from models/__init__.py
rename to src/models/__init__.py
diff --git a/models/amoebanet.py b/src/models/amoebanet.py
similarity index 96%
rename from models/amoebanet.py
rename to src/models/amoebanet.py
index 15bf2f9b..05272892 100644
--- a/models/amoebanet.py
+++ b/src/models/amoebanet.py
@@ -1,7 +1,7 @@
 import torch
 import torch.nn as nn
 from collections import OrderedDict
-from torchgems.spatial_new import conv_spatial, Pool
+from torchgems.spatial import conv_spatial, Pool
 from typing import Any, TYPE_CHECKING, Iterator, List, Tuple, Union, cast
 from torch import Tensor
 
@@ -583,8 +583,6 @@ def normal_cells(ENABLE_SPATIAL) -> Iterator[Tuple[int, Cell]]:
     )
     layers["cell2_reduction"] = reduction_cell(ENABLE_SPATIAL)
 
-    print("Layers:", len(layers))
-
     layers.update(
         (f"cell3_normal{i+1}", cell) for i, cell in normal_cells(ENABLE_SPATIAL)
     )
@@ -596,7 +594,6 @@ def normal_cells(ENABLE_SPATIAL) -> Iterator[Tuple[int, Cell]]:
     # Finally, classifier
     layers["classify"] = Classify(channels_prev, num_classes)
 
-    print("Layers:", len(layers))
     return nn.Sequential(layers)
 
 
@@ -685,7 +682,6 @@ def normal_cells() -> Iterator[Tuple[int, Cell]]:
     layers.update((f"cell1_normal{i+1}", cell) for i, cell in normal_cells())
     layers["cell2_reduction"] = reduction_cell()
 
-    print("Layers")
     layers.update((f"cell3_normal{i+1}", cell) for i, cell in normal_cells())
     layers["cell4_reduction"] = reduction_cell()
     layers.update((f"cell5_normal{i+1}", cell) for i, cell in normal_cells())
diff --git a/models/amoebanet_d2.py b/src/models/amoebanet_d2.py
similarity index 96%
rename from models/amoebanet_d2.py
rename to src/models/amoebanet_d2.py
index 1fb6ad49..7769bc43 100644
--- a/models/amoebanet_d2.py
+++ b/src/models/amoebanet_d2.py
@@ -1,7 +1,7 @@
 import torch
 import torch.nn as nn
 from collections import OrderedDict
-from torchgems.spatial_new import conv_spatial, halo_exchange_layer, Pool
+from torchgems.spatial import conv_spatial, halo_exchange_layer, Pool
 from typing import Any, TYPE_CHECKING, Iterator, List, Tuple, Union, cast
 from torch import Tensor
 
@@ -813,8 +813,6 @@ def normal_cells(ENABLE_SPATIAL) -> Iterator[Tuple[int, Cell]]:
     )
     layers["cell2_reduction"] = reduction_cell(ENABLE_SPATIAL)
 
-    print("Layers:", len(layers))
-
     layers.update(
         (f"cell3_normal{i+1}", cell) for i, cell in normal_cells(ENABLE_SPATIAL)
     )
@@ -826,7 +824,6 @@ def normal_cells(ENABLE_SPATIAL) -> Iterator[Tuple[int, Cell]]:
     # Finally, classifier
     layers["classify"] = Classify(channels_prev, num_classes)
 
-    print("Layers:", len(layers))
     return nn.Sequential(layers)
 
 
@@ -926,7 +923,6 @@ def normal_cells() -> Iterator[Tuple[int, Cell]]:
     layers.update((f"cell1_normal{i+1}", cell) for i, cell in normal_cells())
     layers["cell2_reduction"] = reduction_cell()
 
-    print("Layers")
     layers.update((f"cell3_normal{i+1}", cell) for i, cell in normal_cells())
     layers["cell4_reduction"] = reduction_cell()
     layers.update((f"cell5_normal{i+1}", cell) for i, cell in normal_cells())
diff --git a/models/resnet_cifar_torch.py b/src/models/resnet_cifar_torch.py
similarity index 94%
rename from models/resnet_cifar_torch.py
rename to src/models/resnet_cifar_torch.py
index 4659ff7a..935c41a6 100644
--- a/models/resnet_cifar_torch.py
+++ b/src/models/resnet_cifar_torch.py
@@ -113,7 +113,6 @@ def __init__(self, kernel_size, batch_size, num_filters, image_size):
             * int(image_size / (4 * kernel_size))
             * int(image_size / (4 * kernel_size))
         )
-        # print("flatten_size",self.flatten_size,image_size,kernel_size,num_filters, int(image_size/(4*kernel_size)))
         self.fc1 = nn.Linear(self.flatten_size, 10)
 
     def forward(self, x):
@@ -135,13 +134,10 @@ def get_resnet_v1(input_shape, depth, num_classes=10):
     num_filters = 16
     num_res_blocks = int((depth - 2) / 6)
 
-    # inputs = Input(shape=input_shape)
     layers[str(name)] = resnet_layer(in_num_filters=in_filters)
     name += 1
-    # in_filters = num_filters
 
     in_filters = num_filters
-    # return nn.Sequential(layers)
     for stack in range(3):
         for res_block in range(num_res_blocks):
             strides = 1
diff --git a/models/resnet_cifar_torch_spatial.py b/src/models/resnet_cifar_torch_spatial.py
similarity index 93%
rename from models/resnet_cifar_torch_spatial.py
rename to src/models/resnet_cifar_torch_spatial.py
index 95814cb7..8b8bc3c7 100644
--- a/models/resnet_cifar_torch_spatial.py
+++ b/src/models/resnet_cifar_torch_spatial.py
@@ -1,7 +1,7 @@
 import torch.nn as nn
 import torch.nn.functional as F
 from collections import OrderedDict
-from torchgems.spatial_new import conv_spatial
+from torchgems.spatial import conv_spatial
 
 
 class resnet_layer(nn.Module):
@@ -236,7 +236,6 @@ def __init__(self, kernel_size, batch_size, num_filters, image_size):
             * int(image_size / (4 * kernel_size))
             * int(image_size / (4 * kernel_size))
         )
-        # print("flatten_size",self.flatten_size,image_size,kernel_size,num_filters, int(image_size/(4*kernel_size)))
         self.fc1 = nn.Linear(self.flatten_size, 10)
 
     def forward(self, x):
@@ -299,7 +298,6 @@ def get_resnet_v1(
     _, end_layer = get_start_end_layer_index(
         num_layers, balance, mp_size, local_rank=spatial_size - 1
     )
-    print("end_layer:", end_layer)
 
     # inputs = Input(shape=input_shape)
     layers[str(name)] = resnet_layer_spatial(
@@ -467,7 +465,7 @@ def forward(self, x):
         # Check for vertical and horzontal slicing
         shapes = y.shape
         if shapes[2] != shapes[3]:
-            print("ERROR: YOU ARE IN TROUBLE (SHAPES ARE UNEQUAL)")
+            print(f"ERROR: SHAPES ARE UNEQUAL")
 
         y = self.r2(y)
         y = self.r3(y)
@@ -542,10 +540,7 @@ def get_resnet_v2(
     _, end_layer = get_start_end_layer_index(
         num_layers, balance, mp_size, local_rank=spatial_size - 1
     )
-    print("end_layer:", end_layer)
 
-    # inputs = Input(shape=input_shape)
-    # layers[str(name)] = resnet_layer(in_num_filters=in_filters,conv_first=True)
     layers[str(name)] = resnet_layer_spatial(
         local_rank,
         spatial_size,
@@ -554,10 +549,8 @@ def get_resnet_v2(
         slice_method=slice_method,
     )
     name += 1
-    # in_filters = num_filters
 
     in_filters = num_filters_in
-    # return nn.Sequential(layers)
     for stage in range(3):
         for res_block in range(num_res_blocks):
             strides = 1
diff --git a/models/resnet_cifar_torch_spatial_d2.py b/src/models/resnet_cifar_torch_spatial_d2.py
similarity index 93%
rename from models/resnet_cifar_torch_spatial_d2.py
rename to src/models/resnet_cifar_torch_spatial_d2.py
index 88452229..1871a2a6 100644
--- a/models/resnet_cifar_torch_spatial_d2.py
+++ b/src/models/resnet_cifar_torch_spatial_d2.py
@@ -1,7 +1,7 @@
 import torch.nn as nn
 import torch.nn.functional as F
 from collections import OrderedDict
-from torchgems.spatial_new import conv_spatial, halo_exchange_layer
+from torchgems.spatial import conv_spatial, halo_exchange_layer
 
 
 class resnet_layer(nn.Module):
@@ -255,7 +255,6 @@ def __init__(self, kernel_size, batch_size, num_filters, image_size):
             * int(image_size / (4 * kernel_size))
             * int(image_size / (4 * kernel_size))
         )
-        # print("flatten_size",self.flatten_size,image_size,kernel_size,num_filters, int(image_size/(4*kernel_size)))
         self.fc1 = nn.Linear(self.flatten_size, 10)
 
     def forward(self, x):
@@ -267,6 +266,14 @@ def forward(self, x):
         return x
 
 
+def get_balance(num_layers, mp_size):
+    part_layer = int(num_layers / mp_size)
+    balance = [part_layer] * mp_size
+    # add remianing layers to last split if split is uneven
+    balance[mp_size - 1] = balance[mp_size - 1] + (num_layers - sum(balance))
+    return balance
+
+
 def get_start_end_layer_index(num_layers, balance, mp_size, local_rank=0):
     # return the index of start and end layer for the model
     # based on the size of model parallelism and local rank of the process
@@ -318,7 +325,6 @@ def get_resnet_v1(
     _, end_layer = get_start_end_layer_index(
         num_layers, balance, mp_size, local_rank=spatial_size - 1
     )
-    print("end_layer:", end_layer)
 
     # inputs = Input(shape=input_shape)
     layers[str(name)] = resnet_layer_spatial(
@@ -326,10 +332,9 @@ def get_resnet_v1(
     )
 
     name += 1
-    # in_filters = num_filters
 
     in_filters = num_filters
-    # return nn.Sequential(layers)
+
     for stack in range(3):
         for res_block in range(num_res_blocks):
             strides = 1
@@ -446,9 +451,6 @@ def forward(self, x):
         if self.resblock == 0:
             temp = self.r4(temp)
 
-        # print(y.shape, )
-        # print("Rank {} Y {} temp {} Filters {}".format(self.local_rank,y.shape, temp.shape, self.in_filters))
-
         temp = temp + y
         # x = F.relu(x)
         return temp
@@ -501,7 +503,7 @@ def forward(self, x):
         # Check for vertical and horzontal slicing
         shapes = y.shape
         if shapes[2] != shapes[3]:
-            print("ERROR: YOU ARE IN TROUBLE (SHAPES ARE UNEQUAL)")
+            print(f"ERROR: SHAPES ARE UNEQUAL")
 
         y = self.r2(y)
         y = self.r3(y)
@@ -572,13 +574,13 @@ def get_resnet_v2(
 
     num_layers = num_res_blocks * 3 + 2
 
+    if balance == None:
+        balance = get_balance(num_layers, mp_size)
+
     _, end_layer = get_start_end_layer_index(
         num_layers, balance, mp_size, local_rank=spatial_size - 1
     )
-    print("end_layer:", end_layer)
 
-    # inputs = Input(shape=input_shape)
-    # layers[str(name)] = resnet_layer(in_num_filters=in_filters,conv_first=True)
     layers[str(name)] = resnet_layer_spatial(
         local_rank,
         spatial_size,
@@ -587,7 +589,6 @@ def get_resnet_v2(
         slice_method=slice_method,
     )
     name += 1
-    # in_filters = num_filters
 
     in_filters = num_filters_in
 
diff --git a/torchgems/__init__.py b/src/torchgems/__init__.py
similarity index 100%
rename from torchgems/__init__.py
rename to src/torchgems/__init__.py
diff --git a/torchgems/comm.py b/src/torchgems/comm.py
similarity index 90%
rename from torchgems/comm.py
rename to src/torchgems/comm.py
index e7ef04cb..aecc4985 100644
--- a/torchgems/comm.py
+++ b/src/torchgems/comm.py
@@ -84,9 +84,6 @@ def __init__(
                 self.total_spatial_processes = num_spatial_parts
         if ENABLE_SPATIAL == True:
             self.spatial_allreduce_grp = self.create_allreduce_comm_spatial()
-            print("success:", self.total_spatial_processes)
-            # if(self.local_rank < self.total_spatial_processes):
-            # 	self.test_allreduce_comm(self.spatial_allreduce_grp)
         else:
             self.spatial_allreduce_grp = None
 
@@ -97,7 +94,6 @@ def __init__(
                     num_spatial_parts, self.local_rank
                 )
             else:
-                # self.split_rank = self.local_rank - self.total_spatial_processes + spatial_size
                 self.split_rank = (
                     math.floor(
                         (self.local_rank - self.total_spatial_processes)
@@ -120,8 +116,6 @@ def __init__(
             self.LOCAL_DP_MP_Comm = None
 
         self.allreduce_grp = self.create_allreduce_comm()
-
-        # BUG: if allreduce is called after send and recv it results in unexpected behaviour
         self.test_allreduce_comm(self.allreduce_grp)
 
     def get_split_rank(self, num_spatial_parts_list, local_rank):
@@ -166,8 +160,6 @@ def create_allreduce_comm_master(self):
                 ranks = [temp_first_rank, temp_second_rank]
                 temp_allreduce_grp = torch.distributed.new_group(ranks=ranks)
 
-                print("This is good!!!!!!!!!!!!!!!!!!!!!!")
-
                 if self.first_local_rank in ranks:
                     self.first_LP_master_group = temp_allreduce_grp
                 if self.second_local_rank in ranks:
@@ -185,14 +177,11 @@ def create_allreduce_comm_master(self):
         return allreduce_grp
 
     def create_allreduce_comm_spatial(self):
-        # if(self.local_rank < self.num_spatial_parts * self.spatial_size):
         if self.ENABLE_MASTER:
             first_local_rank = self.mp_size - 1 - self.local_rank
             second_local_rank = self.local_rank
         spatial_allreduce_grp = None
         for j in range(self.spatial_size):
-            # multiplier = math.floor(self.local_rank / self.num_spatial_parts)
-
             if self.spatial_size == 1:
                 ranks = [
                     (self.num_spatial_parts * j) + i
@@ -203,11 +192,9 @@ def create_allreduce_comm_spatial(self):
                     (sum(self.num_spatial_parts_list[:j])) + i
                     for i in range(self.num_spatial_parts_list[j])
                 ]
-            print(ranks)
 
             if self.ENABLE_MASTER:
                 for i in range(len(ranks)):
-                    # ranks[i] = self.mp_size - 1 - ranks[i]
                     ranks.append(self.mp_size - 1 - ranks[i])
 
             temp_spatial_allreduce_grp = torch.distributed.new_group(ranks=ranks)
@@ -243,7 +230,6 @@ def create_allreduce_comm_spatial(self):
         return spatial_allreduce_grp
 
     def create_scatter_gather_spatial_MP_comm(self):
-        # if(self.local_rank < self.num_spatial_parts * self.spatial_size):
         if self.spatial_size == 1:
             prev_num_spatial_parts = self.num_spatial_parts
         else:
@@ -258,15 +244,11 @@ def create_scatter_gather_spatial_MP_comm(self):
         SP_LP_group = None
 
         for j in range(prev_num_spatial_parts):
-            # multiplier = math.floor(self.local_rank / self.num_spatial_parts)
             temp_ranks = [start_last_spatial_rank + j] + LP_ranks
 
             if self.ENABLE_MASTER:
                 for i in range(len(temp_ranks)):
                     temp_ranks[i] = self.mp_size - 1 - temp_ranks[i]
-
-            print(temp_ranks)
-
             temp_LP_SP_grp = torch.distributed.new_group(ranks=temp_ranks)
             LP_SP_Groups.append(temp_LP_SP_grp)
 
@@ -280,12 +262,8 @@ def create_local_DP_in_MP_comm(self):
         LOCAL_DP_MP_Comm = None
 
         for j in range(int(num_mp_ranks / self.LOCAL_DP_LP)):
-            # multiplier = math.floor(self.local_rank / self.num_spatial_parts)
-
             start_rank = self.total_spatial_processes + (j * self.LOCAL_DP_LP)
             temp_ranks = [start_rank + i for i in range(self.LOCAL_DP_LP)]
-            print(temp_ranks)
-
             if self.ENABLE_MASTER:
                 for i in range(len(temp_ranks)):
                     temp_ranks[i] = self.mp_size - 1 - temp_ranks[i]
@@ -444,11 +422,9 @@ def get_grad_flatten(self, model, back=False):
     def modify_grads(self, model, flat_grad, grad_num_element_list, grad_shape_list):
         temp_count_elements = 0
         temp_count_index = 0
-        # print("flat_grad",flat_grad)
 
         for param in model.parameters():
             if param.grad is not None:
-                # print(temp_count_elements,grad_num_element_list[temp_count_index])
                 param.grad.data = (
                     flat_grad[
                         temp_count_elements : temp_count_elements
@@ -513,11 +489,7 @@ def apply_allreduce(self, model_gen, allreduce_grp):
         torch.cuda.synchronize()
         models1 = model_gen.models
         flat_grad1 = self.get_grad_flatten(models1, back=False)
-        # torch.cuda.synchronize()
-        # print("Before allreduce local_rank:",self.local_rank,flat_grad1.abs().sum())
         dist.all_reduce(flat_grad1.data, op=dist.reduce_op.SUM, group=allreduce_grp)
-        # print("After allreduce local_rank:",self.local_rank,flat_grad1.abs().sum())
-        # torch.cuda.synchronize()
         self.modify_grads(
             models1, flat_grad1, self.grad_num_element_list1, self.grad_shape_list1
         )
diff --git a/torchgems/gems_master.py b/src/torchgems/gems_master.py
similarity index 77%
rename from torchgems/gems_master.py
rename to src/torchgems/gems_master.py
index 2d3c1ded..5234250f 100644
--- a/torchgems/gems_master.py
+++ b/src/torchgems/gems_master.py
@@ -43,10 +43,6 @@ def __init__(
             GEMS_INVERSE=True,
         )
 
-        # self.train_model1.models = self.train_model1.models.to('cpu')
-
-        # self.train_model2.models = self.train_model2.models.to('cpu')
-
         self.parts = parts
         self.epochs = epochs
         self.local_rank = local_rank
@@ -55,33 +51,17 @@ def __init__(
 
         self.replications = replications
 
-        # self.initialize_recv_buffers()
-        # self.initialize_send_recv_ranks()
-
     def run_step(self, inputs, labels):
         loss, correct = 0, 0
-        # torch.cuda.empty_cache()
-
-        # self.train_model1.models = self.train_model1.models.to('cuda')
         temp_loss, temp_correct = self.train_model1.run_step(
             inputs[: self.batch_size], labels[: self.batch_size]
         )
         loss += temp_loss
         correct += temp_correct
-
-        # torch.cuda.empty_cache()
-
-        # self.train_model1.models = self.train_model1.models.to('cpu')
-        # self.train_model2.models = self.train_model2.models.to('cuda')
         temp_loss, temp_correct = self.train_model2.run_step(
             inputs[self.batch_size : 2 * self.batch_size],
             labels[self.batch_size : 2 * self.batch_size],
         )
-
-        # self.train_model2.models = self.train_model2.models.to('cpu')
-
-        # torch.cuda.empty_cache()
-
         loss += temp_loss
         correct += temp_correct
 
diff --git a/torchgems/mp_pipeline.py b/src/torchgems/mp_pipeline.py
similarity index 94%
rename from torchgems/mp_pipeline.py
rename to src/torchgems/mp_pipeline.py
index c4a76a48..a64ee2ad 100644
--- a/torchgems/mp_pipeline.py
+++ b/src/torchgems/mp_pipeline.py
@@ -71,7 +71,6 @@ def ready_model(self, split_rank, GET_SHAPES_ON_CUDA=False):
         temp_model = self.get_model(split_rank=split_rank)
         t = time.time()
         self.models = temp_model.to("cuda:0")
-        print("time to move model", time.time() - t)
 
     def DDP_model(
         self, mpi_comm, num_spatial_parts, spatial_size, bucket_size=25, local_rank=None
@@ -80,8 +79,6 @@ def DDP_model(
             local_rank = mpi_comm.local_rank
 
         if local_rank < mpi_comm.total_spatial_processes:
-            print("Allreduce spatial grp", mpi_comm.spatial_allreduce_grp)
-
             self.models = DDP(
                 self.models,
                 device_ids=[0],
@@ -337,7 +334,6 @@ def send_input_async(self, y):
         if self.MULTIPLE_OUTPUT:
             reqs = []
             for one_output in y:
-                # print("To send, Local Rank:",self.local_rank, " dst:",self.to_send_forward)
                 req = dist.isend(
                     tensor=one_output, dst=self.to_send_forward, tag=tag_forward
                 )
@@ -347,8 +343,6 @@ def send_input_async(self, y):
             for req in reqs:
                 req.wait()
         else:
-            # print("------------------------------------------------------------------")
-            # print("LOCAL RANK:",self.local_rank, "Sending to:",self.to_send_forward)
             dist.send(tensor=y, dst=self.to_send_forward, tag=tag_forward)
 
     def receive_grad_sync(self):
@@ -440,7 +434,6 @@ def forward_pass(self, data_x, data_y, part_number=0):
                 input_x = self.input_x_list[part_number]
 
         # Apply forward pass
-        # BUG: without cuda synchronize incorrect values are sent and recv
         torch.cuda.synchronize()
 
         y = self.models(input_x)
@@ -482,7 +475,6 @@ def backward_pass(self, y, part_number=0):
             else:
                 self.send_grad_sync(self.input_x_list[part_number])
 
-        # BUG: using persistant buffer results in NAN value after some steps. Below is the fix
         if self.split_rank != 0:
             if self.MULTIPLE_INPUT:
                 self.input_x_list[part_number] = list(self.input_x_list[part_number])
diff --git a/torchgems/parser.py b/src/torchgems/parser.py
similarity index 62%
rename from torchgems/parser.py
rename to src/torchgems/parser.py
index a2655733..757513e9 100644
--- a/torchgems/parser.py
+++ b/src/torchgems/parser.py
@@ -3,82 +3,48 @@
 
 def get_parser():
     parser = argparse.ArgumentParser(
-        description="MP-DP ResNet Script",
+        description="SP-MP-DP Configuration Script",
         formatter_class=argparse.ArgumentDefaultsHelpFormatter,
     )
+
     parser.add_argument(
-        "--fp16-allreduce",
+        "-v",
+        "--verbose",
+        help="Prints performance numbers or logs",
         action="store_true",
-        default=False,
-        help="use fp16 compression during allreduce",
     )
 
-    parser.add_argument(
-        "--model", type=str, default="resnet50", help="model to benchmark"
-    )
     parser.add_argument("--batch-size", type=int, default=32, help="input batch size")
 
-    parser.add_argument(
-        "--learning-rate",
-        type=float,
-        default=0.001,
-        help="learning rate for the optimizer",
-    )
-    parser.add_argument(
-        "--num-gpus-mp", type=int, default=1, help="number of GPUS per node for MP"
-    )
-    parser.add_argument(
-        "--mem-per-process", type=float, default=1, help="TF GPU memory per GPU"
-    )
-
     parser.add_argument("--parts", type=int, default=1, help="Number of parts for MP")
 
     parser.add_argument(
         "--split-size", type=int, default=2, help="Number of process for MP"
     )
+
     parser.add_argument(
         "--num-spatial-parts",
         type=str,
         default="4",
         help="Number of partitions in spatial parallelism",
     )
+
     parser.add_argument(
         "--spatial-size",
         type=int,
         default=1,
         help="Number splits for spatial parallelism",
     )
+
     parser.add_argument(
         "--times",
         type=int,
         default=1,
         help="Number of times to repeat MASTER 1: 2 repications, 2: 4 replications",
     )
-    parser.add_argument(
-        "--image-size", type=int, default=32, help="Image size for synthetic benchmark"
-    )
-
-    parser.add_argument(
-        "--dp-per-node", type=int, default=1, help="Number of DP modes per node"
-    )
-
-    parser.add_argument(
-        "--enable-dp",
-        dest="enable_dp",
-        action="store_true",
-        help="Enable DP for pytorch scripts",
-    )
 
     parser.add_argument(
-        "--enable-master-comm-opt",
-        dest="enable_master_comm_opt",
-        action="store_true",
-        default=False,
-        help="Enable communication optimization for MASTER in Spatial",
-    )
-
-    parser.add_argument(
-        "--num-gpu-per-node", type=int, default=4, help="Number of GPUs per node"
+        "--image-size", type=int, default=32, help="Image size for synthetic benchmark"
     )
 
     parser.add_argument("--num-epochs", type=int, default=1, help="Number of epochs")
@@ -86,19 +52,18 @@ def get_parser():
     parser.add_argument(
         "--num-layers", type=int, default=18, help="Number of layers in amoebanet"
     )
+
     parser.add_argument(
         "--num-filters", type=int, default=416, help="Number of layers in amoebanet"
     )
-    parser.add_argument("--unet-b", type=int, default=6, help="B hyperparamter in unet")
-    parser.add_argument(
-        "--unet-c", type=int, default=72, help="C hyperparamter in unet"
-    )
+
     parser.add_argument(
         "--balance",
         type=str,
         default=None,
         help="length of list equals to number of partitions and sum should be equal to num layers",
     )
+
     parser.add_argument(
         "--halo-D2",
         dest="halo_d2",
@@ -106,23 +71,40 @@ def get_parser():
         default=False,
         help="Enable design2 (do halo exhange on few convs) for spatial conv. ",
     )
+
     parser.add_argument(
         "--fused-layers",
         type=int,
         default=1,
         help="When D2 design is enables for halo exchange, number of blocks to fuse in ResNet model ",
     )
+
     parser.add_argument(
         "--local-DP",
         type=int,
         default=1,
         help="LBANN intergration of SP with MP. MP can apply data parallelism. 1: only one GPU for a given split, 2: two gpus for a given split (uses DP)",
     )
+
     parser.add_argument(
         "--slice-method",
         type=str,
         default="square",
         help="Slice method (square, vertical, and horizontal) in Spatial parallelism",
     )
-    parser.set_defaults(enable_dp=False)
+
+    parser.add_argument(
+        "--app",
+        type=int,
+        default=3,
+        help="Application type (1.medical, 2.cifar, and synthetic) in Spatial parallelism",
+    )
+
+    parser.add_argument(
+        "--datapath",
+        type=str,
+        default="./train",
+        help="local Dataset path",
+    )
+
     return parser
diff --git a/torchgems/spatial_new.py b/src/torchgems/spatial.py
similarity index 94%
rename from torchgems/spatial_new.py
rename to src/torchgems/spatial.py
index db90f16b..c8bd7593 100644
--- a/torchgems/spatial_new.py
+++ b/src/torchgems/spatial.py
@@ -66,7 +66,7 @@ def __init__(
                 elif self.spatial_local_rank == 1:
                     padding_left, padding_right, padding_top, padding_bottom = (
                         self.halo_len_width,
-                        padding,
+                        padding[1],
                         padding[0],
                         self.halo_len_height,
                     )
@@ -130,7 +130,6 @@ def __init__(
             self.get_neighbours_rank()
             self.get_index_locations()
 
-            print(self.neighbours)
         self.shapes_recv = None
         self.recv_tensors = []
         self.send_tensors = []
@@ -320,7 +319,6 @@ def start_halo_exchange(self, halo_input):
         req = []
         for i in range(9):
             if self.neighbours[i] == 1:
-                # print("Local rank:",self.local_rank, " to:",self.local_rank + self.rank_neighbours[i], " I:",i)
                 temp = (
                     halo_input[
                         :,
@@ -719,10 +717,8 @@ def compute_halo_exchange(self, horizontal_tensor, vertical_tensor):
         return res_horizontal, res_vertical
 
     def compute_halo_exchange_one(self, horizontal_tensor, vertical_tensor, halo_input):
-        # print("LOCAL RANK:",self.local_rank, " Sucess")
         if self.spatial_local_rank == 0:
             None
-            # print(horizontal_tensor,vertical_tensor)
 
         if self.spatial_local_rank == 0:
             halo_input[:, :, 5:6, :] = horizontal_tensor[:, :, 2:3, :]
@@ -1014,41 +1010,6 @@ def forward(self, tensor):
 
         return res_final
 
-    """
-
-	def forward(self,input):
-		#print("Awesome",self.neighbours, self.rank_neighbours)
-		s = torch.cuda.Stream()
-		halo_input = self.padding_layer(input)
-
-		#self.weight = torch.nn.Parameter(self.weight.int())
-
-		#self.bias= torch.nn.Parameter(self.bias.int())
-		torch.cuda.synchronize()
-
-		if(self.halo_len>0):
-
-			with torch.cuda.stream(s):
-				torch.cuda.synchronize()
-				reqs = self.start_halo_exchange(halo_input)
-				self.end_halo_exchange(reqs)
-				s.synchronize()
-				#self.copy_halo_exchange_values(halo_input)
-
-				horizontal_tensor, vertical_tensor = self.make_tensor_halo_compute(halo_input)
-				s.synchronize()
-				res_final = self.compute_halo_exchange_one(horizontal_tensor,vertical_tensor,halo_input)
-
-
-			s.synchronize()
-			torch.cuda.synchronize()
-			
-			return res_final
-		else:
-			res_final = super(conv_spatial,self).forward(halo_input)
-			return res_final
-	"""
-
 
 class halo_exchange_layer(nn.Module):
     def __init__(
@@ -1229,7 +1190,6 @@ def start_halo_exchange(self, halo_input):
         req = []
         for i in range(9):
             if self.neighbours[i] == 1:
-                # print("Local rank:",self.local_rank, " to:",self.local_rank + self.rank_neighbours[i], " I:",i)
                 temp = (
                     halo_input[
                         :,
diff --git a/torchgems/train_spatial.py b/src/torchgems/train_spatial.py
similarity index 86%
rename from torchgems/train_spatial.py
rename to src/torchgems/train_spatial.py
index 3e159926..1e2263f4 100644
--- a/torchgems/train_spatial.py
+++ b/src/torchgems/train_spatial.py
@@ -164,68 +164,6 @@ def get_shapes_spatial(
     return spatial_shapes_list
 
 
-def split_input_2(inputs, image_size, slice_method, local_rank):
-    image_height_local = int(image_size / 2)
-    image_width_local = int(image_size / 2)
-
-    # square == vertical
-
-    if slice_method == "square" or slice_method == "vertical":
-        if local_rank == 0:
-            return inputs[:, :, :, :image_width_local]
-        elif local_rank == 1:
-            return inputs[:, :, :, image_width_local : 2 * image_width_local]
-
-    elif slice_method == "horizontal":
-        if local_rank == 0:
-            return inputs[:, :, :image_height_local, :]
-        elif local_rank == 1:
-            return inputs[:, :, image_height_local : 2 * image_height_local, :]
-
-
-def split_input_4(inputs, image_size, slice_method, local_rank):
-    image_height_local = int(image_size / 4)
-    image_width_local = int(image_size / 4)
-
-    if slice_method == "square":
-        if local_rank == 0:
-            return inputs[:, :, : int(image_size / 2), : int(image_size / 2)]
-        elif local_rank == 1:
-            return inputs[:, :, : int(image_size / 2), int(image_size / 2) :]
-        elif local_rank == 2:
-            return inputs[:, :, int(image_size / 2) :, : int(image_size / 2)]
-        elif local_rank == 3:
-            return inputs[:, :, int(image_size / 2) :, int(image_size / 2) :]
-
-    elif slice_method == "vertical":
-        if local_rank == 0:
-            return inputs[:, :, :, :image_width_local]
-        elif local_rank == 1:
-            return inputs[:, :, :, image_width_local : 2 * image_width_local]
-        elif local_rank == 2:
-            return inputs[:, :, :, 2 * image_width_local : 3 * image_width_local]
-        elif local_rank == 3:
-            return inputs[:, :, :, 3 * image_width_local : 4 * image_width_local]
-
-    elif slice_method == "horizontal":
-        if local_rank == 0:
-            return inputs[:, :, :image_height_local, :]
-        elif local_rank == 1:
-            return inputs[:, :, image_height_local : 2 * image_height_local, :]
-        elif local_rank == 2:
-            return inputs[:, :, 2 * image_height_local : 3 * image_height_local, :]
-        elif local_rank == 3:
-            return inputs[:, :, 3 * image_height_local : 4 * image_height_local, :]
-
-
-def split_input(inputs, image_size, slice_method, local_rank, num_spatial_parts_list):
-    if num_spatial_parts_list[0] == 2:
-        return split_input_2(inputs, image_size, slice_method, local_rank)
-
-    elif num_spatial_parts_list[0] == 4:
-        return split_input_4(inputs, image_size, slice_method, local_rank)
-
-
 class train_model_spatial(train_model):
     def __init__(
         self,
@@ -357,7 +295,6 @@ def update_shape_list_Local_DP_LP(self):
 
                 temp_tuple[0] = int(temp_tuple[0] / self.LOCAL_DP_LP)
                 self.shape_list[i] = tuple(temp_tuple)
-        print("Updated shape list", self.shape_list)
 
     def get_split_rank(self, num_spatial_parts_list, local_rank):
         if isinstance(num_spatial_parts_list, list):
@@ -456,11 +393,6 @@ def initialize_recv_buffers_joint(self):
                 input_x = []
                 if self.split_rank != 0:
                     # multiple inputs
-                    print(
-                        "Initialize recv buffer shape",
-                        self.shape_list[self.split_rank - 1],
-                        list,
-                    )
                     if isinstance(self.shape_list[self.split_rank - 1], list):
                         for i in range(len(self.shape_list[self.split_rank - 1])):
                             one_input = torch.zeros(
@@ -564,15 +496,11 @@ def initialize_send_recv_ranks(self):
             self.decrease_rank = self.LOCAL_DP_LP
 
         if self.GEMS_INVERSE == False:
-            print("initialize_send_recv_ranks Local rank", self.local_rank)
             self.to_send_forward = self.local_rank + self.increase_rank
             self.to_recv_forward = self.local_rank - self.decrease_rank
             self.to_send_backward = self.local_rank - self.decrease_rank
             self.to_recv_backward = self.local_rank + self.increase_rank
         else:
-            print(
-                "initialize_send_recv_ranks Local rank(GEMS_INVERSE)", self.local_rank
-            )
             self.to_send_forward = (
                 self.mp_size - 1 - self.local_rank - self.increase_rank
             )
@@ -666,39 +594,6 @@ def receive_input_async_joint(self, part_number, ranks):
         for req in reqs:
             req.wait()
 
-    # def send_grad_async_joint(self,input_x_list):
-    # 	# No need for writing modification for slide_methods as input_list are used and there is no slicing of image
-
-    # 	ranks = [self.local_rank -1 - i for i in range(self.num_spatial_parts-1,-1,-1)]
-
-    # 	reqs = []
-
-    # 	for partition_i in range(int(math.sqrt(self.num_spatial_parts))):
-    # 		for partition_j in range(int(math.sqrt(self.num_spatial_parts))):
-    # 			tag_forward = 0
-
-    # 			if (self.MULTIPLE_INPUT):
-
-    # 				for i in range(len(self.shape_list[self.split_rank-1])):
-    # 					shape = self.shape_list[self.split_rank-1][i]
-    # 					# to_send =input_x_list[:,:,partition_i*shape[2]:(partition_i+1)*shape[2],  partition_j*shape[3]:(partition_j+1)*shape[3]].grad.data.clone().detach()
-    # 					to_send = input_x_list[partition_i * int(math.sqrt(self.num_spatial_parts)) + partition_j][i].grad.data.clone().detach()
-    # 					req = dist.isend(tensor=to_send, dst=ranks[partition_i * int(math.sqrt(self.num_spatial_parts)) + partition_j], tag = tag_forward)
-    # 					tag_forward+=1
-    # 					reqs.append(req)
-
-    # 			else:
-    # 				shape = self.shape_list[self.split_rank-1]
-    # 				#to_send = joint_input[i][:,:,partition_i*shape[2]:(partition_i+1)*shape[2],  partition_j*shape[3]:(partition_j+1)*shape[3]].grad.data.clone().detach()
-    # 				to_send = input_x_list[partition_i * int(math.sqrt(self.num_spatial_parts)) + partition_j].grad.data.clone().detach()
-    # 				#print("to send dst:",ranks[partition_i * int(math.sqrt(self.num_spatial_parts)) + partition_j],to_send.abs().sum() )
-    # 				torch.cuda.synchronize()
-    # 				req= dist.isend(tensor=to_send, dst=ranks[partition_i * int(math.sqrt(self.num_spatial_parts)) + partition_j], tag = tag_forward)
-    # 				reqs.append(req)
-
-    # 	for req in reqs:
-    # 		req.wait()
-
     def send_grad_async_joint(self, input_x_list):
         # No need for writing modification for slide_methods as input_list are used and there is no slicing of image
 
@@ -718,7 +613,6 @@ def send_grad_async_joint(self, input_x_list):
             if self.MULTIPLE_INPUT:
                 for i in range(len(self.shape_list[self.split_rank - 1])):
                     shape = self.shape_list[self.split_rank - 1][i]
-                    # to_send =input_x_list[:,:,partition_i*shape[2]:(partition_i+1)*shape[2],  partition_j*shape[3]:(partition_j+1)*shape[3]].grad.data.clone().detach()
                     to_send = input_x_list[partition][i].grad.data.clone().detach()
                     req = dist.isend(
                         tensor=to_send, dst=ranks[partition], tag=tag_forward
@@ -728,9 +622,7 @@ def send_grad_async_joint(self, input_x_list):
 
             else:
                 shape = self.shape_list[self.split_rank - 1]
-                # to_send = joint_input[i][:,:,partition_i*shape[2]:(partition_i+1)*shape[2],  partition_j*shape[3]:(partition_j+1)*shape[3]].grad.data.clone().detach()
                 to_send = input_x_list[partition].grad.data.clone().detach()
-                # print("to send dst:",ranks[partition_i * int(math.sqrt(self.num_spatial_parts)) + partition_j],to_send.abs().sum() )
                 torch.cuda.synchronize()
                 req = dist.isend(tensor=to_send, dst=ranks[partition], tag=tag_forward)
                 reqs.append(req)
@@ -769,7 +661,6 @@ def send_grad_async_spatial(self, input_x_list):
             if self.MULTIPLE_INPUT:
                 for i in range(len(self.shape_list[self.split_rank - 1])):
                     shape = self.shape_list[self.split_rank - 1][i]
-                    # to_send =input_x_list[:,:,partition_i*shape[2]:(partition_i+1)*shape[2],  partition_j*shape[3]:(partition_j+1)*shape[3]].grad.data.clone().detach()
                     to_send = input_x_list[partition][i].grad.data.clone().detach()
                     req = dist.isend(
                         tensor=to_send, dst=recv_ranks[partition], tag=tag_forward
@@ -779,9 +670,7 @@ def send_grad_async_spatial(self, input_x_list):
 
             else:
                 shape = self.shape_list[self.split_rank - 1]
-                # to_send = joint_input[i][:,:,partition_i*shape[2]:(partition_i+1)*shape[2],  partition_j*shape[3]:(partition_j+1)*shape[3]].grad.data.clone().detach()
                 to_send = input_x_list[partition].grad.data.clone().detach()
-                # print("to send dst:",ranks[partition_i * int(math.sqrt(self.num_spatial_parts)) + partition_j],to_send.abs().sum() )
                 torch.cuda.synchronize()
                 req = dist.isend(
                     tensor=to_send, dst=recv_ranks[partition], tag=tag_forward
@@ -927,9 +816,7 @@ def send_grad_MP_joint_LP_DP(self, input_x_list):
 
             else:
                 shape = self.shape_list[self.split_rank - 1]
-                # to_send = joint_input[i][:,:,partition_i*shape[2]:(partition_i+1)*shape[2],  partition_j*shape[3]:(partition_j+1)*shape[3]].grad.data.clone().detach()
                 to_send = input_x_list[rank].grad.data.clone().detach()
-                # print("to send dst:",ranks[partition_i * int(math.sqrt(self.num_spatial_parts)) + partition_j],to_send.abs().sum() )
                 torch.cuda.synchronize()
 
                 dst = start_rank_last_spatial + rank
@@ -1246,7 +1133,6 @@ def forward_pass(self, data_x, data_y, part_number=0):
         # part_number: part number between 0 and self.parts-1 used to find right input recv buffer
 
         # Receive inputs if local is not 0
-        # print("LOCAL RANK:", self.local_rank, " Forward Start")
         if self.split_rank == 0:
             input_x = data_x
         else:
@@ -1280,7 +1166,6 @@ def forward_pass(self, data_x, data_y, part_number=0):
                     input_x = self.input_x_list[part_number]
 
         # Apply forward pass
-        # BUG: without cuda synchronize incorrect values are sent and recv
 
         torch.cuda.synchronize()
 
@@ -1296,10 +1181,7 @@ def forward_pass(self, data_x, data_y, part_number=0):
         else:
             y = self.models(input_x)
 
-        # print("LOCAL RANK:", self.local_rank, " Forward comp complete")
-
         torch.cuda.synchronize()
-        # print("LOCAL RANK:", self.local_rank, " Forward comp complete")
 
         if self.split_rank != self.split_size - 1:
             if self.ENABLE_ASYNC == True:
@@ -1331,8 +1213,6 @@ def forward_pass(self, data_x, data_y, part_number=0):
             else:
                 loss = self.criterion(y, data_y)
 
-        # print("LOCAL RANK:", self.local_rank, " Forward complete finally ")
-
         if self.split_rank == self.split_size - 1:
             corrects = (data_y.eq(torch.argmax(y, dim=-1).long())).sum().float()
             return loss, corrects / self.batch_size
@@ -1352,7 +1232,6 @@ def backward_pass(self, y, part_number=0):
                 else:
                     self.receive_grad_sync()
 
-        # print("LOCAL RANK:", self.local_rank, " Backward comp started")
         torch.cuda.synchronize()
         if (
             isinstance(
@@ -1372,8 +1251,6 @@ def backward_pass(self, y, part_number=0):
                 torch.autograd.backward(y, self.grad_overhead)
         torch.cuda.synchronize()
 
-        # print("LOCAL RANK:", self.local_rank, " Backward comp complete")
-
         if self.split_rank != 0:
             if self.split_rank == self.spatial_size:
                 if self.ENABLE_LOCAL_DP_LP:
@@ -1388,7 +1265,6 @@ def backward_pass(self, y, part_number=0):
                 else:
                     self.send_input_sync(self.input_x_list[part_number])
 
-        # BUG: using persistant buffer results in NAN value after some steps. Below is the fix
         if self.split_rank != 0:
             if self.split_rank == self.spatial_size:
                 if self.MULTIPLE_INPUT:
@@ -1454,5 +1330,3 @@ def backward_pass(self, y, part_number=0):
                     self.input_x_list[part_number] = (
                         self.input_x_list[part_number].detach().requires_grad_()
                     )
-
-        # print("LOCAL RANK:", self.local_rank, " BACKWARD COMPLETE")
diff --git a/torchgems/train_spatial_master.py b/src/torchgems/train_spatial_master.py
similarity index 94%
rename from torchgems/train_spatial_master.py
rename to src/torchgems/train_spatial_master.py
index 16d5bac1..f9894c33 100644
--- a/torchgems/train_spatial_master.py
+++ b/src/torchgems/train_spatial_master.py
@@ -288,18 +288,9 @@ def run_step_allreduce(self, inputs, labels, odd_iteration):
                 loss += temp_y.item()
                 corrects += temp_correct.item()
 
-        # start_event = torch.cuda.Event(enable_timing=True, blocking=True)
-        # end_event = torch.cuda.Event(enable_timing=True, blocking=True)
-        # start_event.record()
-
         if tm1.local_rank != self.mp_size - 1:
             self.send_recv_params(odd_iteration)
 
-        # end_event.record()
-        # torch.cuda.synchronize()
-        # t = start_event.elapsed_time(end_event) / 1000
-        # print("LocalRank {} Param send time:{}".format(self.local_rank,t))
-
         for i in range(self.parts):
             None
             tm1.backward_pass(y_list[i], part_number=i)
diff --git a/torchgems/spatial.py b/torchgems/spatial.py
deleted file mode 100644
index dd843745..00000000
--- a/torchgems/spatial.py
+++ /dev/null
@@ -1,822 +0,0 @@
-import torch.nn as nn
-import torch
-import torch.distributed as dist
-import time
-
-
-class conv_spatial(nn.Conv2d):
-    def __init__(
-        self,
-        local_rank,
-        spatial_size,
-        num_spatial_parts,
-        in_channels,
-        out_channels,
-        kernel_size,
-        stride=1,
-        padding=0,
-        dilation=1,
-        groups=1,
-        bias=True,
-        padding_mode="zeros",
-    ):
-        self.local_rank = local_rank
-        # number of parts in one image
-        self.num_spatial_parts = num_spatial_parts
-        # number of sequential spatial layers, Most of the times I expect it to be 1
-        self.spatial_size = spatial_size
-        self.get_neighbours()
-        self.get_neighbours_rank()
-
-        self.halo_len = (kernel_size - 1) / 2
-
-        assert (
-            self.halo_len == padding
-        ), "Spatial not supported yet for this configuration"
-        self.halo_len = padding
-        super(conv_spatial, self).__init__(
-            in_channels,
-            out_channels,
-            kernel_size,
-            stride=stride,
-            padding=0,
-            dilation=1,
-            groups=1,
-            bias=True,
-            padding_mode="zeros",
-        )
-
-        self.padding_layer = nn.ZeroPad2d(padding)
-        self.get_index_locations()
-        self.shapes_recv = None
-        self.recv_tensors = []
-        self.send_tensors = []
-
-        self.set_tags()
-
-    def set_tags(self):
-        self.send_tag = [100, 200, 300, 400, 500, 600, 700, 800, 900]
-        self.recv_tag = [900, 800, 700, 600, 500, 400, 300, 200, 100]
-
-        # self.send_tag = [0]*9
-        # self.recv_tag = [0]*9
-
-    def get_index_locations(self):
-        locations_recv = []
-        locations_recv.append([[None, self.halo_len], [None, self.halo_len]])  # 1
-        locations_recv.append(
-            [[None, self.halo_len], [self.halo_len, -self.halo_len]]
-        )  # 2
-        locations_recv.append([[None, self.halo_len], [-self.halo_len, None]])  # 3
-        locations_recv.append(
-            [[self.halo_len, -self.halo_len], [None, self.halo_len]]
-        )  # 4
-        locations_recv.append([[None, None], [None, None]])  # 5
-        locations_recv.append(
-            [[self.halo_len, -self.halo_len], [-self.halo_len, None]]
-        )  # 6
-        locations_recv.append([[-self.halo_len, None], [None, self.halo_len]])  # 7
-        locations_recv.append(
-            [[-self.halo_len, None], [self.halo_len, -self.halo_len]]
-        )  # 8
-        locations_recv.append([[-self.halo_len, None], [-self.halo_len, None]])  # 9
-
-        self.locations_recv = locations_recv
-
-        locations_send = []
-        locations_send.append(
-            [[self.halo_len, 2 * self.halo_len], [self.halo_len, 2 * self.halo_len]]
-        )  # 1
-        locations_send.append(
-            [[self.halo_len, 2 * self.halo_len], [self.halo_len, -self.halo_len]]
-        )  # 2
-        locations_send.append(
-            [
-                [self.halo_len, 2 * self.halo_len],
-                [-2 * self.halo_len, -1 * self.halo_len],
-            ]
-        )  # 3
-        locations_send.append(
-            [[self.halo_len, -self.halo_len], [self.halo_len, 2 * self.halo_len]]
-        )  # 4
-        locations_send.append([[None, None], [None, None]])  # 5
-        locations_send.append(
-            [[self.halo_len, -self.halo_len], [-2 * self.halo_len, -1 * self.halo_len]]
-        )  # 6
-        locations_send.append(
-            [
-                [-2 * self.halo_len, -1 * self.halo_len],
-                [self.halo_len, 2 * self.halo_len],
-            ]
-        )  # 7
-        locations_send.append(
-            [[-2 * self.halo_len, -1 * self.halo_len], [self.halo_len, -self.halo_len]]
-        )  # 8
-        locations_send.append(
-            [
-                [-2 * self.halo_len, -1 * self.halo_len],
-                [-2 * self.halo_len, -1 * self.halo_len],
-            ]
-        )  # 9
-        self.locations_send = locations_send
-
-    def get_shapes_recv(self, shapes):
-        shapes_recv = []
-
-        shapes_recv.append([self.halo_len, self.halo_len])  # 1
-        shapes_recv.append([self.halo_len, shapes[3] - 2 * self.halo_len])  # 2
-        shapes_recv.append([self.halo_len, self.halo_len])  # 3
-
-        shapes_recv.append([shapes[2] - 2 * self.halo_len, self.halo_len])  # 4
-        shapes_recv.append([None, None])  # 5
-        shapes_recv.append([shapes[2] - 2 * self.halo_len, self.halo_len])  # 6
-
-        shapes_recv.append([self.halo_len, self.halo_len])  # 7
-        shapes_recv.append([self.halo_len, shapes[3] - 2 * self.halo_len])  # 8
-        shapes_recv.append([self.halo_len, self.halo_len])  # 9
-
-        return shapes_recv
-
-    def start_halo_exchange(self, halo_input):
-        req = []
-
-        for i in range(9):
-            if self.neighbours[i] == 1:
-                # print("Local rank:",self.local_rank, " to:",self.local_rank + self.rank_neighbours[i], " I:",i)
-                to_send = (
-                    halo_input[
-                        :,
-                        :,
-                        self.locations_send[i][0][0] : self.locations_send[i][0][1],
-                        self.locations_send[i][1][0] : self.locations_send[i][1][1],
-                    ]
-                    .clone()
-                    .detach()
-                )
-                # torch.cuda.synchronize()
-
-                t = time.time()
-                temp_req = dist.isend(
-                    to_send, self.rank_neighbours[i], tag=self.send_tag[i]
-                )
-                if self.local_rank == 0:
-                    None
-                    # print("sending to:",self.rank_neighbours[i], " Shape:", to_send.shape, " Time taken:" ,time.time()-t)
-                req.append(temp_req)
-                self.send_tag[i] += 1
-
-        # self.recv_tensors = []
-        if len(self.recv_tensors) == 0:
-            flag_recv_tensors_init = 0
-        else:
-            flag_recv_tensors_init = 1
-
-        shapes = halo_input.shape
-        self.halo_input_shape = shapes
-        if self.shapes_recv == None:
-            self.shapes_recv = self.get_shapes_recv(shapes)
-
-        for i in range(9):
-            if self.neighbours[i] == 1:
-                if flag_recv_tensors_init == 0:
-                    temp_tensor = torch.zeros(
-                        shapes[0],
-                        shapes[1],
-                        self.shapes_recv[i][0],
-                        self.shapes_recv[i][1],
-                        dtype=torch.float,
-                        device="cuda",
-                    )
-                    self.recv_tensors.append(temp_tensor)
-
-                temp_req = dist.irecv(
-                    tensor=self.recv_tensors[i],
-                    src=self.rank_neighbours[i],
-                    tag=self.recv_tag[i],
-                )
-                req.append(temp_req)
-                self.recv_tag[i] += 1
-
-                # self.recv_tensors.append(temp_tensor)
-            else:
-                if flag_recv_tensors_init == 0:
-                    self.recv_tensors.append([])
-
-        return req
-
-    def end_halo_exchange(self, reqs):
-        for req in reqs:
-            req.wait()
-
-    def copy_halo_exchange_values(self, halo_input):
-        for i in range(9):
-            if self.neighbours[i] == 1:
-                halo_input[
-                    :,
-                    :,
-                    self.locations_recv[i][0][0] : self.locations_recv[i][0][1],
-                    self.locations_recv[i][1][0] : self.locations_recv[i][1][1],
-                ] = self.recv_tensors[i]
-
-    def make_tensor_halo_compute(self, halo_input):
-        self.halo_input_range = 2 * self.halo_len
-
-        # 0 1 2
-        # 3 4 5
-        # 6 7 8
-
-        # horizontal
-
-        horizontal_tensor_up = None
-        if self.neighbours[0] == 1:
-            # concat both 0 with 3 and 1 with 4 position
-            horizontal_tensor_up = torch.cat(
-                (
-                    self.recv_tensors[0],
-                    self.recv_tensors[3][:, :, : 2 * self.halo_len, :],
-                ),
-                axis=2,
-            )
-
-            horizontal_tensor_temp = torch.cat(
-                (
-                    self.recv_tensors[1],
-                    halo_input[
-                        :,
-                        :,
-                        self.halo_len : 3 * self.halo_len,
-                        self.halo_len : -self.halo_len,
-                    ],
-                ),
-                axis=2,
-            )
-
-            horizontal_tensor_up = torch.cat(
-                (horizontal_tensor_up, horizontal_tensor_temp), axis=3
-            )
-        elif self.neighbours[1] == 1:
-            # concat 1 with 4 position
-            horizontal_tensor_up = torch.cat(
-                (
-                    self.recv_tensors[1],
-                    halo_input[
-                        :,
-                        :,
-                        self.halo_len : 3 * self.halo_len,
-                        self.halo_len : -self.halo_len,
-                    ],
-                ),
-                axis=2,
-            )
-
-        if self.neighbours[2] == 1:
-            horizontal_tensor_temp = torch.cat(
-                (
-                    self.recv_tensors[2],
-                    self.recv_tensors[5][:, :, : 2 * self.halo_len, :],
-                ),
-                axis=2,
-            )
-
-            horizontal_tensor_up = torch.cat(
-                (horizontal_tensor_up, horizontal_tensor_temp), axis=3
-            )
-
-        horizontal_tensor_down = None
-        if self.neighbours[6] == 1:
-            # concat both 6 with 3 and 7 with 4 position
-            horizontal_tensor_down = torch.cat(
-                (
-                    self.recv_tensors[3][:, :, -2 * self.halo_len :, :],
-                    self.recv_tensors[6],
-                ),
-                axis=2,
-            )
-
-            horizontal_tensor_temp = torch.cat(
-                (
-                    halo_input[
-                        :,
-                        :,
-                        -3 * self.halo_len : -self.halo_len,
-                        self.halo_len : -self.halo_len,
-                    ],
-                    self.recv_tensors[7],
-                ),
-                axis=2,
-            )
-
-            horizontal_tensor_down = torch.cat(
-                (horizontal_tensor_down, horizontal_tensor_temp), axis=3
-            )
-        elif self.neighbours[7] == 1:
-            # concat 7 with 4 position
-            horizontal_tensor_down = torch.cat(
-                (
-                    halo_input[
-                        :,
-                        :,
-                        -3 * self.halo_len : -self.halo_len,
-                        self.halo_len : -self.halo_len,
-                    ],
-                    self.recv_tensors[7],
-                ),
-                axis=2,
-            )
-
-        if self.neighbours[8] == 1:
-            horizontal_tensor_temp = torch.cat(
-                (
-                    self.recv_tensors[5][:, :, -2 * self.halo_len :, :],
-                    self.recv_tensors[8],
-                ),
-                axis=2,
-            )
-
-            horizontal_tensor_down = torch.cat(
-                (horizontal_tensor_down, horizontal_tensor_temp), axis=3
-            )
-
-        # Vertical
-
-        vertical_tensor_left = None
-        if self.neighbours[0] == 1:
-            vertical_tensor_left = torch.cat(
-                (
-                    self.recv_tensors[0],
-                    self.recv_tensors[1][:, :, :, : 2 * self.halo_len],
-                ),
-                axis=3,
-            )
-
-            vertical_tensor_temp = torch.cat(
-                (
-                    self.recv_tensors[3],
-                    halo_input[
-                        :,
-                        :,
-                        self.halo_len : -self.halo_len,
-                        self.halo_len : 3 * self.halo_len,
-                    ],
-                ),
-                axis=3,
-            )
-
-            vertical_tensor_left = torch.cat(
-                (vertical_tensor_left, vertical_tensor_temp), axis=2
-            )
-        elif self.neighbours[3] == 1:
-            vertical_tensor_left = torch.cat(
-                (
-                    self.recv_tensors[3],
-                    halo_input[
-                        :,
-                        :,
-                        self.halo_len : -self.halo_len,
-                        self.halo_len : 3 * self.halo_len,
-                    ],
-                ),
-                axis=3,
-            )
-
-        if self.neighbours[6] == 1:
-            vertical_tensor_temp = torch.cat(
-                (
-                    self.recv_tensors[6],
-                    self.recv_tensors[7][:, :, :, : 2 * self.halo_len],
-                ),
-                axis=3,
-            )
-            vertical_tensor_left = torch.cat(
-                (vertical_tensor_left, vertical_tensor_temp), axis=2
-            )
-
-        vertical_tensor_right = None
-        if self.neighbours[2] == 1:
-            vertical_tensor_right = torch.cat(
-                (
-                    self.recv_tensors[1][:, :, :, -2 * self.halo_len :],
-                    self.recv_tensors[2],
-                ),
-                axis=3,
-            )
-
-            vertical_tensor_temp = torch.cat(
-                (
-                    halo_input[
-                        :,
-                        :,
-                        self.halo_len : -self.halo_len,
-                        -3 * self.halo_len : -self.halo_len,
-                    ],
-                    self.recv_tensors[5],
-                ),
-                axis=3,
-            )
-
-            vertical_tensor_right = torch.cat(
-                (vertical_tensor_right, vertical_tensor_temp), axis=2
-            )
-        elif self.neighbours[5] == 1:
-            vertical_tensor_right = torch.cat(
-                (
-                    halo_input[
-                        :,
-                        :,
-                        self.halo_len : -self.halo_len,
-                        -3 * self.halo_len : -self.halo_len,
-                    ],
-                    self.recv_tensors[5],
-                ),
-                axis=3,
-            )
-
-        if self.neighbours[8] == 1:
-            vertical_tensor_temp = torch.cat(
-                (
-                    self.recv_tensors[7][:, :, :, -2 * self.halo_len :],
-                    self.recv_tensors[8],
-                ),
-                axis=3,
-            )
-            vertical_tensor_right = torch.cat(
-                (vertical_tensor_right, vertical_tensor_temp), axis=2
-            )
-
-        if self.neighbours[1] != 1 and self.neighbours[5] == 1:
-            padding_vertical_top_right = halo_input[
-                :, :, 0 : self.halo_len, -3 * self.halo_len :
-            ]
-            vertical_tensor_right = torch.cat(
-                (padding_vertical_top_right.data, vertical_tensor_right), axis=2
-            )
-
-        if self.neighbours[1] != 1 and self.neighbours[3] == 1:
-            padding_vertical_top_left = halo_input[
-                :, :, 0 : self.halo_len, : 3 * self.halo_len
-            ]
-            vertical_tensor_left = torch.cat(
-                (padding_vertical_top_left.data, vertical_tensor_left), axis=2
-            )
-
-        if self.neighbours[7] != 1 and self.neighbours[5] == 1:
-            padding_vertical_down_right = halo_input[
-                :, :, -self.halo_len :, -3 * self.halo_len :
-            ]
-            vertical_tensor_right = torch.cat(
-                (vertical_tensor_right, padding_vertical_down_right), axis=2
-            )
-
-        if self.neighbours[7] != 1 and self.neighbours[3] == 1:
-            padding_vertical_down_left = halo_input[
-                :, :, -self.halo_len :, : 3 * self.halo_len
-            ]
-            vertical_tensor_left = torch.cat(
-                (vertical_tensor_left, padding_vertical_down_left), axis=2
-            )
-
-        if self.neighbours[3] != 1 and self.neighbours[1] == 1:
-            padding_horizontal_up_left = halo_input[
-                :, :, : 3 * self.halo_len, : self.halo_len
-            ]
-            horizontal_tensor_up = torch.cat(
-                (padding_horizontal_up_left, horizontal_tensor_up), axis=3
-            )
-
-        if self.neighbours[3] != 1 and self.neighbours[7] == 1:
-            padding_horizontal_down_left = halo_input[
-                :, :, -3 * self.halo_len :, : self.halo_len
-            ]
-            horizontal_tensor_down = torch.cat(
-                (padding_horizontal_down_left, horizontal_tensor_down), axis=3
-            )
-
-        if self.neighbours[5] != 1 and self.neighbours[1] == 1:
-            padding_horizontal_up_right = halo_input[
-                :, :, : 3 * self.halo_len, -self.halo_len :
-            ]
-            horizontal_tensor_up = torch.cat(
-                (horizontal_tensor_up, padding_horizontal_up_right), axis=3
-            )
-
-        if self.neighbours[5] != 1 and self.neighbours[7] == 1:
-            padding_horizontal_down_right = halo_input[
-                :, :, -3 * self.halo_len :, -self.halo_len :
-            ]
-            horizontal_tensor_down = torch.cat(
-                (horizontal_tensor_down, padding_horizontal_down_right), axis=3
-            )
-
-        if horizontal_tensor_up == None and horizontal_tensor_down == None:
-            horizontal_tensor = None
-        elif horizontal_tensor_down == None:
-            horizontal_tensor = horizontal_tensor_up
-        elif horizontal_tensor_up == None:
-            horizontal_tensor = horizontal_tensor_down
-        else:
-            horizontal_tensor = torch.cat(
-                (horizontal_tensor_up, horizontal_tensor_down), axis=2
-            )
-
-        if vertical_tensor_left == None and vertical_tensor_right == None:
-            vertical_tensor = None
-        elif vertical_tensor_left == None:
-            vertical_tensor = vertical_tensor_right
-        elif vertical_tensor_right == None:
-            vertical_tensor = vertical_tensor_left
-        else:
-            vertical_tensor = torch.cat(
-                (vertical_tensor_left, vertical_tensor_right), axis=3
-            )
-
-        return horizontal_tensor, vertical_tensor
-
-    def compute_halo_exchange(self, horizontal_tensor, vertical_tensor):
-        res_horizontal, res_vertical = None, None
-        if horizontal_tensor != None:
-            res_horizontal = super(conv_spatial, self).forward(horizontal_tensor)
-        if vertical_tensor != None:
-            res_vertical = super(conv_spatial, self).forward(vertical_tensor)
-
-        return res_horizontal, res_vertical
-
-    def compute_halo_exchange_one(self, horizontal_tensor, vertical_tensor, halo_input):
-        # print("LOCAL RANK:",self.local_rank, " Sucess")
-        if self.local_rank == 0:
-            None
-            # print(horizontal_tensor,vertical_tensor)
-
-        if self.local_rank == 0:
-            halo_input[:, :, 5:6, :] = horizontal_tensor[:, :, 2:3, :]
-            halo_input[:, :, :, -1:] = vertical_tensor[:, :, :, -1:]
-
-        if self.local_rank == 1:
-            halo_input[:, :, 5:6, :] = horizontal_tensor[:, :, 2:3, :]
-            halo_input[:, :, :, 0:1] = vertical_tensor[:, :, :, 0:1]
-
-        if self.local_rank == 2:
-            halo_input[:, :, 0:1, :] = horizontal_tensor[:, :, 0:1, :]
-            halo_input[:, :, :, -1:] = vertical_tensor[:, :, :, -1:]
-
-        if self.local_rank == 3:
-            halo_input[:, :, 0:1, :] = horizontal_tensor[:, :, 0:1, :]
-            halo_input[:, :, :, 0:1] = vertical_tensor[:, :, :, 0:1]
-
-        torch.cuda.synchronize()
-        res = super(conv_spatial, self).forward(halo_input)
-
-        return res
-
-    def merge_final_image(self, res_final, res_horizontal, res_vertical):
-        if self.neighbours[3] == 1:
-            start = self.halo_len
-        else:
-            start = 0
-
-        if self.neighbours[5] == 1:
-            end = -self.halo_len
-        else:
-            end = None
-
-        if self.neighbours[1] == 1:
-            res_final = torch.cat(
-                (res_horizontal[:, :, : self.halo_len, start:end], res_final), axis=2
-            )
-
-        if self.neighbours[7] == 1:
-            if self.neighbours[1] == 1:
-                res_final = torch.cat(
-                    (res_final, res_horizontal[:, :, self.halo_len :, start:end]),
-                    axis=2,
-                )
-            else:
-                res_final = torch.cat(
-                    (res_final, res_horizontal[:, :, :, start:end]), axis=2
-                )
-
-        if self.neighbours[3] == 1:
-            res_final = torch.cat(
-                (res_vertical[:, :, : self.halo_len, :], res_final), axis=3
-            )
-
-        if self.neighbours[5] == 1:
-            if self.neighbours[3] == 1:
-                res_final = torch.cat(
-                    (res_final, res_vertical[:, :, self.halo_len :, :]), axis=3
-                )
-            else:
-                res_final = torch.cat((res_final, res_vertical[:, :, :, :]), axis=3)
-        return res_final
-
-    def copy_final_image(self, res_final, res_horizontal, res_vertical):
-        shapes = res_final.shape
-        if self.neighbours[1] == 1:
-            res_final[:, :, : self.halo_len, :] = res_horizontal[
-                :, :, : self.halo_len, :
-            ]
-
-        if self.neighbours[7] == 1:
-            if self.neighbours[1] == 1:
-                res_final[:, :, -self.halo_len :, :] = res_horizontal[
-                    :, :, self.halo_len :, :
-                ]
-            else:
-                res_final[:, :, -self.halo_len :, :] = res_horizontal
-
-        if self.neighbours[3] == 1:
-            res_final[:, :, :, : self.halo_len] = res_vertical[:, :, : shapes[2], :]
-
-        if self.neighbours[5] == 1:
-            if self.neighbours[3] == 1:
-                res_final[:, :, :, -self.halo_len :] = res_vertical[
-                    :, :, shapes[2] :, :
-                ]
-            else:
-                res_final[:, :, :, -self.halo_len :] = res_vertical
-        return res_final
-
-    def start_halo_exchange_nochange(self, halo_input):
-        req = []
-        for i in range(9):
-            if self.neighbours[i] == 1:
-                temp_req = dist.isend(
-                    halo_input[
-                        :,
-                        :,
-                        self.locations_send[i][0][0] : self.locations_send[i][0][1],
-                        self.locations_send[i][1][0] : self.locations_send[i][1][1],
-                    ],
-                    self.rank_neighbours[i],
-                    tag=self.send_tag[i],
-                )
-                req.append(temp_req)
-                self.send_tag[i] += 1
-
-        for i in range(9):
-            if self.neighbours[i] == 1:
-                temp_req = dist.irecv(
-                    tensor=halo_input[
-                        :,
-                        :,
-                        self.locations_recv[i][0][0] : self.locations_recv[i][0][1],
-                        self.locations_recv[i][1][0] : self.locations_recv[i][1][1],
-                    ],
-                    src=self.rank_neighbours[i],
-                    tag=self.recv_tag[i],
-                )
-                req.append(temp_req)
-                self.recv_tag[i] += 1
-
-        return req
-
-    def end_halo_exchange_nochange(self, reqs):
-        for req in reqs:
-            req.wait()
-
-    def get_neighbours_rank(self):
-        self.rank_neighbours = []
-        if self.num_spatial_parts == 2:
-            rank_offset = [0, 0, 0, -1, 0, +1, 0, 0, 0]
-        elif self.num_spatial_parts == 4:
-            rank_offset = [-3, -2, -1, -1, 0, +1, +1, +2, +3]
-        elif self.num_spatial_parts == 9:
-            rank_offset = [-4, -3, -2, -1, 0, +1, +2, +3, +4]
-
-        for i in range(9):
-            if self.neighbours[i] == 1:
-                self.rank_neighbours.append(self.local_rank + rank_offset[i])
-            else:
-                self.rank_neighbours.append(-1)
-
-    def get_neighbours(self):
-        if self.local_rank < self.num_spatial_parts * self.spatial_size:
-            self.ENABLE_SPATIAL = True
-        else:
-            self.ENABLE_SPATIAL = False
-            self.neighbours = None
-            return
-
-        self.spatial_rank = self.local_rank % self.num_spatial_parts
-
-        # Neighbour
-        #  0   1   2
-        #  3   4   5
-        #  6   7   8
-        if self.num_spatial_parts == 2:
-            # 0 | 1
-            if self.spatial_rank == 0:
-                self.neighbours = [0, 0, 0, 0, 0, 1, 0, 0, 0]
-            else:
-                self.neighbours = [0, 0, 0, 1, 0, 0, 0, 0, 0]
-        elif self.num_spatial_parts == 4:
-            # 0 | 1
-            # -----
-            # 2 | 3
-            if self.spatial_rank == 0:
-                self.neighbours = [0, 0, 0, 0, 0, 1, 0, 1, 1]
-            elif self.spatial_rank == 1:
-                self.neighbours = [0, 0, 0, 1, 0, 0, 1, 1, 0]
-            elif self.spatial_rank == 2:
-                self.neighbours = [0, 1, 1, 0, 0, 1, 0, 0, 0]
-            elif self.spatial_rank == 3:
-                self.neighbours = [1, 1, 0, 1, 0, 0, 0, 0, 0]
-
-        elif self.num_spatial_parts == 9:
-            # 0 | 1 | 2
-            # -----------
-            # 3 | 4 | 5
-            # -----------
-            # 6 | 7 | 8
-            if self.spatial_rank == 0:
-                self.neighbours = [0, 0, 0, 0, 0, 1, 0, 1, 1]
-            elif self.spatial_rank == 1:
-                self.neighbours = [0, 0, 0, 1, 0, 1, 1, 1, 1]
-            elif self.spatial_rank == 2:
-                self.neighbours = [0, 0, 0, 1, 0, 0, 1, 1, 0]
-            elif self.spatial_rank == 3:
-                self.neighbours = [0, 1, 1, 0, 0, 1, 0, 1, 1]
-            elif self.spatial_rank == 4:
-                self.neighbours = [1, 1, 1, 1, 0, 1, 1, 1, 1]
-            elif self.spatial_rank == 5:
-                self.neighbours = [1, 1, 0, 1, 0, 0, 1, 1, 0]
-            elif self.spatial_rank == 6:
-                self.neighbours = [0, 1, 1, 0, 0, 1, 0, 0, 0]
-            elif self.spatial_rank == 7:
-                self.neighbours = [1, 1, 1, 1, 0, 1, 0, 0, 0]
-            elif self.spatial_rank == 8:
-                self.neighbours = [1, 1, 0, 1, 0, 0, 0, 0, 0]
-
-    def forward(self, input):
-        s = torch.cuda.Stream()
-        s2 = torch.cuda.Stream()
-        halo_input = self.padding_layer(input)
-        torch.cuda.synchronize()
-
-        if self.halo_len > 0:
-            with torch.cuda.stream(s):
-                # torch.cuda.synchronize()
-                t = time.time()
-
-                reqs = self.start_halo_exchange(halo_input)
-                # print(time.time()-t)
-
-                self.end_halo_exchange(reqs)
-
-                if self.local_rank == 0:
-                    print(time.time() - t)
-
-                res_final = super(conv_spatial, self).forward(halo_input)
-
-                s.synchronize()
-                # self.copy_halo_exchange_values(halo_input)
-                # horizontal_tensor, vertical_tensor = self.make_tensor_halo_compute(halo_input)
-                # s.synchronize()
-                # res_horizontal, res_vertical = self.compute_halo_exchange(horizontal_tensor,vertical_tensor)
-
-            # s.synchronize()
-            # torch.cuda.synchronize()
-
-            # s.synchronize()
-
-            # print("Local rank:",self.local_rank,res_final.shape, res_horizontal.shape, res_vertical.shape)
-            # res_final = self.copy_final_image(res_final,res_horizontal,res_vertical)
-
-            return res_final
-        else:
-            res_final = super(conv_spatial, self).forward(halo_input)
-            return res_final
-
-    """
-
-	def forward(self,input):
-		#print("Awesome",self.neighbours, self.rank_neighbours)
-		s = torch.cuda.Stream()
-		halo_input = self.padding_layer(input)
-
-		#self.weight = torch.nn.Parameter(self.weight.int())
-
-		#self.bias= torch.nn.Parameter(self.bias.int())
-		torch.cuda.synchronize()
-
-		if(self.halo_len>0):
-
-			with torch.cuda.stream(s):
-				torch.cuda.synchronize()
-				reqs = self.start_halo_exchange(halo_input)
-				self.end_halo_exchange(reqs)
-				s.synchronize()
-				#self.copy_halo_exchange_values(halo_input)
-
-				horizontal_tensor, vertical_tensor = self.make_tensor_halo_compute(halo_input)
-				s.synchronize()
-				res_final = self.compute_halo_exchange_one(horizontal_tensor,vertical_tensor,halo_input)
-
-
-			s.synchronize()
-			torch.cuda.synchronize()
-			
-			return res_final
-		else:
-			res_final = super(conv_spatial,self).forward(halo_input)
-			return res_final
-	"""
diff --git a/torchgems/spatial_debug.py b/torchgems/spatial_debug.py
deleted file mode 100644
index 6fa964fd..00000000
--- a/torchgems/spatial_debug.py
+++ /dev/null
@@ -1,835 +0,0 @@
-import torch.nn as nn
-import torch
-import torch.distributed as dist
-
-
-class conv_spatial(nn.Conv2d):
-    def __init__(
-        self,
-        local_rank,
-        spatial_size,
-        num_spatial_parts,
-        in_channels,
-        out_channels,
-        kernel_size,
-        stride=1,
-        padding=0,
-        dilation=1,
-        groups=1,
-        bias=True,
-        padding_mode="zeros",
-    ):
-        self.local_rank = local_rank
-        # number of parts in one image
-        self.num_spatial_parts = num_spatial_parts
-        # number of sequential spatial layers, Most of the times I expect it to be 1
-        self.spatial_size = spatial_size
-        self.get_neighbours()
-        self.get_neighbours_rank()
-
-        self.halo_len = (kernel_size - 1) / 2
-
-        assert (
-            self.halo_len == padding
-        ), "Spatial not supported yet for this configuration"
-        self.halo_len = padding
-        super(conv_spatial, self).__init__(
-            in_channels,
-            out_channels,
-            kernel_size,
-            stride=stride,
-            padding=0,
-            dilation=1,
-            groups=1,
-            bias=True,
-            padding_mode="zeros",
-        )
-
-        self.padding_layer = nn.ZeroPad2d(padding)
-        self.get_index_locations()
-        self.shapes_recv = None
-        self.recv_tensors = []
-
-        self.set_tags()
-
-    def set_tags(self):
-        self.send_tag = [100, 200, 300, 400, 500, 600, 700, 800, 900]
-        self.recv_tag = [900, 800, 700, 600, 500, 400, 300, 200, 100]
-
-        # self.send_tag = [0]*9
-        # self.recv_tag = [0]*9
-
-    def get_index_locations(self):
-        locations_recv = []
-        locations_recv.append([[None, self.halo_len], [None, self.halo_len]])  # 1
-        locations_recv.append(
-            [[None, self.halo_len], [self.halo_len, -self.halo_len]]
-        )  # 2
-        locations_recv.append([[None, self.halo_len], [-self.halo_len, None]])  # 3
-        locations_recv.append(
-            [[self.halo_len, -self.halo_len], [None, self.halo_len]]
-        )  # 4
-        locations_recv.append([[None, None], [None, None]])  # 5
-        locations_recv.append(
-            [[self.halo_len, -self.halo_len], [-self.halo_len, None]]
-        )  # 6
-        locations_recv.append([[-self.halo_len, None], [None, self.halo_len]])  # 7
-        locations_recv.append(
-            [[-self.halo_len, None], [self.halo_len, -self.halo_len]]
-        )  # 8
-        locations_recv.append([[-self.halo_len, None], [-self.halo_len, None]])  # 9
-
-        self.locations_recv = locations_recv
-
-        locations_send = []
-        locations_send.append(
-            [[self.halo_len, 2 * self.halo_len], [self.halo_len, 2 * self.halo_len]]
-        )  # 1
-        locations_send.append(
-            [[self.halo_len, 2 * self.halo_len], [self.halo_len, -self.halo_len]]
-        )  # 2
-        locations_send.append(
-            [
-                [self.halo_len, 2 * self.halo_len],
-                [-2 * self.halo_len, -1 * self.halo_len],
-            ]
-        )  # 3
-        locations_send.append(
-            [[self.halo_len, -self.halo_len], [self.halo_len, 2 * self.halo_len]]
-        )  # 4
-        locations_send.append([[None, None], [None, None]])  # 5
-        locations_send.append(
-            [[self.halo_len, -self.halo_len], [-2 * self.halo_len, -1 * self.halo_len]]
-        )  # 6
-        locations_send.append(
-            [
-                [-2 * self.halo_len, -1 * self.halo_len],
-                [self.halo_len, 2 * self.halo_len],
-            ]
-        )  # 7
-        locations_send.append(
-            [[-2 * self.halo_len, -1 * self.halo_len], [self.halo_len, -self.halo_len]]
-        )  # 8
-        locations_send.append(
-            [
-                [-2 * self.halo_len, -1 * self.halo_len],
-                [-2 * self.halo_len, -1 * self.halo_len],
-            ]
-        )  # 9
-        self.locations_send = locations_send
-
-    def get_shapes_recv(self, shapes):
-        shapes_recv = []
-
-        shapes_recv.append([self.halo_len, self.halo_len])  # 1
-        shapes_recv.append([self.halo_len, shapes[3] - 2 * self.halo_len])  # 2
-        shapes_recv.append([self.halo_len, self.halo_len])  # 3
-
-        shapes_recv.append([shapes[2] - 2 * self.halo_len, self.halo_len])  # 4
-        shapes_recv.append([None, None])  # 5
-        shapes_recv.append([shapes[2] - 2 * self.halo_len, self.halo_len])  # 6
-
-        shapes_recv.append([self.halo_len, self.halo_len])  # 7
-        shapes_recv.append([self.halo_len, shapes[3] - 2 * self.halo_len])  # 8
-        shapes_recv.append([self.halo_len, self.halo_len])  # 9
-
-        return shapes_recv
-
-    def start_halo_exchange(self, halo_input):
-        req = []
-        for i in range(9):
-            if self.neighbours[i] == 1:
-                # print("Local rank:",self.local_rank, " to:",self.local_rank + self.rank_neighbours[i], " I:",i)
-                temp_req = dist.isend(
-                    halo_input[
-                        :,
-                        :,
-                        self.locations_send[i][0][0] : self.locations_send[i][0][1],
-                        self.locations_send[i][1][0] : self.locations_send[i][1][1],
-                    ]
-                    .clone()
-                    .detach(),
-                    self.rank_neighbours[i],
-                    tag=self.send_tag[i],
-                )
-                req.append(temp_req)
-                self.send_tag[i] += 1
-
-        self.recv_tensors = []
-
-        shapes = halo_input.shape
-        self.halo_input_shape = shapes
-        if self.shapes_recv == None:
-            self.shapes_recv = self.get_shapes_recv(shapes)
-
-        for i in range(9):
-            if self.neighbours[i] == 1:
-                temp_tensor = torch.zeros(
-                    shapes[0],
-                    shapes[1],
-                    self.shapes_recv[i][0],
-                    self.shapes_recv[i][1],
-                    dtype=torch.float,
-                    device="cuda",
-                )
-
-                temp_req = dist.irecv(
-                    tensor=temp_tensor,
-                    src=self.rank_neighbours[i],
-                    tag=self.recv_tag[i],
-                )
-                req.append(temp_req)
-                self.recv_tag[i] += 1
-
-                self.recv_tensors.append(temp_tensor)
-            else:
-                self.recv_tensors.append([])
-
-        return req
-
-    def end_halo_exchange(self, reqs):
-        for req in reqs:
-            req.wait()
-
-    def copy_halo_exchange_values(self, halo_input):
-        for i in range(9):
-            if self.neighbours[i] == 1:
-                halo_input[
-                    :,
-                    :,
-                    self.locations_recv[i][0][0] : self.locations_recv[i][0][1],
-                    self.locations_recv[i][1][0] : self.locations_recv[i][1][1],
-                ] = self.recv_tensors[i]
-
-    def make_tensor_halo_compute(self, halo_input):
-        self.halo_input_range = 2 * self.halo_len
-
-        # 0 1 2
-        # 3 4 5
-        # 6 7 8
-
-        # horizontal
-
-        horizontal_tensor_up = None
-        if self.neighbours[0] == 1:
-            # concat both 0 with 3 and 1 with 4 position
-            horizontal_tensor_up = torch.cat(
-                (
-                    self.recv_tensors[0],
-                    self.recv_tensors[3][:, :, : 2 * self.halo_len, :],
-                ),
-                axis=2,
-            )
-
-            horizontal_tensor_temp = torch.cat(
-                (
-                    self.recv_tensors[1],
-                    halo_input[
-                        :,
-                        :,
-                        self.halo_len : 3 * self.halo_len,
-                        self.halo_len : -self.halo_len,
-                    ],
-                ),
-                axis=2,
-            )
-
-            horizontal_tensor_up = torch.cat(
-                (horizontal_tensor_up, horizontal_tensor_temp), axis=3
-            )
-        elif self.neighbours[1] == 1:
-            # concat 1 with 4 position
-            horizontal_tensor_up = torch.cat(
-                (
-                    self.recv_tensors[1],
-                    halo_input[
-                        :,
-                        :,
-                        self.halo_len : 3 * self.halo_len,
-                        self.halo_len : -self.halo_len,
-                    ],
-                ),
-                axis=2,
-            )
-
-        if self.neighbours[2] == 1:
-            horizontal_tensor_temp = torch.cat(
-                (
-                    self.recv_tensors[2],
-                    self.recv_tensors[5][:, :, : 2 * self.halo_len, :],
-                ),
-                axis=2,
-            )
-
-            horizontal_tensor_up = torch.cat(
-                (horizontal_tensor_up, horizontal_tensor_temp), axis=3
-            )
-
-        horizontal_tensor_down = None
-        if self.neighbours[6] == 1:
-            # concat both 6 with 3 and 7 with 4 position
-            horizontal_tensor_down = torch.cat(
-                (
-                    self.recv_tensors[3][:, :, -2 * self.halo_len :, :],
-                    self.recv_tensors[6],
-                ),
-                axis=2,
-            )
-
-            horizontal_tensor_temp = torch.cat(
-                (
-                    halo_input[
-                        :,
-                        :,
-                        -3 * self.halo_len : -self.halo_len,
-                        self.halo_len : -self.halo_len,
-                    ],
-                    self.recv_tensors[7],
-                ),
-                axis=2,
-            )
-
-            horizontal_tensor_down = torch.cat(
-                (horizontal_tensor_down, horizontal_tensor_temp), axis=3
-            )
-        elif self.neighbours[7] == 1:
-            # concat 7 with 4 position
-            horizontal_tensor_down = torch.cat(
-                (
-                    halo_input[
-                        :,
-                        :,
-                        -3 * self.halo_len : -self.halo_len,
-                        self.halo_len : -self.halo_len,
-                    ],
-                    self.recv_tensors[7],
-                ),
-                axis=2,
-            )
-
-        if self.neighbours[8] == 1:
-            horizontal_tensor_temp = torch.cat(
-                (
-                    self.recv_tensors[5][:, :, -2 * self.halo_len :, :],
-                    self.recv_tensors[8],
-                ),
-                axis=2,
-            )
-
-            horizontal_tensor_down = torch.cat(
-                (horizontal_tensor_down, horizontal_tensor_temp), axis=3
-            )
-
-        # Vertical
-
-        vertical_tensor_left = None
-        if self.neighbours[0] == 1:
-            vertical_tensor_left = torch.cat(
-                (
-                    self.recv_tensors[0],
-                    self.recv_tensors[1][:, :, :, : 2 * self.halo_len],
-                ),
-                axis=3,
-            )
-
-            vertical_tensor_temp = torch.cat(
-                (
-                    self.recv_tensors[3],
-                    halo_input[
-                        :,
-                        :,
-                        self.halo_len : -self.halo_len,
-                        self.halo_len : 3 * self.halo_len,
-                    ],
-                ),
-                axis=3,
-            )
-
-            vertical_tensor_left = torch.cat(
-                (vertical_tensor_left, vertical_tensor_temp), axis=2
-            )
-        elif self.neighbours[3] == 1:
-            vertical_tensor_left = torch.cat(
-                (
-                    self.recv_tensors[3],
-                    halo_input[
-                        :,
-                        :,
-                        self.halo_len : -self.halo_len,
-                        self.halo_len : 3 * self.halo_len,
-                    ],
-                ),
-                axis=3,
-            )
-
-        if self.neighbours[6] == 1:
-            vertical_tensor_temp = torch.cat(
-                (
-                    self.recv_tensors[6],
-                    self.recv_tensors[7][:, :, :, : 2 * self.halo_len],
-                ),
-                axis=3,
-            )
-            vertical_tensor_left = torch.cat(
-                (vertical_tensor_left, vertical_tensor_temp), axis=2
-            )
-
-        vertical_tensor_right = None
-        if self.neighbours[2] == 1:
-            vertical_tensor_right = torch.cat(
-                (
-                    self.recv_tensors[1][:, :, :, -2 * self.halo_len :],
-                    self.recv_tensors[2],
-                ),
-                axis=3,
-            )
-
-            vertical_tensor_temp = torch.cat(
-                (
-                    halo_input[
-                        :,
-                        :,
-                        self.halo_len : -self.halo_len,
-                        -3 * self.halo_len : -self.halo_len,
-                    ],
-                    self.recv_tensors[5],
-                ),
-                axis=3,
-            )
-
-            vertical_tensor_right = torch.cat(
-                (vertical_tensor_right, vertical_tensor_temp), axis=2
-            )
-        elif self.neighbours[5] == 1:
-            vertical_tensor_right = torch.cat(
-                (
-                    halo_input[
-                        :,
-                        :,
-                        self.halo_len : -self.halo_len,
-                        -3 * self.halo_len : -self.halo_len,
-                    ],
-                    self.recv_tensors[5],
-                ),
-                axis=3,
-            )
-
-        if self.neighbours[8] == 1:
-            vertical_tensor_temp = torch.cat(
-                (
-                    self.recv_tensors[7][:, :, :, -2 * self.halo_len :],
-                    self.recv_tensors[8],
-                ),
-                axis=3,
-            )
-            vertical_tensor_right = torch.cat(
-                (vertical_tensor_right, vertical_tensor_temp), axis=2
-            )
-
-        if self.neighbours[1] != 1 and self.neighbours[5] == 1:
-            padding_vertical_top_right = halo_input[
-                :, :, 0 : self.halo_len, -3 * self.halo_len :
-            ]
-            vertical_tensor_right = torch.cat(
-                (
-                    padding_vertical_top_right.data.fill_(0.0000000000000000),
-                    vertical_tensor_right,
-                ),
-                axis=2,
-            )
-
-        if self.neighbours[1] != 1 and self.neighbours[3] == 1:
-            padding_vertical_top_left = halo_input[
-                :, :, 0 : self.halo_len, : 3 * self.halo_len
-            ]
-            vertical_tensor_left = torch.cat(
-                (
-                    padding_vertical_top_left.data.fill_(0.0000000000000000),
-                    vertical_tensor_left,
-                ),
-                axis=2,
-            )
-
-        if self.neighbours[7] != 1 and self.neighbours[5] == 1:
-            padding_vertical_down_right = halo_input[
-                :, :, -self.halo_len :, -3 * self.halo_len :
-            ]
-            vertical_tensor_right = torch.cat(
-                (
-                    vertical_tensor_right,
-                    padding_vertical_down_right.data.fill_(0.0000000000000000),
-                ),
-                axis=2,
-            )
-
-        if self.neighbours[7] != 1 and self.neighbours[3] == 1:
-            padding_vertical_down_left = halo_input[
-                :, :, -self.halo_len :, : 3 * self.halo_len
-            ]
-            vertical_tensor_left = torch.cat(
-                (
-                    vertical_tensor_left,
-                    padding_vertical_down_left.data.fill_(0.0000000000000000),
-                ),
-                axis=2,
-            )
-
-        if self.neighbours[3] != 1 and self.neighbours[1] == 1:
-            padding_horizontal_up_left = halo_input[
-                :, :, : 3 * self.halo_len, : self.halo_len
-            ]
-            horizontal_tensor_up = torch.cat(
-                (
-                    padding_horizontal_up_left.data.fill_(0.0000000000000000),
-                    horizontal_tensor_up,
-                ),
-                axis=3,
-            )
-
-        if self.neighbours[3] != 1 and self.neighbours[7] == 1:
-            padding_horizontal_down_left = halo_input[
-                :, :, -3 * self.halo_len :, : self.halo_len
-            ]
-            horizontal_tensor_down = torch.cat(
-                (
-                    padding_horizontal_down_left.data.fill_(0.0000000000000000),
-                    horizontal_tensor_down,
-                ),
-                axis=3,
-            )
-
-        if self.neighbours[5] != 1 and self.neighbours[1] == 1:
-            padding_horizontal_up_right = halo_input[
-                :, :, : 3 * self.halo_len, -self.halo_len :
-            ]
-            horizontal_tensor_up = torch.cat(
-                (
-                    horizontal_tensor_up,
-                    padding_horizontal_up_right.data.fill_(0.0000000000000000),
-                ),
-                axis=3,
-            )
-
-        if self.neighbours[5] != 1 and self.neighbours[7] == 1:
-            padding_horizontal_down_right = halo_input[
-                :, :, -3 * self.halo_len :, -self.halo_len :
-            ]
-            horizontal_tensor_down = torch.cat(
-                (
-                    horizontal_tensor_down,
-                    padding_horizontal_down_right.data.fill_(0.0000000000000000),
-                ),
-                axis=3,
-            )
-
-        if horizontal_tensor_up == None and horizontal_tensor_down == None:
-            horizontal_tensor = None
-        elif horizontal_tensor_down == None:
-            horizontal_tensor = horizontal_tensor_up
-        elif horizontal_tensor_up == None:
-            horizontal_tensor = horizontal_tensor_down
-        else:
-            horizontal_tensor = torch.cat(
-                (horizontal_tensor_up, horizontal_tensor_down), axis=2
-            )
-
-        if vertical_tensor_left == None and vertical_tensor_right == None:
-            vertical_tensor = None
-        elif vertical_tensor_left == None:
-            vertical_tensor = vertical_tensor_right
-        elif vertical_tensor_right == None:
-            vertical_tensor = vertical_tensor_left
-        else:
-            vertical_tensor = torch.cat(
-                (vertical_tensor_left, vertical_tensor_right), axis=3
-            )
-
-        return horizontal_tensor, vertical_tensor
-
-    def compute_halo_exchange(self, horizontal_tensor, vertical_tensor):
-        if self.local_rank == 0:
-            None
-            # print(horizontal_tensor,vertical_tensor)
-
-        res_horizontal, res_vertical = None, None
-        if horizontal_tensor != None:
-            res_horizontal = super(conv_spatial, self).forward(horizontal_tensor)
-        if vertical_tensor != None:
-            res_vertical = super(conv_spatial, self).forward(vertical_tensor)
-
-        return res_horizontal, res_vertical
-
-    def compute_halo_exchange_one(self, horizontal_tensor, vertical_tensor, halo_input):
-        # print("LOCAL RANK:",self.local_rank, " Sucess")
-        if self.local_rank == 0:
-            None
-            # print(horizontal_tensor,vertical_tensor)
-
-        if self.local_rank == 0:
-            halo_input[:, :, 5:6, :] = horizontal_tensor[:, :, 2:3, :]
-            halo_input[:, :, :, -1:] = vertical_tensor[:, :, :, -1:]
-
-        if self.local_rank == 1:
-            halo_input[:, :, 5:6, :] = horizontal_tensor[:, :, 2:3, :]
-            halo_input[:, :, :, 0:1] = vertical_tensor[:, :, :, 0:1]
-
-        if self.local_rank == 2:
-            halo_input[:, :, 0:1, :] = horizontal_tensor[:, :, 0:1, :]
-            halo_input[:, :, :, -1:] = vertical_tensor[:, :, :, -1:]
-
-        if self.local_rank == 3:
-            halo_input[:, :, 0:1, :] = horizontal_tensor[:, :, 0:1, :]
-            halo_input[:, :, :, 0:1] = vertical_tensor[:, :, :, 0:1]
-
-        torch.cuda.synchronize()
-        res = super(conv_spatial, self).forward(halo_input)
-
-        return res
-
-    def merge_final_image(self, res_final, res_horizontal, res_vertical):
-        if self.neighbours[3] == 1:
-            start = self.halo_len
-        else:
-            start = 0
-
-        if self.neighbours[5] == 1:
-            end = -self.halo_len
-        else:
-            end = None
-
-        if self.neighbours[1] == 1:
-            res_final = torch.cat(
-                (res_horizontal[:, :, : self.halo_len, start:end], res_final), axis=2
-            )
-
-        if self.neighbours[7] == 1:
-            if self.neighbours[1] == 1:
-                res_final = torch.cat(
-                    (res_final, res_horizontal[:, :, self.halo_len :, start:end]),
-                    axis=2,
-                )
-            else:
-                res_final = torch.cat(
-                    (res_final, res_horizontal[:, :, :, start:end]), axis=2
-                )
-
-        if self.neighbours[3] == 1:
-            res_final = torch.cat(
-                (res_vertical[:, :, : self.halo_len, :], res_final), axis=3
-            )
-
-        if self.neighbours[5] == 1:
-            if self.neighbours[3] == 1:
-                res_final = torch.cat(
-                    (res_final, res_vertical[:, :, self.halo_len :, :]), axis=3
-                )
-            else:
-                res_final = torch.cat((res_final, res_vertical[:, :, :, :]), axis=3)
-        return res_final
-
-    def copy_final_image(self, res_final, res_horizontal, res_vertical):
-        shapes = res_final.shape
-        if self.neighbours[1] == 1:
-            res_final[:, :, : self.halo_len, :] = res_horizontal[
-                :, :, : self.halo_len, :
-            ]
-
-        if self.neighbours[7] == 1:
-            if self.neighbours[1] == 1:
-                res_final[:, :, -self.halo_len :, :] = res_horizontal[
-                    :, :, self.halo_len :, :
-                ]
-            else:
-                res_final[:, :, -self.halo_len :, :] = res_horizontal
-
-        if self.neighbours[3] == 1:
-            res_final[:, :, :, : self.halo_len] = res_vertical[:, :, : shapes[2], :]
-
-        if self.neighbours[5] == 1:
-            if self.neighbours[3] == 1:
-                res_final[:, :, :, -self.halo_len :] = res_vertical[
-                    :, :, shapes[2] :, :
-                ]
-            else:
-                res_final[:, :, :, -self.halo_len :] = res_vertical
-        return res_final
-
-    def start_halo_exchange_nochange(self, halo_input):
-        req = []
-        for i in range(9):
-            if self.neighbours[i] == 1:
-                temp_req = dist.isend(
-                    halo_input[
-                        :,
-                        :,
-                        self.locations_send[i][0][0] : self.locations_send[i][0][1],
-                        self.locations_send[i][1][0] : self.locations_send[i][1][1],
-                    ],
-                    self.rank_neighbours[i],
-                    tag=self.send_tag[i],
-                )
-                req.append(temp_req)
-                self.send_tag[i] += 1
-
-        for i in range(9):
-            if self.neighbours[i] == 1:
-                temp_req = dist.irecv(
-                    tensor=halo_input[
-                        :,
-                        :,
-                        self.locations_recv[i][0][0] : self.locations_recv[i][0][1],
-                        self.locations_recv[i][1][0] : self.locations_recv[i][1][1],
-                    ],
-                    src=self.rank_neighbours[i],
-                    tag=self.recv_tag[i],
-                )
-                req.append(temp_req)
-                self.recv_tag[i] += 1
-
-        return req
-
-    def end_halo_exchange_nochange(self, reqs):
-        for req in reqs:
-            req.wait()
-
-    def get_neighbours_rank(self):
-        self.rank_neighbours = []
-        if self.num_spatial_parts == 2:
-            rank_offset = [0, 0, 0, -1, 0, +1, 0, 0, 0]
-        elif self.num_spatial_parts == 4:
-            rank_offset = [-3, -2, -1, -1, 0, +1, +1, +2, +3]
-        elif self.num_spatial_parts == 9:
-            rank_offset = [-4, -3, -2, -1, 0, +1, +2, +3, +4]
-
-        for i in range(9):
-            if self.neighbours[i] == 1:
-                self.rank_neighbours.append(self.local_rank + rank_offset[i])
-            else:
-                self.rank_neighbours.append(-1)
-
-    def get_neighbours(self):
-        if self.local_rank < self.num_spatial_parts * self.spatial_size:
-            self.ENABLE_SPATIAL = True
-        else:
-            self.ENABLE_SPATIAL = False
-            self.neighbours = None
-            return
-
-        self.spatial_rank = self.local_rank % self.num_spatial_parts
-
-        # Neighbour
-        #  0   1   2
-        #  3   4   5
-        #  6   7   8
-        if self.num_spatial_parts == 2:
-            # 0 | 1
-            if self.spatial_rank == 0:
-                self.neighbours = [0, 0, 0, 0, 0, 1, 0, 0, 0]
-            else:
-                self.neighbours = [0, 0, 0, 1, 0, 0, 0, 0, 0]
-        elif self.num_spatial_parts == 4:
-            # 0 | 1
-            # -----
-            # 2 | 3
-            if self.spatial_rank == 0:
-                self.neighbours = [0, 0, 0, 0, 0, 1, 0, 1, 1]
-            elif self.spatial_rank == 1:
-                self.neighbours = [0, 0, 0, 1, 0, 0, 1, 1, 0]
-            elif self.spatial_rank == 2:
-                self.neighbours = [0, 1, 1, 0, 0, 1, 0, 0, 0]
-            elif self.spatial_rank == 3:
-                self.neighbours = [1, 1, 0, 1, 0, 0, 0, 0, 0]
-
-        elif self.num_spatial_parts == 9:
-            # 0 | 1 | 2
-            # -----------
-            # 3 | 4 | 5
-            # -----------
-            # 6 | 7 | 8
-            if self.spatial_rank == 0:
-                self.neighbours = [0, 0, 0, 0, 0, 1, 0, 1, 1]
-            elif self.spatial_rank == 1:
-                self.neighbours = [0, 0, 0, 1, 0, 1, 1, 1, 1]
-            elif self.spatial_rank == 2:
-                self.neighbours = [0, 0, 0, 1, 0, 0, 1, 1, 0]
-            elif self.spatial_rank == 3:
-                self.neighbours = [0, 1, 1, 0, 0, 1, 0, 1, 1]
-            elif self.spatial_rank == 4:
-                self.neighbours = [1, 1, 1, 1, 0, 1, 1, 1, 1]
-            elif self.spatial_rank == 5:
-                self.neighbours = [1, 1, 0, 1, 0, 0, 1, 1, 0]
-            elif self.spatial_rank == 6:
-                self.neighbours = [0, 1, 1, 0, 0, 1, 0, 0, 0]
-            elif self.spatial_rank == 7:
-                self.neighbours = [1, 1, 1, 1, 0, 1, 0, 0, 0]
-            elif self.spatial_rank == 8:
-                self.neighbours = [1, 1, 0, 1, 0, 0, 0, 0, 0]
-
-    """
-	def forward(self,input):
-		print("Awesome",self.neighbours, self.rank_neighbours)
-		s = torch.cuda.Stream()
-		halo_input = self.padding_layer(input)
-		self.weight.data.fill_(1.0)
-		#self.weight = torch.nn.Parameter(self.weight.int())
-		self.bias.data.fill_(1.0)
-		#self.bias= torch.nn.Parameter(self.bias.int())
-		torch.cuda.synchronize()
-
-		with torch.cuda.stream(s):
-			torch.cuda.synchronize()
-			reqs = self.start_halo_exchange(halo_input)
-			self.end_halo_exchange(reqs)
-			s.synchronize()
-			#self.copy_halo_exchange_values(halo_input)
-			horizontal_tensor, vertical_tensor = self.make_tensor_halo_compute(halo_input)
-			s.synchronize()
-			res_horizontal, res_vertical = self.compute_halo_exchange(horizontal_tensor,vertical_tensor)
-
-
-		s.synchronize()
-		torch.cuda.synchronize()
-		res_final = super(conv_spatial,self).forward(halo_input)
-
-		torch.cuda.synchronize()
-
-		print("Local rank:",self.local_rank,res_final.shape, res_horizontal.shape, res_vertical.shape)
-		res_final = self.copy_final_image(res_final,res_horizontal,res_vertical)
-		torch.cuda.synchronize()
-		return res_final
-
-	"""
-
-    def forward(self, input):
-        # print("Awesome",self.neighbours, self.rank_neighbours)
-        s = torch.cuda.Stream()
-        halo_input = self.padding_layer(input)
-
-        # self.weight = torch.nn.Parameter(self.weight.int())
-
-        # self.bias= torch.nn.Parameter(self.bias.int())
-        torch.cuda.synchronize()
-
-        if self.halo_len > 0:
-            with torch.cuda.stream(s):
-                torch.cuda.synchronize()
-                reqs = self.start_halo_exchange(halo_input)
-                self.end_halo_exchange(reqs)
-                s.synchronize()
-                # self.copy_halo_exchange_values(halo_input)
-
-                horizontal_tensor, vertical_tensor = self.make_tensor_halo_compute(
-                    halo_input
-                )
-                s.synchronize()
-                res_final = self.compute_halo_exchange_one(
-                    horizontal_tensor, vertical_tensor, halo_input
-                )
-
-            s.synchronize()
-            torch.cuda.synchronize()
-
-            return res_final
-        else:
-            res_final = super(conv_spatial, self).forward(halo_input)
-            return res_final
diff --git a/workflows/cla.yml b/workflows/cla.yml
deleted file mode 100644
index 6a550e6b..00000000
--- a/workflows/cla.yml
+++ /dev/null
@@ -1,43 +0,0 @@
-name: "CLA Assistant"
-on:
-  issue_comment:
-    types: [created]
-  pull_request_target:
-    types: [opened,closed,synchronize]
-
-# explicitly configure permissions, in case your GITHUB_TOKEN workflow permissions are set to read-only in repository settings
-permissions:
-  actions: write
-  contents: write
-  pull-requests: write
-  statuses: write
-
-jobs:
-  CLAAssistant:
-    runs-on: ubuntu-latest
-    steps:
-      - name: "CLA Assistant"
-        if: (github.event.comment.body == 'recheck' || github.event.comment.body == 'I have read the CLA Document and I hereby sign the CLA') || github.event_name == 'pull_request_target'
-        uses: contributor-assistant/github-action@v2.3.0
-        env:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-          # the below token should have repo scope and must be manually added by you in the repository's secret
-          # This token is required only if you have configured to store the signatures in a remote repository/organization
-          PERSONAL_ACCESS_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
-        with:
-          path-to-signatures: 'signatures/version1/cla.json'
-          path-to-document: 'https://gist.github.com/Quentin-Anthony/cf001eacbf75c98fdff63a75f2a7bddf' # e.g. a CLA or a DCO document
-          # branch should not be protected
-          branch: 'main'
-          allowlist: user1,bot*
-
-         # the followings are the optional inputs - If the optional inputs are not given, then default values will be taken
-          #remote-organization-name: enter the remote organization name where the signatures should be stored (Default is storing the signatures in the same repository)
-          #remote-repository-name: enter the  remote repository name where the signatures should be stored (Default is storing the signatures in the same repository)
-          #create-file-commit-message: 'For example: Creating file for storing CLA Signatures'
-          #signed-commit-message: 'For example: $contributorName has signed the CLA in $owner/$repo#$pullRequestNo'
-          #custom-notsigned-prcomment: 'pull request comment with Introductory message to ask new contributors to sign'
-          #custom-pr-sign-comment: 'The signature to be committed in order to sign the CLA'
-          #custom-allsigned-prcomment: 'pull request comment when all contributors has signed, defaults to **CLA Assistant Lite bot** All Contributors have signed the CLA.'
-          #lock-pullrequest-aftermerge: false - if you don't want this bot to automatically lock the pull request after merging (default - true)
-          #use-dco-flag: true - If you are using DCO instead of CLA
\ No newline at end of file
diff --git a/workflows/pre-commit.yml b/workflows/pre-commit.yml
deleted file mode 100644
index e3affdb4..00000000
--- a/workflows/pre-commit.yml
+++ /dev/null
@@ -1,13 +0,0 @@
-name: pre-commit
-
-on:
-  pull_request:
-  push: { branches: [main] }
-
-jobs:
-  pre-commit:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v3
-      - uses: actions/setup-python@v3
-      - uses: pre-commit/action@v3.0.0
\ No newline at end of file