diff --git a/.travis.yml b/.travis.yml
index 64857b9..9ce063a 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -20,7 +20,7 @@ install:
     popd
   # Install CODES
   - |
-    git clone https://xgitlab.cels.anl.gov/codes/codes.git ${TRAVIS_BUILD_DIR}/ci-build-deps/CODES
+    git clone https://github.com/codes-org/codes.git ${TRAVIS_BUILD_DIR}/ci-build-deps/CODES
     pushd ${TRAVIS_BUILD_DIR}/ci-build-deps/CODES
     ./prepare.sh
     mkdir build
diff --git a/docs/UserWriteUp.txt b/docs/UserWriteUp.txt
deleted file mode 100644
index 485e559..0000000
--- a/docs/UserWriteUp.txt
+++ /dev/null
@@ -1,174 +0,0 @@
-This is a work in progress and will eventually be converted to a more readable
-format.
-
-TraceR is a replay tool targeted to simulate control flow of application on
-prototype systems, i.e., if control flow of an application, which includes
-expected computation tasks, communication routines, and their dependencies, is
-provided to TraceR, it will mimic the flow on a hypothetical system with a given
-compute and communication capability. As of now, the control flow is captured by
-either emulating applications using BigSim or by linking with Score-P. CODES
-is used for simulating the communication on the network.
-
-Expected work flow:
-
-1) Write an MPI application. (Avoid global variables so that the application be
-run with virtualization if using BigSim).
-
-If using BigSim follows steps 2-4, else follow step 5.
-2) Compile BigSim/Charm++ for emulation. Use any one of the following commands:
-
-- To use UDP as BigSim/Charm++'s communication layer:
-  ./build bgampi net-linux-x86_64 bigemulator --with-production --enable-tracing
-  ./build bgampi net-darwin-x86_64 bigemulator --with-production --enable-tracing
-
-  or explicitly provide the compiler optimization level
-  ./build bgampi net-linux-x86_64 bigemulator -O2
-
-- To use MPI as BigSim/Charm++'s communication layer:
-  ./build bgampi mpi-linux-x86_64 bigemulator --with-production --enable-tracing
-
-Note that this build is used to compile MPI applications so that traces can be
-generated. Hence, the communication layer used by BigSim/Charm++ is not
-important.  During simulation, the communication will be replayed using the
-network simulator from CODES. However, the computation time captured here can be
-important if it is not being explicitly replaced at simulation time using
-configuration options. So using appropriate compiler flags is important.
-
-3) Compile the MPI application from Step 1 using BigSim/Charm++ from Step 2.
-
-Example commands:
-$CHARM_DIR/bin/ampicc -O2 simplePrg.c -o simplePrg_c
-$CHARM_DIR/bin/ampiCC -O2 simplePrg.cc -o simplePrg_cxx
-
-4) Emulation to generate traces. When the binary generated in Step 3 is run,
-BigSim/Charm++ runs the program on the allocated cores as if it would run in the
-usual case. Users should provide a few additional arguments to specify the
-number of MPI processes in the prototype systems.
-
-If using UDP as the BigSim/Charm++'s communication layer:
-./charmrun +p<number of real processes> ++nodelist <machine file> ./pgm <program arguments> +vp<number of processes expected on the future system> +x<x dim> +y<y dim> +z<z dim> +bglog
-
-If using MPI as the BigSim/Charm++'s communication layer:
-mpirun -n<number of real processes> ./pgm <program arguments> +vp<number of processes expected on the future system> +x<x dim> +y<y dim> +z<z dim> +bglog
-
-Number of real processes is typically equal to the number cores the emulation
-is being run on.
-
-machine file is the list of systems the emulation should be run on (similar to
-machine file for MPI; refer to Charm++ website for more details).
-
-vp is the number of MPI ranks that are to be emulated. For simple tests, it can
-be same as the number of real processes, in which case one MPI rank is run on
-each real processes (as it happens when a regular program is run). When the
-number of vp (virtual processes) is higher, BigSim launches user level threads
-to execute multiple MPI ranks with a process.
-
-+x +y +z defines a 3D grid of the virtual processes. The product of these three
-dimensions must match the number of vp's. These arguments do not have any
-effect on the emulation, but exist due to historical reasons.
-
-+bglog instructs bigsim to write the logs to files.
-
-When this run finished, you should see many files named bgTrace* in the
-directory. The total number of such files equals the number of real processes
-plus one. Their names are bgTrace, bgTrace0, bgTrace1, so on.
-
-Create a new folder and move all bgTrace to that folder.
-
-5) Following instructions in README.OTF to generate OTF2 traces.
-
-6) Simulation. To run a simulation, 2 files are needed: a tracer config file,
-and a codes config file. Optionally, mapping files can also be provided.
-
-Tracer config file: sample found at examples/jacobi2d-bigsim/tracer_config (BigSim) or examples/stencil4d-otf/tracer_config (OTF) Format (expected content on each line of the file):
-<global_map file>
-<num jobs>
-<Trace path for job0> <map file for job0> <number of ranks in job0> <iterations (use 1 if running in normal mode)>
-<Trace path for job1> <map file for job1> <number of ranks in job1> <iterations (use 1 if running in normal mode)>
-...
-```
-If <global_map file> is not needed, use NA for it and <map file for job*>.
-For generating simple global and job map file, use the code in utils.
-
-CODES config files: samples in examples/conf
-
-Additional documentation on format of the CODES config file can be found in the
-CODES wiki at https://xgitlab.cels.anl.gov/codes/codes/wikis/home
-
-Brief summary follows:
-
-LPGROUPS, MODELNET_GRP, PARAMS are keywords and should be used as is.
-
-MODELNET_GRP:
-repetition = number of routers that have nodes connecting to them.
-
-server = number of MPI processes/cores per router
-
-modelnet_* = number of NICs. For torus, this value has to be 1; for dragonfly,
-it should be router radix divided by 4; for the fat-tree, it should be router
-radix divided by 2. For the dragonfly network, modelnet_dragonfly_router should
-also be specified (as 1). For express mesh, modelnet_express_mesh_router should
-also be specified as 1.
-
-Similarly, the fat-tree config file requires specifying fattree_switch which
-can be 2 or 3, depending on the number of levels in the fat-tree. Note that the
-total number of cores specified in the CODES config file can be greater than
-the number of MPI processes being simulated (specified in the tracer config
-file).
-
-Other common parameters:
-packet_size/chunk_size (both should have the same value): size of the packets
-created by NIC for transmission on the network. Smaller the packet size, longer
-the time for which simulation will run (in real time). Larger the packet size,
-the less accurate the predictions are expected to be (in virtual time). Packet
-sizes of 512 bytes to 4096 bytes are commonly used.
-
-modelnet_order = torus/dragonfly/fattree/slimfly/express_mesh
-
-modelnet_scheduler =
-fcfs : packetize messages one by one.
-round-robin : packetize message in a round robin manner.
-
-message_size = PDES parameter (keep constant at 512)
-
-router_delay = delay at each router for packet transmission (in nano seconds)
-
-soft_delay = delay caused by software stack such as that of MPI (in nano
-seconds)
-
-link_bandwidth = bandwidth of each link in the system (in GB/s)
-
-cn_bandwidth = bandwidth of connection between NIC and router (in GB/s)
-
-buffer_size/vc_size = size of channels used to store transient packets at routers (in
-bytes). Typical value is 64*packet_size.
-
-routing = how are packets being routed. Options depend on the network:
-torus = static/adaptive
-dragonfly = minimal/nonminimal/adaptive
-fat-tree = adaptive/static
-
-Network specific parameters:
-
-Torus: n_dims - number of dimensions in the torus
-dim_length - length of each dimension
-
-Dragonfly: num_routers - number of routers within a group.
-global_bandwidth - bandwidth of the links that connect groups.
-
-Fat-tree: ft_type - always choose 1
-num_levels - number of levels in the fat-tree (2 or 3)
-switch_radix -  radix of the switch being used
-switch_count - number of switches at leaf level.
-
-Publications that describe implementation of TraceR in detail:
-Nikhil Jain, Abhinav Bhatele, Sam White, Todd Gamblin, and Laxmikant Kale.
-Evaluating HPC Networks via Simulation of Parallel Workloads. SC 2016.
-
-Bilge Acun, Nikhil Jain, Abhinav Bhatele, Misbah Mubarak, Christopher Carothers,
-Laxmikant Kale. Preliminary Evaluation of a Parallel Trace Replay Tool for HPC
-Network Simulations. Workshop on Parallel and Distributed Agent-Based
-Simulations at EURO-PAR 2015.
-
-More details can be found in Chapter 5 of this thesis:
-http://charm.cs.illinois.edu/newPapers/16-02/Jain_Thesis.pdf
diff --git a/docs/code-examples/scorep_user_calls.c b/docs/code-examples/scorep_user_calls.c
index 898b318..24c47db 100644
--- a/docs/code-examples/scorep_user_calls.c
+++ b/docs/code-examples/scorep_user_calls.c
@@ -6,20 +6,28 @@ int main(int argc, char **argv, char **envp)
   SCOREP_RECORDING_OFF(); //turn recording off for initialization/regions not of interest
   ...
   SCOREP_RECORDING_ON();
+
   //use verbatim to facilitate looping over the traces in simulation when simulating multiple jobs
   SCOREP_USER_REGION_BY_NAME_BEGIN("TRACER_Loop", SCOREP_USER_REGION_TYPE_COMMON); 
   // at least add this BEGIN timer call - called from only one rank
   // you can add more calls later with region names TRACER_WallTime_<any string of your choice>
+
   if(myRank == 0)
-    SCOREP_USER_REGION_BY_NAME_BEGIN("TRACER_WallTime_MainLoop", SCOREP_USER_REGION_TYPE_COMMON);
+    SCOREP_USER_REGION_BY_NAME_BEGIN("TRACER_WallTime_Loop", SCOREP_USER_REGION_TYPE_COMMON);
+
   // Application main work LOOP
   for ( int itscf = 0; itscf < nitscf_; itscf++ )
   {
+    // time call to mark start of loop iteration
+    SCOREP_USER_REGION_BY_NAME_BEGIN("TRACER_WallTime_Loop_Iter", SCOREP_USER_REGION_TYPE_COMMON);
     ...
+    SCOREP_USER_REGION_BY_NAME_END("TRACER_WallTime_Loop_Iter");
   }
+
   // time call to mark END of work - called from only one rank
   if(myRank == 0)
-    SCOREP_USER_REGION_BY_NAME_END("TRACER_WallTime_MainLoop");
+    SCOREP_USER_REGION_BY_NAME_END("TRACER_WallTime_Loop");
+
   // use verbatim - mark end of trace loop
   SCOREP_USER_REGION_BY_NAME_END("TRACER_Loop"); 
   SCOREP_RECORDING_OFF();//turn off recording again
diff --git a/docs/index.rst b/docs/index.rst
index 573d209..90b0d12 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -13,10 +13,11 @@ Computing applications on interconnection networks.
 
 .. toctree::
    :maxdepth: 2
-   :caption: Contents:
+   :caption: Contents
 
    install
    userguide
+   workflow
    tutorial
    autogen/doxygen
 
diff --git a/docs/install.rst b/docs/install.rst
index 098dd62..d314891 100644
--- a/docs/install.rst
+++ b/docs/install.rst
@@ -6,7 +6,7 @@ TraceR can be downloaded from `GitHub <https://github.com/LLNL/TraceR>`_.
 Dependencies
 ------------
 
-TraceR depends on `CODES <https://xgitlab.cels.anl.gov/codes/codes>`_ and `ROSS <https://github.com/ROSS-org/ROSS>`_.
+TraceR depends on `CODES <https://github.com/codes-org/codes>`_ and `ROSS <https://github.com/ROSS-org/ROSS>`_.
 
 Build
 -----
@@ -42,5 +42,5 @@ TraceR supports two different trace formats as input. For each format, you need
 2. AMPI-based BigSim format: To use BigSim traces as input to TraceR, you need to download and build `Charm++ <https://github.com/UIUC-PPL/charm>`_.
 
 The instructions to build Charm++ are in the `Charm++ manual
-<http://charm.cs.illinois.edu/manuals/html/charm++/A.html>`_. You should use
+<https://charm.readthedocs.io/en/latest/charm++/manual.html#installing-charm>`_. You should use
 the "charm++" target and pass "bigemulator" as a build option.
diff --git a/docs/tutorial.rst b/docs/tutorial.rst
index b23b9e5..5379340 100644
--- a/docs/tutorial.rst
+++ b/docs/tutorial.rst
@@ -1,2 +1,35 @@
+.. _tutorial:
+
 Tutorial
 ========
+
+.. rubric:: Slides
+
+.. figure:: tutorial/hoti25-slide-preview.png
+   :target: http://www.hoti.org/tutorials/HOTI25_Tutorial_2c.pdf
+   :height: 72px
+   :align: left
+   :alt: Slide preview
+
+`Download Slides <http://www.hoti.org/tutorials/HOTI25_Tutorial_2c.pdf>`_.
+
+**Full citation:** Nikhil Jain and Misbah Mubarak.
+CODES-TRACER Tutorial: Enabling HPC Design Space
+Exploration via Discrete-Event Simulation.
+Tutorial presented at 25th Annual Symposium on High Performance
+Interconnects (HOTI). Aug 28, 2017, Santa Clara, CA, USA.
+
+.. rubric:: Guides
+
+These guides will give some of the basics needed to use TraceR.
+
+    1. :ref:`tutorial-network-models`
+    2. :ref:`tutorial-simulation-basics`
+    3. :ref:`tutorial-workflow`
+
+Full contents:
+
+.. toctree::
+   tutorial/network_models
+   tutorial/simulation_basics
+   tutorial/workflow
\ No newline at end of file
diff --git a/docs/tutorial/hoti25-slide-preview.png b/docs/tutorial/hoti25-slide-preview.png
new file mode 100644
index 0000000..88b8bc1
Binary files /dev/null and b/docs/tutorial/hoti25-slide-preview.png differ
diff --git a/docs/tutorial/network_models.rst b/docs/tutorial/network_models.rst
new file mode 100644
index 0000000..d288af9
--- /dev/null
+++ b/docs/tutorial/network_models.rst
@@ -0,0 +1,393 @@
+.. _tutorial-network-models:
+
+Network Models
+==============
+
+This guide will give an overview of some of the network models
+supported by TraceR, as presented in the HOTI 25 tutorial (slides 22-39).
+For a more detailed guide, see the CODES wiki pages on network
+models at https://github.com/codes-org/codes/wiki/codes-networks.
+Any commands/examples in this section are referring to files
+included in the `CODES git repository <https://github.com/codes-org/codes>`_ (not TraceR).
+
+Overview
+--------
+
+Multiple network models are supported, including dragonfly, fat
+tree, express mesh, hyperX, torus, slim fly, and LogP. An abstraction
+layer, ``model-net``, sits on top of network models that breaks
+messages into packets and offers FIFO, round robin, and priority
+queues. To try different networks, simply switch the network configuration
+files used when running TraceR. Storage models, MPI simulation, and
+workload replay layers are independent of the underlying network
+model used.
+
+Simplenet
+---------
+
+The Simplenet model uses a latency/bandwidth model where messages are
+sent directly from the source to the destination. It uses infinite
+queueing, and is easy to setup - a startup delay and link bandwidth
+are used for configuration. This model is mostly for debugging and
+testing purposes and can be used as a starting point when replaying
+MPI traces. It can be used as a baseline network model with no contention
+and no routing.
+
+Configuring
+^^^^^^^^^^^
+
+Consider this Simplenet configuration file that can be
+found in *codes/tests/conf/modelnet-test.conf*::
+
+    LPGROUPS
+    {
+        MODELNET_GRP
+        {
+            repetitions="16";
+            server="1";
+            modelnet_simplenet="1";
+        }
+    }
+    PARAMS
+    {
+        packet_size="512";
+        message_size="384";
+        modelnet_order=( "simplenet" );
+        # scheduler options
+        modelnet_scheduler="fcfs";
+        net_startup_ns="1.5";
+        # bandwidth is in MiB/s
+        net_bw_mbps="20000";
+    }
+
+The MODELNET_GRP section is used for mapping entities to
+ROSS MPI processes.
+
+Messages are broken into packets by the ``model-net`` layer,
+with a size that can be set by the ``packet_size`` param.
+
+The ``message_size`` parameter is a ROSS specific parameter
+that is used to set the event size.
+
+``net_startup_ns`` sets the startup delay in nanoseconds.
+
+``net_bw_mbps`` sets the link bandwidth in MB/s between nodes.
+There is one link between each pair of nodes.
+
+Running
+^^^^^^^
+
+The model shown above can be run in CODES with::
+
+    ./tests/modelnet-test --sync=1 -- tests/conf/modelnet-test.conf
+
+The command runs a simple test in which a simulated MPI rank
+sends a message to the next rank, which replies back. This
+continues until a certain number of messages is reached.
+
+Dragonfly
+---------
+
+The dragonfly network model has a hierarchy with a set of
+groups connected with all-to-all links. Within a group there
+can be several routers connected with local links, and routers
+can have links to routers in other groups for intergroup
+connections. Routers will also have compute nodes connected to
+to them. The CODES wiki explains dragonfly networks in much
+greater detail, and the slides from the HOTI 25 tutorial have
+images showing examples of possible dragonfly networks.
+
+Dragonfly networks support minimal, adaptive, non-minimal, and
+progressive adaptive routing. They use packet based simulation
+with credit based flow control, and use multiple virtual channels
+for deadlock prevention.
+
+Configuring
+^^^^^^^^^^^
+
+Consider this example configuration that can be found with the
+CODES source, *codes/src/network-workloads/dragonfly-custom*::
+
+    LPGROUPS
+    {
+        MODELNET_GRP
+        {
+            repetitions="2400";
+    # name of this lp changes according to the model
+            nw-lp="4";
+    # these lp names will be the same for dragonfly custom model
+            modelnet_dragonfly_custom="4";
+            modelnet_dragonfly_custom_router="1";
+        }
+    }
+
+``nw-lp`` is a simulated MPI process. For simulating multiple MPI
+processes per node, set this to the number of processes times the
+number of network nodes.
+
+``modelnet_dragonfly_custom`` is a simulated dragonfly network node.
+
+``modelnet_dragonfly_custom_router`` is a simulated dragonfly network router.
+
+Self messages are messages sent to the same network node. The overhead for sending
+self messages can be configured.
+
+Continuing in the same configuration file, look at the PARAM section::
+
+    PARAMS
+    {
+    # packet size in the network
+        packet_size="4096";
+        modelnet_order=( "dragonfly_custom","dragonfly_custom_router" );
+        # scheduler options
+        modelnet_scheduler="fcfs";
+    # chunk size in the network (when chunk size = packet size, packets will not be
+    # divided into chunks)
+        chunk_size="4096";
+        # number of routers within each group
+        # this is dictated by the dragonfly configuration files
+        num_router_rows="6";
+        # number of router columns
+        num_router_cols="16";
+        # number of groups in the network
+        num_groups="25";
+    # buffer size in bytes for local virtual channels
+        local_vc_size="8192";
+    # buffer size in bytes for global virtual channels
+        global_vc_size="16384";
+    # buffer size in bytes for compute node virtual channels
+        cn_vc_size="8192";
+    # bandwidth in GiB/s for local channels
+        local_bandwidth="5.25";
+    # bandwidth in GiB/s for global channels
+        global_bandwidth="4.69";
+    # bandwidth in GiB/s for compute node-router channels
+        cn_bandwidth="16.0";
+    # ROSS message size
+        message_size="592";
+    # number of compute nodes connected to router, dictated by dragonfly configuration
+    # file
+        num_cns_per_router="4";
+    # number of global channels per router
+        num_global_channels="4";
+    # network config file for intra-group connections
+        intra-group-connections="../src/network-workloads/conf/dragonfly-custom/intra-9K-custom";
+    # network config file for inter-group connections
+        inter-group-connections="../src/network-workloads/conf/dragonfly-custom/inter-9K-custom";
+    # routing protocol to be used
+        routing="prog-adaptive";
+    }
+
+``num_router_rows`` and ``num_router_cols`` control the router arrangement within a group
+and should match the input network configuration.
+
+``local_vc_size``, ``global_vc_size``, and ``cn_vc_size`` are used to configure the buffer
+size of virtual channels.
+
+``num_cns_per_router`` is used to set the number of compute nodes per router.
+
+``intra-group-connections`` and ``inter-group-connections`` are set to network configuration
+files that can be custom generated (see scripts/gen-cray-topo/README.txt).
+
+Running
+^^^^^^^
+
+To run a dragonfly network simulation, try the following:
+
+1. Download the traces::
+
+    wget https://portal.nersc.gov/project/CAL/doe-miniapps-mpi-traces/AMG/df_AMG_n1728_dumpi.tar.global_vc_size
+
+2. Run the simulation::
+
+    ./src/network-workloads/model-net-mpi-replay --sync=1 --disable_compute=1 --workload_type="dumpi" --workload_file=df_AMG_n1728_dumpi/dumpi-2014.03.03.14.55.50- --num_net_traces=1728 -- ../src/network-workloads/conf/dragonfly-custom/modelnet-test-dragonfly-edison.conf
+
+Fat Tree
+--------
+
+The Fat Tree network model can simulate two and three level fat tree networks.
+The width of the tree (number of pods) can also be configured. Two forms of
+routing are supported; static which uses destination-based look-up tables,
+and adaptive which selects the least congested output port. The simulation
+is packet-based with credit-based flow control.
+
+Tapering can be used in a fat tree network configuration to connect more nodes to leaf
+switches, which reduces the bandwidth, switches, and links at a higher level.
+
+To get higher bandwidth, nodes can connect to multiple ports (multi-rail) in one
+or more plane (multi-plane). These configurations can also be tapered to reduce
+switches and links at higher levels.
+
+The model supports configurations for multiple rails, multiple plane, and tapering.
+
+Configuring
+^^^^^^^^^^^
+
+Consider the first part of this configuration file::
+
+    LPGROUPS
+    {
+        MODELNET_GRP
+        {
+            repetitions="198";
+            nw-lp="144";
+            modelnet_fattree="18";
+            fattree_switch="3";
+        }
+    }
+
+``nw-lp`` is a simulated MPI process.
+
+``modelnet_fattree`` is a simulated fat tree network node.
+
+``fattree_switch`` sets the number of simulated fat tree network
+switches. In the above example it is set to 3 (one in each level
+of the network).
+
+Now, consider the next section in the configuration file::
+
+    PARAMS
+    {
+        packet_size="4096";
+        message_size="624";
+        chunk_size="4096";
+        modelnet_scheduler="fcfs"
+        modelnet_order=( "fattree" );
+        ft_type="0";
+        num_levels="3";
+        switch_count="198";
+        switch_radix="36";
+        vc_size="65536";
+        cn_vc_size="65536";
+        link_bandwidth="12.5";
+        cn_bandwidth="12.5";
+        routing="static";
+        routing_folder="/Fat-Tree/summit";
+        dot_file="summit-3564"
+        dump_topo="0";
+    }
+
+The switch arrangement set with ``ft_type``, ``num_levels``, and ``switch_count``
+should match the input network configuration.
+
+``switch_radix`` can be configured.
+
+Static routing requires precomputed destination routing tables, for details
+see https://xgitlab.cels.anl.gov/codes/codes/wikis/codes-fattree#enabling-static-routing.
+
+Slim Fly
+--------
+
+The Slim Fly network model has a topology of interconnected router groups build
+with MMS graphs. The maximum network diameter is always 2. It uses a packet-based
+simulation with credit-based flow control. The forms of routing supported are
+minimal with 2 virtual channels, non-minimal with 4 virtual channels, and adaptive
+with 4 virtual channels.
+
+Configuring
+^^^^^^^^^^^
+
+Consider the params section for a slim fly network configuration file::
+
+    PARAMS
+    {
+        packet_size="4096";
+        chunk_size="4096";
+        message_size="592";
+        modelnet_order=( "slimfly" );
+        modelnet_scheduler="fcfs";
+        num_routers="13";
+        num_terminals="9";
+        global_channels="13";
+        local_channels="6";
+        generator_set_x=("1","10","9","12","3","4");
+        generator_set_x_prime=("6","8","2","7","5","11");
+        local_vc_size="25600";
+        global_vc_size="25600";
+        cn_vc_size="25600";
+        local_bandwidth="12.5";
+        global_bandwidth="12.5";
+        cn_bandwidth="12.5";
+        routing="minimal";
+        num_vcs="4";
+    }
+
+``num_routers``, ``num_terminals``, ``global_channels``, and ``local_channels`` can
+be used to confiure the router arrangement within a group.
+
+Generator sets are a set of indices used to calculate connections between routers
+in the same subgraph. They must be precomputed. The params ``generator_set_x`` and
+``generator_set_x_prime`` are set based on the precomputed indices.
+
+Torus
+-----
+
+A torus network is based on a n-dimensional k-ary network topology. The number of
+torus dimensions and length of each dimension can be configured. The network model
+supports dimension order routing.
+
+Express Mesh and HyperX
+-----------------------
+
+The express mesh topology is low-diameter densely connected grids. The model alllows
+for specifying the connection gap. A gap of 1 is a HyperX network. A bubble escape
+virtual channel is used for deadlock prevention.
+
+Interpreting Simulation Output
+------------------------------
+
+Using the example run on the dragonfly network given above, we ge tthe following output::
+
+            Total GVT Computations                       0
+            Total All Reduce Calls                       0
+            Average Reduction/GVT                      nan
+    
+    Total bytes sent 13584368 recvd 13584368
+    max runtime 449332.124035 ns avg runtime 443706.882419
+    max comm time 449332.124035 avg comm time 443706.882419
+    max send time 5142770.436275 avg send time 2779472.247926
+    max recv time 4149449.596308 avg recv time 2335071.940672
+    max wait time 432820.362362 avg wait time 430457.043452
+    _P-IO: writing output to dragonfly-simple-33405-1499374633/
+    _P-IO: data files:
+        dragonfly-simple-33488-1499374633/dragonfly-router-traffic
+        dragonfly-simple-33488-1499374633/dragonfly-router-net_stats
+        dragonfly-simple-33488-1499374633/dragonfly-msg-stats
+        dragonfly-simple-33488-1499374633/model-net-category-all
+        dragonfly-simple-33488-1499374633/model-net-category-test
+        dragonfly-simple-33488-1499374633/mpi-replay-stats
+    Average number of hops traversed 1.709869 average chunk latency 0.925252 us maximum chunk latency 9.312357 us avg message size 812.563110 bytes finished messages 16820 finished chunks 65012
+
+    ADAPTIVE ROUTING STATS 65012 chunks routed minimally 0 chunks routed non-minimally completed packets 65012
+
+    Total packets generated 39722 finished 39722
+
+As shown in the sample output, average and maximum times are reported
+for all application runs with statistics on time spent in overall execution,
+communication, wait operations, amount of data transferred, and so on.
+
+Enabling lp-io-dir generates detailed network statistics files. The network
+statistics (hops traversed, latency, routing, etc) are reported for the
+entire network.
+
+Detailed statistics for each MPI rank, network node, router, and port are
+generated using the lp-io-dir option.
+
+``--lp-io-dir=my-dir`` can be used to enable statistics generation (each lp
+writes its statistics to a summary file).
+
+Statistics Reported by LP-IO
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+``Dragonfly-msg-stats`` has the number of hops, packet latency, packets
+sent/received, and link saturation time reported for each network node.
+
+``Dragonfly-router-stats`` has the link saturation time for each router port.
+
+``Dragonfly-router-traffic`` has the traffic sent for each router port.
+
+Fat tree and slim fly networks have similar statistics files.
+
+``Mpi-replay-stats`` (generated for any network model) has the bytes
+sent/received per MPI process, the time spent in communication per
+MPI process, and the number of sends and receives per MPI process.
\ No newline at end of file
diff --git a/docs/tutorial/simulation_basics.rst b/docs/tutorial/simulation_basics.rst
new file mode 100644
index 0000000..b80e5a6
--- /dev/null
+++ b/docs/tutorial/simulation_basics.rst
@@ -0,0 +1,131 @@
+.. _tutorial-simulation-basics:
+
+Four Steps to Simulations
+=========================
+
+Creating a network simulation can be broken down into 4 steps:
+
+.. contents::
+   :depth: 1
+   :local:
+
+1. Prototype the system design
+------------------------------
+
+An overview of setup using network parameters was given
+in the :ref:`tutorial-network-models` guide.
+
+2. Workload selection
+---------------------
+
+There are two types of workloads that can be used in a simulation,
+synthetic workloads and HPC application traces.
+
+Synthetic Workloads
+^^^^^^^^^^^^^^^^^^^
+
+Synthetic workloads follow specific communication patterns with a
+constant injection rate. Often they are used to stress the network
+topology to identify best and worst case performance. Examples of
+synthetic workloads include uniform random, all to all, bisection
+pairing, and bit permutation. These workloads don't require simulation
+of MPI operations, and could be used to generate background traffic
+that can simulate interference with an application trace caused by
+a production HPC system having a significant fraction of network nodes
+being occupied.
+
+**Uniform Random**: A network node is equally likely to send to any other
+network node (traffic distributed throughout the network).
+
+**All to All**: Each network node communicates with all other network nodes.
+
+**Nearest Neighbor**: A network node communicates with nearby network nodes
+(or the ones that are at minimal number of hops).
+
+**Permutation Traffic**: Source node sends all traffic to a single destination
+based on a permutation matrix.
+
+**Bisection Pairing**: Node 0 communicates with Node 'n', Node 1 with 'n-1',
+and so on.
+
+HPC Application Traces
+^^^^^^^^^^^^^^^^^^^^^^
+
+Application traces are captured by running an MPI program. They are
+useful for network performance prediction of production HPC applications.
+Trace sizes can be large for long running or communication intensive
+applications, but they have the potential to capture computation-communication
+interplay. These workloads require accurate simulation of MPI operations, and
+simulation results can be complex to analyze.
+
+3. Workload creation
+--------------------
+
+A workload can be created by capturing application traces from
+running an MPI program. Options for capturing a trace include
+using DUMPI, :ref:`userguide-score-p`, and :ref:`userguide-bigsim`.
+
+Information in a Typical Trace
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A typical trace captured (e.g. in DUMPI, OTF2, BigSim) for an
+MPI program contains information on the operations that occur
+at different times with critical information for the operation.
+The table below gives an example of a typical trace.
+
+===========================   ================   =========================================================
+Time stamp, t (rounded off)   Operation type     Operation data (only critical information is highlighted)
+===========================   ================   =========================================================
+t = 10                        MPI_Bcast          root, size of bcast, communicator
+t = 10.5                      MPI_Irecv          source, tag, communicator, req ID
+t = 10.51                     user_computation   optional region name - "boundary updates"
+t = 12.51                     MPI_Isend          dest, tag, communicator, req ID
+t = 12.53                     user_computation   optional region name - "core updates"
+t = 22.53                     MPI_Waitall        req IDs
+t = 25                        MPI_Barrier        communicator
+===========================   ================   =========================================================
+
+Effect of Replaying Traces
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As shown in the table below, replaying a trace can result in
+different results from the original run due to different configurations
+resulting in operations taking more or less time to run. In the first
+and 2nd to last table entries, the MPI_Bcast and MPI_Waitall operations
+are faster in the replayed trace, resulting in subsequent operations
+happening at earlier times than when the trace was captured.
+
+====================   =================   ===============   ============   ================
+Original time stamps   Original duration   New time stamps   New duration   Operation type
+====================   =================   ===============   ============   ================
+10                     0.5                 10                0.2            MPI_Bcast
+10.5                   0.01                10.2              0.01           MPI_Irecv
+10.51                  2                   10.21             2              user_computation
+12.51                  0.02                12.21             0.02           MPI_Isend
+12.53                  10                  12.23             10             user_computation
+22.53                  2.47                22.23             0.03           MPI_Waitall
+25                     1                   22.26             1.7            MPI_Barrier
+====================   =================   ===============   ============   ================
+
+In addition to the affect of the network configuration, different trace
+formats may result in different results.
+
+As an example, DUMPI traces store all the information passed to MPI calls. The
+simulation then decides which request to fulfill, allowing accurate resolution
+for the target systems. If the control flow of the program can change
+significantly due to the ordering of operations, then simulations are not
+entirely correct.
+
+On the other hand, OTF2 traces store only the information that is used (e.g. which
+request was satisfied). This accurately mimics the control flow of the trace
+run, but does not accurately represent execution for the target system.
+
+These differences are artifacts of leveraging existing tools not originally
+intended for Parallel Discrete Event Simulation (PDES).
+
+4. Execution
+------------
+
+The user guide :ref:`userguide-quickstart` section shows the
+arguments taken by TraceR and some of the options available
+to control execution of a simulation.
diff --git a/docs/tutorial/workflow.rst b/docs/tutorial/workflow.rst
new file mode 100644
index 0000000..902c932
--- /dev/null
+++ b/docs/tutorial/workflow.rst
@@ -0,0 +1,40 @@
+.. _tutorial-workflow:
+
+Expected Workflow
+=================
+
+This guide will walk you through the expected workflow for using TraceR.
+It will direct you to resources on generating BigSim and OTF2 traces, the
+format of a TraceR configuration file, and a basic command for running a
+simulation.
+
+TraceR is a replay tool targeted to simulate control flow of application on
+prototype systems, i.e., if control flow of an application, which includes
+expected computation tasks, communication routines, and their dependencies, is
+provided to TraceR, it will mimic the flow on a hypothetical system with a given
+compute and communication capability. As of now, the control flow is captured by
+either emulating applications using BigSim or by linking with Score-P. CODES
+is used for simulating the communication on the network.
+
+1. Write an MPI application. (Avoid global variables so that the application be
+   run with virtualization if using BigSim). Included in the TraceR repository are
+   two examples: jacobi2d-bigsim and stencil4d-otf. The jacobi2d-bigsim example
+   shows how a program would be compiled to generate BigSim traces, and the
+   stencil4d-otf example shows how to compile a program for generating OTF2 traces.
+
+   .. note::
+      If you're using BigSim, avoid global variables in your MPI application so that it can be run with virtualization.
+
+2. Generate traces. For instructions on generating OTF2 traces, see the user guide
+   section on using :ref:`userguide-score-p`, or for using BigSim traces see the section in
+   the user guide about :ref:`userguide-bigsim`.
+
+3. After generating traces, 2 files are needed: a tracer config file, and a codes config file.
+   Optionally, mapping files can also be provided. See :ref:`userguide-tracer-config-file`, :ref:`userguide-codes-config-file`,
+   and :ref:`userguide-job-placement-file` in the user guide for instructions on creating the files.
+
+4. Run the simulation using ``mpirun``. For details on options available, see the
+   :ref:`userguide-quickstart` section in the user guide. This command will
+   run a simulation in optimistic mode::
+    
+    mpirun -np <p> ../traceR --sync=3  -- <network_config> <tracer_config>
diff --git a/docs/userguide.rst b/docs/userguide.rst
index 6baef74..92827a4 100644
--- a/docs/userguide.rst
+++ b/docs/userguide.rst
@@ -4,6 +4,8 @@ User Guide
 Below, we provide detailed instructions for how to start doing network
 simulations using TraceR.
 
+.. _userguide-quickstart:
+
 Quickstart
 ----------
 
@@ -20,242 +22,25 @@ Some useful options to use with TraceR:
 --max-opt-lookahead    leash on optimistic execution in nanoseconds (1 microsecond is a good value)
 --timer-frequency      frequency with which PE0 should print current virtual time
 
-Creating a TraceR configuration file
-------------------------------------
-
-This is the format for the TraceR config file::
-
-    <global map file>
-    <num jobs>
-    <Trace path for job0> <map file for job0> <number of ranks in job0> <iterations (use 1 if running in normal mode)>
-    <Trace path for job1> <map file for job1> <number of ranks in job1> <iterations (use 1 if running in normal mode)>
-    ...
-    <Trace path for jobN> <map file for jobN> <number of ranks in jobN> <iterations (use 1 if running in normal mode)>
-
-
-If you do not intend to create global or per-job map files, you can use ``NA``
-instead of them.
-
-Sample TraceR config files can be found in examples/jacobi2d-bigsim/tracer_config (BigSim) or examples/stencil4d-otf/tracer_config (OTF)
-
-See `Creating the job placement file`_ below for how to generate global or per-job map files.
-
-Creating the network (CODES) configuration file
------------------------------------------------
-Sample network configuration files can be found in examples/conf
-
-Additional documentation on the format of the CODES config file can be found in the
-CODES wiki at https://xgitlab.cels.anl.gov/codes/codes/wikis/home
-
-A brief summary of the format follows.
-
-LPGROUPS, MODELNET_GRP, PARAMS are keywords and should be used as is.
-
-MODELNET_GRP::
-
-    repetition = number of routers that have nodes connecting to them.
-
-    server = number of MPI processes/cores per router
-
-    modelnet_* = number of NICs. For torus, this value has to be 1; for dragonfly,
-    it should be router radix divided by 4; for the fat-tree, it should be router
-    radix divided by 2. For the dragonfly network, modelnet_dragonfly_router should
-    also be specified (as 1). For express mesh, modelnet_express_mesh_router should
-    also be specified as 1.
-
-    Similarly, the fat-tree config file requires specifying fattree_switch which
-    can be 2 or 3, depending on the number of levels in the fat-tree. Note that the
-    total number of cores specified in the CODES config file can be greater than
-    the number of MPI processes being simulated (specified in the tracer config
-    file).
-
-Other common parameters::
-
-    packet_size/chunk_size (both should have the same value) = size of the packets
-    created by NIC for transmission on the network. Smaller the packet size, longer
-    the time for which simulation will run (in real time). Larger the packet size,
-    the less accurate the predictions are expected to be (in virtual time). Packet
-    sizes of 512 bytes to 4096 bytes are commonly used.
+Setting up a Simulation
+-----------------------
 
-    modelnet_order = torus/dragonfly/fattree/slimfly/express_mesh
+.. _userguide-tracer-config-file:
+.. include:: userguide/tracer-config-file.rst
 
-    modelnet_scheduler =
-        fcfs: packetize messages one by one.
-        round-robin: packetize message in a round robin manner.
+See :ref:`userguide-job-placement-file` below for how to generate global or per-job map files.
 
-    message_size = PDES parameter (keep constant at 512)
+.. _userguide-codes-config-file:
+.. include:: userguide/codes-config-file.rst
 
-    router_delay = delay at each router for packet transmission (in nanoseconds)
-
-    soft_delay = delay caused by software stack such as that of MPI (in nanoseconds)
-
-    link_bandwidth = bandwidth of each link in the system (in GB/s)
-
-    cn_bandwidth = bandwidth of connection between NIC and router (in GB/s)
-
-    buffer_size/vc_size = size of channels used to store transient packets at routers (in
-    bytes). Typical value is 64*packet_size.
-
-    routing = how are packets being routed. Options depend on the network.
-        torus: static/adaptive
-        dragonfly: minimal/nonminimal/adaptive
-        fat-tree: adaptive/static
-
-Network specific parameters::
-
-    Torus:
-        n_dims = number of dimensions in the torus
-        dim_length = length of each dimension
-
-    Dragonfly:
-        num_routers = number of routers within a group.
-        global_bandwidth = bandwidth of the links that connect groups.
-
-    Fat-tree:
-        ft_type = always choose 1
-        num_levels = number of levels in the fat-tree (2 or 3)
-        switch_radix =  radix of the switch being used
-        switch_count = number of switches at leaf level.
-
-Creating the job placement file
--------------------------------
-
-See the README in utils for instructions on using the tools to generate the global and job mapping files.
+.. _userguide-job-placement-file:
+.. include:: userguide/job-placement-file.rst
 
 Generating Traces
 -----------------
 
-Score-P
-^^^^^^^
-
-Installation of Score-P
-"""""""""""""""""""""""
-
-1. Download from http://www.vi-hps.org/projects/score-p/
-#. tar -xvzf scorep-3.0.tar.gz
-#. cd scorep-3.0
-#. CC=mpicc CFLAGS="-O2" CXX=mpicxx CXXFLAGS="-O2" FC=mpif77 ./configure --without-gui --prefix=<SCOREP_INSTALL>
-#. make
-#. make install
-
-Generating OTF2 traces with an MPI program using Score-P
-""""""""""""""""""""""""""""""""""""""""""""""""""""""""
-
-Detailed instructions are available at https://silc.zih.tu-dresden.de/scorep-current/pdf/scorep.pdf.
-
-1. Add $SCOREP_INSTALL/bin to your PATH for convenience. Example::
-
-    export SCOREP_INSTALL=$HOME/workspace/scoreP/scorep-3.0/install
-    export PATH=$SCOREP_INSTALL/bin:$PATH
-
-2. Add the following compile time flags to the application::
-
-    -I$SCOREP_INSTALL/include -I$SCOREP_INSTALL/include/scorep -DSCOREP_USER_ENABLE
-
-3. Add #include <scorep/SCOREP_User.h> to all files where you plan to add any of the following Score-P calls (optional step)::
-
-    SCOREP_RECORDING_OFF(); - stop recording
-    SCOREP_RECORDING_ON(); - start recording
-    
- Marking special regions: SCOREP_USER_REGION_BY_NAME_BEGIN(regionname, SCOREP_USER_REGION_TYPE_COMMON) and SCOREP_USER_REGION_BY_NAME_END(regionname).
- 
- Region names beginning with TRACER_WallTime\_ are special: using TRACER_WallTime_<any_name> prints current time during simulation with tag <any_name>.
-
- An example using these features is given below:
-
- .. literalinclude:: code-examples/scorep_user_calls.c
-    :language: c
-
-4. For the link step, prefix the linker line with the following::
-
-    LD = scorep --user --nocompiler --noopenmp --nopomp --nocuda --noopenacc --noopencl --nomemory <your_linker>
-
-5. For running, set::
-
-    export SCOREP_ENABLE_TRACING=1
-    export SCOREP_ENABLE_PROFILING=0
-    export SCOREP_REDUCE_PROBE_TEST=1
-    export SCOREP_MPI_ENABLE_GROUPS=ENV,P2P,COLL,XNONBLOCK
-
- If Score-P prints a warning about flushing traces during the run, you may avoid them using::
-
-    export SCOREP_TOTAL_MEMORY=256M
-    export SCOREP_EXPERIMENT_DIRECTORY=/p/lscratchd/<username>/...
-
-6. Run the binary and traces should be generated in a folder named scorep-\*.
-
-BigSim
-^^^^^^
-
-Installation of BigSim
-""""""""""""""""""""""
-
-Compile BigSim/Charm++ for emulation (see http://charm.cs.illinois.edu/manuals/html/bigsim/manual-1p.html
-for more detail). Use any one of the following commands:
-
-- To use UDP as BigSim/Charm++'s communication layer::
-
-    ./build bgampi net-linux-x86_64 bigemulator --with-production --enable-tracing
-    ./build bgampi net-darwin-x86_64 bigemulator --with-production --enable-tracing
-
-  Or explicitly provide the compiler optimization level::
-
-    ./build bgampi net-linux-x86_64 bigemulator -O2
-
-- To use MPI as BigSim/Charm++'s communication layer::
-
-    ./build bgampi mpi-linux-x86_64 bigemulator --with-production --enable-tracing
-
-.. note::
-   This build is used to compile MPI applications so that traces can be
-   generated. Hence, the communication layer used by BigSim/Charm++ is not
-   important. During simulation, the communication will be replayed using the
-   network simulator from CODES. However, the computation time captured here can be
-   important if it is not being explicitly replaced at simulation time using
-   configuration options. So using appropriate compiler flags is important.
-
-Generating AMPI traces with an MPI program using BigSim
-"""""""""""""""""""""""""""""""""""""""""""""""""""""""
-
-1. Compile your MPI application using BigSim/Charm++.
-
- Example commands::
-
-    $CHARM_DIR/bin/ampicc -O2 simplePrg.c -o simplePrg_c
-    $CHARM_DIR/bin/ampiCC -O2 simplePrg.cc -o simplePrg_cxx
-
-2. Emulation to generate traces. When the binary generated is run,
-   BigSim/Charm++ runs the program on the allocated cores as if it were
-   running as usual. Users should provide a few additional arguments to
-   specify the number of MPI processes in the prototype systems.
-
- If using UDP as the BigSim/Charm++'s communication layer::
-
-    ./charmrun +p<number of real processes> ++nodelist <machine file> ./pgm <program arguments> +vp<number of processes expected on the future system> +x<x dim> +y<y dim> +z<z dim> +bglog
-
- If using MPI as the BigSim/Charm++'s communication layer::
-
-    mpirun -n<number of real processes> ./pgm <program arguments> +vp<number of processes expected on the future system> +x<x dim> +y<y dim> +z<z dim> +bglog
-
- Number of real processes is typically equal to the number cores the emulation
- is being run on.
-
- *machine file* is the list of systems the emulation should be run on (similar to
- machine file for MPI; refer to Charm++ website for more details).
-
- *vp* is the number of MPI ranks that are to be emulated. For simple tests, it can
- be the same as the number of real processes, in which case one MPI rank is run on
- each real process (as it happens when a regular program is run). When the
- number of vp (virtual processes) is higher, BigSim launches user level threads
- to execute multiple MPI ranks within a process.
-
- *+x +y +z* defines a 3D grid of the virtual processes. The product of these three
- dimensions must match the number of vp's. These arguments do not have any
- effect on the emulation, but exist due to historical reasons.
-
- *+bglog* instructs bigsim to write the logs to files.
+.. _userguide-score-p:
+.. include:: userguide/score-p.rst
 
-3. When this run is finished, you should see many files named *bgTrace\** in the
-   directory. The total number of such files equals the number of real processes
-   plus one. Their names are bgTrace, bgTrace0, bgTrace1, and so on. 
-   Create a new folder and move all *bgTrace* files to that folder.
+.. _userguide-bigsim:
+.. include:: userguide/bigsim.rst
diff --git a/docs/userguide/bigsim.rst b/docs/userguide/bigsim.rst
new file mode 100644
index 0000000..70af1e3
--- /dev/null
+++ b/docs/userguide/bigsim.rst
@@ -0,0 +1,75 @@
+BigSim
+^^^^^^
+
+Installation of BigSim
+""""""""""""""""""""""
+
+Compile BigSim/Charm++ for emulation (see the `BigSim manual <https://charm.readthedocs.io/en/latest/bigsim/manual.html>`_
+for more detail). Use any one of the following commands:
+
+- To use UDP as BigSim/Charm++'s communication layer::
+
+    ./build bgampi net-linux-x86_64 bigemulator --with-production --enable-tracing
+    ./build bgampi net-darwin-x86_64 bigemulator --with-production --enable-tracing
+
+  Or explicitly provide the compiler optimization level::
+
+    ./build bgampi net-linux-x86_64 bigemulator -O2
+
+- To use MPI as BigSim/Charm++'s communication layer::
+
+    ./build bgampi mpi-linux-x86_64 bigemulator --with-production --enable-tracing
+
+.. note::
+   This build is used to compile MPI applications so that traces can be
+   generated. Hence, the communication layer used by BigSim/Charm++ is not
+   important. During simulation, the communication will be replayed using the
+   network simulator from CODES. However, the computation time captured here can be
+   important if it is not being explicitly replaced at simulation time using
+   configuration options. So using appropriate compiler flags is important.
+
+Generating AMPI traces with an MPI program using BigSim
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
+1. Compile your MPI application using BigSim/Charm++.
+
+ Example commands::
+
+    $CHARM_DIR/bin/ampicc -O2 simplePrg.c -o simplePrg_c
+    $CHARM_DIR/bin/ampiCC -O2 simplePrg.cc -o simplePrg_cxx
+
+2. Emulation to generate traces. When the binary generated is run,
+   BigSim/Charm++ runs the program on the allocated cores as if it were
+   running as usual. Users should provide a few additional arguments to
+   specify the number of MPI processes in the prototype systems.
+
+ If using UDP as the BigSim/Charm++'s communication layer::
+
+    ./charmrun +p<number of real processes> ++nodelist <machine file> ./pgm <program arguments> +vp<number of processes expected on the future system> +x<x dim> +y<y dim> +z<z dim> +bglog
+
+ If using MPI as the BigSim/Charm++'s communication layer::
+
+    mpirun -n<number of real processes> ./pgm <program arguments> +vp<number of processes expected on the future system> +x<x dim> +y<y dim> +z<z dim> +bglog
+
+ Number of real processes is typically equal to the number cores the emulation
+ is being run on.
+
+ *machine file* is the list of systems the emulation should be run on (similar to
+ machine file for MPI; refer to Charm++ website for more details).
+
+ *vp* is the number of MPI ranks that are to be emulated. For simple tests, it can
+ be the same as the number of real processes, in which case one MPI rank is run on
+ each real process (as it happens when a regular program is run). When the
+ number of vp (virtual processes) is higher, BigSim launches user level threads
+ to execute multiple MPI ranks within a process.
+
+ *+x +y +z* defines a 3D grid of the virtual processes. The product of these three
+ dimensions must match the number of vp's. These arguments do not have any
+ effect on the emulation, but exist due to historical reasons.
+
+ *+bglog* instructs bigsim to write the logs to files.
+
+3. When this run is finished, you should see many files named *bgTrace\** in the
+   directory. The total number of such files equals the number of real processes
+   plus one. Their names are bgTrace, bgTrace0, bgTrace1, and so on. 
+   Create a new folder and move all *bgTrace* files to that folder.
\ No newline at end of file
diff --git a/docs/userguide/codes-config-file.rst b/docs/userguide/codes-config-file.rst
new file mode 100644
index 0000000..2447d8e
--- /dev/null
+++ b/docs/userguide/codes-config-file.rst
@@ -0,0 +1,110 @@
+Creating the network (CODES) configuration file
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Sample network configuration files can be found in examples/conf
+
+Additional documentation on the format of the CODES config file can be found in the
+CODES wiki at https://xgitlab.cels.anl.gov/codes/codes/wikis/home
+
+A brief summary of the format follows.
+
+LPGROUPS, MODELNET_GRP, PARAMS are keywords and should be used as is.
+
+MODELNET_GRP
+""""""""""""
+repetition
+    number of routers that have nodes connecting to them.
+
+server
+    number of MPI processes/cores per router
+
+modelnet_*
+    number of NICs. For torus, this value has to be 1; for dragonfly,
+    it should be router radix divided by 4; for the fat-tree, it should be router
+    radix divided by 2. For the dragonfly network, modelnet_dragonfly_router should
+    also be specified (as 1). For express mesh, modelnet_express_mesh_router should
+    also be specified as 1.
+
+    Similarly, the fat-tree config file requires specifying fattree_switch which
+    can be 2 or 3, depending on the number of levels in the fat-tree. Note that the
+    total number of cores specified in the CODES config file can be greater than
+    the number of MPI processes being simulated (specified in the tracer config
+    file).
+
+Common parameters (PARAMS)
+""""""""""""""""""""""""""
+
+packet_size/chunk_size (both should have the same value)
+    size of the packets created by NIC for transmission on the network. Smaller the
+    packet size, longer the time for which simulation will run (in real time). Larger
+    the packet size, the less accurate the predictions are expected to be (in virtual
+    time). Packet sizes of 512 bytes to 4096 bytes are commonly used.
+
+modelnet_order
+    torus/dragonfly/fattree/slimfly/express_mesh
+
+modelnet_scheduler
+    fcfs: packetize messages one by one.
+
+    round-robin: packetize message in a round robin manner.
+
+message_size
+    PDES parameter (keep constant at 512)
+
+router_delay
+    delay at each router for packet transmission (in nanoseconds)
+
+soft_delay
+    delay caused by software stack such as that of MPI (in nanoseconds)
+
+link_bandwidth
+    bandwidth of each link in the system (in GB/s)
+
+cn_bandwidth
+    bandwidth of connection between NIC and router (in GB/s)
+
+buffer_size/vc_size
+    size of channels used to store transient packets at routers (in
+    bytes). Typical value is 64*packet_size.
+
+routing
+    how are packets being routed. Options depend on the network.
+
+    torus: static/adaptive
+
+    dragonfly: minimal/nonminimal/adaptive
+
+    fat-tree: adaptive/static
+
+Network specific parameters (PARAMS)
+""""""""""""""""""""""""""""""""""""
+
+Torus:
+
+    n_dims
+        number of dimensions in the torus
+
+    dim_length
+        length of each dimension
+
+Dragonfly:
+    
+    num_routers
+        number of routers within a group.
+    
+    global_bandwidth
+        bandwidth of the links that connect groups.
+
+Fat-tree:
+
+    ft_type
+        always choose 1
+
+    num_levels
+        number of levels in the fat-tree (2 or 3)
+
+    switch_radix
+        radix of the switch being used
+
+    switch_count
+        number of switches at leaf level.
\ No newline at end of file
diff --git a/docs/userguide/job-placement-file.rst b/docs/userguide/job-placement-file.rst
new file mode 100644
index 0000000..debcbac
--- /dev/null
+++ b/docs/userguide/job-placement-file.rst
@@ -0,0 +1,106 @@
+Creating the job placement file
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Ranking basics
+""""""""""""""
+
+TraceR requires two sets of mapping files (with some what redundant information).
+Both types of files provide information about mapping of global rank to jobs and
+their local rank. Global rank of a server/core is the logical rank that server LPs
+get inside CODES. It increases linearly for servers/cores connected from one switch
+to another. Due to the way default server to node mapping works within CODES, if
+more than one node is connected to a switch, servers/cores are distributed in a
+cyclic manner.
+
+Consider this example config file::
+
+    MODELNET_GRP
+    {
+        repetitions="8";
+        server="4";
+        modelnet_dragonfly="4";
+        modelnet_dragonfly_router="1";
+    }
+
+Servers residing in nodes connected to the first router get global ranks 0-3,
+nodes connected to the second router get global ranks 4-7, and so on.
+
+Now, consider another case::
+
+    MODELNET_GRP
+    {
+        repetitions="8";
+        server="8";
+        modelnet_dragonfly="4";
+        modelnet_dragonfly_router="1";
+    }
+
+Servers residing in nodes connected to the first router get global ranks 0-7,
+nodes connected to the second router get global ranks 8-15, and so on. However,
+there are 8 servers but only 4 nodes, so each node hosts 2 servers. The servers
+are distributed in a cyclic manner within a router, i.e. in router 0, server 0
+is on node 0, 1 is on node 1, 2 is on node 2, 3 is on node 3, 4 is on node 0, 5
+is on node 1, 6 is on node 2, and 7 is on node 3. Similar cyclic distribution is
+done within every switch.
+
+Map file requirements
+"""""""""""""""""""""
+
+Map files are divided into two sets: global map files and individual job files.
+The global file specifies how the global ranks are mapped to individual jobs and
+ranks within those jobs. It is a binary file structured as sets of 3 integers:
+<global rank> <local rank> <job_id>. A typical write routine looks like this:
+
+.. code::
+
+    for(...)
+        fwrite(&global_rank, sizeof(int), 1, binout);
+        fwrite(&local_rank, sizeof(int), 1, binout);
+        fwrite(&jobid, sizeof(int), 1, binout);
+    endfor
+
+For each job, individual job map files are needed. A map file for a job is also a
+binary file filled with a series of global ranks. The global ranks are ordered by
+using the local ranks as the key. So, if the series of integers is loaded into an
+array called local_to_global, local_to_global[i] will contain the global rank of
+local rank i.
+
+Job mappers
+"""""""""""
+
+In the utils subfolder of the TraceR repository, there are several job mappers
+written in C that can be used to generate job map files with various layouts.
+Eventually these will likely be rewritten as a Python script. A brief summary
+of the generators provided follows.
+
+def_lin_mapping.C
+    Generates a linear mapping which is also the default mapping
+    when no mapping is specified. If nodes per router is more than 1, then this
+    mapping will spread the ranks in a round-robin fashion among the nodes.
+
+node_mapping.C
+    Generates a mapping that always places servers with contiguous
+    global ranks on a node. That is, if there are 2 servers per node, ranks 0-1 are
+    on node 0, ranks 2-3 are on node 1, and so on.
+
+multi_job.C
+    Router based various schemes for mapping.
+
+many_job.C
+    Nodes based various schemes for mapping.
+
+Commands for execution
+""""""""""""""""""""""
+./def_lin_mapping <global_map_file> <space separated #ranks in each job>
+
+./node_mapping <global_map_file> <total_ranks in the job> <nodes per router> <cores per node> [optional <nodes with router to skip after>]
+
+The output from these commands will be a global map file, and job{0,1..} files in binary format.
+
+Example::
+
+    ./def_lin_mapping global.bin 32 32 64
+
+The above command generates global.bin with 128 ranks, where the first 32 are mapped to job0,
+the next 32 to job1, and last 64 to job2. It also generates job0, job1, and job2 that maps
+ranks from these jobs to global ranks.
\ No newline at end of file
diff --git a/docs/userguide/score-p.rst b/docs/userguide/score-p.rst
new file mode 100644
index 0000000..e37f48f
--- /dev/null
+++ b/docs/userguide/score-p.rst
@@ -0,0 +1,62 @@
+Score-P
+^^^^^^^
+
+Installation of Score-P
+"""""""""""""""""""""""
+
+1. Download from http://www.vi-hps.org/projects/score-p/
+#. tar -xvzf scorep-5.0.tar.gz
+#. cd scorep-5.0
+#. CC=mpicc CFLAGS="-O2" CXX=mpicxx CXXFLAGS="-O2" FC=mpif77 ./configure --without-gui --prefix=<SCOREP_INSTALL>
+#. make
+#. make install
+
+Generating OTF2 traces with an MPI program using Score-P
+""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
+Detailed instructions are available at https://silc.zih.tu-dresden.de/scorep-current/pdf/scorep.pdf.
+
+1. Add $SCOREP_INSTALL/bin to your PATH for convenience. Example::
+
+    export SCOREP_INSTALL=$HOME/workspace/scoreP/scorep-5.0/install
+    export PATH=$SCOREP_INSTALL/bin:$PATH
+
+2. Add the following compile time flags to the application::
+
+    -I$SCOREP_INSTALL/include -I$SCOREP_INSTALL/include/scorep -DSCOREP_USER_ENABLE
+
+3. Add #include <scorep/SCOREP_User.h> to all files where you plan to add any of the following Score-P calls (optional step)::
+
+    SCOREP_RECORDING_OFF(); - stop recording
+    SCOREP_RECORDING_ON(); - start recording
+    
+ Marking special regions: SCOREP_USER_REGION_BY_NAME_BEGIN(regionname, SCOREP_USER_REGION_TYPE_COMMON) and SCOREP_USER_REGION_BY_NAME_END(regionname).
+ 
+ Region names beginning with TRACER_WallTime\_ are special: using TRACER_WallTime_<any_name> prints current time during simulation with tag <any_name>.
+
+ An example using these features is given below:
+
+ .. literalinclude:: code-examples/scorep_user_calls.c
+    :language: c
+
+4. For the link step, prefix the linker line with the following::
+
+    LD = scorep --user --nocompiler --noopenmp --nopomp --nocuda --noopenacc --noopencl --nomemory <your_linker>
+
+5. For running, set::
+
+    export SCOREP_ENABLE_TRACING=1
+    export SCOREP_ENABLE_PROFILING=0
+    export SCOREP_MPI_ENABLE_GROUPS=ENV,P2P,COLL,XNONBLOCK
+
+ If Score-P prints a warning about flushing traces during the run, you may avoid them using::
+
+    export SCOREP_TOTAL_MEMORY=256M
+    export SCOREP_EXPERIMENT_DIRECTORY=/p/lscratchd/<username>/...
+
+ .. note::
+   For larger simulations, performance can get slow. There is a :download:`patch for Score-P 5.0 <scorep-5.0.patch>` that
+   adds an option to reduce the number of MPI Probes. After applying the patch, it can be enabled like the other Score-P
+   options with ``export SCOREP_REDUCE_PROBE_TEST=1``.
+
+6. Run the binary and traces should be generated in a folder named scorep-\*.
\ No newline at end of file
diff --git a/docs/userguide/tracer-config-file.rst b/docs/userguide/tracer-config-file.rst
new file mode 100644
index 0000000..c67c9f3
--- /dev/null
+++ b/docs/userguide/tracer-config-file.rst
@@ -0,0 +1,17 @@
+Creating a TraceR configuration file
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This is the format for the TraceR config file::
+
+    <global map file>
+    <num jobs>
+    <Trace path for job0> <map file for job0> <number of ranks in job0> <iterations (use 1 if running in normal mode)>
+    <Trace path for job1> <map file for job1> <number of ranks in job1> <iterations (use 1 if running in normal mode)>
+    ...
+    <Trace path for jobN> <map file for jobN> <number of ranks in jobN> <iterations (use 1 if running in normal mode)>
+
+
+If you do not intend to create global or per-job map files, you can use ``NA``
+instead of them.
+
+Sample TraceR config files can be found in examples/jacobi2d-bigsim/tracer_config (BigSim) or examples/stencil4d-otf/tracer_config (OTF)
\ No newline at end of file
diff --git a/utils/README b/utils/README
deleted file mode 100644
index aad1306..0000000
--- a/utils/README
+++ /dev/null
@@ -1,90 +0,0 @@
-Ranking basics:
----------------------
-TraceR requires two sets of mapping files (with some what redundant information).
-Both types files provide information about mapping of global rank to jobs and
-their local rank. Global rank of a server/core is simply the logical rank that
-server LPs get inside CODES. It increases linearly from servers/cores connected
-to one switch to another. Due to the way default server to node mapping works
-within CODES, if more than one node is connected to a switch, server/cores are
-distributed in a cyclic manner.
-
-Example: Consider the following config file
-MODELNET_GRP                      
-{                                 
-  repetitions="8             
-  server="4";                    
-  modelnet_dragonfly="4";        
-  modelnet_dragonfly_router="1"; 
-}   
-
-Servers residing in nodes connected to the first router gets global rank 0-3,
-second router gets global rank 4-7, and so on.
-
-Now consider this case:
-MODELNET_GRP                      
-{                                 
-  repetitions="8             
-  server="8";                    
-  modelnet_dragonfly="4";        
-  modelnet_dragonfly_router="1"; 
-}   
-
-Servers residing in nodes connected to the first router gets global rank 0-7,
-second router gets global rank 8-15, and so on.  However, there are 8 servers
-but only 4 nodes, so each node hosts 2 servers.  The servers are distributed in
-a cyclic manner within a router, i.e. in router 0, server 0 is on node 0, 1 is
-on node 1, 2 is on node 2, 3 is node 3, 4 is on node 0, 5 is on node 1, 6 is on
-node 2, and 7 is on node 3. Similar cyclic distribution is done within every
-switch.
-
-Map file requirements:
----------------------
-Map files are divided into two sets: global map file and individual job files.
-The global file specifies how the global rank are mapped to individual jobs and
-ranks within those jobs. It is a binary file structured as sets of 3 integers: 
-<global rank> <local rank> <job_id>. Typical write routine look like:
-
-for(....)
-        fwrite(&global_rank, sizeof(int), 1, binout);
-        fwrite(&local_rank, sizeof(int), 1, binout);
-        fwrite(&jobid, sizeof(int), 1, binout);
-endfor
-
-For each job, individual job map files are needed. A map file for a job is also a
-binary file filled with a series of global ranks. The global ranks are ordered 
-by using the local ranks as the key. So, if the series of integers is loaded
-into an array called local_to_global, local_to_global[i] will contain the global
-rank of local rank i.
-
-Note for author: Eliminate individual job map files and make life easier for
-users.
-
-Job mappers
-------------------
-def_lin_mapping.C : generate linear mapping which is also the default mapping
-when no mapping is specified. If nodes per router is more than 1, then this
-mapping will spread the ranks in a round-robin fashion among the nodes.
-
-node_mapping.C : generates mapping that always places server with contiguous 
-global ranks on a node. That, if there 2 servers per node, ranks 0-1 are on node
-0, ranks 2-3 are on node 1, and so on.
-
-multi_job.C : Router based various schemes for mapping.
-many_job.C : Nodes based various schemes for mapping.
-
-Commands for execution
-----------------------
-./def_lin_mapping <global_map_file> <space sepated #ranks in each job>
-./node_mapping <global_map_file> <total ranks in the job> <nodes per router> <cores per node> [optional <nodes with router to skip after>]
-
-Output - 
-<global_map_file> in binary format
-job{0,1..} files in binary format
-
-Example:
-./def_lin_mapping global.bin 32 32 64
-
-generates global.bin with 128 ranks, where first 32 are mapped to job0, next 32
-to job1, and last 64 to job2. Also generates job0, job1, job2 that maps ranks 
-from these jobs to global ranks.
-