diff --git a/.travis.yml b/.travis.yml index 64857b9..9ce063a 100644 --- a/.travis.yml +++ b/.travis.yml @@ -20,7 +20,7 @@ install: popd # Install CODES - | - git clone https://xgitlab.cels.anl.gov/codes/codes.git ${TRAVIS_BUILD_DIR}/ci-build-deps/CODES + git clone https://github.com/codes-org/codes.git ${TRAVIS_BUILD_DIR}/ci-build-deps/CODES pushd ${TRAVIS_BUILD_DIR}/ci-build-deps/CODES ./prepare.sh mkdir build diff --git a/docs/UserWriteUp.txt b/docs/UserWriteUp.txt deleted file mode 100644 index 485e559..0000000 --- a/docs/UserWriteUp.txt +++ /dev/null @@ -1,174 +0,0 @@ -This is a work in progress and will eventually be converted to a more readable -format. - -TraceR is a replay tool targeted to simulate control flow of application on -prototype systems, i.e., if control flow of an application, which includes -expected computation tasks, communication routines, and their dependencies, is -provided to TraceR, it will mimic the flow on a hypothetical system with a given -compute and communication capability. As of now, the control flow is captured by -either emulating applications using BigSim or by linking with Score-P. CODES -is used for simulating the communication on the network. - -Expected work flow: - -1) Write an MPI application. (Avoid global variables so that the application be -run with virtualization if using BigSim). - -If using BigSim follows steps 2-4, else follow step 5. -2) Compile BigSim/Charm++ for emulation. Use any one of the following commands: - -- To use UDP as BigSim/Charm++'s communication layer: - ./build bgampi net-linux-x86_64 bigemulator --with-production --enable-tracing - ./build bgampi net-darwin-x86_64 bigemulator --with-production --enable-tracing - - or explicitly provide the compiler optimization level - ./build bgampi net-linux-x86_64 bigemulator -O2 - -- To use MPI as BigSim/Charm++'s communication layer: - ./build bgampi mpi-linux-x86_64 bigemulator --with-production --enable-tracing - -Note that this build is used to compile MPI applications so that traces can be -generated. Hence, the communication layer used by BigSim/Charm++ is not -important. During simulation, the communication will be replayed using the -network simulator from CODES. However, the computation time captured here can be -important if it is not being explicitly replaced at simulation time using -configuration options. So using appropriate compiler flags is important. - -3) Compile the MPI application from Step 1 using BigSim/Charm++ from Step 2. - -Example commands: -$CHARM_DIR/bin/ampicc -O2 simplePrg.c -o simplePrg_c -$CHARM_DIR/bin/ampiCC -O2 simplePrg.cc -o simplePrg_cxx - -4) Emulation to generate traces. When the binary generated in Step 3 is run, -BigSim/Charm++ runs the program on the allocated cores as if it would run in the -usual case. Users should provide a few additional arguments to specify the -number of MPI processes in the prototype systems. - -If using UDP as the BigSim/Charm++'s communication layer: -./charmrun +p ++nodelist ./pgm +vp +x +y +z +bglog - -If using MPI as the BigSim/Charm++'s communication layer: -mpirun -n ./pgm +vp +x +y +z +bglog - -Number of real processes is typically equal to the number cores the emulation -is being run on. - -machine file is the list of systems the emulation should be run on (similar to -machine file for MPI; refer to Charm++ website for more details). - -vp is the number of MPI ranks that are to be emulated. For simple tests, it can -be same as the number of real processes, in which case one MPI rank is run on -each real processes (as it happens when a regular program is run). When the -number of vp (virtual processes) is higher, BigSim launches user level threads -to execute multiple MPI ranks with a process. - -+x +y +z defines a 3D grid of the virtual processes. The product of these three -dimensions must match the number of vp's. These arguments do not have any -effect on the emulation, but exist due to historical reasons. - -+bglog instructs bigsim to write the logs to files. - -When this run finished, you should see many files named bgTrace* in the -directory. The total number of such files equals the number of real processes -plus one. Their names are bgTrace, bgTrace0, bgTrace1, so on. - -Create a new folder and move all bgTrace to that folder. - -5) Following instructions in README.OTF to generate OTF2 traces. - -6) Simulation. To run a simulation, 2 files are needed: a tracer config file, -and a codes config file. Optionally, mapping files can also be provided. - -Tracer config file: sample found at examples/jacobi2d-bigsim/tracer_config (BigSim) or examples/stencil4d-otf/tracer_config (OTF) Format (expected content on each line of the file): - - - - -... -``` -If is not needed, use NA for it and . -For generating simple global and job map file, use the code in utils. - -CODES config files: samples in examples/conf - -Additional documentation on format of the CODES config file can be found in the -CODES wiki at https://xgitlab.cels.anl.gov/codes/codes/wikis/home - -Brief summary follows: - -LPGROUPS, MODELNET_GRP, PARAMS are keywords and should be used as is. - -MODELNET_GRP: -repetition = number of routers that have nodes connecting to them. - -server = number of MPI processes/cores per router - -modelnet_* = number of NICs. For torus, this value has to be 1; for dragonfly, -it should be router radix divided by 4; for the fat-tree, it should be router -radix divided by 2. For the dragonfly network, modelnet_dragonfly_router should -also be specified (as 1). For express mesh, modelnet_express_mesh_router should -also be specified as 1. - -Similarly, the fat-tree config file requires specifying fattree_switch which -can be 2 or 3, depending on the number of levels in the fat-tree. Note that the -total number of cores specified in the CODES config file can be greater than -the number of MPI processes being simulated (specified in the tracer config -file). - -Other common parameters: -packet_size/chunk_size (both should have the same value): size of the packets -created by NIC for transmission on the network. Smaller the packet size, longer -the time for which simulation will run (in real time). Larger the packet size, -the less accurate the predictions are expected to be (in virtual time). Packet -sizes of 512 bytes to 4096 bytes are commonly used. - -modelnet_order = torus/dragonfly/fattree/slimfly/express_mesh - -modelnet_scheduler = -fcfs : packetize messages one by one. -round-robin : packetize message in a round robin manner. - -message_size = PDES parameter (keep constant at 512) - -router_delay = delay at each router for packet transmission (in nano seconds) - -soft_delay = delay caused by software stack such as that of MPI (in nano -seconds) - -link_bandwidth = bandwidth of each link in the system (in GB/s) - -cn_bandwidth = bandwidth of connection between NIC and router (in GB/s) - -buffer_size/vc_size = size of channels used to store transient packets at routers (in -bytes). Typical value is 64*packet_size. - -routing = how are packets being routed. Options depend on the network: -torus = static/adaptive -dragonfly = minimal/nonminimal/adaptive -fat-tree = adaptive/static - -Network specific parameters: - -Torus: n_dims - number of dimensions in the torus -dim_length - length of each dimension - -Dragonfly: num_routers - number of routers within a group. -global_bandwidth - bandwidth of the links that connect groups. - -Fat-tree: ft_type - always choose 1 -num_levels - number of levels in the fat-tree (2 or 3) -switch_radix - radix of the switch being used -switch_count - number of switches at leaf level. - -Publications that describe implementation of TraceR in detail: -Nikhil Jain, Abhinav Bhatele, Sam White, Todd Gamblin, and Laxmikant Kale. -Evaluating HPC Networks via Simulation of Parallel Workloads. SC 2016. - -Bilge Acun, Nikhil Jain, Abhinav Bhatele, Misbah Mubarak, Christopher Carothers, -Laxmikant Kale. Preliminary Evaluation of a Parallel Trace Replay Tool for HPC -Network Simulations. Workshop on Parallel and Distributed Agent-Based -Simulations at EURO-PAR 2015. - -More details can be found in Chapter 5 of this thesis: -http://charm.cs.illinois.edu/newPapers/16-02/Jain_Thesis.pdf diff --git a/docs/code-examples/scorep_user_calls.c b/docs/code-examples/scorep_user_calls.c index 898b318..24c47db 100644 --- a/docs/code-examples/scorep_user_calls.c +++ b/docs/code-examples/scorep_user_calls.c @@ -6,20 +6,28 @@ int main(int argc, char **argv, char **envp) SCOREP_RECORDING_OFF(); //turn recording off for initialization/regions not of interest ... SCOREP_RECORDING_ON(); + //use verbatim to facilitate looping over the traces in simulation when simulating multiple jobs SCOREP_USER_REGION_BY_NAME_BEGIN("TRACER_Loop", SCOREP_USER_REGION_TYPE_COMMON); // at least add this BEGIN timer call - called from only one rank // you can add more calls later with region names TRACER_WallTime_ + if(myRank == 0) - SCOREP_USER_REGION_BY_NAME_BEGIN("TRACER_WallTime_MainLoop", SCOREP_USER_REGION_TYPE_COMMON); + SCOREP_USER_REGION_BY_NAME_BEGIN("TRACER_WallTime_Loop", SCOREP_USER_REGION_TYPE_COMMON); + // Application main work LOOP for ( int itscf = 0; itscf < nitscf_; itscf++ ) { + // time call to mark start of loop iteration + SCOREP_USER_REGION_BY_NAME_BEGIN("TRACER_WallTime_Loop_Iter", SCOREP_USER_REGION_TYPE_COMMON); ... + SCOREP_USER_REGION_BY_NAME_END("TRACER_WallTime_Loop_Iter"); } + // time call to mark END of work - called from only one rank if(myRank == 0) - SCOREP_USER_REGION_BY_NAME_END("TRACER_WallTime_MainLoop"); + SCOREP_USER_REGION_BY_NAME_END("TRACER_WallTime_Loop"); + // use verbatim - mark end of trace loop SCOREP_USER_REGION_BY_NAME_END("TRACER_Loop"); SCOREP_RECORDING_OFF();//turn off recording again diff --git a/docs/index.rst b/docs/index.rst index 573d209..90b0d12 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -13,10 +13,11 @@ Computing applications on interconnection networks. .. toctree:: :maxdepth: 2 - :caption: Contents: + :caption: Contents install userguide + workflow tutorial autogen/doxygen diff --git a/docs/install.rst b/docs/install.rst index 098dd62..d314891 100644 --- a/docs/install.rst +++ b/docs/install.rst @@ -6,7 +6,7 @@ TraceR can be downloaded from `GitHub `_. Dependencies ------------ -TraceR depends on `CODES `_ and `ROSS `_. +TraceR depends on `CODES `_ and `ROSS `_. Build ----- @@ -42,5 +42,5 @@ TraceR supports two different trace formats as input. For each format, you need 2. AMPI-based BigSim format: To use BigSim traces as input to TraceR, you need to download and build `Charm++ `_. The instructions to build Charm++ are in the `Charm++ manual -`_. You should use +`_. You should use the "charm++" target and pass "bigemulator" as a build option. diff --git a/docs/tutorial.rst b/docs/tutorial.rst index b23b9e5..5379340 100644 --- a/docs/tutorial.rst +++ b/docs/tutorial.rst @@ -1,2 +1,35 @@ +.. _tutorial: + Tutorial ======== + +.. rubric:: Slides + +.. figure:: tutorial/hoti25-slide-preview.png + :target: http://www.hoti.org/tutorials/HOTI25_Tutorial_2c.pdf + :height: 72px + :align: left + :alt: Slide preview + +`Download Slides `_. + +**Full citation:** Nikhil Jain and Misbah Mubarak. +CODES-TRACER Tutorial: Enabling HPC Design Space +Exploration via Discrete-Event Simulation. +Tutorial presented at 25th Annual Symposium on High Performance +Interconnects (HOTI). Aug 28, 2017, Santa Clara, CA, USA. + +.. rubric:: Guides + +These guides will give some of the basics needed to use TraceR. + + 1. :ref:`tutorial-network-models` + 2. :ref:`tutorial-simulation-basics` + 3. :ref:`tutorial-workflow` + +Full contents: + +.. toctree:: + tutorial/network_models + tutorial/simulation_basics + tutorial/workflow \ No newline at end of file diff --git a/docs/tutorial/hoti25-slide-preview.png b/docs/tutorial/hoti25-slide-preview.png new file mode 100644 index 0000000..88b8bc1 Binary files /dev/null and b/docs/tutorial/hoti25-slide-preview.png differ diff --git a/docs/tutorial/network_models.rst b/docs/tutorial/network_models.rst new file mode 100644 index 0000000..d288af9 --- /dev/null +++ b/docs/tutorial/network_models.rst @@ -0,0 +1,393 @@ +.. _tutorial-network-models: + +Network Models +============== + +This guide will give an overview of some of the network models +supported by TraceR, as presented in the HOTI 25 tutorial (slides 22-39). +For a more detailed guide, see the CODES wiki pages on network +models at https://github.com/codes-org/codes/wiki/codes-networks. +Any commands/examples in this section are referring to files +included in the `CODES git repository `_ (not TraceR). + +Overview +-------- + +Multiple network models are supported, including dragonfly, fat +tree, express mesh, hyperX, torus, slim fly, and LogP. An abstraction +layer, ``model-net``, sits on top of network models that breaks +messages into packets and offers FIFO, round robin, and priority +queues. To try different networks, simply switch the network configuration +files used when running TraceR. Storage models, MPI simulation, and +workload replay layers are independent of the underlying network +model used. + +Simplenet +--------- + +The Simplenet model uses a latency/bandwidth model where messages are +sent directly from the source to the destination. It uses infinite +queueing, and is easy to setup - a startup delay and link bandwidth +are used for configuration. This model is mostly for debugging and +testing purposes and can be used as a starting point when replaying +MPI traces. It can be used as a baseline network model with no contention +and no routing. + +Configuring +^^^^^^^^^^^ + +Consider this Simplenet configuration file that can be +found in *codes/tests/conf/modelnet-test.conf*:: + + LPGROUPS + { + MODELNET_GRP + { + repetitions="16"; + server="1"; + modelnet_simplenet="1"; + } + } + PARAMS + { + packet_size="512"; + message_size="384"; + modelnet_order=( "simplenet" ); + # scheduler options + modelnet_scheduler="fcfs"; + net_startup_ns="1.5"; + # bandwidth is in MiB/s + net_bw_mbps="20000"; + } + +The MODELNET_GRP section is used for mapping entities to +ROSS MPI processes. + +Messages are broken into packets by the ``model-net`` layer, +with a size that can be set by the ``packet_size`` param. + +The ``message_size`` parameter is a ROSS specific parameter +that is used to set the event size. + +``net_startup_ns`` sets the startup delay in nanoseconds. + +``net_bw_mbps`` sets the link bandwidth in MB/s between nodes. +There is one link between each pair of nodes. + +Running +^^^^^^^ + +The model shown above can be run in CODES with:: + + ./tests/modelnet-test --sync=1 -- tests/conf/modelnet-test.conf + +The command runs a simple test in which a simulated MPI rank +sends a message to the next rank, which replies back. This +continues until a certain number of messages is reached. + +Dragonfly +--------- + +The dragonfly network model has a hierarchy with a set of +groups connected with all-to-all links. Within a group there +can be several routers connected with local links, and routers +can have links to routers in other groups for intergroup +connections. Routers will also have compute nodes connected to +to them. The CODES wiki explains dragonfly networks in much +greater detail, and the slides from the HOTI 25 tutorial have +images showing examples of possible dragonfly networks. + +Dragonfly networks support minimal, adaptive, non-minimal, and +progressive adaptive routing. They use packet based simulation +with credit based flow control, and use multiple virtual channels +for deadlock prevention. + +Configuring +^^^^^^^^^^^ + +Consider this example configuration that can be found with the +CODES source, *codes/src/network-workloads/dragonfly-custom*:: + + LPGROUPS + { + MODELNET_GRP + { + repetitions="2400"; + # name of this lp changes according to the model + nw-lp="4"; + # these lp names will be the same for dragonfly custom model + modelnet_dragonfly_custom="4"; + modelnet_dragonfly_custom_router="1"; + } + } + +``nw-lp`` is a simulated MPI process. For simulating multiple MPI +processes per node, set this to the number of processes times the +number of network nodes. + +``modelnet_dragonfly_custom`` is a simulated dragonfly network node. + +``modelnet_dragonfly_custom_router`` is a simulated dragonfly network router. + +Self messages are messages sent to the same network node. The overhead for sending +self messages can be configured. + +Continuing in the same configuration file, look at the PARAM section:: + + PARAMS + { + # packet size in the network + packet_size="4096"; + modelnet_order=( "dragonfly_custom","dragonfly_custom_router" ); + # scheduler options + modelnet_scheduler="fcfs"; + # chunk size in the network (when chunk size = packet size, packets will not be + # divided into chunks) + chunk_size="4096"; + # number of routers within each group + # this is dictated by the dragonfly configuration files + num_router_rows="6"; + # number of router columns + num_router_cols="16"; + # number of groups in the network + num_groups="25"; + # buffer size in bytes for local virtual channels + local_vc_size="8192"; + # buffer size in bytes for global virtual channels + global_vc_size="16384"; + # buffer size in bytes for compute node virtual channels + cn_vc_size="8192"; + # bandwidth in GiB/s for local channels + local_bandwidth="5.25"; + # bandwidth in GiB/s for global channels + global_bandwidth="4.69"; + # bandwidth in GiB/s for compute node-router channels + cn_bandwidth="16.0"; + # ROSS message size + message_size="592"; + # number of compute nodes connected to router, dictated by dragonfly configuration + # file + num_cns_per_router="4"; + # number of global channels per router + num_global_channels="4"; + # network config file for intra-group connections + intra-group-connections="../src/network-workloads/conf/dragonfly-custom/intra-9K-custom"; + # network config file for inter-group connections + inter-group-connections="../src/network-workloads/conf/dragonfly-custom/inter-9K-custom"; + # routing protocol to be used + routing="prog-adaptive"; + } + +``num_router_rows`` and ``num_router_cols`` control the router arrangement within a group +and should match the input network configuration. + +``local_vc_size``, ``global_vc_size``, and ``cn_vc_size`` are used to configure the buffer +size of virtual channels. + +``num_cns_per_router`` is used to set the number of compute nodes per router. + +``intra-group-connections`` and ``inter-group-connections`` are set to network configuration +files that can be custom generated (see scripts/gen-cray-topo/README.txt). + +Running +^^^^^^^ + +To run a dragonfly network simulation, try the following: + +1. Download the traces:: + + wget https://portal.nersc.gov/project/CAL/doe-miniapps-mpi-traces/AMG/df_AMG_n1728_dumpi.tar.global_vc_size + +2. Run the simulation:: + + ./src/network-workloads/model-net-mpi-replay --sync=1 --disable_compute=1 --workload_type="dumpi" --workload_file=df_AMG_n1728_dumpi/dumpi-2014.03.03.14.55.50- --num_net_traces=1728 -- ../src/network-workloads/conf/dragonfly-custom/modelnet-test-dragonfly-edison.conf + +Fat Tree +-------- + +The Fat Tree network model can simulate two and three level fat tree networks. +The width of the tree (number of pods) can also be configured. Two forms of +routing are supported; static which uses destination-based look-up tables, +and adaptive which selects the least congested output port. The simulation +is packet-based with credit-based flow control. + +Tapering can be used in a fat tree network configuration to connect more nodes to leaf +switches, which reduces the bandwidth, switches, and links at a higher level. + +To get higher bandwidth, nodes can connect to multiple ports (multi-rail) in one +or more plane (multi-plane). These configurations can also be tapered to reduce +switches and links at higher levels. + +The model supports configurations for multiple rails, multiple plane, and tapering. + +Configuring +^^^^^^^^^^^ + +Consider the first part of this configuration file:: + + LPGROUPS + { + MODELNET_GRP + { + repetitions="198"; + nw-lp="144"; + modelnet_fattree="18"; + fattree_switch="3"; + } + } + +``nw-lp`` is a simulated MPI process. + +``modelnet_fattree`` is a simulated fat tree network node. + +``fattree_switch`` sets the number of simulated fat tree network +switches. In the above example it is set to 3 (one in each level +of the network). + +Now, consider the next section in the configuration file:: + + PARAMS + { + packet_size="4096"; + message_size="624"; + chunk_size="4096"; + modelnet_scheduler="fcfs" + modelnet_order=( "fattree" ); + ft_type="0"; + num_levels="3"; + switch_count="198"; + switch_radix="36"; + vc_size="65536"; + cn_vc_size="65536"; + link_bandwidth="12.5"; + cn_bandwidth="12.5"; + routing="static"; + routing_folder="/Fat-Tree/summit"; + dot_file="summit-3564" + dump_topo="0"; + } + +The switch arrangement set with ``ft_type``, ``num_levels``, and ``switch_count`` +should match the input network configuration. + +``switch_radix`` can be configured. + +Static routing requires precomputed destination routing tables, for details +see https://xgitlab.cels.anl.gov/codes/codes/wikis/codes-fattree#enabling-static-routing. + +Slim Fly +-------- + +The Slim Fly network model has a topology of interconnected router groups build +with MMS graphs. The maximum network diameter is always 2. It uses a packet-based +simulation with credit-based flow control. The forms of routing supported are +minimal with 2 virtual channels, non-minimal with 4 virtual channels, and adaptive +with 4 virtual channels. + +Configuring +^^^^^^^^^^^ + +Consider the params section for a slim fly network configuration file:: + + PARAMS + { + packet_size="4096"; + chunk_size="4096"; + message_size="592"; + modelnet_order=( "slimfly" ); + modelnet_scheduler="fcfs"; + num_routers="13"; + num_terminals="9"; + global_channels="13"; + local_channels="6"; + generator_set_x=("1","10","9","12","3","4"); + generator_set_x_prime=("6","8","2","7","5","11"); + local_vc_size="25600"; + global_vc_size="25600"; + cn_vc_size="25600"; + local_bandwidth="12.5"; + global_bandwidth="12.5"; + cn_bandwidth="12.5"; + routing="minimal"; + num_vcs="4"; + } + +``num_routers``, ``num_terminals``, ``global_channels``, and ``local_channels`` can +be used to confiure the router arrangement within a group. + +Generator sets are a set of indices used to calculate connections between routers +in the same subgraph. They must be precomputed. The params ``generator_set_x`` and +``generator_set_x_prime`` are set based on the precomputed indices. + +Torus +----- + +A torus network is based on a n-dimensional k-ary network topology. The number of +torus dimensions and length of each dimension can be configured. The network model +supports dimension order routing. + +Express Mesh and HyperX +----------------------- + +The express mesh topology is low-diameter densely connected grids. The model alllows +for specifying the connection gap. A gap of 1 is a HyperX network. A bubble escape +virtual channel is used for deadlock prevention. + +Interpreting Simulation Output +------------------------------ + +Using the example run on the dragonfly network given above, we ge tthe following output:: + + Total GVT Computations 0 + Total All Reduce Calls 0 + Average Reduction/GVT nan + + Total bytes sent 13584368 recvd 13584368 + max runtime 449332.124035 ns avg runtime 443706.882419 + max comm time 449332.124035 avg comm time 443706.882419 + max send time 5142770.436275 avg send time 2779472.247926 + max recv time 4149449.596308 avg recv time 2335071.940672 + max wait time 432820.362362 avg wait time 430457.043452 + _P-IO: writing output to dragonfly-simple-33405-1499374633/ + _P-IO: data files: + dragonfly-simple-33488-1499374633/dragonfly-router-traffic + dragonfly-simple-33488-1499374633/dragonfly-router-net_stats + dragonfly-simple-33488-1499374633/dragonfly-msg-stats + dragonfly-simple-33488-1499374633/model-net-category-all + dragonfly-simple-33488-1499374633/model-net-category-test + dragonfly-simple-33488-1499374633/mpi-replay-stats + Average number of hops traversed 1.709869 average chunk latency 0.925252 us maximum chunk latency 9.312357 us avg message size 812.563110 bytes finished messages 16820 finished chunks 65012 + + ADAPTIVE ROUTING STATS 65012 chunks routed minimally 0 chunks routed non-minimally completed packets 65012 + + Total packets generated 39722 finished 39722 + +As shown in the sample output, average and maximum times are reported +for all application runs with statistics on time spent in overall execution, +communication, wait operations, amount of data transferred, and so on. + +Enabling lp-io-dir generates detailed network statistics files. The network +statistics (hops traversed, latency, routing, etc) are reported for the +entire network. + +Detailed statistics for each MPI rank, network node, router, and port are +generated using the lp-io-dir option. + +``--lp-io-dir=my-dir`` can be used to enable statistics generation (each lp +writes its statistics to a summary file). + +Statistics Reported by LP-IO +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +``Dragonfly-msg-stats`` has the number of hops, packet latency, packets +sent/received, and link saturation time reported for each network node. + +``Dragonfly-router-stats`` has the link saturation time for each router port. + +``Dragonfly-router-traffic`` has the traffic sent for each router port. + +Fat tree and slim fly networks have similar statistics files. + +``Mpi-replay-stats`` (generated for any network model) has the bytes +sent/received per MPI process, the time spent in communication per +MPI process, and the number of sends and receives per MPI process. \ No newline at end of file diff --git a/docs/tutorial/simulation_basics.rst b/docs/tutorial/simulation_basics.rst new file mode 100644 index 0000000..b80e5a6 --- /dev/null +++ b/docs/tutorial/simulation_basics.rst @@ -0,0 +1,131 @@ +.. _tutorial-simulation-basics: + +Four Steps to Simulations +========================= + +Creating a network simulation can be broken down into 4 steps: + +.. contents:: + :depth: 1 + :local: + +1. Prototype the system design +------------------------------ + +An overview of setup using network parameters was given +in the :ref:`tutorial-network-models` guide. + +2. Workload selection +--------------------- + +There are two types of workloads that can be used in a simulation, +synthetic workloads and HPC application traces. + +Synthetic Workloads +^^^^^^^^^^^^^^^^^^^ + +Synthetic workloads follow specific communication patterns with a +constant injection rate. Often they are used to stress the network +topology to identify best and worst case performance. Examples of +synthetic workloads include uniform random, all to all, bisection +pairing, and bit permutation. These workloads don't require simulation +of MPI operations, and could be used to generate background traffic +that can simulate interference with an application trace caused by +a production HPC system having a significant fraction of network nodes +being occupied. + +**Uniform Random**: A network node is equally likely to send to any other +network node (traffic distributed throughout the network). + +**All to All**: Each network node communicates with all other network nodes. + +**Nearest Neighbor**: A network node communicates with nearby network nodes +(or the ones that are at minimal number of hops). + +**Permutation Traffic**: Source node sends all traffic to a single destination +based on a permutation matrix. + +**Bisection Pairing**: Node 0 communicates with Node 'n', Node 1 with 'n-1', +and so on. + +HPC Application Traces +^^^^^^^^^^^^^^^^^^^^^^ + +Application traces are captured by running an MPI program. They are +useful for network performance prediction of production HPC applications. +Trace sizes can be large for long running or communication intensive +applications, but they have the potential to capture computation-communication +interplay. These workloads require accurate simulation of MPI operations, and +simulation results can be complex to analyze. + +3. Workload creation +-------------------- + +A workload can be created by capturing application traces from +running an MPI program. Options for capturing a trace include +using DUMPI, :ref:`userguide-score-p`, and :ref:`userguide-bigsim`. + +Information in a Typical Trace +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +A typical trace captured (e.g. in DUMPI, OTF2, BigSim) for an +MPI program contains information on the operations that occur +at different times with critical information for the operation. +The table below gives an example of a typical trace. + +=========================== ================ ========================================================= +Time stamp, t (rounded off) Operation type Operation data (only critical information is highlighted) +=========================== ================ ========================================================= +t = 10 MPI_Bcast root, size of bcast, communicator +t = 10.5 MPI_Irecv source, tag, communicator, req ID +t = 10.51 user_computation optional region name - "boundary updates" +t = 12.51 MPI_Isend dest, tag, communicator, req ID +t = 12.53 user_computation optional region name - "core updates" +t = 22.53 MPI_Waitall req IDs +t = 25 MPI_Barrier communicator +=========================== ================ ========================================================= + +Effect of Replaying Traces +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +As shown in the table below, replaying a trace can result in +different results from the original run due to different configurations +resulting in operations taking more or less time to run. In the first +and 2nd to last table entries, the MPI_Bcast and MPI_Waitall operations +are faster in the replayed trace, resulting in subsequent operations +happening at earlier times than when the trace was captured. + +==================== ================= =============== ============ ================ +Original time stamps Original duration New time stamps New duration Operation type +==================== ================= =============== ============ ================ +10 0.5 10 0.2 MPI_Bcast +10.5 0.01 10.2 0.01 MPI_Irecv +10.51 2 10.21 2 user_computation +12.51 0.02 12.21 0.02 MPI_Isend +12.53 10 12.23 10 user_computation +22.53 2.47 22.23 0.03 MPI_Waitall +25 1 22.26 1.7 MPI_Barrier +==================== ================= =============== ============ ================ + +In addition to the affect of the network configuration, different trace +formats may result in different results. + +As an example, DUMPI traces store all the information passed to MPI calls. The +simulation then decides which request to fulfill, allowing accurate resolution +for the target systems. If the control flow of the program can change +significantly due to the ordering of operations, then simulations are not +entirely correct. + +On the other hand, OTF2 traces store only the information that is used (e.g. which +request was satisfied). This accurately mimics the control flow of the trace +run, but does not accurately represent execution for the target system. + +These differences are artifacts of leveraging existing tools not originally +intended for Parallel Discrete Event Simulation (PDES). + +4. Execution +------------ + +The user guide :ref:`userguide-quickstart` section shows the +arguments taken by TraceR and some of the options available +to control execution of a simulation. diff --git a/docs/tutorial/workflow.rst b/docs/tutorial/workflow.rst new file mode 100644 index 0000000..902c932 --- /dev/null +++ b/docs/tutorial/workflow.rst @@ -0,0 +1,40 @@ +.. _tutorial-workflow: + +Expected Workflow +================= + +This guide will walk you through the expected workflow for using TraceR. +It will direct you to resources on generating BigSim and OTF2 traces, the +format of a TraceR configuration file, and a basic command for running a +simulation. + +TraceR is a replay tool targeted to simulate control flow of application on +prototype systems, i.e., if control flow of an application, which includes +expected computation tasks, communication routines, and their dependencies, is +provided to TraceR, it will mimic the flow on a hypothetical system with a given +compute and communication capability. As of now, the control flow is captured by +either emulating applications using BigSim or by linking with Score-P. CODES +is used for simulating the communication on the network. + +1. Write an MPI application. (Avoid global variables so that the application be + run with virtualization if using BigSim). Included in the TraceR repository are + two examples: jacobi2d-bigsim and stencil4d-otf. The jacobi2d-bigsim example + shows how a program would be compiled to generate BigSim traces, and the + stencil4d-otf example shows how to compile a program for generating OTF2 traces. + + .. note:: + If you're using BigSim, avoid global variables in your MPI application so that it can be run with virtualization. + +2. Generate traces. For instructions on generating OTF2 traces, see the user guide + section on using :ref:`userguide-score-p`, or for using BigSim traces see the section in + the user guide about :ref:`userguide-bigsim`. + +3. After generating traces, 2 files are needed: a tracer config file, and a codes config file. + Optionally, mapping files can also be provided. See :ref:`userguide-tracer-config-file`, :ref:`userguide-codes-config-file`, + and :ref:`userguide-job-placement-file` in the user guide for instructions on creating the files. + +4. Run the simulation using ``mpirun``. For details on options available, see the + :ref:`userguide-quickstart` section in the user guide. This command will + run a simulation in optimistic mode:: + + mpirun -np

../traceR --sync=3 -- diff --git a/docs/userguide.rst b/docs/userguide.rst index 6baef74..92827a4 100644 --- a/docs/userguide.rst +++ b/docs/userguide.rst @@ -4,6 +4,8 @@ User Guide Below, we provide detailed instructions for how to start doing network simulations using TraceR. +.. _userguide-quickstart: + Quickstart ---------- @@ -20,242 +22,25 @@ Some useful options to use with TraceR: --max-opt-lookahead leash on optimistic execution in nanoseconds (1 microsecond is a good value) --timer-frequency frequency with which PE0 should print current virtual time -Creating a TraceR configuration file ------------------------------------- - -This is the format for the TraceR config file:: - - - - - - ... - - - -If you do not intend to create global or per-job map files, you can use ``NA`` -instead of them. - -Sample TraceR config files can be found in examples/jacobi2d-bigsim/tracer_config (BigSim) or examples/stencil4d-otf/tracer_config (OTF) - -See `Creating the job placement file`_ below for how to generate global or per-job map files. - -Creating the network (CODES) configuration file ------------------------------------------------ -Sample network configuration files can be found in examples/conf - -Additional documentation on the format of the CODES config file can be found in the -CODES wiki at https://xgitlab.cels.anl.gov/codes/codes/wikis/home - -A brief summary of the format follows. - -LPGROUPS, MODELNET_GRP, PARAMS are keywords and should be used as is. - -MODELNET_GRP:: - - repetition = number of routers that have nodes connecting to them. - - server = number of MPI processes/cores per router - - modelnet_* = number of NICs. For torus, this value has to be 1; for dragonfly, - it should be router radix divided by 4; for the fat-tree, it should be router - radix divided by 2. For the dragonfly network, modelnet_dragonfly_router should - also be specified (as 1). For express mesh, modelnet_express_mesh_router should - also be specified as 1. - - Similarly, the fat-tree config file requires specifying fattree_switch which - can be 2 or 3, depending on the number of levels in the fat-tree. Note that the - total number of cores specified in the CODES config file can be greater than - the number of MPI processes being simulated (specified in the tracer config - file). - -Other common parameters:: - - packet_size/chunk_size (both should have the same value) = size of the packets - created by NIC for transmission on the network. Smaller the packet size, longer - the time for which simulation will run (in real time). Larger the packet size, - the less accurate the predictions are expected to be (in virtual time). Packet - sizes of 512 bytes to 4096 bytes are commonly used. +Setting up a Simulation +----------------------- - modelnet_order = torus/dragonfly/fattree/slimfly/express_mesh +.. _userguide-tracer-config-file: +.. include:: userguide/tracer-config-file.rst - modelnet_scheduler = - fcfs: packetize messages one by one. - round-robin: packetize message in a round robin manner. +See :ref:`userguide-job-placement-file` below for how to generate global or per-job map files. - message_size = PDES parameter (keep constant at 512) +.. _userguide-codes-config-file: +.. include:: userguide/codes-config-file.rst - router_delay = delay at each router for packet transmission (in nanoseconds) - - soft_delay = delay caused by software stack such as that of MPI (in nanoseconds) - - link_bandwidth = bandwidth of each link in the system (in GB/s) - - cn_bandwidth = bandwidth of connection between NIC and router (in GB/s) - - buffer_size/vc_size = size of channels used to store transient packets at routers (in - bytes). Typical value is 64*packet_size. - - routing = how are packets being routed. Options depend on the network. - torus: static/adaptive - dragonfly: minimal/nonminimal/adaptive - fat-tree: adaptive/static - -Network specific parameters:: - - Torus: - n_dims = number of dimensions in the torus - dim_length = length of each dimension - - Dragonfly: - num_routers = number of routers within a group. - global_bandwidth = bandwidth of the links that connect groups. - - Fat-tree: - ft_type = always choose 1 - num_levels = number of levels in the fat-tree (2 or 3) - switch_radix = radix of the switch being used - switch_count = number of switches at leaf level. - -Creating the job placement file -------------------------------- - -See the README in utils for instructions on using the tools to generate the global and job mapping files. +.. _userguide-job-placement-file: +.. include:: userguide/job-placement-file.rst Generating Traces ----------------- -Score-P -^^^^^^^ - -Installation of Score-P -""""""""""""""""""""""" - -1. Download from http://www.vi-hps.org/projects/score-p/ -#. tar -xvzf scorep-3.0.tar.gz -#. cd scorep-3.0 -#. CC=mpicc CFLAGS="-O2" CXX=mpicxx CXXFLAGS="-O2" FC=mpif77 ./configure --without-gui --prefix= -#. make -#. make install - -Generating OTF2 traces with an MPI program using Score-P -"""""""""""""""""""""""""""""""""""""""""""""""""""""""" - -Detailed instructions are available at https://silc.zih.tu-dresden.de/scorep-current/pdf/scorep.pdf. - -1. Add $SCOREP_INSTALL/bin to your PATH for convenience. Example:: - - export SCOREP_INSTALL=$HOME/workspace/scoreP/scorep-3.0/install - export PATH=$SCOREP_INSTALL/bin:$PATH - -2. Add the following compile time flags to the application:: - - -I$SCOREP_INSTALL/include -I$SCOREP_INSTALL/include/scorep -DSCOREP_USER_ENABLE - -3. Add #include to all files where you plan to add any of the following Score-P calls (optional step):: - - SCOREP_RECORDING_OFF(); - stop recording - SCOREP_RECORDING_ON(); - start recording - - Marking special regions: SCOREP_USER_REGION_BY_NAME_BEGIN(regionname, SCOREP_USER_REGION_TYPE_COMMON) and SCOREP_USER_REGION_BY_NAME_END(regionname). - - Region names beginning with TRACER_WallTime\_ are special: using TRACER_WallTime_ prints current time during simulation with tag . - - An example using these features is given below: - - .. literalinclude:: code-examples/scorep_user_calls.c - :language: c - -4. For the link step, prefix the linker line with the following:: - - LD = scorep --user --nocompiler --noopenmp --nopomp --nocuda --noopenacc --noopencl --nomemory - -5. For running, set:: - - export SCOREP_ENABLE_TRACING=1 - export SCOREP_ENABLE_PROFILING=0 - export SCOREP_REDUCE_PROBE_TEST=1 - export SCOREP_MPI_ENABLE_GROUPS=ENV,P2P,COLL,XNONBLOCK - - If Score-P prints a warning about flushing traces during the run, you may avoid them using:: - - export SCOREP_TOTAL_MEMORY=256M - export SCOREP_EXPERIMENT_DIRECTORY=/p/lscratchd//... - -6. Run the binary and traces should be generated in a folder named scorep-\*. - -BigSim -^^^^^^ - -Installation of BigSim -"""""""""""""""""""""" - -Compile BigSim/Charm++ for emulation (see http://charm.cs.illinois.edu/manuals/html/bigsim/manual-1p.html -for more detail). Use any one of the following commands: - -- To use UDP as BigSim/Charm++'s communication layer:: - - ./build bgampi net-linux-x86_64 bigemulator --with-production --enable-tracing - ./build bgampi net-darwin-x86_64 bigemulator --with-production --enable-tracing - - Or explicitly provide the compiler optimization level:: - - ./build bgampi net-linux-x86_64 bigemulator -O2 - -- To use MPI as BigSim/Charm++'s communication layer:: - - ./build bgampi mpi-linux-x86_64 bigemulator --with-production --enable-tracing - -.. note:: - This build is used to compile MPI applications so that traces can be - generated. Hence, the communication layer used by BigSim/Charm++ is not - important. During simulation, the communication will be replayed using the - network simulator from CODES. However, the computation time captured here can be - important if it is not being explicitly replaced at simulation time using - configuration options. So using appropriate compiler flags is important. - -Generating AMPI traces with an MPI program using BigSim -""""""""""""""""""""""""""""""""""""""""""""""""""""""" - -1. Compile your MPI application using BigSim/Charm++. - - Example commands:: - - $CHARM_DIR/bin/ampicc -O2 simplePrg.c -o simplePrg_c - $CHARM_DIR/bin/ampiCC -O2 simplePrg.cc -o simplePrg_cxx - -2. Emulation to generate traces. When the binary generated is run, - BigSim/Charm++ runs the program on the allocated cores as if it were - running as usual. Users should provide a few additional arguments to - specify the number of MPI processes in the prototype systems. - - If using UDP as the BigSim/Charm++'s communication layer:: - - ./charmrun +p ++nodelist ./pgm +vp +x +y +z +bglog - - If using MPI as the BigSim/Charm++'s communication layer:: - - mpirun -n ./pgm +vp +x +y +z +bglog - - Number of real processes is typically equal to the number cores the emulation - is being run on. - - *machine file* is the list of systems the emulation should be run on (similar to - machine file for MPI; refer to Charm++ website for more details). - - *vp* is the number of MPI ranks that are to be emulated. For simple tests, it can - be the same as the number of real processes, in which case one MPI rank is run on - each real process (as it happens when a regular program is run). When the - number of vp (virtual processes) is higher, BigSim launches user level threads - to execute multiple MPI ranks within a process. - - *+x +y +z* defines a 3D grid of the virtual processes. The product of these three - dimensions must match the number of vp's. These arguments do not have any - effect on the emulation, but exist due to historical reasons. - - *+bglog* instructs bigsim to write the logs to files. +.. _userguide-score-p: +.. include:: userguide/score-p.rst -3. When this run is finished, you should see many files named *bgTrace\** in the - directory. The total number of such files equals the number of real processes - plus one. Their names are bgTrace, bgTrace0, bgTrace1, and so on. - Create a new folder and move all *bgTrace* files to that folder. +.. _userguide-bigsim: +.. include:: userguide/bigsim.rst diff --git a/docs/userguide/bigsim.rst b/docs/userguide/bigsim.rst new file mode 100644 index 0000000..70af1e3 --- /dev/null +++ b/docs/userguide/bigsim.rst @@ -0,0 +1,75 @@ +BigSim +^^^^^^ + +Installation of BigSim +"""""""""""""""""""""" + +Compile BigSim/Charm++ for emulation (see the `BigSim manual `_ +for more detail). Use any one of the following commands: + +- To use UDP as BigSim/Charm++'s communication layer:: + + ./build bgampi net-linux-x86_64 bigemulator --with-production --enable-tracing + ./build bgampi net-darwin-x86_64 bigemulator --with-production --enable-tracing + + Or explicitly provide the compiler optimization level:: + + ./build bgampi net-linux-x86_64 bigemulator -O2 + +- To use MPI as BigSim/Charm++'s communication layer:: + + ./build bgampi mpi-linux-x86_64 bigemulator --with-production --enable-tracing + +.. note:: + This build is used to compile MPI applications so that traces can be + generated. Hence, the communication layer used by BigSim/Charm++ is not + important. During simulation, the communication will be replayed using the + network simulator from CODES. However, the computation time captured here can be + important if it is not being explicitly replaced at simulation time using + configuration options. So using appropriate compiler flags is important. + +Generating AMPI traces with an MPI program using BigSim +""""""""""""""""""""""""""""""""""""""""""""""""""""""" + +1. Compile your MPI application using BigSim/Charm++. + + Example commands:: + + $CHARM_DIR/bin/ampicc -O2 simplePrg.c -o simplePrg_c + $CHARM_DIR/bin/ampiCC -O2 simplePrg.cc -o simplePrg_cxx + +2. Emulation to generate traces. When the binary generated is run, + BigSim/Charm++ runs the program on the allocated cores as if it were + running as usual. Users should provide a few additional arguments to + specify the number of MPI processes in the prototype systems. + + If using UDP as the BigSim/Charm++'s communication layer:: + + ./charmrun +p ++nodelist ./pgm +vp +x +y +z +bglog + + If using MPI as the BigSim/Charm++'s communication layer:: + + mpirun -n ./pgm +vp +x +y +z +bglog + + Number of real processes is typically equal to the number cores the emulation + is being run on. + + *machine file* is the list of systems the emulation should be run on (similar to + machine file for MPI; refer to Charm++ website for more details). + + *vp* is the number of MPI ranks that are to be emulated. For simple tests, it can + be the same as the number of real processes, in which case one MPI rank is run on + each real process (as it happens when a regular program is run). When the + number of vp (virtual processes) is higher, BigSim launches user level threads + to execute multiple MPI ranks within a process. + + *+x +y +z* defines a 3D grid of the virtual processes. The product of these three + dimensions must match the number of vp's. These arguments do not have any + effect on the emulation, but exist due to historical reasons. + + *+bglog* instructs bigsim to write the logs to files. + +3. When this run is finished, you should see many files named *bgTrace\** in the + directory. The total number of such files equals the number of real processes + plus one. Their names are bgTrace, bgTrace0, bgTrace1, and so on. + Create a new folder and move all *bgTrace* files to that folder. \ No newline at end of file diff --git a/docs/userguide/codes-config-file.rst b/docs/userguide/codes-config-file.rst new file mode 100644 index 0000000..2447d8e --- /dev/null +++ b/docs/userguide/codes-config-file.rst @@ -0,0 +1,110 @@ +Creating the network (CODES) configuration file +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Sample network configuration files can be found in examples/conf + +Additional documentation on the format of the CODES config file can be found in the +CODES wiki at https://xgitlab.cels.anl.gov/codes/codes/wikis/home + +A brief summary of the format follows. + +LPGROUPS, MODELNET_GRP, PARAMS are keywords and should be used as is. + +MODELNET_GRP +"""""""""""" +repetition + number of routers that have nodes connecting to them. + +server + number of MPI processes/cores per router + +modelnet_* + number of NICs. For torus, this value has to be 1; for dragonfly, + it should be router radix divided by 4; for the fat-tree, it should be router + radix divided by 2. For the dragonfly network, modelnet_dragonfly_router should + also be specified (as 1). For express mesh, modelnet_express_mesh_router should + also be specified as 1. + + Similarly, the fat-tree config file requires specifying fattree_switch which + can be 2 or 3, depending on the number of levels in the fat-tree. Note that the + total number of cores specified in the CODES config file can be greater than + the number of MPI processes being simulated (specified in the tracer config + file). + +Common parameters (PARAMS) +"""""""""""""""""""""""""" + +packet_size/chunk_size (both should have the same value) + size of the packets created by NIC for transmission on the network. Smaller the + packet size, longer the time for which simulation will run (in real time). Larger + the packet size, the less accurate the predictions are expected to be (in virtual + time). Packet sizes of 512 bytes to 4096 bytes are commonly used. + +modelnet_order + torus/dragonfly/fattree/slimfly/express_mesh + +modelnet_scheduler + fcfs: packetize messages one by one. + + round-robin: packetize message in a round robin manner. + +message_size + PDES parameter (keep constant at 512) + +router_delay + delay at each router for packet transmission (in nanoseconds) + +soft_delay + delay caused by software stack such as that of MPI (in nanoseconds) + +link_bandwidth + bandwidth of each link in the system (in GB/s) + +cn_bandwidth + bandwidth of connection between NIC and router (in GB/s) + +buffer_size/vc_size + size of channels used to store transient packets at routers (in + bytes). Typical value is 64*packet_size. + +routing + how are packets being routed. Options depend on the network. + + torus: static/adaptive + + dragonfly: minimal/nonminimal/adaptive + + fat-tree: adaptive/static + +Network specific parameters (PARAMS) +"""""""""""""""""""""""""""""""""""" + +Torus: + + n_dims + number of dimensions in the torus + + dim_length + length of each dimension + +Dragonfly: + + num_routers + number of routers within a group. + + global_bandwidth + bandwidth of the links that connect groups. + +Fat-tree: + + ft_type + always choose 1 + + num_levels + number of levels in the fat-tree (2 or 3) + + switch_radix + radix of the switch being used + + switch_count + number of switches at leaf level. \ No newline at end of file diff --git a/docs/userguide/job-placement-file.rst b/docs/userguide/job-placement-file.rst new file mode 100644 index 0000000..debcbac --- /dev/null +++ b/docs/userguide/job-placement-file.rst @@ -0,0 +1,106 @@ +Creating the job placement file +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Ranking basics +"""""""""""""" + +TraceR requires two sets of mapping files (with some what redundant information). +Both types of files provide information about mapping of global rank to jobs and +their local rank. Global rank of a server/core is the logical rank that server LPs +get inside CODES. It increases linearly for servers/cores connected from one switch +to another. Due to the way default server to node mapping works within CODES, if +more than one node is connected to a switch, servers/cores are distributed in a +cyclic manner. + +Consider this example config file:: + + MODELNET_GRP + { + repetitions="8"; + server="4"; + modelnet_dragonfly="4"; + modelnet_dragonfly_router="1"; + } + +Servers residing in nodes connected to the first router get global ranks 0-3, +nodes connected to the second router get global ranks 4-7, and so on. + +Now, consider another case:: + + MODELNET_GRP + { + repetitions="8"; + server="8"; + modelnet_dragonfly="4"; + modelnet_dragonfly_router="1"; + } + +Servers residing in nodes connected to the first router get global ranks 0-7, +nodes connected to the second router get global ranks 8-15, and so on. However, +there are 8 servers but only 4 nodes, so each node hosts 2 servers. The servers +are distributed in a cyclic manner within a router, i.e. in router 0, server 0 +is on node 0, 1 is on node 1, 2 is on node 2, 3 is on node 3, 4 is on node 0, 5 +is on node 1, 6 is on node 2, and 7 is on node 3. Similar cyclic distribution is +done within every switch. + +Map file requirements +""""""""""""""""""""" + +Map files are divided into two sets: global map files and individual job files. +The global file specifies how the global ranks are mapped to individual jobs and +ranks within those jobs. It is a binary file structured as sets of 3 integers: + . A typical write routine looks like this: + +.. code:: + + for(...) + fwrite(&global_rank, sizeof(int), 1, binout); + fwrite(&local_rank, sizeof(int), 1, binout); + fwrite(&jobid, sizeof(int), 1, binout); + endfor + +For each job, individual job map files are needed. A map file for a job is also a +binary file filled with a series of global ranks. The global ranks are ordered by +using the local ranks as the key. So, if the series of integers is loaded into an +array called local_to_global, local_to_global[i] will contain the global rank of +local rank i. + +Job mappers +""""""""""" + +In the utils subfolder of the TraceR repository, there are several job mappers +written in C that can be used to generate job map files with various layouts. +Eventually these will likely be rewritten as a Python script. A brief summary +of the generators provided follows. + +def_lin_mapping.C + Generates a linear mapping which is also the default mapping + when no mapping is specified. If nodes per router is more than 1, then this + mapping will spread the ranks in a round-robin fashion among the nodes. + +node_mapping.C + Generates a mapping that always places servers with contiguous + global ranks on a node. That is, if there are 2 servers per node, ranks 0-1 are + on node 0, ranks 2-3 are on node 1, and so on. + +multi_job.C + Router based various schemes for mapping. + +many_job.C + Nodes based various schemes for mapping. + +Commands for execution +"""""""""""""""""""""" +./def_lin_mapping + +./node_mapping [optional ] + +The output from these commands will be a global map file, and job{0,1..} files in binary format. + +Example:: + + ./def_lin_mapping global.bin 32 32 64 + +The above command generates global.bin with 128 ranks, where the first 32 are mapped to job0, +the next 32 to job1, and last 64 to job2. It also generates job0, job1, and job2 that maps +ranks from these jobs to global ranks. \ No newline at end of file diff --git a/docs/userguide/score-p.rst b/docs/userguide/score-p.rst new file mode 100644 index 0000000..e37f48f --- /dev/null +++ b/docs/userguide/score-p.rst @@ -0,0 +1,62 @@ +Score-P +^^^^^^^ + +Installation of Score-P +""""""""""""""""""""""" + +1. Download from http://www.vi-hps.org/projects/score-p/ +#. tar -xvzf scorep-5.0.tar.gz +#. cd scorep-5.0 +#. CC=mpicc CFLAGS="-O2" CXX=mpicxx CXXFLAGS="-O2" FC=mpif77 ./configure --without-gui --prefix= +#. make +#. make install + +Generating OTF2 traces with an MPI program using Score-P +"""""""""""""""""""""""""""""""""""""""""""""""""""""""" + +Detailed instructions are available at https://silc.zih.tu-dresden.de/scorep-current/pdf/scorep.pdf. + +1. Add $SCOREP_INSTALL/bin to your PATH for convenience. Example:: + + export SCOREP_INSTALL=$HOME/workspace/scoreP/scorep-5.0/install + export PATH=$SCOREP_INSTALL/bin:$PATH + +2. Add the following compile time flags to the application:: + + -I$SCOREP_INSTALL/include -I$SCOREP_INSTALL/include/scorep -DSCOREP_USER_ENABLE + +3. Add #include to all files where you plan to add any of the following Score-P calls (optional step):: + + SCOREP_RECORDING_OFF(); - stop recording + SCOREP_RECORDING_ON(); - start recording + + Marking special regions: SCOREP_USER_REGION_BY_NAME_BEGIN(regionname, SCOREP_USER_REGION_TYPE_COMMON) and SCOREP_USER_REGION_BY_NAME_END(regionname). + + Region names beginning with TRACER_WallTime\_ are special: using TRACER_WallTime_ prints current time during simulation with tag . + + An example using these features is given below: + + .. literalinclude:: code-examples/scorep_user_calls.c + :language: c + +4. For the link step, prefix the linker line with the following:: + + LD = scorep --user --nocompiler --noopenmp --nopomp --nocuda --noopenacc --noopencl --nomemory + +5. For running, set:: + + export SCOREP_ENABLE_TRACING=1 + export SCOREP_ENABLE_PROFILING=0 + export SCOREP_MPI_ENABLE_GROUPS=ENV,P2P,COLL,XNONBLOCK + + If Score-P prints a warning about flushing traces during the run, you may avoid them using:: + + export SCOREP_TOTAL_MEMORY=256M + export SCOREP_EXPERIMENT_DIRECTORY=/p/lscratchd//... + + .. note:: + For larger simulations, performance can get slow. There is a :download:`patch for Score-P 5.0 ` that + adds an option to reduce the number of MPI Probes. After applying the patch, it can be enabled like the other Score-P + options with ``export SCOREP_REDUCE_PROBE_TEST=1``. + +6. Run the binary and traces should be generated in a folder named scorep-\*. \ No newline at end of file diff --git a/docs/userguide/tracer-config-file.rst b/docs/userguide/tracer-config-file.rst new file mode 100644 index 0000000..c67c9f3 --- /dev/null +++ b/docs/userguide/tracer-config-file.rst @@ -0,0 +1,17 @@ +Creating a TraceR configuration file +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +This is the format for the TraceR config file:: + + + + + + ... + + + +If you do not intend to create global or per-job map files, you can use ``NA`` +instead of them. + +Sample TraceR config files can be found in examples/jacobi2d-bigsim/tracer_config (BigSim) or examples/stencil4d-otf/tracer_config (OTF) \ No newline at end of file diff --git a/utils/README b/utils/README deleted file mode 100644 index aad1306..0000000 --- a/utils/README +++ /dev/null @@ -1,90 +0,0 @@ -Ranking basics: ---------------------- -TraceR requires two sets of mapping files (with some what redundant information). -Both types files provide information about mapping of global rank to jobs and -their local rank. Global rank of a server/core is simply the logical rank that -server LPs get inside CODES. It increases linearly from servers/cores connected -to one switch to another. Due to the way default server to node mapping works -within CODES, if more than one node is connected to a switch, server/cores are -distributed in a cyclic manner. - -Example: Consider the following config file -MODELNET_GRP -{ - repetitions="8 - server="4"; - modelnet_dragonfly="4"; - modelnet_dragonfly_router="1"; -} - -Servers residing in nodes connected to the first router gets global rank 0-3, -second router gets global rank 4-7, and so on. - -Now consider this case: -MODELNET_GRP -{ - repetitions="8 - server="8"; - modelnet_dragonfly="4"; - modelnet_dragonfly_router="1"; -} - -Servers residing in nodes connected to the first router gets global rank 0-7, -second router gets global rank 8-15, and so on. However, there are 8 servers -but only 4 nodes, so each node hosts 2 servers. The servers are distributed in -a cyclic manner within a router, i.e. in router 0, server 0 is on node 0, 1 is -on node 1, 2 is on node 2, 3 is node 3, 4 is on node 0, 5 is on node 1, 6 is on -node 2, and 7 is on node 3. Similar cyclic distribution is done within every -switch. - -Map file requirements: ---------------------- -Map files are divided into two sets: global map file and individual job files. -The global file specifies how the global rank are mapped to individual jobs and -ranks within those jobs. It is a binary file structured as sets of 3 integers: - . Typical write routine look like: - -for(....) - fwrite(&global_rank, sizeof(int), 1, binout); - fwrite(&local_rank, sizeof(int), 1, binout); - fwrite(&jobid, sizeof(int), 1, binout); -endfor - -For each job, individual job map files are needed. A map file for a job is also a -binary file filled with a series of global ranks. The global ranks are ordered -by using the local ranks as the key. So, if the series of integers is loaded -into an array called local_to_global, local_to_global[i] will contain the global -rank of local rank i. - -Note for author: Eliminate individual job map files and make life easier for -users. - -Job mappers ------------------- -def_lin_mapping.C : generate linear mapping which is also the default mapping -when no mapping is specified. If nodes per router is more than 1, then this -mapping will spread the ranks in a round-robin fashion among the nodes. - -node_mapping.C : generates mapping that always places server with contiguous -global ranks on a node. That, if there 2 servers per node, ranks 0-1 are on node -0, ranks 2-3 are on node 1, and so on. - -multi_job.C : Router based various schemes for mapping. -many_job.C : Nodes based various schemes for mapping. - -Commands for execution ----------------------- -./def_lin_mapping -./node_mapping [optional ] - -Output - - in binary format -job{0,1..} files in binary format - -Example: -./def_lin_mapping global.bin 32 32 64 - -generates global.bin with 128 ranks, where first 32 are mapped to job0, next 32 -to job1, and last 64 to job2. Also generates job0, job1, job2 that maps ranks -from these jobs to global ranks. -