Initial commit

fbshipit-source-id: 8e11243d48d1abf2f035216bf1d75e5563c86bcb
TaekyungHeo · Oct 1, 2020 · 7671b66 · 7671b66
commit 7671b66
Show file tree

Hide file tree

Showing 17 changed files with 3,419 additions and 0 deletions.
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,4 @@
+[submodule "train/workloads/dlrm"]
+	path = train/workloads/dlrm
+	url = https://github.com/facebookresearch/dlrm.git
+	branch = dist_exp
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -0,0 +1,76 @@
+# Code of Conduct
+
+## Our Pledge
+
+In the interest of fostering an open and welcoming environment, we as
+contributors and maintainers pledge to make participation in our project and
+our community a harassment-free experience for everyone, regardless of age, body
+size, disability, ethnicity, sex characteristics, gender identity and expression,
+level of experience, education, socio-economic status, nationality, personal
+appearance, race, religion, or sexual identity and orientation.
+
+## Our Standards
+
+Examples of behavior that contributes to creating a positive environment
+include:
+
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+
+Examples of unacceptable behavior by participants include:
+
+* The use of sexualized language or imagery and unwelcome sexual attention or
+advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic
+address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+professional setting
+
+## Our Responsibilities
+
+Project maintainers are responsible for clarifying the standards of acceptable
+behavior and are expected to take appropriate and fair corrective action in
+response to any instances of unacceptable behavior.
+
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned to this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviors that they deem inappropriate,
+threatening, offensive, or harmful.
+
+## Scope
+
+This Code of Conduct applies within all project spaces, and it also applies when
+an individual is representing the project or its community in public spaces.
+Examples of representing a project or community include using an official
+project e-mail address, posting via an official social media account, or acting
+as an appointed representative at an online or offline event. Representation of
+a project may be further defined and clarified by project maintainers.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported by contacting the project team at <[email protected]>. All
+complaints will be reviewed and investigated and will result in a response that
+is deemed necessary and appropriate to the circumstances. The project team is
+obligated to maintain confidentiality with regard to the reporter of an incident.
+Further details of specific enforcement policies may be posted separately.
+
+Project maintainers who do not follow or enforce the Code of Conduct in good
+faith may face temporary or permanent repercussions as determined by other
+members of the project's leadership.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
+available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
+
+[homepage]: https://www.contributor-covenant.org
+
+For answers to common questions about this code of conduct, see
+https://www.contributor-covenant.org/faq
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,31 @@
+# Contributing to PARAM_Bench
+We want to make contributing to this project as easy and transparent as
+possible.
+
+## Pull Requests
+We actively welcome your pull requests.
+
+1. Fork the repo and create your branch from `master`.
+2. If you've added code that should be tested, add tests.
+3. If you've changed APIs, update the documentation.
+4. Ensure the test suite passes.
+5. Make sure your code lints.
+6. If you haven't already, complete the Contributor License Agreement ("CLA").
+
+## Contributor License Agreement ("CLA")
+In order to accept your pull request, we need you to submit a CLA. You only need
+to do this once to work on any of Facebook's open source projects.
+
+Complete your CLA here: <https://code.facebook.com/cla>
+
+## Issues
+We use GitHub issues to track public bugs. Please ensure your description is
+clear and has sufficient instructions to be able to reproduce the issue.
+
+Facebook has a [bounty program](https://www.facebook.com/whitehat/) for the safe
+disclosure of security bugs. In those cases, please go through the process
+outlined on that page and do not file a public issue.
+
+## License
+By contributing to PARAM-Bench, you agree that your contributions will be licensed
+under the LICENSE file in the root directory of this source tree.
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) Facebook, Inc. and its affiliates.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,32 @@
+# PARAM
+
+PARAM Benchmarks is a repository of communication and compute micro-benchmarks as well as full workloads for evaluating training and inference platforms.
+
+PARAM complements two broad categories of commonly used benchmarks:
+1. C++ based stand-alone compute and communication benchmarks using cuDNN, MKL, NCCL, MPI libraries - e.g. NCCL tests (https://github.com/NVIDIA/nccl-tests), OSU MPI benchmarks (https://mvapich.cse.ohio-state.edu/benchmarks/), and DeepBench (https://github.com/baidu-research/DeepBench).
+2. Application benchmarks such as Deep Learning Recommendation Model (DLRM) and the broader MLPerf benchmarks. MLPerf is a de-facto industry standard benchmark covering a wide range of AI workloads including computer vision and natural language processing. Recent addition of DLRM to MLPerf 0.7 is a great step towards making it more representative of FB’s AI workloads.
+
+Our inital release of PARAM benchmarks focuses on AI training and comprises of:
+1. Communication: PyTorch based collective benchmarks across arbitrary message sizes, effectiveness of compute-communication overlap, and DLRM communication patterns in fwd/bwd pass
+2. Compute: PyTorch based GEMM, embedding lookup, and linear layer
+3. DLRM: tracks the `ext_dist` branch of DRLM benchmark use Facebook's DLRM benchmark (https://github.com/facebookresearch/dlrm). In short, PARAM fully relies on DLRM benchmark for end-to-end workload evaluation; with additional extensions as required for scale-out AI training platforms.
+
+In essence, PARAM bridges the gap between stand-alone C++ benchmarks and PyTorch/Tensorflow based application benchmarks. This enables us gain deep insights into the inner workings of the system architecture as well as identify framework-level overheads by stressing all subcomponents of a system.
+
+## Version
+
+0.1 : Initial release
+
+## Requirements
+
+pytorch
+future
+numpy
+
+## License
+
+PARAM benchmarks is released under the MIT license. Please see the [`LICENSE`](LICENSE) file for more information.
+
+## Contributing
+
+We actively welcome your pull requests! Please see [`CONTRIBUTING.md`](CONTRIBUTING.md) and [`CODE_OF_CONDUCT.md`](CODE_OF_CONDUCT.md) for more info.
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,3 @@
+torch
+future
+numpy
diff --git a/train/comms/pt/README.md b/train/comms/pt/README.md
@@ -0,0 +1,53 @@
+# PARAM benchmark - Communication benchmarks
+
+PARAM-Comms is an effort to develop unified benchmarking framework to
+characterize training platform backends.
+
+The PARAM-Comms benchmark offers a single point solution to perform both top-down
+(DLRM application) and bottoms-up (collectives) operations for any given
+communication backend.
+
+Currently the benchmark supports Pytorch-NCCL backend, and PyTorch-XLA backend.
+
+The bottoms-up benchmark (`comms.py`) is designed similar to nccl-tests, and the
+top-down benchmark (`dlrm.py`) is similar to opensource dlrm benchmark except it
+only implements communication primitives.
+
+## Usage:
+
+### The bottoms-up benchmark (`comms.py`)
+```bash
+mpirun -np <num-processes> -N <processes per node> --hostfile <file contains host list> ./comms.py \
+    --master-ip 127.0.0.1
+    --b <begin-size> \
+    --e <end-size> \
+    --n <num-iters> \
+    --f <step-factor> \
+    --z <blocking/non-blocking> \
+    --collective <collective-to-test>
+```
+Example:
+```bash
+mpirun -np 16 -N 8 --hostfile ./hfile ./comms.py --master-ip 127.0.0.1 --b 8 --e 256M --n 100 \
+    --f 2 --z 1 --collective all_to_all
+```
+
+### The top-down benchmark (`dlrm.py`)
+```bash
+mpirun -np <num-processes> -N <processes per node> --hostfile <file contains host list> ./dlrm.py \
+    --master-ip <master-node-ip-address>
+    --arch-sparse-feature-size <spare-feature-size> \
+    --arch-embedding-size <embedding-table-sizes> \
+    --arch-mlp-bot <layer-dimensions of bottom layers> \
+    --arch-mlp-top <layer-dimensions of top layers> \
+    --mini-batch-size <mini-batch-sizes> \
+    --num-batches <number-of-batches>
+```
+Example:
+```bash
+mpirun -np 16 -N 8 --hostfile ./hfile ./dlrm.py --master-ip <node-1-ip> --mini-batch-size 32 \
+    --num-batches 100 \
+    --arch-mlp-bot 1024-256 \
+    --arch-sparse-feature-size 64 \
+    --arch-embedding-size "10000-10000"
+```
-Original file line number
+Diff line change
@@ -0,0 +1,3 @@
+    torch
+    future
+    numpy