This repository was archived by the owner on Jan 3, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 15
Langjian/distributed json #366
Merged
Merged
Changes from 39 commits
Commits
Show all changes
41 commits
Select commit
Hold shift + click to select a range
3878e43
Fix new klocwork error reports on 11/20/2018
jianyinglang 4decf57
Fix the FusedBatchNorm with the Bessel correction in variance
jianyinglang cfb0a83
Fix the format
jianyinglang 76ab3fa
Merge remote-tracking branch 'origin/master' into langjian/BatchNorm_…
jianyinglang 289a43d
Add distributed macro
jianyinglang 2163ac1
Add multi-node .json file output
jianyinglang 34c4ddf
Merge remote-tracking branch 'origin/master' into langjian/distribute…
jianyinglang 2e70b97
Change CMake file to be consistent with master
jianyinglang 1c34c3d
Format change
jianyinglang 70449be
Add a simple distributed mnist model
jianyinglang 5ea9070
Merge remote-tracking branch 'origin/master' into langjian/distribute…
jianyinglang 73cf9d0
Add distributed option for Makefile
jianyinglang 2c594ee
modify distributed example
jianyinglang d788247
Add distributed flags for multi-process graph dumps
jianyinglang 0371935
Changes the Makefile to enable distributed build
jianyinglang d16f2d9
Merge remote-tracking branch 'origin/master' into langjian/distribute…
jianyinglang f205cb2
Format fix
jianyinglang 0ecd43b
Fix the typo and delete debug comment and add run command
jianyinglang 100c129
Fix the python file format
jianyinglang dda11aa
Fix python file format
jianyinglang 6caf977
Add mnist data directory
jianyinglang e2ba875
Merge branch 'master' into langjian/distributed_json
avijit-nervana 59ffb37
Fix the typo
jianyinglang 7f6c498
Merge branch 'langjian/distributed_json' of https://github.com/Nervan…
jianyinglang 6507c3a
Set the default distributed build as false
jianyinglang dbb302e
Add initialization if not initialized in MPI
jianyinglang b163c5c
Merge remote-tracking branch 'origin/master' into langjian/distribute…
jianyinglang 35af774
Fix the format
jianyinglang 97bb170
Fix the format using python2
jianyinglang 7515794
Merge branch 'master' into langjian/distributed_json
avijit-nervana 2f515af
Merge remote-tracking branch 'origin/master' into langjian/distribute…
jianyinglang e97fd82
Merge branch 'langjian/distributed_json' of https://github.com/Nervan…
jianyinglang 1f9f611
Fix the build with no specified mpi library
jianyinglang 7922d0b
Comment out the unused lines in CMakeLists.txt
jianyinglang e8553cf
Fix some errors
jianyinglang 0499c73
Merge branch 'master' into langjian/distributed_json
avijit-nervana 361c725
Change to if define
jianyinglang 4533c46
Merge branch 'langjian/distributed_json' of https://github.com/Nervan…
jianyinglang e2c3009
Added the source reference for mnist_softmax_distributed.py
jianyinglang ae96c77
Merge branch 'master' into langjian/distributed_json
avijit-nervana 8a941ee
Merge branch 'master' into langjian/distributed_json
avijit-nervana File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,148 @@ | ||
# Copyright 2015 The TensorFlow Authors. All Rights Reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# ============================================================================== | ||
"""A very simple MNIST classifier. | ||
|
||
See extensive documentation at | ||
https://www.tensorflow.org/get_started/mnist/beginners | ||
Reference to the original source code: | ||
https://github.com/tensorflow/tensorflow/blob/r1.2/tensorflow/examples/tutorials/mnist/mnist_softmax.py | ||
Add distributed fetaure with horovod | ||
1. hvd.init() | ||
2. Add distributed wrapper from hvd.DistributedOptimizer | ||
3. Broadcast the variables from root rank to the rest processors: hvd.BroadcastGlobalVariablesHook(0) | ||
4. Print the output for root rank only | ||
""" | ||
from __future__ import absolute_import | ||
from __future__ import division | ||
from __future__ import print_function | ||
|
||
import argparse | ||
import sys | ||
import time | ||
|
||
from tensorflow.examples.tutorials.mnist import input_data | ||
|
||
import tensorflow as tf | ||
import ngraph_bridge | ||
import horovod.tensorflow as hvd | ||
learn = tf.contrib.learn | ||
|
||
FLAGS = None | ||
|
||
hvd.init() | ||
|
||
|
||
def main(_): | ||
run_mnist(_) | ||
|
||
|
||
def run_mnist(_): | ||
# Import data | ||
mnist = learn.datasets.mnist.read_data_sets( | ||
FLAGS.data_dir + 'MNIST-data-%d' % hvd.rank(), one_hot=True) | ||
|
||
# Create the model | ||
with tf.name_scope("mnist_placholder"): | ||
x = tf.placeholder(tf.float32, [None, 784]) | ||
W = tf.Variable(tf.zeros([784, 10])) | ||
b = tf.Variable(tf.zeros([10])) | ||
y = tf.matmul(x, W) + b | ||
|
||
# Define loss and optimizer | ||
y_ = tf.placeholder(tf.float32, [None, 10]) | ||
|
||
# The raw formulation of cross-entropy, | ||
# | ||
# tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(tf.nn.softmax(y)), | ||
# reduction_indices=[1])) | ||
# | ||
# can be numerically unstable. | ||
# | ||
# So here we use tf.nn.softmax_cross_entropy_with_logits on the raw | ||
# outputs of 'y', and then average across the batch. | ||
cross_entropy = tf.reduce_mean( | ||
tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y)) | ||
#global_step = tf.train.get_or_create_global_step() | ||
global_step = tf.contrib.framework.get_or_create_global_step() | ||
opt = tf.train.GradientDescentOptimizer(0.5) | ||
# Add MPI Distributed Optimizer | ||
with tf.name_scope("horovod_opt"): | ||
opt = hvd.DistributedOptimizer(opt) | ||
train_step = opt.minimize(cross_entropy, global_step=global_step) | ||
|
||
# The StopAtStepHook handles stopping after running given steps. | ||
hooks = [ | ||
hvd.BroadcastGlobalVariablesHook(0), | ||
tf.train.StopAtStepHook(last_step=10) | ||
] | ||
|
||
# Test trained model | ||
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1)) | ||
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) | ||
|
||
# Enable soft placement and tracing as needed | ||
config = tf.ConfigProto( | ||
allow_soft_placement=True, | ||
log_device_placement=True, | ||
inter_op_parallelism_threads=1) | ||
|
||
#config.graph_options.optimizer_options.global_jit_level = jit_level | ||
run_metadata = tf.RunMetadata() | ||
|
||
#init_op = tf.global_variables_initializer() | ||
print("Variables initialized ...") | ||
|
||
# The MonitoredTrainingSession takes care of session initialization | ||
with tf.train.MonitoredTrainingSession( | ||
hooks=hooks, config=config) as mon_sess: | ||
start = time.time() | ||
train_writer = tf.summary.FileWriter(FLAGS.log_dir, mon_sess.graph) | ||
while not mon_sess.should_stop(): | ||
# Train | ||
batch_xs, batch_ys = mnist.train.next_batch(100) | ||
mon_sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys}) | ||
|
||
# Test trained model | ||
if not mon_sess.should_stop(): | ||
print("Accuracy: ", | ||
mon_sess.run( | ||
accuracy, | ||
feed_dict={ | ||
x: mnist.test.images, | ||
y_: mnist.test.labels | ||
})) | ||
|
||
end = time.time() | ||
|
||
if hvd.rank() == 0: | ||
print("Training time: %f seconds" % (end - start)) | ||
|
||
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument( | ||
'--data_dir', | ||
type=str, | ||
default='/tmp/tensorflow/mnist/input_data', | ||
help='Directory for storing input data') | ||
parser.add_argument( | ||
'--log_dir', | ||
type=str, | ||
default='/tmp/tensorflow/mnist/logs/mnist_with_summaries', | ||
help='Summaries log directory') | ||
FLAGS, unparsed = parser.parse_known_args() | ||
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) | ||
# run command for this distributed script | ||
# mpirun -np 2 python mnist_softmax_distributed.py --data_dir=/mnt/data/mnist |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide a summary of modifications made to this file and a reference to the source of this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Listed the changes I made from the source file and added the link to the reference source of this file