- Single host, multiple device synchronous training - which uses MirrorStrategy. Code in mnist-distributed/mirrored_mnist.py.
- On a cluster of many machines, each hosting one or multiple GPUs (multi-worker distributed training) - which uses MultiworkerMirrorStrategy. Code in mnist-distributed/multiworker_mnist.py
Note: The way TF distributed works is, TF operator sets the distributed cluster configuration in an env named TF_CONFIG, It automatically inserts the ENV in each worker, chief and parameter server. It is handled between DKube and TFJob operators.
- Name: distributed-training
- Project source: Git
- Git URL: https://github.com/pallavi-pannu-oc/distributed-training.git
- Branch: main
- Name: mnist
- Dataset source: Other
- URL: https://s3.amazonaws.com/img-datasets/mnist.pkl.gz
- Name: distributed-model
- Keep default for others
- Runs->+Training Run.
- Code: distributed-training
- Framework: Tensorflow
- Version:
- For mirrored : 2.0.0-gpu
- For multiworker : 2.3.0-gpu
- Start-up script:
- For mirrored : python mnist-distributed/mirrored_mnist.py
- For multiworker : python mnist-distributed/multiworker_mnist.py
- Repos->Inputs->Datasets: select mnist dataset enter mountpath as /mnist
- Repos->Outputs->Model: select distributed-model and enter mountpath as /model
- Allocate GPUS
- Select Distributed workloads : Automatic Distribution
- Submit