From 984748763c7f9314ba0155f4ca9f399e96757660 Mon Sep 17 00:00:00 2001 From: sethah Date: Fri, 9 Mar 2018 10:27:36 -0800 Subject: [PATCH 1/2] clarify distributed key value store --- .../training-with-multiple-machines.ipynb | 22 ++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/chapter07_distributed-learning/training-with-multiple-machines.ipynb b/chapter07_distributed-learning/training-with-multiple-machines.ipynb index 7ef7cbc..0dcd061 100644 --- a/chapter07_distributed-learning/training-with-multiple-machines.ipynb +++ b/chapter07_distributed-learning/training-with-multiple-machines.ipynb @@ -167,6 +167,24 @@ "\n", "Not only can we synchronize data within a machine, with the key-value store we can facilitate inter-machine communication. To use it, one can create a distributed kvstore by using the following command: (Note: distributed key-value store requires `MXNet` to be compiled with the flag `USE_DIST_KVSTORE=1`, e.g. `make USE_DIST_KVSTORE=1`.)\n", "\n", + "In the distributed setting, `MXNet` launches three kinds of processes (each time, running `python myprog.py` will create a process). One is a *worker*, which runs the user program, such as the code in the previous section. The other two are the *server*, which maintains the data pushed into the store, and the *scheduler*, which monitors the aliveness of each node.\n", + "\n", + "To use the distributed key-value store, we must first start a scheduler process and at least one server process. When the MXNet library is imported in a process, it checks what the process's role is through the `DMLC_ROLE` environment variable. Starting a server or scheduler is as simple as importing MXNet with the appropriate environment variables set.\n", + "\n", + "```python\n", + "# start scheduler\n", + "scheduler_env = os.environ.copy()\n", + "scheduler_env.update({\"DMLC_ROLE\": \"scheduler\",\"DMLC_PS_ROOT_PORT\": \"9090\",\"DMLC_PS_ROOT_URI\": \"\",\"DMLC_NUM_SERVER\": \"1\",\"DMLC_NUM_WORKER\": \"2\",\"PS_VERBOSE\": \"2\"})\n", + "subprocess.Popen('python -c \"import mxnet\"', shell=True, env=scheduler_env)\n", + "\n", + "# start server\n", + "server_env = os.environ.copy()\n", + "server_env.update({\"DMLC_ROLE\": \"server\",\"DMLC_PS_ROOT_PORT\": \"9090\",\"DMLC_PS_ROOT_URI\": \"\",\"DMLC_NUM_SERVER\": \"1\",\"DMLC_NUM_WORKER\": \"2\",\"PS_VERBOSE\": \"2\"})\n", + "subprocess.Popen('python -c \"import mxnet\"', shell=True, env=server_env)\n", + "```\n", + "\n", + "To use a distributed key-value store from a worker process, just create the store as follows:\n", + "\n", "```python\n", "store = kv.create('dist')\n", "```\n", @@ -178,9 +196,7 @@ " [ 6. 6. 6.]]\n", "```\n", "\n", - "In the distributed setting, `MXNet` launches three kinds of processes (each time, running `python myprog.py` will create a process). One is a *worker*, which runs the user program, such as the code in the previous section. The other two are the *server*, which maintains the data pushed into the store, and the *scheduler*, which monitors the aliveness of each node.\n", - "\n", - "It's up to users which machines to run these processes on. But to simplify the process placement and launching, MXNet provides a tool located at [tools/launch.py](https://github.com/dmlc/mxnet/blob/master/tools/launch.py). \n", + "It's up to users which machines to run the worker, scheduler, and server processes on. But to simplify the process placement and launching, MXNet provides a tool located at [tools/launch.py](https://github.com/dmlc/mxnet/blob/master/tools/launch.py). \n", "\n", "Assume there are two machines, A and B. They are ssh-able, and their IPs are saved in a file named `hostfile`. Then we can start one worker in each machine through: \n", "\n", From b14244bf022fd03b32e063563ed35394683504f6 Mon Sep 17 00:00:00 2001 From: sethah Date: Fri, 9 Mar 2018 10:59:43 -0800 Subject: [PATCH 2/2] set optimizer --- .../training-with-multiple-machines.ipynb | 2 ++ 1 file changed, 2 insertions(+) diff --git a/chapter07_distributed-learning/training-with-multiple-machines.ipynb b/chapter07_distributed-learning/training-with-multiple-machines.ipynb index 0dcd061..04b0a87 100644 --- a/chapter07_distributed-learning/training-with-multiple-machines.ipynb +++ b/chapter07_distributed-learning/training-with-multiple-machines.ipynb @@ -187,6 +187,8 @@ "\n", "```python\n", "store = kv.create('dist')\n", + "# setting the optimizer instructs the server on how to update weights that are pushed to it\n", + "store.set_optimizer(mxnet.optimizer.SGD())\n", "```\n", "\n", "Now if we run the code from the previous section on two machines at the same time, then the store will aggregate the two ndarrays pushed from each machine, and after that, the pulled results will be: \n",