-
Notifications
You must be signed in to change notification settings - Fork 44
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #15 from oracle/2.6.6_docs
Update docs for distributed training
- Loading branch information
Showing
28 changed files
with
907 additions
and
240 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
############ | ||
Loading Data | ||
############ | ||
######### | ||
Load Data | ||
######### | ||
|
||
|
||
Connecting to Data Sources | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
28 changes: 28 additions & 0 deletions
28
docs/source/user_guide/model_training/distributed_training/_profiling.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
|
||
**Profiling using Nvidia Nsights** | ||
|
||
|
||
`Nvidia Nsights <https://github.com/horovod/horovod/tree/master/examples/elastic/pytorch>`__. is a system wide profiling tool from Nvidia that can be used to profile Deep Learning workloads. | ||
|
||
Nsights requires no change in your training code. This works on process level. You can enable this experimental feature(highlighted in bold) in your training setup via the following configuration in the runtime yaml file. | ||
|
||
|
||
.. code-block:: bash | ||
- name: PROFILE | ||
value: 1 | ||
- name: PROFILE_CMD | ||
value: ""nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s none -o /opt/ml/nsight_report -x true"" | ||
Refer `this <https://docs.nvidia.com/nsight-systems/UserGuide/index.html#cli-profile-command-switch-options>`__ for nsys profile command options. You can modify the command within the ``PROFILE_CMD`` but remember this is all experimental. The profiling reports are generated per node. You need to download the reports to your computer manually or via the oci command. | ||
|
||
.. code-block:: bash | ||
oci os object bulk-download \ | ||
-ns <namespace> \ | ||
-bn <bucket_name> \ | ||
--download-dir /path/on/your/computer \ | ||
--prefix path/on/bucket/<job_id> | ||
To view the reports, you would need to install Nsight Systems app from `here <https://developer.nvidia.com/nsight-systems>`_. Thereafter, open the downloaded reports in the Nsight Systems app. |
25 changes: 25 additions & 0 deletions
25
docs/source/user_guide/model_training/distributed_training/_save_artifacts.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
|
||
**Saving Artifacts to Object Storage Buckets** | ||
|
||
|
||
In case you want to save the artifacts generated by the training process (model checkpoints, TensorBoard logs, etc.) to an object bucket | ||
you can use the 'sync' feature. The environment variable ``OCI__SYNC_DIR`` exposes the directory location that will be automatically synchronized | ||
to the configured object storage bucket location. Use this directory in your training script to save the artifacts. | ||
|
||
To configure the destination object storage bucket location, use the following settings in the workload yaml file(train.yaml). | ||
|
||
.. code-block:: bash | ||
- name: SYNC_ARTIFACTS | ||
value: 1 | ||
- name: WORKSPACE | ||
value: "<bucket_name>" | ||
- name: WORKSPACE_PREFIX | ||
value: "<bucket_prefix>" | ||
**Note**: Change ``SYNC_ARTIFACTS`` to ``0`` to disable this feature. | ||
Use ``OCI__SYNC_DIR`` env variable in your code to save the artifacts. For Example : | ||
|
||
|
||
|
||
|
45 changes: 45 additions & 0 deletions
45
docs/source/user_guide/model_training/distributed_training/_test_and_submit.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
**Test Locally:** | ||
|
||
Before submitting the workload to jobs, you can run it locally to test your code, dependencies, configurations etc. | ||
With ``-b local`` flag, it uses a local backend. Further when you need to run this workload on odsc jobs, simply use ``-b job`` flag instead. | ||
|
||
.. code-block:: bash | ||
ads opctl run -f train.yaml -b local | ||
If your code requires to use any oci services (like object bucket), you need to mount oci keys from your local host machine onto the container. This is already done for you assuming the typical location of oci keys ``~/.oci``. You can modify it though, in-case you have keys at a different location. You need to do this in the config.ini file. | ||
|
||
.. code-block:: bash | ||
oci_key_mnt = ~/.oci:/home/oci_dist_training/.oci | ||
**Submit the workload:** | ||
|
||
|
||
|
||
.. code-block:: bash | ||
ads opctl run -f train.yaml -b job | ||
**Note:**: This will automatically push the docker image to the | ||
OCI `container registry repo <https://docs.oracle.com/en-us/iaas/Content/Registry/Concepts/registryoverview.htm>`_ . | ||
|
||
Once running, you will see on the terminal an output similar to the below. Note that this yaml | ||
can be used as input to ``ads opctl distributed-training show-config -f <info.yaml>`` - to both | ||
save and see the run info use ``tee`` - for example: | ||
|
||
.. code-block:: bash | ||
ads opctl run -f train.yaml | tee info.yaml | ||
.. code-block:: yaml | ||
:caption: info.yaml | ||
jobId: oci.xxxx.<job_ocid> | ||
mainJobRunId: | ||
mainJobRunIdName: oci.xxxx.<job_run_ocid> | ||
workDir: oci://my-bucket@my-namespace/daskcluster-testing/005 | ||
otherJobRunIds: | ||
- workerJobRunIdName_1: oci.xxxx.<job_run_ocid> | ||
- workerJobRunIdName_2: oci.xxxx.<job_run_ocid> | ||
- workerJobRunIdName_3: oci.xxxx.<job_run_ocid> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.