Skip to content

Commit

Permalink
Merge pull request #15 from oracle/2.6.6_docs
Browse files Browse the repository at this point in the history
Update docs for distributed training
  • Loading branch information
mayoor authored Oct 12, 2022
2 parents 3226228 + b04cc80 commit 3fee4c6
Show file tree
Hide file tree
Showing 28 changed files with 907 additions and 240 deletions.
7 changes: 3 additions & 4 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,16 @@ Oracle Accelerated Data Science SDK (ADS)
.. toctree::
:hidden:
:maxdepth: 5
:caption: History:
:caption: Getting Started:

release_notes
user_guide/quick_start/quick_start

.. toctree::
:hidden:
:maxdepth: 5
:caption: Installation and Configuration:

user_guide/quick_start/quick_start
user_guide/cli/quickstart
user_guide/cli/authentication
user_guide/cli/opctl/configure
Expand All @@ -33,19 +33,18 @@ Oracle Accelerated Data Science SDK (ADS)
:caption: Tasks:

user_guide/loading_data/connect
user_guide/apachespark/spark
user_guide/data_labeling/index
user_guide/data_transformation/data_transformation
user_guide/data_visualization/visualization
user_guide/model_training/index
user_guide/model_registration/introduction
user_guide/ADSString/index

.. toctree::
:hidden:
:maxdepth: 5
:caption: Integrations:

user_guide/apachespark/spark
user_guide/big_data_service/index
user_guide/jobs/index
user_guide/logs/logs
Expand Down
11 changes: 0 additions & 11 deletions docs/source/user_guide/ADSString/index.rst
Original file line number Diff line number Diff line change
@@ -1,10 +1,5 @@
.. _ADSString:

######################
Manipulating Text Data
######################


TextStrings
-----------

Expand All @@ -18,10 +13,4 @@ TextStrings
regex_match
still_a_string

Text Extraction
---------------

.. toctree::
:maxdepth: 1

../text_extraction/text_dataset
6 changes: 3 additions & 3 deletions docs/source/user_guide/apachespark/spark.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
=========================
Working with Apache Spark
=========================
============
Apache Spark
============


.. admonition:: DataFlow
Expand Down
2 changes: 1 addition & 1 deletion docs/source/user_guide/cli/opctl/localdev/vscode.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Setting up Visual Studio Code

**Prerequisites**

1. ADS CLI is :doc:`configured<configure>`
1. ADS CLI is :doc:`configured<../configure>`
2. Install Visual Studio Code
3. :doc:`Build Development Container Image<jobs_container_image>`
4. Install Visual Studio Code extension for `Remote Development <https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.vscode-remote-extensionpack>`_
Expand Down
6 changes: 3 additions & 3 deletions docs/source/user_guide/data_labeling/index.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
.. _data-labeling-8:

#############
Labeling Data
#############
##########
Label Data
##########

The Oracle Cloud Infrastructure (OCI) Data Labeling service allows you to create and browse datasets, view data records (text, images) and apply labels for the purposes of building AI/machine learning (ML) models. The service also provides interactive user interfaces that enable the labeling process. After you label records, you can export the dataset as line-delimited JSON Lines (JSONL) for use in model development.

Expand Down
12 changes: 10 additions & 2 deletions docs/source/user_guide/data_transformation/data_transformation.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
.. _data-transformations-8:

Data Transformations
####################
Transform Data
##############

When datasets are loaded with DatasetFactory, they can be transformed and manipulated easily with the built-in functions. Underlying, an ``ADSDataset`` object is a Pandas dataframe. Any operation that can be performed to a `Pandas dataframe <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`_ can also be applied to an ADS Dataset.

Expand Down Expand Up @@ -520,3 +520,11 @@ You can split the dataset right after the ``DatasetFactory.open()`` statement:
ds = DatasetFactory.open("path/data.csv").set_target('target')
train, test = ds.train_test_split(test_size=0.25)
Text Data
*********

.. toctree::
:maxdepth: 3

../ADSString/index
../text_extraction/text_dataset
6 changes: 3 additions & 3 deletions docs/source/user_guide/data_visualization/visualization.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
.. _data-visualization-8:

##################
Data Visualization
##################
##############
Visualize Data
##############

Data visualization is an important aspect of data exploration, analysis, and communication. Generally, visualization of the data is one of the first steps in any analysis. It allows the analysts to efficiently gain an understanding of the data and guides the exploratory data analysis (EDA) and the modeling process.

Expand Down
6 changes: 3 additions & 3 deletions docs/source/user_guide/loading_data/connect.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
############
Loading Data
############
#########
Load Data
#########


Connecting to Data Sources
Expand Down
6 changes: 3 additions & 3 deletions docs/source/user_guide/model_registration/introduction.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
.. _model-catalog-8:

#################################
Model Registration and Deployment
#################################
##########################
Register and Deploy Models
##########################


You could register your model with OCI Data Science service through ADS. Alternatively, the Oracle Cloud Infrastructure (OCI) Console can be used by going to the Data Science projects page, selecting a project, then click **Models**. The models page shows the model artifacts that are in the model catalog for a given project.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@

**Profiling using Nvidia Nsights**


`Nvidia Nsights <https://github.com/horovod/horovod/tree/master/examples/elastic/pytorch>`__. is a system wide profiling tool from Nvidia that can be used to profile Deep Learning workloads.

Nsights requires no change in your training code. This works on process level. You can enable this experimental feature(highlighted in bold) in your training setup via the following configuration in the runtime yaml file.


.. code-block:: bash
- name: PROFILE
value: 1
- name: PROFILE_CMD
value: ""nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s none -o /opt/ml/nsight_report -x true""
Refer `this <https://docs.nvidia.com/nsight-systems/UserGuide/index.html#cli-profile-command-switch-options>`__ for nsys profile command options. You can modify the command within the ``PROFILE_CMD`` but remember this is all experimental. The profiling reports are generated per node. You need to download the reports to your computer manually or via the oci command.

.. code-block:: bash
oci os object bulk-download \
-ns <namespace> \
-bn <bucket_name> \
--download-dir /path/on/your/computer \
--prefix path/on/bucket/<job_id>
To view the reports, you would need to install Nsight Systems app from `here <https://developer.nvidia.com/nsight-systems>`_. Thereafter, open the downloaded reports in the Nsight Systems app.
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@

**Saving Artifacts to Object Storage Buckets**


In case you want to save the artifacts generated by the training process (model checkpoints, TensorBoard logs, etc.) to an object bucket
you can use the 'sync' feature. The environment variable ``OCI__SYNC_DIR`` exposes the directory location that will be automatically synchronized
to the configured object storage bucket location. Use this directory in your training script to save the artifacts.

To configure the destination object storage bucket location, use the following settings in the workload yaml file(train.yaml).

.. code-block:: bash
- name: SYNC_ARTIFACTS
value: 1
- name: WORKSPACE
value: "<bucket_name>"
- name: WORKSPACE_PREFIX
value: "<bucket_prefix>"
**Note**: Change ``SYNC_ARTIFACTS`` to ``0`` to disable this feature.
Use ``OCI__SYNC_DIR`` env variable in your code to save the artifacts. For Example :




Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
**Test Locally:**

Before submitting the workload to jobs, you can run it locally to test your code, dependencies, configurations etc.
With ``-b local`` flag, it uses a local backend. Further when you need to run this workload on odsc jobs, simply use ``-b job`` flag instead.

.. code-block:: bash
ads opctl run -f train.yaml -b local
If your code requires to use any oci services (like object bucket), you need to mount oci keys from your local host machine onto the container. This is already done for you assuming the typical location of oci keys ``~/.oci``. You can modify it though, in-case you have keys at a different location. You need to do this in the config.ini file.

.. code-block:: bash
oci_key_mnt = ~/.oci:/home/oci_dist_training/.oci
**Submit the workload:**



.. code-block:: bash
ads opctl run -f train.yaml -b job
**Note:**: This will automatically push the docker image to the
OCI `container registry repo <https://docs.oracle.com/en-us/iaas/Content/Registry/Concepts/registryoverview.htm>`_ .

Once running, you will see on the terminal an output similar to the below. Note that this yaml
can be used as input to ``ads opctl distributed-training show-config -f <info.yaml>`` - to both
save and see the run info use ``tee`` - for example:

.. code-block:: bash
ads opctl run -f train.yaml | tee info.yaml
.. code-block:: yaml
:caption: info.yaml
jobId: oci.xxxx.<job_ocid>
mainJobRunId:
mainJobRunIdName: oci.xxxx.<job_run_ocid>
workDir: oci://my-bucket@my-namespace/daskcluster-testing/005
otherJobRunIds:
- workerJobRunIdName_1: oci.xxxx.<job_run_ocid>
- workerJobRunIdName_2: oci.xxxx.<job_run_ocid>
- workerJobRunIdName_3: oci.xxxx.<job_run_ocid>
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ You need to use a private subnet for distributed training and configure the secu

* `PyTorch`: By default, ``PyTorch`` uses **29400**.
* `Horovod`: allow TCP traffic on all ports within the subnet.
* `Tensorflow`: Worker Port: Allow traffic from all source ports to one worker port (default: 12345). If changed, provide this in train.yaml config.

See also: `Security Lists <https://docs.oracle.com/en-us/iaas/Content/Network/Concepts/securitylists.htm>`_

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,46 +73,37 @@ example. Now running the command below

Before you can build the image, you must set the following environment variables:

Specify image name and tag

.. code-block:: bash
export IMAGE_NAME=<region.ocir.io/my-tenancy/image-name>
export TAG=latest
To build the image:


`without Proxy server:`
Build the container image.

.. code-block:: bash
docker build -t $IMAGE_NAME:$TAG \
-f oci_dist_training_artifacts/dask/v1/Dockerfile .
ads opctl distributed-training build-image \
-t $TAG \
-reg $IMAGE_NAME \
-df oci_dist_training_artifacts/dask/v1/Dockerfile
`with Proxy server:`
.. code-block:: bash
docker build --build-arg no_proxy=$no_proxy \
--build-arg http_proxy=$http_proxy \
--build-arg https_proxy=$http_proxy \
-t $IMAGE_NAME:$TAG \
-f oci_dist_training_artifacts/dask/v1/Dockerfile .
The code is assumed to be in the current working directory. To override the source code directory:
The code is assumed to be in the current working directory. To override the source code directory, use the ``-s`` flag and specify the code dir. This folder should be within the current working directory.

.. code-block:: bash
docker build --build-arg CODE_DIR=`pwd` \
-t $IMAGE_NAME:$TAG \
-f oci_dist_training_artifacts/dask/v1/Dockerfile
ads opctl distributed-training build-image \
-t $TAG \
-reg $IMAGE_NAME \
-df oci_dist_training_artifacts/dask/v1/Dockerfile
-s <code_dir>
If you are behind proxy, ads opctl will automatically use your proxy settings (defined via ``no_proxy``, ``http_proxy`` and ``https_proxy``).

Finally, push your image using:

.. code-block:: bash
docker push $IMAGE_NAME:$TAG


**Define your workload yaml:**
Expand Down Expand Up @@ -229,31 +220,7 @@ This will give an option similar to this -
OCI__MODE:WORKER
-----------------------------Ending dryrun mode----------------------------------
Submit the workload -

.. code-block:: bash
ads opctl run -f train.yaml
Once running you will see on the terminal an output similar to the contents of the ``info.yaml`` below. To both
save and see the run info use ``tee`` - for example:

.. code-block:: bash
ads opctl run -f train.yaml | tee info.yaml
.. code-block:: yaml
:caption: info.yaml
jobId: oci.xxxx.<job_ocid>
mainJobRunId: oci.xxxx.<job_run_ocid>
workDir: oci://my-bucket@my-namespace/daskcluster-testing/005
workerJobRunIds:
- oci.xxxx.<job_run_ocid>
- oci.xxxx.<job_run_ocid>
- oci.xxxx.<job_run_ocid>
It is recommended that you save the output to a file.
.. include:: ../_test_and_submit.rst

**Monitoring the workload logs**

Expand Down Expand Up @@ -286,6 +253,12 @@ The alternate approach is to use either a Bastion host on the same subnet as the

For more information about the dashboard, checkout https://docs.dask.org/en/stable/diagnostics-distributed.html

.. include:: ../_save_artifacts.rst
.. code-block:: python
with open(os.path.join(os.environ.get("OCI__SYNC_DIR"),"results.txt"), "w") as rf:
rf.write(f"Best Params are: {grid.best_params_}, Score is {grid.best_score_}")
**Terminating In-Progress Cluster**

To terminate a running cluster, you could run -
Expand Down
Loading

0 comments on commit 3fee4c6

Please sign in to comment.