From aa789c7dd0ff43dd06f271309726071b65bc016d Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Sun, 9 Aug 2020 23:47:35 -0700 Subject: [PATCH 01/39] update TF 2.2 smdebug features --- docs/tensorflow.md | 128 ++++++++++++++++++++++++++++++++++++--------- 1 file changed, 103 insertions(+), 25 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 1f8d7e5d9..ee4cc1ac9 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -2,7 +2,9 @@ ## Contents - [Support](#support) -- [How to Use](#how-to-use) +- [How to Use Debugger with TensorFlow](#how-to-use) + - [Debugger with AWS Deep Learning Containers](#debugger-dlc) + - [Debugger with other AWS training containers and custom containers](#debugger-script-change) - [Code Structure Samples](#examples) - [References](#references) @@ -10,48 +12,111 @@ ## Support -**Zero script change experience** — No modification is needed to your training script to enable the Debugger features while using the [official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). - -**Script mode experience** — The smdebug library supports training jobs with the TensorFlow framework and script mode through its API operations. This option requires minimal changes to your training script, and the smdebug library provides you hook features to help implement Debugger and analyze tensors. +### Supported TensorFlow Versions -### Versions +The SageMaker Debugger python SDK and `smdebug` library now fully support TensorFlow 2.2 with the latest version release. Using Debugger, you can retrieve tensors from your TensorFlow models with either eager or non-eager mode, with Keras API or the pure TensorFlow framework. For a full list of TensorFlow framework versions to use Debugger, see [AWS Deep Learning Containers and SageMaker training containers](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). +**Zero script change experience** — No modification is needed to your training script to enable the Debugger features while using the [official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). + +**Script mode experience** — The smdebug library supports training jobs with the TensorFlow framework and script mode through its API operations. This option requires minimal changes to your training script to register Debugger hooks, and the smdebug library provides you hook features to help implement Debugger and analyze saved tensors. + ### Distributed training supported by Debugger - Horovod and Mirrored Strategy multi-GPU distributed trainings are supported. - Parameter server based distributed training is currently not supported. --- -## How to Use -### Debugger with AWS Deep Learning Containers and zero script change +## How to Use Debugger + +### Debugger with AWS Deep Learning Containers + +The Debugger built-in rules and hook features are fully integrated into the AWS Deep Learning Containers, and you can run your training script without any script changes. When running training jobs on those Deep Learning Containers, Debugger registers its hooks automatically to your training script in order to retrieve tensors. To find a comprehensive guide of using the high-level SageMaker TensorFlow estimator with Debugger, see [Debugger in TensorFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html#debugger-zero-script-change-TensorFlow). -The Debugger features are all integrated into the AWS Deep Learning Containers, and you can run your training script with zero script change. To find a high-level SageMaker TensorFlow estimator with Debugger example code, see [Debugger in TensorFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html#debugger-zero-script-change-TensorFlow). +The following code sample is how to set a SageMaker TensorFlow estimator with Debugger. + +```python +from sagemaker.tensorflow import TensorFlow +from sagemaker.debugger import Rule, DebuggerHookConfig, CollectionConfig, rule_configs + +tf_estimator = TensorFlow( + entry_point = "tf-train.py", + role = "SageMakerRole", + instance_count = 1, + instance_type = "ml.p2.xlarge", + framework_version = "2.2", + py_version = "py37" + + # Debugger-specific Parameters + rules = [ + Rule.sagemaker(rule_configs.vanishing_gradient()), + Rule.sagemaker(rule_configs.loss_not_decreasing()), + ... + ], + debugger_hook_config = DebuggerHookConfig( + CollectionConfig(name="inputs"), + CollectionConfig(name="outputs"), + CollectionConfig(name="layers"), + CollectionConfig(name="gradients") + ... + ) +) +tf_estimator.fit("s3://bucket/path/to/training/data") +``` + +Available tensor collections that you can retrieve from TensorFlow training jobs for zero script change are as follows: + +| Name | Description| +| --- | --- | +| all | Matches all tensors. | +| default | Includes "metrics", "losses", and "sm_metrics". | +| metrics | For KerasHook, saves the metrics computed by Keras for the model. | +| losses | Saves all losses of the model. | +| sm_metrics | You can add scalars that you want to show up in SageMaker Metrics to this collection. SageMaker Debugger will save these scalars both to the out_dir of the hook, as well as to SageMaker Metric. Note that the scalars passed here will be saved on AWS servers outside of your AWS account. | +| inputs | Matches all input to the model. | +| outputs | Matches all outputs of the model, such as predictions (logits) and labels. | +| layers | Matches all inputs and outputs of intermediate layers. | +| gradients | Matches all gradients of the model. In TensorFlow when not using zero script change environments, must use hook.wrap_optimizer() or hook.wrap_tape(). | +| weights | Matches all weights of the model. | +| biases | Matches all biases of the model. | +| optimizer_variables | Matches all optimizer variables, currently only supported for Keras. | + +>**Note**: The `inputs`, `outputs`, and `layers` collections are not currently available for TensorFlow 2.1. -### Debugger with AWS training containers and script mode +### Debugger with other AWS training containers and custom containers + +If you want to run your own training script or custom container, there are two available options. One option is to use the SageMaker TensorFlow with script change on other AWS training containers (the SageMaker TensorFlow estimator is in script mode by default from TensorFlow 2.1, so you do not need to specify `script_mode` parameter). Another option is to use your custom container with your training script and push the container to Amazon ECR. In both cases, you need to manually register the Debugger hook to your training script. Depending on the TensorFlow models and API operations in your script, you need to pick the right hook class as introduced in the following steps. + +1. [Create a hook](#create-a-hook) + * [KerasHook](#kerashook) + * [SessionHook](#sessionhook) + * [EstimatorHook](#estimatorhook) +2. [Wrap the optimizer and the gradient tape with the hook to retrieve gradient tensors](#wrap-opt-with-hook) +3. [Register the hook to model.fit()](#register-a-hook) -In case you want to run your own training script and debug using the SageMaker TensorFlow framework with script mode and Debugger, the smdebug client library provides the hook constructor that you can add to the training script and retrieve tensors. #### 1. Create a hook - To create the hook constructor, add the following code. + To create the hook constructor, add the following code to your training script. This will enable the `smdebug` tools for TensorFlow and create a TensorFlow hook object. ```python import smdebug.tensorflow as smd hook = smd.{hook_class}.create_from_json_file() ``` -Depending on the TensorFlow versions for your model, you need to choose a hook class. There are three hook constructor classes that you can pick and replace `{hook_class}`: `KerasHook`, `SessionHook`, and `EstimatorHook`. +Depending on TensorFlow versions and Keras API that was used in your training script, you need to choose the right hook class. There are three hook constructors for TensorFlow that you can choose: `KerasHook`, `SessionHook`, and `EstimatorHook`. #### KerasHook -Use if you use the Keras `model.fit()` API. This is available for all frameworks and versions of Keras and TensorFlow. `KerasHook` covers the eager execution modes and the gradient tape feature that are introduced from the TensorFlow framework version 2.0. For example, you can set the Keras hook constructor by adding the following code into your training script. +Use `KerasHook` if you use the Keras model zoo and a Keras `model.fit()` API. This is available for the Keras with TensorFlow backend interface. `KerasHook` covers the eager execution modes and the gradient tape features that are introduced from the TensorFlow framework version 2.0. You can set the smdebug Keras hook constructor by adding the following code into your training script. Place this code line before `model.compile()`. + ```python hook = smd.KerasHook.create_from_json_file() ``` + To learn how to fully implement the hook to your training script, see the [Keras with the TensorFlow gradient tape and the smdebug hook example scripts](https://github.com/awslabs/sagemaker-debugger/tree/master/examples/tensorflow2/scripts). -> **Note**: If you use the AWS Deep Learning Containers for zero script change, Debugger collects the most of tensors regardless the eager execution modes, through its high-level API. +>**Note**: If you use the AWS Deep Learning Containers for zero script change, Debugger collects the most of tensors regardless the eager execution modes, through its high-level API. #### SessionHook @@ -63,7 +128,7 @@ hook = smd.SessionHook.create_from_json_file() To learn how to fully implement the hook into your training script, see the [TensorFlow monitored training session with the smdebug hook example script](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow/sagemaker_byoc/simple.py). -> **Note**: The official TensorFlow library deprecated the `tf.train.MonitoredSessions()` API in favor of `tf.function()` in TF 2.0 and above. You can use `SessionHook` for `tf.function()` in TF 2.0 and above. +>**Note**: The official TensorFlow library deprecated the `tf.train.MonitoredSessions()` API in favor of `tf.function()` in TF 2.0 and above. You can use `SessionHook` for `tf.function()` in TF 2.0 and above. #### EstimatorHook @@ -75,34 +140,47 @@ hook = smd.EstimatorHook.create_from_json_file() To learn how to fully implement the hook into your training script, see the [simple MNIST training script with the Tensorflow estimator](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow/sagemaker_byoc/simple.py). -#### 2. Register the hook to your model - -To collect the tensors from the hooks that you implemented, add `callbacks=[hook]` to the Keras `model.fit()` API and `hooks=[hook]` for the `MonitoredSession()`, `tf.function()`, and `tf.estimator()` APIs. - -#### 3. Wrap the optimizer and the gradient tape +#### 2. Wrap the optimizer and the gradient tape to retrieve gradient tensors The smdebug TensorFlow hook provides tools to manually retrieve `gradients` tensors specific for the TensorFlow framework. -If you want to save `gradients` from the optimizer of your model, wrap it with the hook as follows: +If you want to save `gradients`, for example, from the Keras Adam optimizer, wrap it with the hook as follows: ```python +optimizer = tf.keras.optimizers.Adam(learning_rate=args.lr) optimizer = hook.wrap_optimizer(optimizer) ``` -If you want to save `gradients` from the TensorFlow gradient tape feature, wrap it as follows: +If you want to save `gradients` from the TensorFlow gradient tape feature, wrap `tf.GradientTape` with the `hook.wrap_tape` method and save using the `hook.save_tensor` function. The input of `hook.save_tensor` is in (tensor_name, tensor_value, collections_to_write="default") format. For example: ```python with hook.wrap_tape(tf.GradientTape(persistent=True)) as tape: + logits = model(data, training=True) + loss_value = cce(labels, logits) +hook.save_tensor("y_labels", labels, "outputs") +grads = tape.gradient(loss_value, model.variables) ``` -These wrappers capture the gradient tensors, not affecting your optimization logic at all. +These smdebug hook wrapper functions capture the gradient tensors, not affecting your optimization logic at all. For examples of code structure to apply the hook wrappers, see the [Examples](#examples) section. +#### 3. Register the hook to model.fit() + +To collect the tensors from the hooks that you registered, add `callbacks=[hook]` to the Keras `model.fit()` API. This will pass the SageMaker Debugger hook as a Keras callback. Similarly, add `hooks=[hook]` to the `MonitoredSession()`, `tf.function()`, and `tf.estimator()` APIs. For example: + +```python +model.fit(X_train, Y_train, + batch_size=batch_size, + epochs=epoch, + validation_data=(X_valid, Y_valid), + shuffle=True, + # smdebug modification: Pass the hook as a Keras callback + callbacks=[hook]) +``` + #### 4. Take actions using the hook APIs For a full list of actions that the hook APIs offer to construct hooks and save tensors, see [Common hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#common-hook-api) and [TensorFlow specific hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#tensorflow-specific-hook-api). ->**Note**: The `inputs`, `outputs`, and `layers` collections are not currently available for TensorFlow 2.1. - --- ## Examples From df74588878da974b5a4ebbe5b32cac2d1a74f0e7 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 10 Aug 2020 00:57:40 -0700 Subject: [PATCH 02/39] add details --- docs/tensorflow.md | 69 ++++++++++++++++++++++++++-------------------- 1 file changed, 39 insertions(+), 30 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index ee4cc1ac9..6b768effd 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -1,25 +1,21 @@ # Tensorflow ## Contents -- [Support](#support) +- [What SageMaker Debugger Supports](#support) - [How to Use Debugger with TensorFlow](#how-to-use) - [Debugger with AWS Deep Learning Containers](#debugger-dlc) - [Debugger with other AWS training containers and custom containers](#debugger-script-change) -- [Code Structure Samples](#examples) +- [Code Samples](#examples) - [References](#references) --- -## Support +## What SageMaker Debugger Supports -### Supported TensorFlow Versions +The SageMaker Debugger python SDK and `smdebug` library now fully support TensorFlow 2.2 with the latest version release (v0.9.1). Using Debugger, you can access tensors from any kind of TensorFlow models, from the Keras model zoo to your custom model. +You can simply run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. No matter what your TensorFlow models use Keras APIs or pure TensorFlow API, in eager mode or non-eager mode, you can directly run them on the AWS Deep Learning Containers. -The SageMaker Debugger python SDK and `smdebug` library now fully support TensorFlow 2.2 with the latest version release. Using Debugger, you can retrieve tensors from your TensorFlow models with either eager or non-eager mode, with Keras API or the pure TensorFlow framework. -For a full list of TensorFlow framework versions to use Debugger, see [AWS Deep Learning Containers and SageMaker training containers](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). - -**Zero script change experience** — No modification is needed to your training script to enable the Debugger features while using the [official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). - -**Script mode experience** — The smdebug library supports training jobs with the TensorFlow framework and script mode through its API operations. This option requires minimal changes to your training script to register Debugger hooks, and the smdebug library provides you hook features to help implement Debugger and analyze saved tensors. +Debugger and its client library `smdebug` support debugging your training job on other AWS training containers and custom containers. In this case, a hook registration process is required to manually add the hook features to your training script. For a full list of AWS TensorFlow containers to use Debugger, see [AWS Deep Learning Containers and SageMaker training containers](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). ### Distributed training supported by Debugger - Horovod and Mirrored Strategy multi-GPU distributed trainings are supported. @@ -29,7 +25,7 @@ For a full list of TensorFlow framework versions to use Debugger, see [AWS Deep ## How to Use Debugger -### Debugger with AWS Deep Learning Containers +### Debugger on AWS Deep Learning Containers with TensorFlow The Debugger built-in rules and hook features are fully integrated into the AWS Deep Learning Containers, and you can run your training script without any script changes. When running training jobs on those Deep Learning Containers, Debugger registers its hooks automatically to your training script in order to retrieve tensors. To find a comprehensive guide of using the high-level SageMaker TensorFlow estimator with Debugger, see [Debugger in TensorFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html#debugger-zero-script-change-TensorFlow). @@ -63,8 +59,16 @@ tf_estimator = TensorFlow( ) tf_estimator.fit("s3://bucket/path/to/training/data") ``` +>**Note**: The SageMaker TensorFlow estimator and the Debugger collections in the example are based on the latest SageMaker python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command line. +```bash +pip install -U sagemaker +pip install -U smdebug +``` +If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. + +#### Available Tensor Collections for TensorFlow -Available tensor collections that you can retrieve from TensorFlow training jobs for zero script change are as follows: +The following table lists the pre-configured tensor collections for TensorFlow models. | Name | Description| | --- | --- | @@ -83,19 +87,22 @@ Available tensor collections that you can retrieve from TensorFlow training jobs >**Note**: The `inputs`, `outputs`, and `layers` collections are not currently available for TensorFlow 2.1. -### Debugger with other AWS training containers and custom containers +### Debugger on SageMaker TensorFlow training containers or custom containers -If you want to run your own training script or custom container, there are two available options. One option is to use the SageMaker TensorFlow with script change on other AWS training containers (the SageMaker TensorFlow estimator is in script mode by default from TensorFlow 2.1, so you do not need to specify `script_mode` parameter). Another option is to use your custom container with your training script and push the container to Amazon ECR. In both cases, you need to manually register the Debugger hook to your training script. Depending on the TensorFlow models and API operations in your script, you need to pick the right hook class as introduced in the following steps. +If you want to run your own training script or custom containers other than the AWS Deep Learning Containers in the previous option, there are two alternatives. +- Alternative 1: Use the SageMaker TensorFlow training containers with training script modification +- Alternative 2: Use your custom container with modified training script and push the container to Amazon ECR. +In both cases, you need to manually register the Debugger hook to your training script. Depending on the TensorFlow and Keras API operations used to construct your model, you need to pick the right TensorFlow hook class, register the hook, and save tensors. 1. [Create a hook](#create-a-hook) - * [KerasHook](#kerashook) - * [SessionHook](#sessionhook) - * [EstimatorHook](#estimatorhook) + - [KerasHook](#kerashook) + - [SessionHook](#sessionhook) + - [EstimatorHook](#estimatorhook) 2. [Wrap the optimizer and the gradient tape with the hook to retrieve gradient tensors](#wrap-opt-with-hook) 3. [Register the hook to model.fit()](#register-a-hook) -#### 1. Create a hook +#### 1. Create a hook To create the hook constructor, add the following code to your training script. This will enable the `smdebug` tools for TensorFlow and create a TensorFlow hook object. @@ -150,13 +157,15 @@ optimizer = tf.keras.optimizers.Adam(learning_rate=args.lr) optimizer = hook.wrap_optimizer(optimizer) ``` -If you want to save `gradients` from the TensorFlow gradient tape feature, wrap `tf.GradientTape` with the `hook.wrap_tape` method and save using the `hook.save_tensor` function. The input of `hook.save_tensor` is in (tensor_name, tensor_value, collections_to_write="default") format. For example: +If you want to save gradients and outputs tensors from the TensorFlow `GradientTape` feature, wrap `tf.GradientTape` with the smdebug `hook.wrap_tape` method and save using the `hook.save_tensor` function. The input of `hook.save_tensor` is in (tensor_name, tensor_value, collections_to_write="default") format. For example: ```python with hook.wrap_tape(tf.GradientTape(persistent=True)) as tape: logits = model(data, training=True) loss_value = cce(labels, logits) hook.save_tensor("y_labels", labels, "outputs") +hook.save_tensor("predictions", logits, "outputs") grads = tape.gradient(loss_value, model.variables) +hook.save_tensor("grads", grads, "gradients") ``` These smdebug hook wrapper functions capture the gradient tensors, not affecting your optimization logic at all. @@ -169,12 +178,12 @@ To collect the tensors from the hooks that you registered, add `callbacks=[hook] ```python model.fit(X_train, Y_train, - batch_size=batch_size, - epochs=epoch, - validation_data=(X_valid, Y_valid), - shuffle=True, - # smdebug modification: Pass the hook as a Keras callback - callbacks=[hook]) + batch_size=batch_size, + epochs=epoch, + validation_data=(X_valid, Y_valid), + shuffle=True, + # smdebug modification: Pass the hook as a Keras callback + callbacks=[hook]) ``` #### 4. Take actions using the hook APIs @@ -191,7 +200,7 @@ The following examples show the three different hook constructions of TensorFlow ```python import smdebug.tensorflow as smd -hook = smd.KerasHook(out_dir=args.out_dir) +hook = smd.KerasHook.create_from_json_file() model = tf.keras.models.Sequential([ ... ]) model.compile( @@ -207,7 +216,7 @@ model.evaluate(x_test, y_test, callbacks=[hook]) ```python import smdebug.tensorflow as smd -hook = smd.KerasHook(out_dir=args.out_dir) +hook = smd.KerasHook.create_from_json_file() model = tf.keras.models.Sequential([ ... ]) for epoch in range(n_epochs): @@ -221,14 +230,14 @@ model = tf.keras.models.Sequential([ ... ]) opt.apply_gradients(zip(grads, model.variables)) acc = train_acc_metric(dataset_labels, logits) # manually save metric values - hook.record_tensor_value(tensor_name="accuracy", tensor_value=acc) + hook.save_tensor(tensor_name="accuracy", tensor_value=acc, collections_to_write="default") ``` ### Monitored Session (tf.train.MonitoredSession) ```python import smdebug.tensorflow as smd -hook = smd.SessionHook(out_dir=args.out_dir) +hook = smd.SessionHook.create_from_json_file() loss = tf.reduce_mean(tf.matmul(...), name="loss") optimizer = tf.train.AdamOptimizer(args.lr) @@ -246,7 +255,7 @@ sess.run([loss, ...]) ```python import smdebug.tensorflow as smd -hook = smd.EstimatorHook(out_dir=args.out_dir) +hook = smd.EstimatorHook.create_from_json_file() train_input_fn, eval_input_fn = ... estimator = tf.estimator.Estimator(...) From 2fa0fdbfa25b0d2fee739f42ca40823604d54d11 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 10 Aug 2020 01:55:24 -0700 Subject: [PATCH 03/39] Update code samples/notes for new pySDK and smdebug/add and fix links --- docs/tensorflow.md | 35 +++++++++++++++++++++-------------- 1 file changed, 21 insertions(+), 14 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 6b768effd..dac815d2b 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -3,8 +3,8 @@ ## Contents - [What SageMaker Debugger Supports](#support) - [How to Use Debugger with TensorFlow](#how-to-use) - - [Debugger with AWS Deep Learning Containers](#debugger-dlc) - - [Debugger with other AWS training containers and custom containers](#debugger-script-change) + - [Debugger on AWS Deep Learning Containers with TensorFlow](#debugger-dlc) + - [Debugger on SageMaker Training Containers and Custom Containers](#debugger-script-change) - [Code Samples](#examples) - [References](#references) @@ -12,10 +12,10 @@ ## What SageMaker Debugger Supports -The SageMaker Debugger python SDK and `smdebug` library now fully support TensorFlow 2.2 with the latest version release (v0.9.1). Using Debugger, you can access tensors from any kind of TensorFlow models, from the Keras model zoo to your custom model. -You can simply run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. No matter what your TensorFlow models use Keras APIs or pure TensorFlow API, in eager mode or non-eager mode, you can directly run them on the AWS Deep Learning Containers. +SageMaker Debugger python SDK (v2.0) and its client library `smdebug` library (v0.9.1) now fully support TensorFlow 2.2 with the latest version release. Using Debugger, you can access tensors from any kind of TensorFlow models, from the Keras model zoo to your custom model. +You can simply run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. No matter what your TensorFlow models use Keras API or pure TensorFlow API, in eager mode or non-eager mode, you can directly run them on the AWS Deep Learning Containers. -Debugger and its client library `smdebug` support debugging your training job on other AWS training containers and custom containers. In this case, a hook registration process is required to manually add the hook features to your training script. For a full list of AWS TensorFlow containers to use Debugger, see [AWS Deep Learning Containers and SageMaker training containers](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). +Debugger and its client library `smdebug` support debugging your training job on other AWS training containers and custom containers. In this case, a hook registration process is required to manually add the hook features to your training script. For a full list of AWS TensorFlow containers to use Debugger, see [SageMaker containers to use Debugger with script mode](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). For a complete guide of using custom containers, go to [Use Debugger in Custom Training Containers ](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-bring-your-own-container.html). ### Distributed training supported by Debugger - Horovod and Mirrored Strategy multi-GPU distributed trainings are supported. @@ -23,13 +23,13 @@ Debugger and its client library `smdebug` support debugging your training job on --- -## How to Use Debugger +## How to Use Debugger ### Debugger on AWS Deep Learning Containers with TensorFlow -The Debugger built-in rules and hook features are fully integrated into the AWS Deep Learning Containers, and you can run your training script without any script changes. When running training jobs on those Deep Learning Containers, Debugger registers its hooks automatically to your training script in order to retrieve tensors. To find a comprehensive guide of using the high-level SageMaker TensorFlow estimator with Debugger, see [Debugger in TensorFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html#debugger-zero-script-change-TensorFlow). +The Debugger built-in rules and hook features are fully integrated into the AWS Deep Learning Containers, and you can run your training script without any script changes. When running training jobs on those Deep Learning Containers, Debugger registers its hooks automatically to your training script in order to retrieve tensors. To find a comprehensive guide of using the high-level SageMaker TensorFlow estimator with Debugger, go to the [Amazon SageMaker Debugger with TensorFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html#debugger-zero-script-change-TensorFlow) developer guide. -The following code sample is how to set a SageMaker TensorFlow estimator with Debugger. +The following code sample is the base structure of a SageMaker TensorFlow estimator with Debugger. ```python from sagemaker.tensorflow import TensorFlow @@ -59,16 +59,16 @@ tf_estimator = TensorFlow( ) tf_estimator.fit("s3://bucket/path/to/training/data") ``` ->**Note**: The SageMaker TensorFlow estimator and the Debugger collections in the example are based on the latest SageMaker python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command line. +>**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest SageMaker python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command lines. ```bash pip install -U sagemaker pip install -U smdebug ``` -If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. +If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. For more information about breaking changes of the SageMaker python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). #### Available Tensor Collections for TensorFlow -The following table lists the pre-configured tensor collections for TensorFlow models. +The following table lists the pre-configured tensor collections for TensorFlow models. You can pick any tensor collections by specifying the `name` parameter of `CollectionConfig()` as shown in the base code sample. | Name | Description| | --- | --- | @@ -85,9 +85,13 @@ The following table lists the pre-configured tensor collections for TensorFlow m | biases | Matches all biases of the model. | | optimizer_variables | Matches all optimizer variables, currently only supported for Keras. | +For more information about adjusting the tensor collection parameters, see [Save Tensors Using Debugger Modified Built-in Collections ](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-data.html#debugger-save-modified-built-in-collections). + +For a full list of available tensor collection parameters, see [Configuring Collection using SageMaker Python SDK](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk). + >**Note**: The `inputs`, `outputs`, and `layers` collections are not currently available for TensorFlow 2.1. -### Debugger on SageMaker TensorFlow training containers or custom containers +### Debugger on SageMaker Training Containers and Custom Containers If you want to run your own training script or custom containers other than the AWS Deep Learning Containers in the previous option, there are two alternatives. - Alternative 1: Use the SageMaker TensorFlow training containers with training script modification @@ -192,9 +196,9 @@ For a full list of actions that the hook APIs offer to construct hooks and save --- -## Examples +## Code Samples -The following examples show the three different hook constructions of TensorFlow. The following examples show what minimal changes have to be made to enable SageMaker Debugger while using the AWS containers with script mode. To learn how to use the high-level Debugger features with zero script change on AWS Deep Learning Containers, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). +The following examples show the base structures of hook registration in various TensorFlow training scripts. If you want to take the benefit of the high-level Debugger features with zero script change on AWS Deep Learning Containers, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). ### Keras API (tf.keras) ```python @@ -208,7 +212,10 @@ model.compile( loss='sparse_categorical_crossentropy', ) # Add the hook as a callback +hook.set_mode(mode=smd.modes.TRAIN) model.fit(x_train, y_train, epochs=args.epochs, callbacks=[hook]) + +hook.set_mode(mode=smd.modes.EVAL) model.evaluate(x_test, y_test, callbacks=[hook]) ``` From 6857d6caecfc0c079896e4a64f11db275798559e Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 10 Aug 2020 09:37:50 -0700 Subject: [PATCH 04/39] add 'New features' note --- docs/tensorflow.md | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index dac815d2b..c3aa1a6c0 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -12,11 +12,18 @@ ## What SageMaker Debugger Supports -SageMaker Debugger python SDK (v2.0) and its client library `smdebug` library (v0.9.1) now fully support TensorFlow 2.2 with the latest version release. Using Debugger, you can access tensors from any kind of TensorFlow models, from the Keras model zoo to your custom model. +SageMaker Debugger python SDK (v2.0) and its client library `smdebug` library (v0.9.1) now fully support TensorFlow 2.2 with the latest version release. Using Debugger, you can access tensors of any kind of TensorFlow models, from the Keras model zoo to your custom model, and save them using Debugger built-in or custom tensor collections. You can simply run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. No matter what your TensorFlow models use Keras API or pure TensorFlow API, in eager mode or non-eager mode, you can directly run them on the AWS Deep Learning Containers. Debugger and its client library `smdebug` support debugging your training job on other AWS training containers and custom containers. In this case, a hook registration process is required to manually add the hook features to your training script. For a full list of AWS TensorFlow containers to use Debugger, see [SageMaker containers to use Debugger with script mode](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). For a complete guide of using custom containers, go to [Use Debugger in Custom Training Containers ](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-bring-your-own-container.html). +### New features +- The latest TensorFlow version fully covered by Debugger is `2.2.0`. +- Debug training jobs with the TensorFlow framework or Keras TensorFlow. +- Debug training jobs with the TensorFlow framework in eager or non-eager model. +- New built-in tensor collections: model `inputs`, `outputs`, `layers`, `gradients`. +- New hook APIs to save tensors, in addition to scalars: `save_tensors`, `save_scalar`. + ### Distributed training supported by Debugger - Horovod and Mirrored Strategy multi-GPU distributed trainings are supported. - Parameter server based distributed training is currently not supported. @@ -40,7 +47,7 @@ tf_estimator = TensorFlow( role = "SageMakerRole", instance_count = 1, instance_type = "ml.p2.xlarge", - framework_version = "2.2", + framework_version = "2.2.0", py_version = "py37" # Debugger-specific Parameters From 8be632a8da6a498a0539bb5d69f625b775eb76e2 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 10 Aug 2020 09:49:24 -0700 Subject: [PATCH 05/39] minor fix --- docs/api.md | 2 +- docs/tensorflow.md | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/api.md b/docs/api.md index 778cf3e46..6edb47540 100644 --- a/docs/api.md +++ b/docs/api.md @@ -163,6 +163,7 @@ Note that `smd` import below translates to `import smdebug.{framework} as smd`. |`create_from_json_file(`
` json_file_path=None)` | `json_file_path (str)` | Takes the path of a file which holds the json configuration of the hook, and creates hook from that configuration. This is an optional parameter.
If this is not passed it tries to get the file path from the value of the environment variable `SMDEBUG_CONFIG_FILE_PATH` and defaults to `/opt/ml/input/config/debughookconfig.json`. When training on SageMaker you do not have to specify any path because this is the default path that SageMaker writes the hook configuration to. |`close()` | - | Closes all files that are currently open by the hook | | `save_scalar()` | `name (str)`
`value (float)`
`sm_metric (bool)`| Saves a scalar value by the given name. Passing `sm_metric=True` flag also makes this scalar available as a SageMaker Metric to show up in SageMaker Studio. Note that when `sm_metric` is False, this scalar always resides only in your AWS account, but setting it to True saves the scalar also on AWS servers. The default value of `sm_metric` for this method is False. | +| `save_tensor()`| tensor_name (str), tensor_value (float), collections_to_write (str) | - | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.| ### TensorFlow specific Hook API @@ -178,7 +179,6 @@ The following hook APIs are specific to training scripts using the TF 2.x Gradie | Method | Arguments | Returns | Behavior | | --- | --- | --- | --- | | `wrap_tape(tape)` | `tape` (tensorflow.python.eager.backprop.GradientTape) | Returns a tape object with three identifying markers to help `smdebug`. This returned tape should be used for training. | When not using Zero Script Change environments, calling this method on your tape is necessary for SageMaker Debugger to identify and save gradient tensors. Note that this method returns the same tape object passed. -| `save_tensor()`| tensor_name (str), tensor_value (float), collections_to_write (str) | - | Manually save metrics tensors while using TF 2.x GradientTape. Note: `record_tensor_value()` is deprecated.| ### MXNet specific Hook API diff --git a/docs/tensorflow.md b/docs/tensorflow.md index c3aa1a6c0..4cf860ac7 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -20,8 +20,8 @@ Debugger and its client library `smdebug` support debugging your training job on ### New features - The latest TensorFlow version fully covered by Debugger is `2.2.0`. - Debug training jobs with the TensorFlow framework or Keras TensorFlow. -- Debug training jobs with the TensorFlow framework in eager or non-eager model. -- New built-in tensor collections: model `inputs`, `outputs`, `layers`, `gradients`. +- Debug training jobs with the TensorFlow eager or non-eager mode. +- New built-in tensor collections: `inputs`, `outputs`, `layers`, `gradients`. - New hook APIs to save tensors, in addition to scalars: `save_tensors`, `save_scalar`. ### Distributed training supported by Debugger @@ -67,11 +67,11 @@ tf_estimator = TensorFlow( tf_estimator.fit("s3://bucket/path/to/training/data") ``` >**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest SageMaker python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command lines. -```bash +>```bash pip install -U sagemaker pip install -U smdebug ``` -If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. For more information about breaking changes of the SageMaker python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). +>If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. For more information about breaking changes of the SageMaker python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). #### Available Tensor Collections for TensorFlow From d787f4b923a4d8bd7199a4d9b3cbd14471956236 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 10 Aug 2020 09:50:31 -0700 Subject: [PATCH 06/39] minor fix --- docs/tensorflow.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 4cf860ac7..314e0f550 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -67,11 +67,11 @@ tf_estimator = TensorFlow( tf_estimator.fit("s3://bucket/path/to/training/data") ``` >**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest SageMaker python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command lines. ->```bash +``` pip install -U sagemaker pip install -U smdebug ``` ->If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. For more information about breaking changes of the SageMaker python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). +If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. For more information about breaking changes of the SageMaker python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). #### Available Tensor Collections for TensorFlow From 6c00d2a0a5e9b5087d33656c031cec84cf1409ba Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 10 Aug 2020 09:52:27 -0700 Subject: [PATCH 07/39] fix formatting --- docs/tensorflow.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 314e0f550..1dcdd502b 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -66,12 +66,13 @@ tf_estimator = TensorFlow( ) tf_estimator.fit("s3://bucket/path/to/training/data") ``` ->**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest SageMaker python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command lines. -``` -pip install -U sagemaker -pip install -U smdebug -``` -If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. For more information about breaking changes of the SageMaker python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). + +>**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest SageMaker Python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command lines. + ``` + pip install -U sagemaker + pip install -U smdebug + ``` +>If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. For more information about breaking changes of the SageMaker Python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). #### Available Tensor Collections for TensorFlow From 4b6e0deab62d81004c4a7e9993798cbeb33bf424 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 10 Aug 2020 12:14:21 -0700 Subject: [PATCH 08/39] minor fix --- docs/tensorflow.md | 31 ++++++++++++++++++------------- 1 file changed, 18 insertions(+), 13 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 1dcdd502b..fde9d4aea 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -67,16 +67,16 @@ tf_estimator = TensorFlow( tf_estimator.fit("s3://bucket/path/to/training/data") ``` ->**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest SageMaker Python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command lines. - ``` - pip install -U sagemaker - pip install -U smdebug - ``` ->If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. For more information about breaking changes of the SageMaker Python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). +**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest SageMaker Python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command lines. +``` +pip install -U sagemaker +pip install -U smdebug +``` +If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. For more information about breaking changes of the SageMaker Python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). #### Available Tensor Collections for TensorFlow -The following table lists the pre-configured tensor collections for TensorFlow models. You can pick any tensor collections by specifying the `name` parameter of `CollectionConfig()` as shown in the base code sample. +The following table lists the pre-configured tensor collections for TensorFlow models. You can pick any tensor collections by specifying the `name` parameter of `CollectionConfig()` as shown in the base code sample. SageMaker Debugger will save these tensors to the default out_dir of the hook. | Name | Description| | --- | --- | @@ -84,11 +84,11 @@ The following table lists the pre-configured tensor collections for TensorFlow m | default | Includes "metrics", "losses", and "sm_metrics". | | metrics | For KerasHook, saves the metrics computed by Keras for the model. | | losses | Saves all losses of the model. | -| sm_metrics | You can add scalars that you want to show up in SageMaker Metrics to this collection. SageMaker Debugger will save these scalars both to the out_dir of the hook, as well as to SageMaker Metric. Note that the scalars passed here will be saved on AWS servers outside of your AWS account. | -| inputs | Matches all input to the model. | -| outputs | Matches all outputs of the model, such as predictions (logits) and labels. | +| sm_metrics | Saves scalars that you want to include in the SageMaker metrics collection. | +| inputs | Matches all model inputs to the model. | +| outputs | Matches all model outputs of the model, such as predictions (logits) and labels. | | layers | Matches all inputs and outputs of intermediate layers. | -| gradients | Matches all gradients of the model. In TensorFlow when not using zero script change environments, must use hook.wrap_optimizer() or hook.wrap_tape(). | +| gradients | Matches all gradients of the model. | | weights | Matches all weights of the model. | | biases | Matches all biases of the model. | | optimizer_variables | Matches all optimizer variables, currently only supported for Keras. | @@ -97,13 +97,15 @@ For more information about adjusting the tensor collection parameters, see [Save For a full list of available tensor collection parameters, see [Configuring Collection using SageMaker Python SDK](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk). ->**Note**: The `inputs`, `outputs`, and `layers` collections are not currently available for TensorFlow 2.1. +>**Note**: The `inputs`, `outputs`, and `layers` collections are currently not available for TensorFlow 2.1. ### Debugger on SageMaker Training Containers and Custom Containers If you want to run your own training script or custom containers other than the AWS Deep Learning Containers in the previous option, there are two alternatives. + - Alternative 1: Use the SageMaker TensorFlow training containers with training script modification - Alternative 2: Use your custom container with modified training script and push the container to Amazon ECR. + In both cases, you need to manually register the Debugger hook to your training script. Depending on the TensorFlow and Keras API operations used to construct your model, you need to pick the right TensorFlow hook class, register the hook, and save tensors. 1. [Create a hook](#create-a-hook) @@ -116,11 +118,14 @@ In both cases, you need to manually register the Debugger hook to your training #### 1. Create a hook - To create the hook constructor, add the following code to your training script. This will enable the `smdebug` tools for TensorFlow and create a TensorFlow hook object. + To create the hook constructor, add the following code to your training script. This will enable the `smdebug` tools for TensorFlow and create a TensorFlow `hook` object. When executing the fit() API for training, specify the smdebug `hook` as callbacks. ```python import smdebug.tensorflow as smd hook = smd.{hook_class}.create_from_json_file() +... +model.fit(... + callbacks=[hook]) ``` Depending on TensorFlow versions and Keras API that was used in your training script, you need to choose the right hook class. There are three hook constructors for TensorFlow that you can choose: `KerasHook`, `SessionHook`, and `EstimatorHook`. From 54c12ce6c0874e7535c285319784a142b5cabab4 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 10 Aug 2020 12:31:55 -0700 Subject: [PATCH 09/39] lint --- docs/tensorflow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index fde9d4aea..4f9528569 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -13,7 +13,7 @@ ## What SageMaker Debugger Supports SageMaker Debugger python SDK (v2.0) and its client library `smdebug` library (v0.9.1) now fully support TensorFlow 2.2 with the latest version release. Using Debugger, you can access tensors of any kind of TensorFlow models, from the Keras model zoo to your custom model, and save them using Debugger built-in or custom tensor collections. -You can simply run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. No matter what your TensorFlow models use Keras API or pure TensorFlow API, in eager mode or non-eager mode, you can directly run them on the AWS Deep Learning Containers. +You can simply run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. No matter what your TensorFlow models use Keras API or pure TensorFlow API, in eager mode or non-eager mode, you can directly run them on the AWS Deep Learning Containers. Debugger and its client library `smdebug` support debugging your training job on other AWS training containers and custom containers. In this case, a hook registration process is required to manually add the hook features to your training script. For a full list of AWS TensorFlow containers to use Debugger, see [SageMaker containers to use Debugger with script mode](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). For a complete guide of using custom containers, go to [Use Debugger in Custom Training Containers ](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-bring-your-own-container.html). From 9e079dd9213c224c6e0a1bf8c30841b7df7f9a74 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Wed, 12 Aug 2020 17:33:46 -0700 Subject: [PATCH 10/39] lint --- docs/tensorflow.md | 118 ++++++++++++++++++++++----------------------- 1 file changed, 57 insertions(+), 61 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 4f9528569..7138ccff4 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -10,33 +10,28 @@ --- -## What SageMaker Debugger Supports +## Amazon SageMaker Debugger Support for TensorFlow -SageMaker Debugger python SDK (v2.0) and its client library `smdebug` library (v0.9.1) now fully support TensorFlow 2.2 with the latest version release. Using Debugger, you can access tensors of any kind of TensorFlow models, from the Keras model zoo to your custom model, and save them using Debugger built-in or custom tensor collections. -You can simply run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. No matter what your TensorFlow models use Keras API or pure TensorFlow API, in eager mode or non-eager mode, you can directly run them on the AWS Deep Learning Containers. +Amazon SageMaker Debugger python SDK (v2.0) and its client library `smdebug` library (v0.9.1) now fully support TensorFlow 2.2 with the latest version release. Using Debugger, you can access tensors of any kind for TensorFlow models, from the Keras model zoo to your own custom model, and save them using Debugger built-in or custom tensor collections. You can run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. It doesn't matter whether your TensorFlow models use Keras API or pure TensorFlow API (in eager mode or non-eager mode), you can directly run them on the AWS Deep Learning Containers. -Debugger and its client library `smdebug` support debugging your training job on other AWS training containers and custom containers. In this case, a hook registration process is required to manually add the hook features to your training script. For a full list of AWS TensorFlow containers to use Debugger, see [SageMaker containers to use Debugger with script mode](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). For a complete guide of using custom containers, go to [Use Debugger in Custom Training Containers ](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-bring-your-own-container.html). +Debugger and its client library `smdebug` support debugging your training job on other AWS training containers and custom containers. In this case, a hook registration process is required to manually add the hook features to your training script. For a full list of AWS TensorFlow containers to use Debugger, see [SageMaker containers to use Debugger with script mode](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). For a complete guide for using custom containers, see [Use Debugger in Custom Training Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-bring-your-own-container.html). -### New features -- The latest TensorFlow version fully covered by Debugger is `2.2.0`. -- Debug training jobs with the TensorFlow framework or Keras TensorFlow. -- Debug training jobs with the TensorFlow eager or non-eager mode. -- New built-in tensor collections: `inputs`, `outputs`, `layers`, `gradients`. -- New hook APIs to save tensors, in addition to scalars: `save_tensors`, `save_scalar`. - -### Distributed training supported by Debugger -- Horovod and Mirrored Strategy multi-GPU distributed trainings are supported. -- Parameter server based distributed training is currently not supported. +### New Features supported by Debugger +- The latest TensorFlow version fully covered by Debugger is 2.2.0 +- Debug training jobs with the TensorFlow framework or Keras TensorFlow +- Debug training jobs with the TensorFlow eager or non-eager mode +- New built-in tensor collections: `inputs`, `outputs`, `layers`, `gradients` +- New hook APIs to save tensors, in addition to scalars: `save_tensors`, `save_scalar` --- -## How to Use Debugger +## Using Debugger -### Debugger on AWS Deep Learning Containers with TensorFlow +### Using Debugger on AWS Deep Learning Containers with TensorFlow -The Debugger built-in rules and hook features are fully integrated into the AWS Deep Learning Containers, and you can run your training script without any script changes. When running training jobs on those Deep Learning Containers, Debugger registers its hooks automatically to your training script in order to retrieve tensors. To find a comprehensive guide of using the high-level SageMaker TensorFlow estimator with Debugger, go to the [Amazon SageMaker Debugger with TensorFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html#debugger-zero-script-change-TensorFlow) developer guide. +The Debugger built-in rules and hook features are fully integrated with the AWS Deep Learning Containers. You can run your training script without any script changes. When running training jobs on those Deep Learning Containers, Debugger registers its hooks automatically to your training script in order to retrieve tensors. To find a comprehensive guide of using the high-level SageMaker TensorFlow estimator with Debugger, see [Amazon SageMaker Debugger with TensorFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html#debugger-zero-script-change-TensorFlow) in the Amazon SageMaker Developer Guide. -The following code sample is the base structure of a SageMaker TensorFlow estimator with Debugger. +The following code example provides the base structure for a SageMaker TensorFlow estimator with Debugger. ```python from sagemaker.tensorflow import TensorFlow @@ -45,8 +40,8 @@ from sagemaker.debugger import Rule, DebuggerHookConfig, CollectionConfig, rule_ tf_estimator = TensorFlow( entry_point = "tf-train.py", role = "SageMakerRole", - instance_count = 1, - instance_type = "ml.p2.xlarge", + train_instance_count = 1, + train_instance_type = "ml.p2.xlarge", framework_version = "2.2.0", py_version = "py37" @@ -67,46 +62,46 @@ tf_estimator = TensorFlow( tf_estimator.fit("s3://bucket/path/to/training/data") ``` -**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest SageMaker Python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command lines. +**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest `smdebug` library v0.9.1. We highly recommend that you upgrade the packages by running the following commands at the command line: ``` pip install -U sagemaker pip install -U smdebug ``` -If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. For more information about breaking changes of the SageMaker Python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). +If you are using a Jupyter Notebook, put an exclamation mark (!) at the beginning of the code string and restart your kernel. For more information about the SageMaker Python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). -#### Available Tensor Collections for TensorFlow +#### Using Tensor Collections with TensorFlow -The following table lists the pre-configured tensor collections for TensorFlow models. You can pick any tensor collections by specifying the `name` parameter of `CollectionConfig()` as shown in the base code sample. SageMaker Debugger will save these tensors to the default out_dir of the hook. +The following table lists the pre-configured tensor collections for TensorFlow models. You can pick any tensor collections by specifying the `name` parameter of `CollectionConfig()` as shown in the previous base code example. SageMaker Debugger will save these tensors to the default out_dir of the hook. | Name | Description| | --- | --- | -| all | Matches all tensors. | -| default | Includes "metrics", "losses", and "sm_metrics". | -| metrics | For KerasHook, saves the metrics computed by Keras for the model. | -| losses | Saves all losses of the model. | -| sm_metrics | Saves scalars that you want to include in the SageMaker metrics collection. | -| inputs | Matches all model inputs to the model. | -| outputs | Matches all model outputs of the model, such as predictions (logits) and labels. | -| layers | Matches all inputs and outputs of intermediate layers. | -| gradients | Matches all gradients of the model. | -| weights | Matches all weights of the model. | -| biases | Matches all biases of the model. | -| optimizer_variables | Matches all optimizer variables, currently only supported for Keras. | - -For more information about adjusting the tensor collection parameters, see [Save Tensors Using Debugger Modified Built-in Collections ](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-data.html#debugger-save-modified-built-in-collections). +| `all` | Matches all tensors. | +| `default` | Includes `metrics`, `losses`, and `sm_metrics`. | +| `metrics` | For KerasHook, saves the metrics computed by Keras for the model. | +| `losses` | Saves all losses of the model. | +| `sm_metrics` | Saves scalars that you want to include in the SageMaker metrics collection. | +| `inputs` | Matches all model inputs to the model. | +| `outputs` | Matches all model outputs of the model, such as predictions (logits) and labels. | +| `layers` | Matches all inputs and outputs of intermediate layers. | +| `gradients` | Matches all gradients of the model. | +| `weights` | Matches all weights of the model. | +| `biases` | Matches all biases of the model. | +| `optimizer_variables` | Matches all optimizer variables, currently only supported for Keras. | + +For more information about adjusting the tensor collection parameters, see [Save Tensors Using Debugger Modified Built-in Collections](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-data.html#debugger-save-modified-built-in-collections). For a full list of available tensor collection parameters, see [Configuring Collection using SageMaker Python SDK](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk). ->**Note**: The `inputs`, `outputs`, and `layers` collections are currently not available for TensorFlow 2.1. +>**Note**: The `inputs`, `outputs`, and `layers` collections are currently not available for TensorFlow 2.1.0. -### Debugger on SageMaker Training Containers and Custom Containers +### Using Debugger on SageMaker Training Containers and Custom Containers -If you want to run your own training script or custom containers other than the AWS Deep Learning Containers in the previous option, there are two alternatives. +If you want to run your own training script or custom containers other than the AWS Deep Learning Containers in the previous option, you can use any of the following options: -- Alternative 1: Use the SageMaker TensorFlow training containers with training script modification -- Alternative 2: Use your custom container with modified training script and push the container to Amazon ECR. +- Option 1 - Use the SageMaker TensorFlow training containers with training script modification +- Option 2 - Use your custom container with modified training script and push the container to Amazon ECR. -In both cases, you need to manually register the Debugger hook to your training script. Depending on the TensorFlow and Keras API operations used to construct your model, you need to pick the right TensorFlow hook class, register the hook, and save tensors. +For both options, you need to manually register the Debugger hook to your training script. Depending on the TensorFlow and Keras API operations used to construct your model, you need to pick the right TensorFlow hook class, register the hook, and then save the tensors. 1. [Create a hook](#create-a-hook) - [KerasHook](#kerashook) @@ -116,9 +111,9 @@ In both cases, you need to manually register the Debugger hook to your training 3. [Register the hook to model.fit()](#register-a-hook) -#### 1. Create a hook +#### Step 1: Create a hook - To create the hook constructor, add the following code to your training script. This will enable the `smdebug` tools for TensorFlow and create a TensorFlow `hook` object. When executing the fit() API for training, specify the smdebug `hook` as callbacks. + To create the hook constructor, add the following code to your training script. This enables the `smdebug` tools for TensorFlow and creates a TensorFlow `hook` object. When you run the `fit()` API for training, specify the smdebug `hook` as `callbacks`, as shown following: ```python import smdebug.tensorflow as smd @@ -128,23 +123,23 @@ model.fit(... callbacks=[hook]) ``` -Depending on TensorFlow versions and Keras API that was used in your training script, you need to choose the right hook class. There are three hook constructors for TensorFlow that you can choose: `KerasHook`, `SessionHook`, and `EstimatorHook`. +Depending on the TensorFlow versions and the Keras API that you use in your training script, you need to choose the right hook class. The hook constructors for TensorFlow that you can choose are `KerasHook`, `SessionHook`, and `EstimatorHook`. #### KerasHook -Use `KerasHook` if you use the Keras model zoo and a Keras `model.fit()` API. This is available for the Keras with TensorFlow backend interface. `KerasHook` covers the eager execution modes and the gradient tape features that are introduced from the TensorFlow framework version 2.0. You can set the smdebug Keras hook constructor by adding the following code into your training script. Place this code line before `model.compile()`. +If you use the Keras model zoo and a Keras `model.fit()` API, use `KerasHook`. `KerasHook` is available for the Keras model with the TensorFlow backend interface. `KerasHook` covers the eager execution modes and the gradient tape features that are introduced in the TensorFlow framework version 2.0. You can set the smdebug Keras hook constructor by adding the following code to your training script. Place this code line before `model.compile()`: ```python hook = smd.KerasHook.create_from_json_file() ``` -To learn how to fully implement the hook to your training script, see the [Keras with the TensorFlow gradient tape and the smdebug hook example scripts](https://github.com/awslabs/sagemaker-debugger/tree/master/examples/tensorflow2/scripts). +To learn how to fully implement the hook in your training script, see the [Keras with the TensorFlow gradient tape and the smdebug hook example scripts](https://github.com/awslabs/sagemaker-debugger/tree/master/examples/tensorflow2/scripts). ->**Note**: If you use the AWS Deep Learning Containers for zero script change, Debugger collects the most of tensors regardless the eager execution modes, through its high-level API. +>**Note**: If you use the AWS Deep Learning Containers for zero script change, Debugger collects most of the tensors through its high-level API, regardless of the eager execution modes. #### SessionHook -Use if your model is created in TensorFlow version 1.x with the low-level approach, not using the Keras API. This is for the TensorFlow 1.x monitored training session API, `tf.train.MonitoredSessions()`. +If your model is created in TensorFlow version 1.x with the low-level approach (not using the Keras API), use `SessionHook`. `SessionHook` is for the TensorFlow 1.x monitored training session API, `tf.train.MonitoredSessions()`, as shown following: ```python hook = smd.SessionHook.create_from_json_file() @@ -152,11 +147,11 @@ hook = smd.SessionHook.create_from_json_file() To learn how to fully implement the hook into your training script, see the [TensorFlow monitored training session with the smdebug hook example script](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow/sagemaker_byoc/simple.py). ->**Note**: The official TensorFlow library deprecated the `tf.train.MonitoredSessions()` API in favor of `tf.function()` in TF 2.0 and above. You can use `SessionHook` for `tf.function()` in TF 2.0 and above. +>**Note**: The official TensorFlow library deprecated the `tf.train.MonitoredSessions()` API in favor of `tf.function()` in TensorFlow 2.0 and later. You can use `SessionHook` for `tf.function()` in TensorFlow 2.0 and later. #### EstimatorHook -Use if you have a model using the `tf.estimator()` API. Available for any TensorFlow framework versions that supports the `tf.estimator()` API. +If you have a model using the `tf.estimator()` API, use `EstimatorHook`. `EstimatorHook` is available for any TensorFlow framework versions that support the `tf.estimator()` API, as shown following: ```python hook = smd.EstimatorHook.create_from_json_file() @@ -164,11 +159,12 @@ hook = smd.EstimatorHook.create_from_json_file() To learn how to fully implement the hook into your training script, see the [simple MNIST training script with the Tensorflow estimator](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow/sagemaker_byoc/simple.py). -#### 2. Wrap the optimizer and the gradient tape to retrieve gradient tensors +#### Step 2: Wrap the optimizer and the gradient tape to retrieve gradient tensors + +The smdebug TensorFlow hook provides tools to manually retrieve `gradients` tensors specific to the TensorFlow framework. -The smdebug TensorFlow hook provides tools to manually retrieve `gradients` tensors specific for the TensorFlow framework. +If you want to save `gradients` (for example, from the Keras Adam optimizer) wrap it with the hook as shown following: -If you want to save `gradients`, for example, from the Keras Adam optimizer, wrap it with the hook as follows: ```python optimizer = tf.keras.optimizers.Adam(learning_rate=args.lr) optimizer = hook.wrap_optimizer(optimizer) @@ -187,9 +183,9 @@ hook.save_tensor("grads", grads, "gradients") These smdebug hook wrapper functions capture the gradient tensors, not affecting your optimization logic at all. -For examples of code structure to apply the hook wrappers, see the [Examples](#examples) section. +For examples of code structures that you can use to apply the hook wrappers, see the [Code Examples](#examples) section. -#### 3. Register the hook to model.fit() +#### Step 3: Register the hook to model.fit() To collect the tensors from the hooks that you registered, add `callbacks=[hook]` to the Keras `model.fit()` API. This will pass the SageMaker Debugger hook as a Keras callback. Similarly, add `hooks=[hook]` to the `MonitoredSession()`, `tf.function()`, and `tf.estimator()` APIs. For example: @@ -203,7 +199,7 @@ model.fit(X_train, Y_train, callbacks=[hook]) ``` -#### 4. Take actions using the hook APIs +#### Step 4: Perform actions using the hook APIs For a full list of actions that the hook APIs offer to construct hooks and save tensors, see [Common hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#common-hook-api) and [TensorFlow specific hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#tensorflow-specific-hook-api). @@ -211,7 +207,7 @@ For a full list of actions that the hook APIs offer to construct hooks and save ## Code Samples -The following examples show the base structures of hook registration in various TensorFlow training scripts. If you want to take the benefit of the high-level Debugger features with zero script change on AWS Deep Learning Containers, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). +The following code examples show the base structures that you can use for hook registration in various TensorFlow training scripts. If you want to use the high-level Debugger features with zero script change on AWS Deep Learning Containers, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). ### Keras API (tf.keras) ```python @@ -232,7 +228,7 @@ hook.set_mode(mode=smd.modes.EVAL) model.evaluate(x_test, y_test, callbacks=[hook]) ``` -### Keras GradientTape example for TF 2.0 and above +### Keras GradientTape example for TensorFlow 2.0 and later ```python import smdebug.tensorflow as smd From 4afb5fc3c43e577cca82772424ae5ab02896a123 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Wed, 12 Aug 2020 17:44:49 -0700 Subject: [PATCH 11/39] minor structure change --- docs/tensorflow.md | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 7138ccff4..b7c9a3ece 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -2,9 +2,9 @@ ## Contents - [What SageMaker Debugger Supports](#support) -- [How to Use Debugger with TensorFlow](#how-to-use) - - [Debugger on AWS Deep Learning Containers with TensorFlow](#debugger-dlc) - - [Debugger on SageMaker Training Containers and Custom Containers](#debugger-script-change) +- [Debugger on AWS Deep Learning Containers with TensorFlow](#debugger-dlc) + - [Debugger Built-in Tensor Collections for TensorFlow](#tf-built-in-collection) +- [Debugger on SageMaker Training Containers and Custom Containers](#debugger-script-change) - [Code Samples](#examples) - [References](#references) @@ -25,9 +25,7 @@ Debugger and its client library `smdebug` support debugging your training job on --- -## Using Debugger - -### Using Debugger on AWS Deep Learning Containers with TensorFlow +## Using Debugger on AWS Deep Learning Containers with TensorFlow The Debugger built-in rules and hook features are fully integrated with the AWS Deep Learning Containers. You can run your training script without any script changes. When running training jobs on those Deep Learning Containers, Debugger registers its hooks automatically to your training script in order to retrieve tensors. To find a comprehensive guide of using the high-level SageMaker TensorFlow estimator with Debugger, see [Amazon SageMaker Debugger with TensorFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html#debugger-zero-script-change-TensorFlow) in the Amazon SageMaker Developer Guide. @@ -69,7 +67,7 @@ pip install -U smdebug ``` If you are using a Jupyter Notebook, put an exclamation mark (!) at the beginning of the code string and restart your kernel. For more information about the SageMaker Python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). -#### Using Tensor Collections with TensorFlow +### Debugger Built-in Tensor Collections for TensorFlow **Note**: The `inputs`, `outputs`, and `layers` collections are currently not available for TensorFlow 2.1.0. -### Using Debugger on SageMaker Training Containers and Custom Containers +--- + +## Using Debugger on SageMaker Training Containers and Custom Containers If you want to run your own training script or custom containers other than the AWS Deep Learning Containers in the previous option, you can use any of the following options: -- Option 1 - Use the SageMaker TensorFlow training containers with training script modification -- Option 2 - Use your custom container with modified training script and push the container to Amazon ECR. +- **Option 1** - Use the SageMaker TensorFlow training containers with training script modification +- **Option 2** - Use your custom container with modified training script and push the container to Amazon ECR. For both options, you need to manually register the Debugger hook to your training script. Depending on the TensorFlow and Keras API operations used to construct your model, you need to pick the right TensorFlow hook class, register the hook, and then save the tensors. @@ -111,7 +111,7 @@ For both options, you need to manually register the Debugger hook to your traini 3. [Register the hook to model.fit()](#register-a-hook) -#### Step 1: Create a hook +### Step 1: Create a hook To create the hook constructor, add the following code to your training script. This enables the `smdebug` tools for TensorFlow and creates a TensorFlow `hook` object. When you run the `fit()` API for training, specify the smdebug `hook` as `callbacks`, as shown following: @@ -159,7 +159,7 @@ hook = smd.EstimatorHook.create_from_json_file() To learn how to fully implement the hook into your training script, see the [simple MNIST training script with the Tensorflow estimator](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow/sagemaker_byoc/simple.py). -#### Step 2: Wrap the optimizer and the gradient tape to retrieve gradient tensors +### Step 2: Wrap the optimizer and the gradient tape to retrieve gradient tensors The smdebug TensorFlow hook provides tools to manually retrieve `gradients` tensors specific to the TensorFlow framework. @@ -185,7 +185,7 @@ These smdebug hook wrapper functions capture the gradient tensors, not affecting For examples of code structures that you can use to apply the hook wrappers, see the [Code Examples](#examples) section. -#### Step 3: Register the hook to model.fit() +### Step 3: Register the hook to model.fit() To collect the tensors from the hooks that you registered, add `callbacks=[hook]` to the Keras `model.fit()` API. This will pass the SageMaker Debugger hook as a Keras callback. Similarly, add `hooks=[hook]` to the `MonitoredSession()`, `tf.function()`, and `tf.estimator()` APIs. For example: @@ -199,7 +199,7 @@ model.fit(X_train, Y_train, callbacks=[hook]) ``` -#### Step 4: Perform actions using the hook APIs +### Step 4: Perform actions using the hook APIs For a full list of actions that the hook APIs offer to construct hooks and save tensors, see [Common hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#common-hook-api) and [TensorFlow specific hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#tensorflow-specific-hook-api). From 9c20ef2a509d57b9d00a82c2caa988338ba78839 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Wed, 12 Aug 2020 17:46:22 -0700 Subject: [PATCH 12/39] minor fix --- docs/tensorflow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index b7c9a3ece..89dec7e41 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -67,7 +67,7 @@ pip install -U smdebug ``` If you are using a Jupyter Notebook, put an exclamation mark (!) at the beginning of the code string and restart your kernel. For more information about the SageMaker Python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). -### Debugger Built-in Tensor Collections for TensorFlow The following table lists the pre-configured tensor collections for TensorFlow models. You can pick any tensor collections by specifying the `name` parameter of `CollectionConfig()` as shown in the previous base code example. SageMaker Debugger will save these tensors to the default out_dir of the hook. From 293f7704910d30e730240114c9e0566feed3a603 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Wed, 12 Aug 2020 20:58:52 -0700 Subject: [PATCH 13/39] minor fix --- docs/tensorflow.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 89dec7e41..92e08d673 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -5,7 +5,7 @@ - [Debugger on AWS Deep Learning Containers with TensorFlow](#debugger-dlc) - [Debugger Built-in Tensor Collections for TensorFlow](#tf-built-in-collection) - [Debugger on SageMaker Training Containers and Custom Containers](#debugger-script-change) -- [Code Samples](#examples) +- [Code Examples](#examples) - [References](#references) --- @@ -205,7 +205,7 @@ For a full list of actions that the hook APIs offer to construct hooks and save --- -## Code Samples +## Code Examples The following code examples show the base structures that you can use for hook registration in various TensorFlow training scripts. If you want to use the high-level Debugger features with zero script change on AWS Deep Learning Containers, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). From 4996feb6c02b73a763349ff309c37c6459b83621 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 13 Aug 2020 09:55:06 -0700 Subject: [PATCH 14/39] incorporate comments --- docs/tensorflow.md | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 92e08d673..efac54234 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -12,7 +12,7 @@ ## Amazon SageMaker Debugger Support for TensorFlow -Amazon SageMaker Debugger python SDK (v2.0) and its client library `smdebug` library (v0.9.1) now fully support TensorFlow 2.2 with the latest version release. Using Debugger, you can access tensors of any kind for TensorFlow models, from the Keras model zoo to your own custom model, and save them using Debugger built-in or custom tensor collections. You can run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. It doesn't matter whether your TensorFlow models use Keras API or pure TensorFlow API (in eager mode or non-eager mode), you can directly run them on the AWS Deep Learning Containers. +Amazon SageMaker Debugger python SDK (v2.0) and its client library `smdebug` library (v0.9.2) now fully support TensorFlow 2.2 with the latest version release. Using Debugger, you can access tensors of any kind for TensorFlow models, from the Keras model zoo to your own custom model, and save them using Debugger built-in or custom tensor collections. You can run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. It doesn't matter whether your TensorFlow models use Keras API or pure TensorFlow API (in eager mode or non-eager mode), you can directly run them on the AWS Deep Learning Containers. Debugger and its client library `smdebug` support debugging your training job on other AWS training containers and custom containers. In this case, a hook registration process is required to manually add the hook features to your training script. For a full list of AWS TensorFlow containers to use Debugger, see [SageMaker containers to use Debugger with script mode](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). For a complete guide for using custom containers, see [Use Debugger in Custom Training Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-bring-your-own-container.html). @@ -60,7 +60,7 @@ tf_estimator = TensorFlow( tf_estimator.fit("s3://bucket/path/to/training/data") ``` -**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest `smdebug` library v0.9.1. We highly recommend that you upgrade the packages by running the following commands at the command line: +**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest `smdebug` library. We highly recommend that you upgrade the packages by running the following commands at the command line: ``` pip install -U sagemaker pip install -U smdebug @@ -123,7 +123,7 @@ model.fit(... callbacks=[hook]) ``` -Depending on the TensorFlow versions and the Keras API that you use in your training script, you need to choose the right hook class. The hook constructors for TensorFlow that you can choose are `KerasHook`, `SessionHook`, and `EstimatorHook`. +Depending on the TensorFlow versions and the Keras API that you use in your training script, you need to choose the right hook class. The hook constructors for TensorFlow that you can choose are `smd.KerasHook`, `smd.SessionHook`, and `smd.EstimatorHook`. #### KerasHook @@ -210,6 +210,9 @@ For a full list of actions that the hook APIs offer to construct hooks and save The following code examples show the base structures that you can use for hook registration in various TensorFlow training scripts. If you want to use the high-level Debugger features with zero script change on AWS Deep Learning Containers, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). ### Keras API (tf.keras) + +The following code example shows how to register the smdebug `KerasHook` for the Keras `model.fit()`. You can also set the hook mode to track stored tensors in different phases of training job. For a list of available hook modes, see [smdebug modes](#api.md#modes). + ```python import smdebug.tensorflow as smd @@ -221,6 +224,7 @@ model.compile( loss='sparse_categorical_crossentropy', ) # Add the hook as a callback +# Set hook.set_mode to set tensors to be stored in different phases of training job, such as TRAIN and EVAL hook.set_mode(mode=smd.modes.TRAIN) model.fit(x_train, y_train, epochs=args.epochs, callbacks=[hook]) @@ -229,6 +233,9 @@ model.evaluate(x_test, y_test, callbacks=[hook]) ``` ### Keras GradientTape example for TensorFlow 2.0 and later + +The following code example shows how to register the smdebug `KerasHook` by wrapping the TensorFlow `GradientTape()` with the smdebug `hook.wrap_tape()` API. + ```python import smdebug.tensorflow as smd @@ -250,6 +257,9 @@ model = tf.keras.models.Sequential([ ... ]) ``` ### Monitored Session (tf.train.MonitoredSession) + +The following code example shows how to register the smdebug `SessionHook`. + ```python import smdebug.tensorflow as smd @@ -268,6 +278,9 @@ sess.run([loss, ...]) ``` ### Estimator (tf.estimator.Estimator) + +The following code example shows how to register the smdebug `EstimatorHook`. You can also set the hook mode to track stored tensors in different phases of training job. For a list of available hook modes, see [smdebug modes](#api.md#modes). + ```python import smdebug.tensorflow as smd @@ -276,7 +289,7 @@ hook = smd.EstimatorHook.create_from_json_file() train_input_fn, eval_input_fn = ... estimator = tf.estimator.Estimator(...) -# Set the mode and pass the hook as callback +# Set hook.set_mode to set tensors to be stored in different phases of training job, such as TRAIN and EVAL. hook.set_mode(mode=smd.modes.TRAIN) estimator.train(input_fn=train_input_fn, steps=args.steps, hooks=[hook]) From 782e8c6b957a77d4f89be923a76cc209d7d75e1f Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 13 Aug 2020 10:33:45 -0700 Subject: [PATCH 15/39] incorporate comments / lift limitation note --- README.md | 2 -- docs/tensorflow.md | 19 +++++++------------ 2 files changed, 7 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index 62fe15e61..665f2b213 100644 --- a/README.md +++ b/README.md @@ -68,8 +68,6 @@ The following frameworks are available AWS Deep Learning Containers with the dee | [PyTorch](docs/pytorch.md) | 1.4, 1.5 | | [XGBoost](docs/xgboost.md) | 0.90-2, 1.0-1 ([As a built-in algorithm](docs/xgboost.md#use-xgboost-as-a-built-in-algorithm))| ->**Note**: Limited support of the zero script change experience for TensorFlow 2.2. The tensor collections `layers`, `inputs`, `outputs`, and `gradients` are currently not available. - ### AWS training containers with script mode The `smdebug` library supports frameworks other than the ones listed above while using AWS containers with script mode. If you want to use SageMaker Debugger with one of the following framework versions, you need to make minimal changes to your training script. diff --git a/docs/tensorflow.md b/docs/tensorflow.md index efac54234..551f8243f 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -113,15 +113,7 @@ For both options, you need to manually register the Debugger hook to your traini ### Step 1: Create a hook - To create the hook constructor, add the following code to your training script. This enables the `smdebug` tools for TensorFlow and creates a TensorFlow `hook` object. When you run the `fit()` API for training, specify the smdebug `hook` as `callbacks`, as shown following: - -```python -import smdebug.tensorflow as smd -hook = smd.{hook_class}.create_from_json_file() -... -model.fit(... - callbacks=[hook]) -``` +To create the hook constructor, add the following code to your training script. This enables the `smdebug` tools for TensorFlow and creates a TensorFlow `hook` object. When you run the `fit()` API for training, specify the smdebug `hook` as `callbacks`, as shown in the following subsections. Depending on the TensorFlow versions and the Keras API that you use in your training script, you need to choose the right hook class. The hook constructors for TensorFlow that you can choose are `smd.KerasHook`, `smd.SessionHook`, and `smd.EstimatorHook`. @@ -130,6 +122,7 @@ Depending on the TensorFlow versions and the Keras API that you use in your trai If you use the Keras model zoo and a Keras `model.fit()` API, use `KerasHook`. `KerasHook` is available for the Keras model with the TensorFlow backend interface. `KerasHook` covers the eager execution modes and the gradient tape features that are introduced in the TensorFlow framework version 2.0. You can set the smdebug Keras hook constructor by adding the following code to your training script. Place this code line before `model.compile()`: ```python +import smdebug.tensorflow as smd hook = smd.KerasHook.create_from_json_file() ``` @@ -142,6 +135,7 @@ To learn how to fully implement the hook in your training script, see the [Keras If your model is created in TensorFlow version 1.x with the low-level approach (not using the Keras API), use `SessionHook`. `SessionHook` is for the TensorFlow 1.x monitored training session API, `tf.train.MonitoredSessions()`, as shown following: ```python +import smdebug.tensorflow as smd hook = smd.SessionHook.create_from_json_file() ``` @@ -154,6 +148,7 @@ To learn how to fully implement the hook into your training script, see the [Ten If you have a model using the `tf.estimator()` API, use `EstimatorHook`. `EstimatorHook` is available for any TensorFlow framework versions that support the `tf.estimator()` API, as shown following: ```python +import smdebug.tensorflow as smd hook = smd.EstimatorHook.create_from_json_file() ``` @@ -211,7 +206,7 @@ The following code examples show the base structures that you can use for hook r ### Keras API (tf.keras) -The following code example shows how to register the smdebug `KerasHook` for the Keras `model.fit()`. You can also set the hook mode to track stored tensors in different phases of training job. For a list of available hook modes, see [smdebug modes](#api.md#modes). +The following code example shows how to register the smdebug `KerasHook` for the Keras `model.fit()`. You can also set the hook mode to track stored tensors in different phases of training job. For a list of available hook modes, see [smdebug modes](api.md#modes). ```python import smdebug.tensorflow as smd @@ -279,7 +274,7 @@ sess.run([loss, ...]) ### Estimator (tf.estimator.Estimator) -The following code example shows how to register the smdebug `EstimatorHook`. You can also set the hook mode to track stored tensors in different phases of training job. For a list of available hook modes, see [smdebug modes](#api.md#modes). +The following code example shows how to register the smdebug `EstimatorHook`. You can also set the hook mode to track stored tensors in different phases of training job. For a list of available hook modes, see [smdebug modes](api.md#modes). ```python import smdebug.tensorflow as smd @@ -289,7 +284,7 @@ hook = smd.EstimatorHook.create_from_json_file() train_input_fn, eval_input_fn = ... estimator = tf.estimator.Estimator(...) -# Set hook.set_mode to set tensors to be stored in different phases of training job, such as TRAIN and EVAL. +# Set hook.set_mode to set tensors to be stored in different phases of training job, such as TRAIN and EVAL hook.set_mode(mode=smd.modes.TRAIN) estimator.train(input_fn=train_input_fn, steps=args.steps, hooks=[hook]) From aa7fcc540592863b315080d08ac1e4a54e948c7f Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 13 Aug 2020 11:11:48 -0700 Subject: [PATCH 16/39] incorporate comments --- docs/api.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/api.md b/docs/api.md index 6edb47540..bfb2f549a 100644 --- a/docs/api.md +++ b/docs/api.md @@ -163,7 +163,7 @@ Note that `smd` import below translates to `import smdebug.{framework} as smd`. |`create_from_json_file(`
` json_file_path=None)` | `json_file_path (str)` | Takes the path of a file which holds the json configuration of the hook, and creates hook from that configuration. This is an optional parameter.
If this is not passed it tries to get the file path from the value of the environment variable `SMDEBUG_CONFIG_FILE_PATH` and defaults to `/opt/ml/input/config/debughookconfig.json`. When training on SageMaker you do not have to specify any path because this is the default path that SageMaker writes the hook configuration to. |`close()` | - | Closes all files that are currently open by the hook | | `save_scalar()` | `name (str)`
`value (float)`
`sm_metric (bool)`| Saves a scalar value by the given name. Passing `sm_metric=True` flag also makes this scalar available as a SageMaker Metric to show up in SageMaker Studio. Note that when `sm_metric` is False, this scalar always resides only in your AWS account, but setting it to True saves the scalar also on AWS servers. The default value of `sm_metric` for this method is False. | -| `save_tensor()`| tensor_name (str), tensor_value (float), collections_to_write (str) | - | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.| +| `save_tensor()`| tensor_name (str), tensor_value (float), collections_to_write (str or list[str]) | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.| ### TensorFlow specific Hook API From 83ad97019d23666b02defbd063092b962f763621 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 13 Aug 2020 11:25:13 -0700 Subject: [PATCH 17/39] include pypi links --- docs/tensorflow.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 551f8243f..f3abde5f2 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -12,7 +12,12 @@ ## Amazon SageMaker Debugger Support for TensorFlow -Amazon SageMaker Debugger python SDK (v2.0) and its client library `smdebug` library (v0.9.2) now fully support TensorFlow 2.2 with the latest version release. Using Debugger, you can access tensors of any kind for TensorFlow models, from the Keras model zoo to your own custom model, and save them using Debugger built-in or custom tensor collections. You can run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. It doesn't matter whether your TensorFlow models use Keras API or pure TensorFlow API (in eager mode or non-eager mode), you can directly run them on the AWS Deep Learning Containers. +Amazon SageMaker Debugger python SDK and its client library `smdebug` now fully support TensorFlow 2.2 with the latest version release. + +- [Amazon SageMaker SDK PyPI](https://pypi.org/project/sagemaker/) +- [The latest smdebug PyPI release](https://pypi.org/project/smdebug/) + +Using Debugger, you can access tensors of any kind for TensorFlow models, from the Keras model zoo to your own custom model, and save them using Debugger built-in or custom tensor collections. You can run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. It doesn't matter whether your TensorFlow models use Keras API or pure TensorFlow API (in eager mode or non-eager mode), you can directly run them on the AWS Deep Learning Containers. Debugger and its client library `smdebug` support debugging your training job on other AWS training containers and custom containers. In this case, a hook registration process is required to manually add the hook features to your training script. For a full list of AWS TensorFlow containers to use Debugger, see [SageMaker containers to use Debugger with script mode](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). For a complete guide for using custom containers, see [Use Debugger in Custom Training Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-bring-your-own-container.html). From 3f2beff7472e1b248d215e181384559e3a86a687 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 13 Aug 2020 12:19:01 -0700 Subject: [PATCH 18/39] minor fix --- docs/api.md | 2 +- docs/tensorflow.md | 18 +++++++++--------- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/docs/api.md b/docs/api.md index bfb2f549a..a01b33750 100644 --- a/docs/api.md +++ b/docs/api.md @@ -163,7 +163,7 @@ Note that `smd` import below translates to `import smdebug.{framework} as smd`. |`create_from_json_file(`
` json_file_path=None)` | `json_file_path (str)` | Takes the path of a file which holds the json configuration of the hook, and creates hook from that configuration. This is an optional parameter.
If this is not passed it tries to get the file path from the value of the environment variable `SMDEBUG_CONFIG_FILE_PATH` and defaults to `/opt/ml/input/config/debughookconfig.json`. When training on SageMaker you do not have to specify any path because this is the default path that SageMaker writes the hook configuration to. |`close()` | - | Closes all files that are currently open by the hook | | `save_scalar()` | `name (str)`
`value (float)`
`sm_metric (bool)`| Saves a scalar value by the given name. Passing `sm_metric=True` flag also makes this scalar available as a SageMaker Metric to show up in SageMaker Studio. Note that when `sm_metric` is False, this scalar always resides only in your AWS account, but setting it to True saves the scalar also on AWS servers. The default value of `sm_metric` for this method is False. | -| `save_tensor()`| tensor_name (str), tensor_value (float), collections_to_write (str or list[str]) | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.| +| `save_tensor()`| `tensor_name (str)`, `tensor_value (float)`, `collections_to_write (str or list[str])` | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.| ### TensorFlow specific Hook API diff --git a/docs/tensorflow.md b/docs/tensorflow.md index f3abde5f2..272530b82 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -10,11 +10,11 @@ --- -## Amazon SageMaker Debugger Support for TensorFlow +## Amazon SageMaker Debugger Support for TensorFlow Amazon SageMaker Debugger python SDK and its client library `smdebug` now fully support TensorFlow 2.2 with the latest version release. -- [Amazon SageMaker SDK PyPI](https://pypi.org/project/sagemaker/) +- [Amazon SageMaker Python SDK PyPI](https://pypi.org/project/sagemaker/) - [The latest smdebug PyPI release](https://pypi.org/project/smdebug/) Using Debugger, you can access tensors of any kind for TensorFlow models, from the Keras model zoo to your own custom model, and save them using Debugger built-in or custom tensor collections. You can run your training script on [the official AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) where Debugger can automatically capture tensors from your training job. It doesn't matter whether your TensorFlow models use Keras API or pure TensorFlow API (in eager mode or non-eager mode), you can directly run them on the AWS Deep Learning Containers. @@ -30,7 +30,7 @@ Debugger and its client library `smdebug` support debugging your training job on --- -## Using Debugger on AWS Deep Learning Containers with TensorFlow +## Using Debugger on AWS Deep Learning Containers with TensorFlow The Debugger built-in rules and hook features are fully integrated with the AWS Deep Learning Containers. You can run your training script without any script changes. When running training jobs on those Deep Learning Containers, Debugger registers its hooks automatically to your training script in order to retrieve tensors. To find a comprehensive guide of using the high-level SageMaker TensorFlow estimator with Debugger, see [Amazon SageMaker Debugger with TensorFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html#debugger-zero-script-change-TensorFlow) in the Amazon SageMaker Developer Guide. @@ -72,7 +72,7 @@ pip install -U smdebug ``` If you are using a Jupyter Notebook, put an exclamation mark (!) at the beginning of the code string and restart your kernel. For more information about the SageMaker Python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). -### Debugger Built-in Tensor Collections for TensorFlow +### Debugger Built-in Tensor Collections for TensorFlow The following table lists the pre-configured tensor collections for TensorFlow models. You can pick any tensor collections by specifying the `name` parameter of `CollectionConfig()` as shown in the previous base code example. SageMaker Debugger will save these tensors to the default out_dir of the hook. @@ -99,7 +99,7 @@ For a full list of available tensor collection parameters, see [Configuring Coll --- -## Using Debugger on SageMaker Training Containers and Custom Containers +## Using Debugger on SageMaker Training Containers and Custom Containers If you want to run your own training script or custom containers other than the AWS Deep Learning Containers in the previous option, you can use any of the following options: @@ -116,7 +116,7 @@ For both options, you need to manually register the Debugger hook to your traini 3. [Register the hook to model.fit()](#register-a-hook) -### Step 1: Create a hook +### Step 1: Create a hook To create the hook constructor, add the following code to your training script. This enables the `smdebug` tools for TensorFlow and creates a TensorFlow `hook` object. When you run the `fit()` API for training, specify the smdebug `hook` as `callbacks`, as shown in the following subsections. @@ -159,7 +159,7 @@ hook = smd.EstimatorHook.create_from_json_file() To learn how to fully implement the hook into your training script, see the [simple MNIST training script with the Tensorflow estimator](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow/sagemaker_byoc/simple.py). -### Step 2: Wrap the optimizer and the gradient tape to retrieve gradient tensors +### Step 2: Wrap the optimizer and the gradient tape to retrieve gradient tensors The smdebug TensorFlow hook provides tools to manually retrieve `gradients` tensors specific to the TensorFlow framework. @@ -185,7 +185,7 @@ These smdebug hook wrapper functions capture the gradient tensors, not affecting For examples of code structures that you can use to apply the hook wrappers, see the [Code Examples](#examples) section. -### Step 3: Register the hook to model.fit() +### Step 3: Register the hook to model.fit() To collect the tensors from the hooks that you registered, add `callbacks=[hook]` to the Keras `model.fit()` API. This will pass the SageMaker Debugger hook as a Keras callback. Similarly, add `hooks=[hook]` to the `MonitoredSession()`, `tf.function()`, and `tf.estimator()` APIs. For example: @@ -205,7 +205,7 @@ For a full list of actions that the hook APIs offer to construct hooks and save --- -## Code Examples +## Code Examples The following code examples show the base structures that you can use for hook registration in various TensorFlow training scripts. If you want to use the high-level Debugger features with zero script change on AWS Deep Learning Containers, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). From fd1b1c2296b36c065da2d4fbf1a0ba4265a7f811 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 13 Aug 2020 14:08:44 -0700 Subject: [PATCH 19/39] incorporate comments --- docs/api.md | 2 +- docs/tensorflow.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/api.md b/docs/api.md index a01b33750..bbbd0fa52 100644 --- a/docs/api.md +++ b/docs/api.md @@ -163,7 +163,7 @@ Note that `smd` import below translates to `import smdebug.{framework} as smd`. |`create_from_json_file(`
` json_file_path=None)` | `json_file_path (str)` | Takes the path of a file which holds the json configuration of the hook, and creates hook from that configuration. This is an optional parameter.
If this is not passed it tries to get the file path from the value of the environment variable `SMDEBUG_CONFIG_FILE_PATH` and defaults to `/opt/ml/input/config/debughookconfig.json`. When training on SageMaker you do not have to specify any path because this is the default path that SageMaker writes the hook configuration to. |`close()` | - | Closes all files that are currently open by the hook | | `save_scalar()` | `name (str)`
`value (float)`
`sm_metric (bool)`| Saves a scalar value by the given name. Passing `sm_metric=True` flag also makes this scalar available as a SageMaker Metric to show up in SageMaker Studio. Note that when `sm_metric` is False, this scalar always resides only in your AWS account, but setting it to True saves the scalar also on AWS servers. The default value of `sm_metric` for this method is False. | -| `save_tensor()`| `tensor_name (str)`, `tensor_value (float)`, `collections_to_write (str or list[str])` | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.| +| `save_tensor()`| `tensor_name (str)`, `tensor_value (np.ndarray)`, `collections_to_write (str or list[str])` | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.| ### TensorFlow specific Hook API diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 272530b82..9332aab4c 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -95,7 +95,7 @@ For more information about adjusting the tensor collection parameters, see [Save For a full list of available tensor collection parameters, see [Configuring Collection using SageMaker Python SDK](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk). ->**Note**: The `inputs`, `outputs`, and `layers` collections are currently not available for TensorFlow 2.1.0. +>**Note**: The `inputs`, `outputs`, `gradients`, and `layers` built-in collections are currently available for TensorFlow 2.2.0. --- From 463f0b4fd5d21d1afb51a7a91b46f613c14e301f Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 13 Aug 2020 14:12:20 -0700 Subject: [PATCH 20/39] incorporate comments --- docs/tensorflow.md | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 9332aab4c..0827cd9c8 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -83,20 +83,18 @@ The following table lists the pre-configured tensor collections for TensorFlow m | `metrics` | For KerasHook, saves the metrics computed by Keras for the model. | | `losses` | Saves all losses of the model. | | `sm_metrics` | Saves scalars that you want to include in the SageMaker metrics collection. | -| `inputs` | Matches all model inputs to the model. | -| `outputs` | Matches all model outputs of the model, such as predictions (logits) and labels. | -| `layers` | Matches all inputs and outputs of intermediate layers. | -| `gradients` | Matches all gradients of the model. | | `weights` | Matches all weights of the model. | | `biases` | Matches all biases of the model. | | `optimizer_variables` | Matches all optimizer variables, currently only supported for Keras. | +| `inputs` | Matches all model inputs to the model. (Available only for TensorFlow 2.2.0)| +| `outputs` | Matches all model outputs of the model, such as predictions (logits) and labels. (Available only for TensorFlow 2.2.0)| +| `layers` | Matches all inputs and outputs of intermediate layers. (Available only for TensorFlow 2.2.0)| +| `gradients` | Matches all gradients of the model. (Available only for TensorFlow 2.2.0)| For more information about adjusting the tensor collection parameters, see [Save Tensors Using Debugger Modified Built-in Collections](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-data.html#debugger-save-modified-built-in-collections). For a full list of available tensor collection parameters, see [Configuring Collection using SageMaker Python SDK](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk). ->**Note**: The `inputs`, `outputs`, `gradients`, and `layers` built-in collections are currently available for TensorFlow 2.2.0. - --- ## Using Debugger on SageMaker Training Containers and Custom Containers From 72e48dfd39fc6dc3b6b98e81984b2864b029c08d Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 13 Aug 2020 14:40:25 -0700 Subject: [PATCH 21/39] incorporate comments --- docs/tensorflow.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 0827cd9c8..c3f076905 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -83,18 +83,20 @@ The following table lists the pre-configured tensor collections for TensorFlow m | `metrics` | For KerasHook, saves the metrics computed by Keras for the model. | | `losses` | Saves all losses of the model. | | `sm_metrics` | Saves scalars that you want to include in the SageMaker metrics collection. | +| `inputs` | Matches all model inputs to the model. | +| `outputs` | Matches all model outputs of the model, such as predictions (logits) and labels. | +| `layers` | Matches all inputs and outputs of intermediate layers. | +| `gradients` | Matches all gradients of the model. | | `weights` | Matches all weights of the model. | | `biases` | Matches all biases of the model. | | `optimizer_variables` | Matches all optimizer variables, currently only supported for Keras. | -| `inputs` | Matches all model inputs to the model. (Available only for TensorFlow 2.2.0)| -| `outputs` | Matches all model outputs of the model, such as predictions (logits) and labels. (Available only for TensorFlow 2.2.0)| -| `layers` | Matches all inputs and outputs of intermediate layers. (Available only for TensorFlow 2.2.0)| -| `gradients` | Matches all gradients of the model. (Available only for TensorFlow 2.2.0)| For more information about adjusting the tensor collection parameters, see [Save Tensors Using Debugger Modified Built-in Collections](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-data.html#debugger-save-modified-built-in-collections). For a full list of available tensor collection parameters, see [Configuring Collection using SageMaker Python SDK](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk). +>**Note**: The `inputs`, `outputs`, `gradients`, and `layers` built-in collections are currently available for TensorFlow versions <2.0 and ==2.2.0. + --- ## Using Debugger on SageMaker Training Containers and Custom Containers From 557eae10968ffe5cb31b1fa8277aff5f5a51350e Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 13 Aug 2020 14:57:01 -0700 Subject: [PATCH 22/39] version addition --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 665f2b213..ac87bd0cb 100644 --- a/README.md +++ b/README.md @@ -63,9 +63,9 @@ The following frameworks are available AWS Deep Learning Containers with the dee | Framework | Version | | --- | --- | -| [TensorFlow](docs/tensorflow.md) | 1.15, 2.1, 2.2 | +| [TensorFlow](docs/tensorflow.md) | 1.15, 2.1.0, 2.2.0, 2.3.0 | | [MXNet](docs/mxnet.md) | 1.6 | -| [PyTorch](docs/pytorch.md) | 1.4, 1.5 | +| [PyTorch](docs/pytorch.md) | 1.4, 1.5, 1.6 | | [XGBoost](docs/xgboost.md) | 0.90-2, 1.0-1 ([As a built-in algorithm](docs/xgboost.md#use-xgboost-as-a-built-in-algorithm))| ### AWS training containers with script mode @@ -74,10 +74,10 @@ The `smdebug` library supports frameworks other than the ones listed above while | Framework | Versions | | --- | --- | -| [TensorFlow](docs/tensorflow.md) | 1.13, 1.14, 1.15, 2.1, 2.2 | +| [TensorFlow](docs/tensorflow.md) | 1.13, 1.14, 1.15, 2.1.0, 2.2.0, 2.3.0 | | Keras (with TensorFlow backend) | 2.3 | | [MXNet](docs/mxnet.md) | 1.4, 1.5, 1.6 | -| [PyTorch](docs/pytorch.md) | 1.2, 1.3, 1.4, 1.5 | +| [PyTorch](docs/pytorch.md) | 1.2, 1.3, 1.4, 1.5, 1.6 | | [XGBoost](docs/xgboost.md) | 0.90-2, 1.0-1 (As a framework)| ### Debugger on custom containers or local machines From 1eee9c626e20ca7c3d77880c6c29d432ea298981 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 13 Aug 2020 15:01:50 -0700 Subject: [PATCH 23/39] version addition --- docs/api.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/api.md b/docs/api.md index bbbd0fa52..92ac9ecc0 100644 --- a/docs/api.md +++ b/docs/api.md @@ -163,7 +163,7 @@ Note that `smd` import below translates to `import smdebug.{framework} as smd`. |`create_from_json_file(`
` json_file_path=None)` | `json_file_path (str)` | Takes the path of a file which holds the json configuration of the hook, and creates hook from that configuration. This is an optional parameter.
If this is not passed it tries to get the file path from the value of the environment variable `SMDEBUG_CONFIG_FILE_PATH` and defaults to `/opt/ml/input/config/debughookconfig.json`. When training on SageMaker you do not have to specify any path because this is the default path that SageMaker writes the hook configuration to. |`close()` | - | Closes all files that are currently open by the hook | | `save_scalar()` | `name (str)`
`value (float)`
`sm_metric (bool)`| Saves a scalar value by the given name. Passing `sm_metric=True` flag also makes this scalar available as a SageMaker Metric to show up in SageMaker Studio. Note that when `sm_metric` is False, this scalar always resides only in your AWS account, but setting it to True saves the scalar also on AWS servers. The default value of `sm_metric` for this method is False. | -| `save_tensor()`| `tensor_name (str)`, `tensor_value (np.ndarray)`, `collections_to_write (str or list[str])` | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.| +| `save_tensor()`| `tensor_name (str)`, `tensor_value (numpy.array or numpy.ndarray)`, `collections_to_write (str or list[str])` | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.| ### TensorFlow specific Hook API From 19754a1a63e645b20d0fcfc873397ef31894be07 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 31 Aug 2020 15:46:32 -0700 Subject: [PATCH 24/39] add details --- docs/tensorflow.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/docs/tensorflow.md b/docs/tensorflow.md index c3f076905..498959b00 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -64,6 +64,14 @@ tf_estimator = TensorFlow( ) tf_estimator.fit("s3://bucket/path/to/training/data") ``` +>**Note**: The SageMaker TensorFlow estimator and the Debugger collections in the example are based on the latest SageMaker python SDK v2.0 and `smdebug` v0.9.1. It is highly recommended to upgrade the packages by executing the following command line. +```bash +pip install -U sagemaker +pip install -U smdebug +``` +If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. + +#### Available Tensor Collections for TensorFlow **Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest `smdebug` library. We highly recommend that you upgrade the packages by running the following commands at the command line: ``` From dd13c6cb41fd598e5f5391e1301cd3cdb1e9f8db Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 31 Aug 2020 15:58:45 -0700 Subject: [PATCH 25/39] add footnote --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index ac87bd0cb..c2cfd7d58 100644 --- a/README.md +++ b/README.md @@ -68,6 +68,8 @@ The following frameworks are available AWS Deep Learning Containers with the dee | [PyTorch](docs/pytorch.md) | 1.4, 1.5, 1.6 | | [XGBoost](docs/xgboost.md) | 0.90-2, 1.0-1 ([As a built-in algorithm](docs/xgboost.md#use-xgboost-as-a-built-in-algorithm))| +**Note**: Debugger with zero script change is partially available for TensorFlow v2.1.0 and v2.3.0. The `inputs`, `outputs`, `gradients`, and `layers` built-in collections are currently not available for these TensorFlow versions. + ### AWS training containers with script mode The `smdebug` library supports frameworks other than the ones listed above while using AWS containers with script mode. If you want to use SageMaker Debugger with one of the following framework versions, you need to make minimal changes to your training script. From 9edd714a60a0159c3acb6d31ac2ae3bae68ebcd6 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Wed, 23 Jun 2021 13:37:29 -0700 Subject: [PATCH 26/39] pre-launch smdebug readthedocs website --- README.md | 40 +- docs/Makefile | 20 + docs/README.rst | 413 +++++++ docs/_static/aws-ux-shortbread/index.js | 3 + docs/_static/aws-ux-shortbread/init.js | 34 + docs/_static/css/custom.css | 4 + docs/_static/js/analytics.js | 2 + docs/analysis.md | 595 ---------- docs/api.md | 472 -------- docs/api.rst.bkup | 1024 +++++++++++++++++ docs/collections.rst | 328 ++++++ docs/conf.py | 105 ++ docs/environment.yml | 14 + docs/glossary.rst | 69 ++ docs/hook-api.rst | 42 + docs/hook-config-in-smestimator.rst | 111 ++ docs/hook-constructor.rst | 64 ++ docs/hook-methods.rst | 276 +++++ docs/index.rst | 61 + docs/make.bat | 35 + docs/make_html.sh | 3 + docs/mxnet.md | 102 -- docs/mxnet.rst | 143 +++ docs/other-configs.rst | 177 +++ docs/pytorch.md | 125 -- docs/pytorch.rst | 173 +++ docs/release-notes.rst | 80 ++ docs/sagemaker.md | 267 ----- docs/sagemaker.rst | 357 ++++++ docs/smdebug.analysis.rst | 46 + docs/smdebug.exceptions.rst | 19 + docs/smdebug.modes.rst | 26 + ...debug.profiler.analysis.notebook_utils.rst | 32 + docs/smdebug.profiler.analysis.rst | 15 + docs/smdebug.profiler.analysis.utils.rst | 27 + docs/smdebug.profiler.rst | 18 + docs/smdebug.rules.rst | 164 +++ docs/smdebug.trials.rst | 21 + docs/tensor-api.rst | 286 +++++ docs/tensorflow.md | 323 ------ docs/tensorflow.rst | 498 ++++++++ docs/trial-api.rst | 14 + docs/xgboost.md | 100 -- docs/xgboost.rst | 138 +++ setup.py | 24 +- smdebug/analysis/utils.py | 5 +- smdebug/core/hook.py | 1 + smdebug/exceptions.py | 47 + .../analysis/notebook_utils/heatmap.py | 9 + .../analysis/python_profile_analysis.py | 6 + .../analysis/utils/profiler_data_to_pandas.py | 11 +- smdebug/rules/rule.py | 29 +- smdebug/rules/rule_invoker.py | 22 + smdebug/trials/trial.py | 277 ++++- smdebug/trials/utils.py | 62 + 55 files changed, 5319 insertions(+), 2040 deletions(-) create mode 100644 docs/Makefile create mode 100644 docs/README.rst create mode 100644 docs/_static/aws-ux-shortbread/index.js create mode 100644 docs/_static/aws-ux-shortbread/init.js create mode 100644 docs/_static/css/custom.css create mode 100644 docs/_static/js/analytics.js delete mode 100644 docs/analysis.md delete mode 100644 docs/api.md create mode 100644 docs/api.rst.bkup create mode 100644 docs/collections.rst create mode 100644 docs/conf.py create mode 100644 docs/environment.yml create mode 100644 docs/glossary.rst create mode 100644 docs/hook-api.rst create mode 100644 docs/hook-config-in-smestimator.rst create mode 100644 docs/hook-constructor.rst create mode 100644 docs/hook-methods.rst create mode 100644 docs/index.rst create mode 100644 docs/make.bat create mode 100755 docs/make_html.sh delete mode 100644 docs/mxnet.md create mode 100644 docs/mxnet.rst create mode 100644 docs/other-configs.rst delete mode 100644 docs/pytorch.md create mode 100644 docs/pytorch.rst create mode 100644 docs/release-notes.rst delete mode 100644 docs/sagemaker.md create mode 100644 docs/sagemaker.rst create mode 100644 docs/smdebug.analysis.rst create mode 100644 docs/smdebug.exceptions.rst create mode 100644 docs/smdebug.modes.rst create mode 100644 docs/smdebug.profiler.analysis.notebook_utils.rst create mode 100644 docs/smdebug.profiler.analysis.rst create mode 100644 docs/smdebug.profiler.analysis.utils.rst create mode 100644 docs/smdebug.profiler.rst create mode 100644 docs/smdebug.rules.rst create mode 100644 docs/smdebug.trials.rst create mode 100644 docs/tensor-api.rst delete mode 100644 docs/tensorflow.md create mode 100644 docs/tensorflow.rst create mode 100644 docs/trial-api.rst delete mode 100644 docs/xgboost.md create mode 100644 docs/xgboost.rst diff --git a/README.md b/README.md index d9cc437ab..950ce374e 100644 --- a/README.md +++ b/README.md @@ -63,10 +63,10 @@ The following frameworks are available AWS Deep Learning Containers with the dee | Framework | Version | | --- | --- | -| [TensorFlow](docs/tensorflow.md) | 1.15, 2.1.0, 2.2.0, 2.3.0, 2.3.1 | -| [MXNet](docs/mxnet.md) | 1.6, 1.7 | -| [PyTorch](docs/pytorch.md) | 1.4, 1.5, 1.6 | -| [XGBoost](docs/xgboost.md) | 0.90-2, 1.0-1 ([As a built-in algorithm](docs/xgboost.md#use-xgboost-as-a-built-in-algorithm))| +| [TensorFlow](tensorflow.md) | 1.15, 2.1.0, 2.2.0, 2.3.0, 2.3.1 | +| [MXNet](mxnet.md) | 1.6, 1.7 | +| [PyTorch](pytorch.md) | 1.4, 1.5, 1.6 | +| [XGBoost](xgboost.md) | 0.90-2, 1.0-1 ([As a built-in algorithm](docs/xgboost.md#use-xgboost-as-a-built-in-algorithm))| **Note**: Debugger with zero script change is partially available for TensorFlow v2.1.0. The `inputs`, `outputs`, `gradients`, and `layers` built-in collections are currently not available for these TensorFlow versions. @@ -76,11 +76,11 @@ The `smdebug` library supports frameworks other than the ones listed above while | Framework | Versions | | --- | --- | -| [TensorFlow](docs/tensorflow.md) | 1.13, 1.14, 1.15, 2.1.0, 2.2.0, 2.3.0, 2.3.1 | +| [TensorFlow](tensorflow.md) | 1.13, 1.14, 1.15, 2.1.0, 2.2.0, 2.3.0, 2.3.1 | | Keras (with TensorFlow backend) | 2.3 | -| [MXNet](docs/mxnet.md) | 1.4, 1.5, 1.6, 1.7 | -| [PyTorch](docs/pytorch.md) | 1.2, 1.3, 1.4, 1.5, 1.6 | -| [XGBoost](docs/xgboost.md) | 0.90-2, 1.0-1 (As a framework)| +| [MXNet](mxnet.md) | 1.4, 1.5, 1.6, 1.7 | +| [PyTorch](pytorch.md) | 1.2, 1.3, 1.4, 1.5, 1.6 | +| [XGBoost](xgboost.md) | 0.90-2, 1.0-1 (As a framework)| ### Debugger on custom containers or local machines You can also fully use the Debugger features in custom containers with the SageMaker Python SDK. Furthermore, `smdebug` is an open source library, so you can install it on your local machine for any advanced use cases that cannot be run in the SageMaker environment and for constructing `smdebug` custom hooks and rules. @@ -110,10 +110,10 @@ To see a complete list of built-in rules and their functionalities, see [List of You can use Debugger with your training script on your own container making only a minimal modification to your training script to add Debugger's `Hook`. For an example template of code to use Debugger on your own container in TensorFlow 2.x frameworks, see [Run Debugger in custom container](#Run-Debugger-in-custom-container). See the following instruction pages to set up Debugger in your preferred framework. - - [TensorFlow](docs/tensorflow.md) - - [MXNet](docs/mxnet.md) - - [PyTorch](docs/pytorch.md) - - [XGBoost](docs/xgboost.md) + - [TensorFlow](tensorflow.md) + - [MXNet](mxnet.md) + - [PyTorch](pytorch.md) + - [XGBoost](xgboost.md) #### Using SageMaker Debugger on custom containers @@ -177,7 +177,7 @@ When you run the `sagemaker_simple_estimator.fit()` API, SageMaker will automatically monitor your training job for you with the Rules specified and create a `CloudWatch` event that tracks the status of the Rule, so you can take any action based on them. -If you want additional configuration and control, see [Running SageMaker jobs with Debugger](docs/sagemaker.md) for more information. +If you want additional configuration and control, see [Running SageMaker jobs with Debugger](sagemaker.md) for more information. #### Run Debugger in custom container @@ -235,12 +235,12 @@ print(f"Loss values during evaluation were {trial.tensor('CrossEntropyLoss:0').v ## SageMaker Debugger in Action - Through the model pruning process using Debugger and `smdebug`, you can iteratively identify the importance of weights and cut neurons below a threshold you define. This process allows you to train the model with significantly fewer neurons, which means a lighter, more efficient, faster, and cheaper model without compromising accuracy. The following accuracy versus the number of parameters graph is produced in Studio. It shows that the model accuracy started from about 0.9 with 12 million parameters (the data point moves from right to left along with the pruning process), improved during the first few pruning iterations, kept the quality of accuracy until it cut the number of parameters down to 6 million, and start sacrificing the accuracy afterwards. -![Debugger Iterative Model Pruning using ResNet](docs/resources/results_resnet.png?raw=true) +![Debugger Iterative Model Pruning using ResNet](resources/results_resnet.png?raw=true) Debugger provides you tools to access such training process and have a complete control over your model. See [Using SageMaker Debugger and SageMaker Experiments for iterative model pruning](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/pytorch_iterative_model_pruning/iterative_model_pruning_resnet.ipynb) notebook for the full example and more information. -- Use Debugger with XGBoost in SageMaker Studio to save feature importance values and plot them in a notebook during training. ![Debugger XGBoost Visualization Example](docs/resources/xgboost_feature_importance.png?raw=true) +- Use Debugger with XGBoost in SageMaker Studio to save feature importance values and plot them in a notebook during training. ![Debugger XGBoost Visualization Example](resources/xgboost_feature_importance.png?raw=true) -- Use Debugger with TensorFlow in SageMaker Studio to run built-in rules and visualize the loss. ![Debugger TensorFlow Visualization Example](docs/resources/tensorflow_rules_loss.png?raw=true) +- Use Debugger with TensorFlow in SageMaker Studio to run built-in rules and visualize the loss. ![Debugger TensorFlow Visualization Example](resources/tensorflow_rules_loss.png?raw=true) --- @@ -248,10 +248,10 @@ Debugger provides you tools to access such training process and have a complete | Section | Description | | --- | --- | -| [SageMaker Training](docs/sagemaker.md) | SageMaker users, we recommend you start with this page on how to run SageMaker training jobs with SageMaker Debugger | -| Frameworks
  • [TensorFlow](docs/tensorflow.md)
  • [PyTorch](docs/pytorch.md)
  • [MXNet](docs/mxnet.md)
  • [XGBoost](docs/xgboost.md)
| See the frameworks pages for details on what's supported and how to modify your training script if applicable | -| [APIs for Saving Tensors](docs/api.md) | Full description of our APIs on saving tensors | -| [Programming Model for Analysis](docs/analysis.md) | For description of the programming model provided by the APIs that enable you to perform interactive exploration of tensors saved, as well as to write your own Rules monitoring your training jobs. | +| [SageMaker Training](sagemaker.md) | SageMaker users, we recommend you start with this page on how to run SageMaker training jobs with SageMaker Debugger | +| Frameworks
  • [TensorFlow](tensorflow.md)
  • [PyTorch](pytorch.md)
  • [MXNet](mxnet.md)
  • [XGBoost](xgboost.md)
| See the frameworks pages for details on what's supported and how to modify your training script if applicable | +| [APIs for Saving Tensors](api.md) | Full description of our APIs on saving tensors | +| [Programming Model for Analysis](analysis.md) | For description of the programming model provided by the APIs that enable you to perform interactive exploration of tensors saved, as well as to write your own Rules monitoring your training jobs. | ## License diff --git a/docs/Makefile b/docs/Makefile new file mode 100644 index 000000000..d4bb2cbb9 --- /dev/null +++ b/docs/Makefile @@ -0,0 +1,20 @@ +# Minimal makefile for Sphinx documentation +# + +# You can set these variables from the command line, and also +# from the environment for the first two. +SPHINXOPTS ?= +SPHINXBUILD ?= sphinx-build +SOURCEDIR = . +BUILDDIR = _build + +# Put it first so that "make" without argument is like "make help". +help: + @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) + +.PHONY: help Makefile + +# Catch-all target: route all unknown targets to Sphinx using the new +# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). +%: Makefile + @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) diff --git a/docs/README.rst b/docs/README.rst new file mode 100644 index 000000000..3e5a93029 --- /dev/null +++ b/docs/README.rst @@ -0,0 +1,413 @@ +SMDebug: Amazon SageMaker Debugger Client Library +================================================= + +Table of Contents +----------------- + +- `Overview <#overview>`__ +- `Install the smdebug library <#install-the-smdebug-library>`__ +- `Debugger-supported Frameworks <#debugger-supported-frameworks>`__ +- `How It Works <#how-it-works>`__ +- `Examples <#examples>`__ +- `SageMaker Debugger in action <#sagemaker-debugger-in-action>`__ +- `Further Documentation and + References <#further-documentation-and-references>`__ + +Overview +-------- + +`Amazon SageMaker +Debugger `__ +automates the debugging and profiling process of machine learning training. +SageMaker Debugger provides built-in features to capture output tensors, +monitor system utilization, and profile ML frameworks from training jobs of your own training script +and Docker containers. + +You have flexibility to build customized *hooks* +for saving training data to a storage either locally or in a cloud storage such as `Amazon +S3 `__. You can also use Debugger *rules* - built-in or custom - +to analyze the saved data, identify training issues, receive alerts, +and take actions to stop training. + +The ``smdebug`` client library is an open source library that powers SageMaker Debugger +by calling the saved training data from training jobs. ``smdebug`` retrieves and +filters the tensors generated from Debugger such as gradients, weights, +and biases. + +SageMaker Debugger helps you develop and optimize model performance and computation. It also allows to run +training jobs minimally modifying estimators, +tracing training data (output tensors, system metrics, and framework metrics), +catching anomalies while training models, and iteratively pruning large models. + +Debugger supports TensorFlow, PyTorch, MXNet, and XGBoost frameworks. + +The following list is a summary of the main functionalities of Debugger: + +- Run, debug, and profile training jobs of your model on SageMaker when using + `AWS managed containers <#debugger-supported-frameworks>`__ +- No changes needed to your training script if using AWS Deep Learning + Containers with Debugger fully integrated +- Minimal changes to your training script if using AWS containers with + script mode or custom containers +- Full visibility into any tensor retrieved from targeted parts of the + training jobs +- Real-time training job monitoring through Rules +- Automated anomaly detection and state assertions through built-in and + custom Rules on SageMaker +- Actions on your training jobs based on the status of Rules +- Interactive exploration of saved tensors +- Distributed training support +- TensorBoard support + +See `How it works <#how-it-works>`__ for more details. + +-------------- + +Install the smdebug library +--------------------------- + +The ``smdebug`` library runs on Python 3. Install using the following +command: + +.. code:: python + + pip install smdebug + +-------------- + +Debugger-supported Frameworks +----------------------------- + +For a complete overview of Amazon SageMaker Debugger to learn how it +works, go to the `Use Debugger in AWS +Containers `__ +developer guide. + +AWS Deep Learning Containers with zero code change +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Debugger is installed by default in AWS Deep Learning Containers with +TensorFlow, PyTorch, MXNet, and XGBoost. The following framework +containers enable you to use Debugger with no changes to your training +script, by automatically adding `SageMaker Debugger’s +Hook `__. + +The following frameworks are available AWS Deep Learning Containers with +the deep learning frameworks for the zero script change experience. + ++--------------------------------+------------------------------------+ +| Framework | Version | ++================================+====================================+ +| `TensorFlow `__ | 1.15, 2.1.0, 2.2.0, 2.3.0, 2.3.1 | ++--------------------------------+------------------------------------+ +| `MXNet `__ | 1.6, 1.7 | ++--------------------------------+------------------------------------+ +| `PyTorch `__ | 1.4, 1.5, 1.6 | ++--------------------------------+------------------------------------+ +| `XGBoost `__ | 0.90-2, 1.0-1 (`As a built-in | +| | algorithm `__) | ++--------------------------------+------------------------------------+ + +**Note**: Debugger with zero script change is partially available for +TensorFlow v2.1.0. The ``inputs``, ``outputs``, ``gradients``, and +``layers`` built-in collections are currently not available for these +TensorFlow versions. + +AWS training containers with script mode +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``smdebug`` library supports frameworks other than the ones listed +above while using AWS containers with script mode. If you want to use +SageMaker Debugger with one of the following framework versions, you +need to make minimal changes to your training script. + ++---------------------------------+-----------------------------------+ +| Framework | Versions | ++=================================+===================================+ +| `TensorFlow `__ | 1.13, 1.14, 1.15, 2.1.0, 2.2.0, | +| | 2.3.0, 2.3.1 | ++---------------------------------+-----------------------------------+ +| Keras (with TensorFlow backend) | 2.3 | ++---------------------------------+-----------------------------------+ +| `MXNet `__ | 1.4, 1.5, 1.6, 1.7 | ++---------------------------------+-----------------------------------+ +| `PyTorch `__ | 1.2, 1.3, 1.4, 1.5, 1.6 | ++---------------------------------+-----------------------------------+ +| `XGBoost `__ | 0.90-2, 1.0-1 (As a framework) | ++---------------------------------+-----------------------------------+ + +Debugger on custom containers or local machines +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can also fully use the Debugger features in custom containers with +the SageMaker Python SDK. Furthermore, ``smdebug`` is an open source +library, so you can install it on your local machine for any advanced +use cases that cannot be run in the SageMaker environment and for +constructing ``smdebug`` custom hooks and rules. + +-------------- + +How It Works +------------ + +Amazon SageMaker Debugger uses the construct of a ``Hook`` to save the +values of requested tensors throughout the training process. You can +then setup a ``Rule`` job which simultaneously monitors and validates +these tensors to ensure that training is progressing as expected. + +A ``Rule`` checks for vanishing gradients, exploding tensor values, or +poor weight initialization. Rules are attached to Amazon CloudWatch +events, so that when a rule is triggered it changes the state of the +CloudWatch event. You can configure any action on the CloudWatch event, +such as to stop the training job saving you time and money. + +Debugger can be used inside or outside of SageMaker. However the +built-in rules that AWS provides are only available for SageMaker +training. Scenarios of usage can be classified into the following three +cases. + +Using SageMaker Debugger on AWS Deep Learning Containers with zero training script change +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Use Debugger built-in hook configurations and rules while setting up the +estimator and monitor your training job. + +For a full guide and examples of using the built-in rules, see `Running +a Rule with zero script change on AWS Deep Learning +Containers `__. + +To see a complete list of built-in rules and their functionalities, see +`List of Debugger Built-in +Rules `__. + +Using SageMaker Debugger on AWS training containers with script mode +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can use Debugger with your training script on your own container +making only a minimal modification to your training script to add +Debugger’s ``Hook``. For an example template of code to use Debugger on +your own container in TensorFlow 2.x frameworks, see `Run Debugger in +custom container <#Run-Debugger-in-custom-container>`__. See the +following instruction pages to set up Debugger in your preferred +framework. - `TensorFlow `__ - `MXNet `__ - +`PyTorch `__ - `XGBoost `__ + +Using SageMaker Debugger on custom containers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Debugger is available for any deep learning models that you bring to +Amazon SageMaker. The AWS CLI, the SageMaker Estimator API, and the +Debugger APIs enable you to use any Docker base images to build and +customize containers to train and debug your models. To use Debugger +with customized containers, go to `Use Debugger in Custom Training +Containers `__. + +Using SageMaker Debugger on a non-SageMaker environment +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Using the smdebug library, you can create custom hooks and rules (or +manually analyze the tensors) and modify your training script to enable +tensor analysis on a non-SageMaker environment, such as your local +machine. For an example of this, see `Run Debugger +locally <#run-debugger-locally>`__. + +-------------- + +Examples +-------- + +SageMaker Notebook Examples +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To find a collection of demonstrations using Debugger, see `SageMaker +Debugger Example +Notebooks `__. + +Run Debugger rules with zero script change +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +This example shows a how to use Debugger with Zero Script Change of your +training script on a SageMaker DLC. + +.. code:: python + + import sagemaker as sm + from sagemaker.debugger import rule_configs, Rule, CollectionConfig + + # Choose a built-in rule to monitor your training job + rule = Rule.sagemaker( + rule_configs.exploding_tensor(), + # configure your rule if applicable + rule_parameters={"tensor_regex": ".*"}, + # specify collections to save for processing your rule + collections_to_save=[ + CollectionConfig(name="weights"), + CollectionConfig(name="losses"), + ], + ) + + # Pass the rule to the estimator + sagemaker_simple_estimator = sm.tensorflow.TensorFlow( + entry_point="script.py", #replace script.py to your own training script + role=sm.get_execution_role(), + framework_version="1.15", + py_version="py3", + # argument for smdebug below + rules=[rule], + ) + + sagemaker_simple_estimator.fit() + tensors_path = sagemaker_simple_estimator.latest_job_debugger_artifacts_path() + + import smdebug.trials as smd + trial = smd.create_trial(out_dir=tensors_path) + print(f"Saved these tensors: {trial.tensor_names()}") + print(f"Loss values during evaluation were {trial.tensor('CrossEntropyLoss:0').values(mode=smd.modes.EVAL)}") + +That’s it! When you configure the ``sagemaker_simple_estimator``, you +simply specify the ``entry_point`` to your training script python file. +When you run the ``sagemaker_simple_estimator.fit()`` API, SageMaker +will automatically monitor your training job for you with the Rules +specified and create a ``CloudWatch`` event that tracks the status of +the Rule, so you can take any action based on them. + +If you want additional configuration and control, see `Running SageMaker +jobs with Debugger `__ for more information. + +Run Debugger in custom container +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The following example shows how to set ``hook`` to set a training model +using Debugger in your own container. This example is for containers in +TensorFlow 2.x framework using GradientTape to configure the ``hook``. + +.. code:: python + + import smdebug.tensorflow as smd + hook = smd.KerasHook(out_dir=args.out_dir) + + model = tf.keras.models.Sequential([ ... ]) + for epoch in range(n_epochs): + for data, labels in dataset: + dataset_labels = labels + # wrap the tape to capture tensors + with hook.wrap_tape(tf.GradientTape(persistent=True)) as tape: + logits = model(data, training=True) # (32,10) + loss_value = cce(labels, logits) + grads = tape.gradient(loss_value, model.variables) + opt.apply_gradients(zip(grads, model.variables)) + acc = train_acc_metric(dataset_labels, logits) + # manually save metric values + hook.record_tensor_value(tensor_name="accuracy", tensor_value=acc) + +To see a full script of this, refer to the +`tf_keras_gradienttape.py `__ +example script. For a notebook example of using BYOC in PyTorch, see +`Using Amazon SageMaker Debugger with Your Own PyTorch +Container `__ + +Run Debugger locally +^^^^^^^^^^^^^^^^^^^^ + +This example shows how to use Debugger for the Keras ``model.fit()`` +API. + +To use Debugger, simply add a callback ``hook``: + +.. code:: python + + import smdebug.tensorflow as smd + hook = smd.KerasHook(out_dir='~/smd_outputs/') + + model = tf.keras.models.Sequential([ ... ]) + model.compile( + optimizer='adam', + loss='sparse_categorical_crossentropy', + ) + + # Add the hook as a callback + model.fit(x_train, y_train, epochs=2, callbacks=[hook]) + model.evaluate(x_test, y_test, callbacks=[hook]) + + # Create a trial to inspect the saved tensors + trial = smd.create_trial(out_dir='~/smd_outputs/') + print(f"Saved these tensors: {trial.tensor_names()}") + print(f"Loss values during evaluation were {trial.tensor('CrossEntropyLoss:0').values(mode=smd.modes.EVAL)}") + +-------------- + +SageMaker Debugger in Action +---------------------------- + +- Through the model pruning process using Debugger and ``smdebug``, you + can iteratively identify the importance of weights and cut neurons + below a threshold you define. This process allows you to train the + model with significantly fewer neurons, which means a lighter, more + efficient, faster, and cheaper model without compromising accuracy. + The following accuracy versus the number of parameters graph is + produced in Studio. It shows that the model accuracy started from + about 0.9 with 12 million parameters (the data point moves from right + to left along with the pruning process), improved during the first + few pruning iterations, kept the quality of accuracy until it cut the + number of parameters down to 6 million, and start sacrificing the + accuracy afterwards. + + |Debugger Iterative Model Pruning using ResNet| Debugger provides you + tools to access such training process and have a complete control over + your model. See `Using SageMaker Debugger and SageMaker Experiments for + iterative model + pruning `__ + notebook for the full example and more information. + +- Use Debugger with XGBoost in SageMaker Studio to save feature + importance values and plot them in a notebook during training. + |Debugger XGBoost Visualization Example| + +- Use Debugger with TensorFlow in SageMaker Studio to run built-in + rules and visualize the loss. |Debugger TensorFlow Visualization + Example| + +-------------- + +Further Documentation and References +------------------------------------ + ++-----------------------------------+-----------------------------------+ +| Section | Description | ++===================================+===================================+ +| :doc:`sagemaker` | SageMaker users, we recommend you | +| | start with this page on how to | +| | run SageMaker training jobs with | +| | SageMaker Debugger. | ++-----------------------------------+-----------------------------------+ +| Frameworks | See the frameworks pages for | +| - :doc:`tensorflow` | details on what’s supported and | +| - :doc:`pytorch` | how to modify your training | +| - :doc:`mxnet` | script if applicable. | +| - :doc:`xgboost` | | ++-----------------------------------+-----------------------------------+ +| :doc:`api` | Full description of our APIs | +| | on saving tensors. | ++-----------------------------------+-----------------------------------+ +| :doc:`analysis` | For description of the programming| +| | model provided by the APIs that | +| | enable you to perform interactive | +| | exploration of tensors saved, | +| | as well as to write your own | +| | rules monitoring your training | +| | jobs. | ++-----------------------------------+-----------------------------------+ + +License +------- + +This library is licensed under the Apache 2.0 License. + +.. |codecov| image:: https://codecov.io/gh/awslabs/sagemaker-debugger/branch/master/graph/badge.svg + :target: https://codecov.io/gh/awslabs/sagemaker-debugger +.. |PyPI| image:: https://badge.fury.io/py/smdebug.svg + :target: https://badge.fury.io/py/smdebug +.. |Debugger Iterative Model Pruning using ResNet| image:: resources/results_resnet.png +.. |Debugger XGBoost Visualization Example| image:: resources/xgboost_feature_importance.png +.. |Debugger TensorFlow Visualization Example| image:: resources/tensorflow_rules_loss.png diff --git a/docs/_static/aws-ux-shortbread/index.js b/docs/_static/aws-ux-shortbread/index.js new file mode 100644 index 000000000..eda166d58 --- /dev/null +++ b/docs/_static/aws-ux-shortbread/index.js @@ -0,0 +1,3 @@ +/*! Version: 1.0.13 */ +!function(e,c){if("object"==typeof exports&&"object"==typeof module)module.exports=c();else if("function"==typeof define&&define.amd)define([],c);else{var a=c();for(var t in a)("object"==typeof exports?exports:e)[t]=a[t]}}(window,(function(){return function(e){var c={};function a(t){if(c[t])return c[t].exports;var n=c[t]={i:t,l:!1,exports:{}};return e[t].call(n.exports,n,n.exports,a),n.l=!0,n.exports}return a.m=e,a.c=c,a.d=function(e,c,t){a.o(e,c)||Object.defineProperty(e,c,{enumerable:!0,get:t})},a.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",{value:!0})},a.t=function(e,c){if(1&c&&(e=a(e)),8&c)return e;if(4&c&&"object"==typeof e&&e&&e.__esModule)return e;var t=Object.create(null);if(a.r(t),Object.defineProperty(t,"default",{enumerable:!0,value:e}),2&c&&"string"!=typeof e)for(var n in e)a.d(t,n,function(c){return e[c]}.bind(null,n));return t},a.n=function(e){var c=e&&e.__esModule?function(){return e.default}:function(){return e};return a.d(c,"a",c),c},a.o=function(e,c){return Object.prototype.hasOwnProperty.call(e,c)},a.p="",a(a.s=13)}([function(e,c,a){"use strict";var t=this&&this.__assign||function(){return(t=Object.assign||function(e){for(var c,a=1,t=arguments.length;a0&&r.forEach((function(c){if("string"==typeof c)l.appendChild(t.createTextNode(c));else if("number"==typeof c)l.appendChild(t.createTextNode(r.toString()));else{if(null===c)throw Error("Unsupported child type "+c);e(l,c,t,!0)}})),i?c.appendChild(l):c.insertBefore(l,c.firstChild)}},function(e,c,a){"use strict";var t,n=function(){return void 0===t&&(t=Boolean(window&&document&&document.all&&!window.atob)),t},i=function(){var e={};return function(c){if(void 0===e[c]){var a=document.querySelector(c);if(window.HTMLIFrameElement&&a instanceof window.HTMLIFrameElement)try{a=a.contentDocument.head}catch(e){a=null}e[c]=a}return e[c]}}(),o=[];function s(e){for(var c=-1,a=0;a-1?"awsccc-Rtl":"";function b(){return document.querySelector("div[data-id="+s.BANNER_ID+"]")}function f(){return document.querySelector("div[data-id="+s.CUSTOMIZE_ID+"]")}function h(e,c){var a=document.querySelector("label[data-id=awsccc-u-cb-"+e+"-label]"),t=a.classList,n=a.querySelector("input");c?(n.setAttribute("checked",""),t.add("awsccc-u-cb-checkbox-active")):(t.remove("awsccc-u-cb-checkbox-active"),n.removeAttribute("checked")),n.setAttribute("aria-checked",""+c)}var g=function(e){var c=e.event,a=e.category;"checkbox"!==c.target.getAttribute("type")&&"awsccc-cs-s-title"!==c.target.getAttribute("class")||h(a,!p(a))},m=function(c){return function(a,t){var n=b().querySelector("div[data-id=awsccc-cb-tabstart]");document.querySelector("div[data-id="+s.CUSTOMIZE_ID+"]").style.display="none",b().style.display="none",n.setAttribute("tabindex","-1"),e.onSaveConsent(a),document.body.classList.remove("awsccc-cs-modal-open"),e.log("info")(c,{detail:"Save Consent Clicked",source:t,cookie:e.getConsentCookie()})}},k=function(e){"Escape"===e.key&&x()},v=function(){return e.getConsentCookie()||u.DEFAULT_COOKIE},w=function(c){var a;a=v(),o.COOKIE_CATEGORIES.filter((function(e){return e!==o.ESSENTIAL})).forEach((function(e){h(e,a[e])})),f().addEventListener("keydown",k),f().style.display="block",document.body.classList.add("awsccc-cs-modal-open");var t=document.querySelectorAll("div[data-id="+s.TABTRAP_ID+"]");l.convertToArray(t).forEach((function(e,c){0===c&&e.focus({preventScroll:!0}),e.setAttribute("tabindex","0")})),e.log("info")("customizeCookies",{detail:"Customize Consent Clicked",source:c,cookie:e.getConsentCookie()})},x=function(){f().removeEventListener("keydown",k),f().style.display="none",document.body.classList.remove("awsccc-cs-modal-open");var c=f().querySelectorAll("div[data-id="+s.TABTRAP_ID+"]");(l.convertToArray(c).forEach((function(e){e.setAttribute("tabindex","-1")})),"block"===b().style.display)&&b().querySelector("div[data-id=awsccc-cb-tabstart]").focus({preventScroll:!0});e.onModalClose&&e.onModalClose()};return d.default((function(){document.querySelector("#"+s.CONTAINER_ID)||t.render(e.parent||document.body,t.act("div",{id:s.CONTAINER_ID},t.act("div",{id:s.APP_ID,class:a},t.act(n.default,{showConsentSelector:w,handleSaveClick:m("acceptAll"),localizedText:c.consentBanner,hasConsoleNavFooter:e.hasConsoleNavFooter}),t.act(i.default,{consentState:v(),handleSaveClick:m("customize"),handleCheckboxToggle:g,localizedText:c.consentSelector,closeConsentSelector:x,darkModeEnabled:e.hasConsoleNavFooter}))))})),{showConsentSelector:function(e){d.default((function(){w(e)}))},showBanner:function(e){d.default((function(){var c;c=b().querySelector("div[data-id=awsccc-cb-tabstart]"),b().style.display="block",c.setAttribute("tabindex","0"),c.focus({preventScroll:!0}),e()}))}}}c.isChecked=p,c.default={createShortbreadUi:function(e){return b(e)}}},function(e,c,a){"use strict";var t=this&&this.__assign||function(){return(t=Object.assign||function(e){for(var c,a=1,t=arguments.length;a0)try{var o=JSON.parse(atob(n[n.length-1]));return 1===(t=o).e&&"number"==typeof t.p&&"number"==typeof t.f&&"number"==typeof t.a&&"string"==typeof t.i&&"string"==typeof t.v?{essential:1===(a=o).e,performance:1===a.p,functional:1===a.f,advertising:1===a.a,id:a.i,version:a.v}:void i("getCookie",{detail:"Cookie format is not valid",cookie:o})}catch(e){return void i("getCookie",{detail:"Error parsing cookie",cookie:n[n.length-1]})}}function s(e){document.cookie=e}c.getConsentCookie=function(e,c){void 0===e&&(e=function(){return document.cookie});var a=o(e(),c);if(a)return{essential:a.essential,performance:a.performance,functional:a.functional,advertising:a.advertising}},c.setConsentCookie=function(e,c,a,r,l,u,d,p){void 0===c&&(c=".aws.amazon.com"),void 0===a&&(a=i.DEFAULT_COOKIE_AGE),void 0===r&&(r=n.default),void 0===l&&(l=s);var b,f=function(e){void 0===e&&(e=function(){return document.cookie});var c=o(e());if(c&&c.id)return c.id}()||r(u,d,p),h=t(t({},e),{id:f,version:i.COOKIE_VERSION}),g={e:(b=h).essential?1:0,p:b.performance?1:0,f:b.functional?1:0,a:b.advertising?1:0,i:b.id,v:b.version};return l("awsccc="+btoa(JSON.stringify(g))+"; domain="+c+"; path=/; max-age="+a+"; secure=true; SameSite=Lax"),h}},function(e,c,a){"use strict";Object.defineProperty(c,"__esModule",{value:!0});var t=a(46);c.default=function(e,c,a){void 0===c&&(c=t.v4),void 0===a&&(a=function(){return"ts-"+Date.now().toString()});var n=e?e("error"):function(){};try{return c()}catch(e){return n("uuid",{detail:"Error generating UUID",errorMessage:e.message||""}),a()}}},function(e,c,a){"use strict";Object.defineProperty(c,"__esModule",{value:!0}),c.queryGeolocationByHttpGetRequest=c.timestampUrl=c.QUERY_PARAM_KEY=void 0;var t=a(12);c.QUERY_PARAM_KEY="awsccc",c.timestampUrl=function(e){if(-1!==e.indexOf("?")){var a=e.split("?");e=a[0]+"?"+c.QUERY_PARAM_KEY+"="+Date.now()+"&"+a[1]}else{if(-1===e.indexOf("#"))return e+"?"+c.QUERY_PARAM_KEY+"="+Date.now();a=e.split("#");e=a[0]+"?"+c.QUERY_PARAM_KEY+"="+Date.now()+"#"+a[1]}return e},c.queryGeolocationByHttpGetRequest=function(e,a,n){function i(c,a,t,n,i){c("info")("geolocationLatency",{metric:a,region:t,detail:n,url:e,status:i.status})}return void 0===e&&(e="https://prod.tools.shortbread.aws.dev/1x1.png"),void 0===a&&(a=5e3),void 0===n&&(n=t.DEFAULT_LOGGER),function(o,s){void 0===s&&(s=n||t.DEFAULT_LOGGER);var r=Date.now(),l=new XMLHttpRequest;l.addEventListener("load",(function(){var e=403===l.status?"NON-EU":"EU";i(s,Date.now()-r,e,"Geolocation Response Received",l),o(e)})),l.addEventListener("timeout",(function(){o("EU");var c="Geolocation Request Timed out";i(s,a,"EU",c,l),s("error")("geolocationRequestTimeout",{url:e,timeoutSetting:a,detail:c})})),l.open("GET",c.timestampUrl(e)),l.timeout=a,l.send()}},c.default=c.queryGeolocationByHttpGetRequest},function(e,c,a){"use strict";var t=this&&this.__assign||function(){return(t=Object.assign||function(e){for(var c,a=1,t=arguments.length;aspan{color:#687078}",""]),e.exports=c},function(e,c,a){"use strict";Object.defineProperty(c,"__esModule",{value:!0});var t=a(0),n=a(6),i=a(3);a(35);var o=a(5);c.default=function(e){var c=e.handleSaveClick,a=e.handleCancelClick,s=e.localizedText;return t.act("div",{id:"awsccc-cs-f-c"},t.act(n.default,{dataId:i.CUSTOMIZE_CANCEL_BTN_ID,variant:"secondary",events:{onclick:a},text:s["button-cancel"],props:{"aria-label":s["button-cancel-aria-label"]}}),t.act(n.default,{dataId:i.CUSTOMIZE_SAVE_BTN_ID,variant:"primary",events:{onclick:function(){c({essential:!0,performance:o.isChecked("performance"),functional:o.isChecked("functional"),advertising:o.isChecked("advertising")},"preferencesModal")}},text:s["button-save"],props:{"aria-label":s["button-save-aria-label"]}}))}},function(e,c,a){var t=a(1),n=a(36);"string"==typeof(n=n.__esModule?n.default:n)&&(n=[[e.i,n,""]]);var i={insert:"head",singleton:!1};t(n,i);e.exports=n.locals||{}},function(e,c,a){(c=a(2)(!1)).push([e.i,"#awsccc-sb-ux-c #awsccc-sb-a.awsccc-Rtl #awsccc-cs-f-c{text-align:left}#awsccc-sb-ux-c #awsccc-sb-a #awsccc-cs-f-c{text-decoration:none;padding:10px 20px;text-align:right;border-top:1px solid #eaeded;display:flex;justify-content:center;flex-wrap:wrap}#awsccc-sb-ux-c #awsccc-sb-a #awsccc-cs-f-c .awsccc-u-btn{margin-left:10px}#awsccc-sb-ux-c #awsccc-sb-a #awsccc-cs-f-c .awsccc-u-btn.awsccc-u-btn-secondary{background-color:#fff;border-color:#fff;color:#545b64;margin-bottom:6px}#awsccc-sb-ux-c #awsccc-sb-a #awsccc-cs-f-c .awsccc-u-btn.awsccc-u-btn-secondary:hover{color:#000;background-color:#fafafa}@media screen and (min-width: 700px){#awsccc-sb-ux-c #awsccc-sb-a #awsccc-cs-f-c{display:block}#awsccc-sb-ux-c #awsccc-sb-a #awsccc-cs-f-c .awsccc-u-btn.awsccc-u-btn-secondary{margin-bottom:0}}",""]),e.exports=c},function(e,c,a){var t=a(1),n=a(38);"string"==typeof(n=n.__esModule?n.default:n)&&(n=[[e.i,n,""]]);var i={insert:"head",singleton:!1};t(n,i);e.exports=n.locals||{}},function(e,c,a){(c=a(2)(!1)).push([e.i,"#awsccc-sb-ux-c #awsccc-sb-a #awsccc-cs-container{display:flex;align-items:center;justify-items:center;bottom:0;left:0;right:0;top:0;position:fixed;z-index:10002;outline:0;overflow:hidden}#awsccc-sb-ux-c #awsccc-sb-a #awsccc-cs-container-inner{max-width:820px;box-sizing:border-box;outline:none;margin:10px auto;width:calc(100vw - 20px)}#awsccc-sb-ux-c #awsccc-sb-a #awsccc-cs-content{background-color:#fff;border-radius:0;box-sizing:border-box;margin-bottom:0;word-wrap:break-word;box-shadow:0 1px 1px 0 rgba(0,28,36,.3),1px 1px 1px 0 rgba(0,28,36,.15),-1px 1px 1px 0 rgba(0,28,36,.15)}#awsccc-sb-ux-c #awsccc-sb-a #awsccc-cs-header{background-color:#fafafa;padding:19px 20px;border-bottom:1px solid #eaeded}#awsccc-sb-ux-c #awsccc-sb-a #awsccc-cs-title{min-width:0;word-break:break-word;color:#16191f;flex:auto}#awsccc-sb-ux-c #awsccc-sb-a #awsccc-cs-title h2{font-size:18px;font-weight:700;margin:0}#awsccc-sb-ux-c #awsccc-sb-a #awsccc-cs-modalBody{overflow-y:auto;max-height:calc(100vh - 200px);padding:19px 20px}@media screen and (max-width: 480px){#awsccc-sb-ux-c #awsccc-sb-a #awsccc-cs-modalBody{max-height:calc(100vh - 275px)}}#awsccc-sb-ux-c #awsccc-sb-a #awsccc-cs-modalOverlay{background-color:rgba(242,243,243,.9);position:fixed;z-index:10001;right:0;top:0;bottom:0;left:0}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled{background-color:#2a2e33}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-modalOverlay{background-color:rgba(22,25,31,.8)}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-header{background-color:#21252c;border-bottom:1px solid #414750}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-title h2{color:#eaeded}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-modalBody{background-color:#2a2e33}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-modalBody #awsccc-cs-i-container{border-bottom:1px solid #414750}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-modalBody #awsccc-cs-i-container span{color:#eaeded}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-modalBody .awsccc-cs-s-container{border-bottom:1px solid #414750}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-modalBody .awsccc-cs-s-container h3{color:#eaeded}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-modalBody .awsccc-cs-s-container p{color:#eaeded}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-modalBody .awsccc-cs-s-container span{color:#eaeded}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-modalBody .awsccc-cs-s-container rect{fill:#1a2029}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-modalBody .awsccc-cs-s-container .awsccc-cs-s-text{border-top:0}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-modalBody .awsccc-cs-s-container .awscc-u-cb-checkbox-poly-line{display:none}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-modalBody .awsccc-cs-s-container .awsccc-u-cb-checkbox-active .awscc-u-cb-checkbox-poly-line{display:inline-block}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-modalBody .awsccc-cs-s-container .awsccc-u-cb-checkbox-active .awscc-u-cb-checkbox-rect{fill:#00a1c9;stroke:#00a1c9}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-f-c{border-top:1px solid #414750;background-color:#2a2e33}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-f-c .awsccc-u-btn-secondary{background-color:#2a2e33}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-f-c .awsccc-u-btn-secondary span{color:#d5dbdb}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-f-c .awsccc-u-btn-secondary:hover{background-color:#21252c}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-f-c .awsccc-u-btn-primary span{color:#16191f}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-l-container span{color:#eaeded}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-l-container span a{display:inline-block}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-l-container span a span{color:#44b9d6}#awsccc-sb-ux-c #awsccc-sb-a .dark-mode-enabled #awsccc-cs-l-container path{color:#44b9d6}div[data-id=awsccc-cs]{display:none}",""]),e.exports=c},function(e,c,a){"use strict";Object.defineProperty(c,"__esModule",{value:!0});var t=a(0);a(40),c.default=function(){return t.act("div",{class:"awsccc-u-i-open-c"},t.act("svg",{class:"awsccc-u-i-open",viewBox:"0 0 16 16",focusable:"false","aria-hidden":"true"},t.act("path",{class:"awsccc-stroke-linecap-square",d:"M10 2h4v4"}),t.act("path",{d:"M6 10l8-8"}),t.act("path",{class:"awsccc-stroke-linejoin-round",d:"M14 9.048V14H2V2h5"})))}},function(e,c,a){var t=a(1),n=a(41);"string"==typeof(n=n.__esModule?n.default:n)&&(n=[[e.i,n,""]]);var i={insert:"head",singleton:!1};t(n,i);e.exports=n.locals||{}},function(e,c,a){(c=a(2)(!1)).push([e.i,"#awsccc-sb-ux-c #awsccc-sb-a .awsccc-u-i-open-c{display:inline-block;vertical-align:middle;line-height:1em;padding-left:.3em}#awsccc-sb-ux-c #awsccc-sb-a .awsccc-u-i-open-c svg{stroke-width:2px;pointer-events:none;fill:none;padding-bottom:1px;height:10px;width:10px}#awsccc-sb-ux-c #awsccc-sb-a .awsccc-u-i-open-c svg .awsccc-stroke-linecap-square{stroke-linecap:square}#awsccc-sb-ux-c #awsccc-sb-a .awsccc-u-i-open-c svg .awsccc-stroke-linejoin-round{stroke-linejoin:round}#awsccc-sb-ux-c #awsccc-sb-a .awsccc-u-i-open-c svg path{stroke:currentColor}",""]),e.exports=c},function(e,c,a){var t=a(1),n=a(43);"string"==typeof(n=n.__esModule?n.default:n)&&(n=[[e.i,n,""]]);var i={insert:"head",singleton:!1};t(n,i);e.exports=n.locals||{}},function(e,c,a){(c=a(2)(!1)).push([e.i,'#awsccc-sb-ux-c #awsccc-sb-a *{font-family:"Amazon Ember","HelveticaNeue","Helvetica Neue","Amazon Ember",Roboto,"Roboto-Regular","Amazon Ember",Helvetica,Arial,sans-serif;font-size:14px;line-height:21px;color:#16191f;text-align:left;background:none;border:0}#awsccc-sb-ux-c #awsccc-sb-a.awsccc-Rtl *{direction:rtl;text-align:right}#awsccc-sb-ux-c #awsccc-sb-a.awsccc-Rtl .awsccc-cs-s-container .awsccc-cs-s-action{right:auto;left:20px}@media screen and (min-width: 1020px){#awsccc-sb-ux-c #awsccc-sb-a.awsccc-Rtl #awsccc-cb-c #awsccc-cb-title{padding-right:40px}}#awsccc-sb-ux-c #awsccc-sb-a a,#awsccc-sb-ux-c #awsccc-sb-a a>span,#awsccc-sb-ux-c #awsccc-sb-a a svg path{color:#0073bb;text-decoration:none}#awsccc-sb-ux-c #awsccc-sb-a a:hover,#awsccc-sb-ux-c #awsccc-sb-a a>span:hover,#awsccc-sb-ux-c #awsccc-sb-a a svg path:hover{color:#0073bb;text-decoration:underline}#awsccc-sb-ux-c #awsccc-sb-a .awsccc-tab-helper{outline:0;text-decoration:none}.awsccc-cs-modal-open{overflow:hidden;-webkit-box-sizing:border-box;box-sizing:border-box}',""]),e.exports=c},function(e,c,a){"use strict";Object.defineProperty(c,"__esModule",{value:!0}),c.convertToArray=c.update=void 0,c.update=function(e,c){return Object.keys(c).forEach((function(a){e[a]=c[a]})),e},c.convertToArray=function(e){return Array.prototype.slice.call(e)}},function(e,c,a){"use strict";Object.defineProperty(c,"__esModule",{value:!0});c.default=function(e,c,a){function t(){c.removeEventListener("DOMContentLoaded",t),a.removeEventListener("load",t),e()}void 0===c&&(c=document),void 0===a&&(a=window),"loading"!==c.readyState?a.setTimeout(e):(c.addEventListener("DOMContentLoaded",t),a.addEventListener("load",t))}},function(e,c,a){"use strict";a.r(c),a.d(c,"v1",(function(){return b})),a.d(c,"v3",(function(){return y})),a.d(c,"v4",(function(){return C})),a.d(c,"v5",(function(){return A}));var t="undefined"!=typeof crypto&&crypto.getRandomValues&&crypto.getRandomValues.bind(crypto)||"undefined"!=typeof msCrypto&&"function"==typeof msCrypto.getRandomValues&&msCrypto.getRandomValues.bind(msCrypto),n=new Uint8Array(16);function i(){if(!t)throw new Error("crypto.getRandomValues() not supported. See https://github.com/uuidjs/uuid#getrandomvalues-not-supported");return t(n)}for(var o=[],s=0;s<256;++s)o.push((s+256).toString(16).substr(1));var r,l,u=function(e,c){var a=c||0,t=o;return(t[e[a+0]]+t[e[a+1]]+t[e[a+2]]+t[e[a+3]]+"-"+t[e[a+4]]+t[e[a+5]]+"-"+t[e[a+6]]+t[e[a+7]]+"-"+t[e[a+8]]+t[e[a+9]]+"-"+t[e[a+10]]+t[e[a+11]]+t[e[a+12]]+t[e[a+13]]+t[e[a+14]]+t[e[a+15]]).toLowerCase()},d=0,p=0;var b=function(e,c,a){var t=c&&a||0,n=c||[],o=(e=e||{}).node||r,s=void 0!==e.clockseq?e.clockseq:l;if(null==o||null==s){var b=e.random||(e.rng||i)();null==o&&(o=r=[1|b[0],b[1],b[2],b[3],b[4],b[5]]),null==s&&(s=l=16383&(b[6]<<8|b[7]))}var f=void 0!==e.msecs?e.msecs:Date.now(),h=void 0!==e.nsecs?e.nsecs:p+1,g=f-d+(h-p)/1e4;if(g<0&&void 0===e.clockseq&&(s=s+1&16383),(g<0||f>d)&&void 0===e.nsecs&&(h=0),h>=1e4)throw new Error("uuid.v1(): Can't create more than 10M uuids/sec");d=f,p=h,l=s;var m=(1e4*(268435455&(f+=122192928e5))+h)%4294967296;n[t++]=m>>>24&255,n[t++]=m>>>16&255,n[t++]=m>>>8&255,n[t++]=255&m;var k=f/4294967296*1e4&268435455;n[t++]=k>>>8&255,n[t++]=255&k,n[t++]=k>>>24&15|16,n[t++]=k>>>16&255,n[t++]=s>>>8|128,n[t++]=255&s;for(var v=0;v<6;++v)n[t+v]=o[v];return c||u(n)};var f=function(e,c,a){function t(e,t,n,i){var o=n&&i||0;if("string"==typeof e&&(e=function(e){e=unescape(encodeURIComponent(e));for(var c=[],a=0;a>>9<<4)+1}function g(e,c){var a=(65535&e)+(65535&c);return(e>>16)+(c>>16)+(a>>16)<<16|65535&a}function m(e,c,a,t,n,i){return g((o=g(g(c,e),g(t,i)))<<(s=n)|o>>>32-s,a);var o,s}function k(e,c,a,t,n,i,o){return m(c&a|~c&t,e,c,n,i,o)}function v(e,c,a,t,n,i,o){return m(c&t|a&~t,e,c,n,i,o)}function w(e,c,a,t,n,i,o){return m(c^a^t,e,c,n,i,o)}function x(e,c,a,t,n,i,o){return m(a^(c|~t),e,c,n,i,o)}var y=f("v3",48,(function(e){if("string"==typeof e){var c=unescape(encodeURIComponent(e));e=new Uint8Array(c.length);for(var a=0;a>5]>>>t%32&255,i=parseInt("0123456789abcdef".charAt(n>>>4&15)+"0123456789abcdef".charAt(15&n),16);c.push(i)}return c}(function(e,c){e[c>>5]|=128<>5]|=(255&e[t/8])<>>32-c}var A=f("v5",80,(function(e){var c=[1518500249,1859775393,2400959708,3395469782],a=[1732584193,4023233417,2562383102,271733878,3285377520];if("string"==typeof e){var t=unescape(encodeURIComponent(e));e=[];for(var n=0;n>>0;v=k,k=m,m=_(g,30)>>>0,g=h,h=y}a[0]=a[0]+h>>>0,a[1]=a[1]+g>>>0,a[2]=a[2]+m>>>0,a[3]=a[3]+k>>>0,a[4]=a[4]+v>>>0}return[a[0]>>24&255,a[0]>>16&255,a[0]>>8&255,255&a[0],a[1]>>24&255,a[1]>>16&255,a[1]>>8&255,255&a[1],a[2]>>24&255,a[2]>>16&255,a[2]>>8&255,255&a[2],a[3]>>24&255,a[3]>>16&255,a[3]>>8&255,255&a[3],a[4]>>24&255,a[4]>>16&255,a[4]>>8&255,255&a[4]]}))}])})); +//# sourceMappingURL=index.js.map diff --git a/docs/_static/aws-ux-shortbread/init.js b/docs/_static/aws-ux-shortbread/init.js new file mode 100644 index 000000000..d2f7f3aed --- /dev/null +++ b/docs/_static/aws-ux-shortbread/init.js @@ -0,0 +1,34 @@ +(function (w) { +w.URLSearchParams = w.URLSearchParams || function (searchString) { + var self = this; + self.searchString = searchString; + self.get = function (name) { + var results = new RegExp('[\?&]' + name + '=([^&#]*)').exec(self.searchString); + if (results === null) { + return null; + } + else { + return decodeURI(results[1]) || 0; + } + }; +} +})(window); + +const queryString = window.location.search; +const urlParams = new URLSearchParams(queryString); +const lang = urlParams.get('lang') +window.onload = function () { +var domainName = window.location.hostname; + +// remove an instance of shortbread if already exists +var existingShortbreadEl = document.getElementById("awsccc-sb-ux-c"); +existingShortbreadEl && existingShortbreadEl.remove(); + +var shortbread = AWSCShortbread({ + domain: domainName, + language: lang, + //queryGeolocation: function (geolocatedIn) { geolocatedIn("EU") }, +}); + +shortbread.checkForCookieConsent(); +} diff --git a/docs/_static/css/custom.css b/docs/_static/css/custom.css new file mode 100644 index 000000000..c621c7b64 --- /dev/null +++ b/docs/_static/css/custom.css @@ -0,0 +1,4 @@ +/* override table no-wrap */ +.wy-table-responsive table td, .wy-table-responsive table th { + white-space: normal; +} diff --git a/docs/_static/js/analytics.js b/docs/_static/js/analytics.js new file mode 100644 index 000000000..216095695 --- /dev/null +++ b/docs/_static/js/analytics.js @@ -0,0 +1,2 @@ +console.log("Starting analytics..."); +var s_code=s.t();if(s_code)document.write(s_code) diff --git a/docs/analysis.md b/docs/analysis.md deleted file mode 100644 index d00a13c64..000000000 --- a/docs/analysis.md +++ /dev/null @@ -1,595 +0,0 @@ -# Programming Model for Analysis - -This page describes the programming model that SageMaker Debugger provides for your analysis, and introduces you to the constructs of Trial, Tensor and Rule. - -## Table of Contents -* [Trial](#Trial) - * [Path of trial](#Path-of-trial) - * [SageMaker training job](#SageMaker-training-job) - * [Non SageMaker training jobs](#Non-SageMaker-training-jobs) - * [Creating a trial object](#Creating-a-trial-object) - * [Creating S3 trial](#Creating-S3-trial) - * [Creating local trial](#Creating-local-trial) - * [Restricting analysis to a range of steps](#Restricting-analysis-to-a-range-of-steps) - * [Trial API](#Trial-API) - * [tensor_names](#tensor_names) - * [tensor](#tensor) - * [has_tensor](#has_tensor) - * [steps](#steps) - * [modes](#modes) - * [mode](#mode) - * [mode_step](#mode_step) - * [global_step](#global_step) - * [workers](#workers) - * [collections](#collections) - * [collection](#collection) - * [wait\_for\_steps](#wait\_for\_steps) - * [has\_passed\_step](#has\_passed\_step) -* [Tensor](#Tensor-1) - * [Tensor API](#Tensor-API) - * [steps](#steps-1) - * [value](#value) - * [reduction_value](#reduction_value) - * [shape](#shape) - * [values](#values) - * [reduction_values](#reduction_values) - * [shapes](#shapes) - * [workers](#workers-1) - * [prev_steps](#prev_steps) -* [Rules](#Rules) - * [Built In Rules](#Built-In-Rules) - * [Writing a custom rule](#Writing-a-custom-rule) - * [Constructor](#Constructor) - * [Function to invoke at a given step](#Function-to-invoke-at-a-given-step) - * [Invoking a rule](#Invoking-a-rule) - * [invoke_rule](#invoke_rule) -* [Exceptions](#Exceptions) -* [Utils](#Utils) - * [Enable or disable refresh of tensors in a trial](#Enable-or-disable-refresh-of-tensors-in-a-trial) - -## Trial -Trial is an object which lets you query for tensors for a given training job, specified by the path where smdebug's artifacts are saved. -Trial is capable of loading new tensors as and when they become available at the given path, allowing you to do both offline as well as realtime analysis. - -### Path of trial -#### SageMaker training job -When running a SageMaker job this path is on S3. SageMaker saves data from your training job locally on the training instance first and uploads them to an S3 location in your account. When you start a SageMaker training job with the python SDK, you can control this path using the parameter `s3_output_path` in the `DebuggerHookConfig` object. This is an optional parameter, if you do not pass this the python SDK will populate a default location for you. If you do pass this, make sure the bucket is in the same region as where the training job is running. If you're not using the python SDK, set this path for the parameter `S3OutputPath` in the `DebugHookConfig` section of `CreateTrainingJob` API. SageMaker takes this path and appends training_job_name and "debug-output" to it to ensure we have a unique path for each training job. - -#### Non SageMaker training jobs -If you are not running a SageMaker training job, this is the path you pass as `out_dir` when you create a smdebug [`Hook`](api.md#hook). Just like when creating the hook, you can pass either a local path or an S3 path (as `s3://bucket/prefix`). - -### Creating a trial object -There are two types of trials you can create: LocalTrial or S3Trial depending on the path. We provide a wrapper method to create the appropriate trial. - -The parameters you have to provide are: -- `path`: path can be a local path or an S3 path of the form `s3://bucket/prefix`. You should see directories such as `collections`, `events` and `index` at this path once the training job starts. -- `name`: name can be any string. It is to help you manage different trials. This is an optional parameter, which defaults to the basename of the path if not passed. Please make sure to give it a unique name to prevent confusion. - -#### Creating S3 trial -```python -from smdebug.trials import create_trial -trial = create_trial(path='s3://smdebug-testing-bucket/outputs/resnet', name='resnet_training_run') -``` - -#### Creating local trial -```python -from smdebug.trials import create_trial -trial = create_trial(path='/home/ubuntu/smdebug_outputs/resnet', name='resnet_training_run') -``` - -#### Restricting analysis to a range of steps -You can optionally pass `range_steps` to restrict your analysis to a certain range of steps. -Note that if you do so, Trial will not load data from other steps. - -*Examples* -- `range_steps=(100, None)`: This will load all steps after 100 -- `range_steps=(None, 100)`: This will load all steps before 100 -- `range_steps=(100, 200)` : This will load steps between 100 and 200 -- `range_steps=None`: This will load all steps - -```python -from smdebug.trials import create_trial -tr = create_trial(path='s3://smdebug-testing-bucket/outputs/resnet', name='resnet_training', - range_steps=(100, 200)) -``` - -### Trial API - -Here's a list of methods that the Trial API provides which helps you load data for analysis. Please click on the method to see all the parameters it takes and a detailed description. If you are not familiar with smdebug constructs, you might want to review [this doc](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md) before going through this page. - -| Method | Description | -| ------------- |-------------| -| [trial.tensor_names()](#tensor_names) | See names of all tensors available | -| [trial.tensor(name)](#tensor) | Retrieve smdebug Tensor object | -| [trial.has_tensor(name)](#has_tensor) | Query for whether tensor was saved | -| [trial.steps()](#steps) | Query steps for which data was saved | -| [trial.modes()](#modes) | Query modes for which data was saved | -| [trial.mode(step)](#mode) | Query the mode for a given global step | -| [trial.global_step(mode, step)](#global_step) | Query global step for a given step and mode | -| [trial.mode_step(step)](#mode_step) | Query the mode step for a given global step | -| [trial.workers()](#workers) | Query list of workers from the data saved | -| [trial.collections()](#collections) | Query list of collections saved from the training job | -| [trial.collection(name)](#collection) | Retrieve a single collection saved from the training job | -| [trial.wait\_for\_steps(steps)](#wait\_for\_steps) | Wait till the requested steps are available | -| [trial.has\_passed\_step(step)](#has\_passed\_step) | Query whether the requested step is available | - - -#### tensor_names -Retrieves names of tensors saved -```python -trial.tensor_names(step= None, - mode=modes.GLOBAL, - regex=None, - collection=None) -``` - -###### Arguments -All arguments to this method are optional. You are not required to pass any of these arguments as keyword arguments. - -- `step (int)` If you want to retrieve the list of tensors saved at a particular step, pass the step number as an integer. This step number will be treated as step number corresponding to the mode passed below. By default it is treated as global step. -- `mode (smdebug.modes enum value)` If you want to retrieve the list of tensors saved for a particular mode, pass the mode here as `smd.modes.TRAIN`, `smd.modes.EVAL`, `smd.modes.PREDICT`, or `smd.modes.GLOBAL`. -- `regex (str or list[str])` You can filter tensors matching regex expressions by passing a regex expressions as a string or list of strings. You can only pass one of `regex` or `collection` parameters. -- `collection (Collection or str)` You can filter tensors belonging to a collection by either passing a collection object or the name of collection as a string. You can only pass one of `regex` or `collection` parameters. - -###### Returns -`list[str]`: List of strings representing names of tensors matching the given arguments. Arguments are processed as follows: get the list of tensor names for given step and mode, saved for given step matching all the given arguments, i.e. intersection of tensors matching each of the parameters. - -###### Examples -- `trial.tensor_names()` Returns all tensors saved for any step or mode. -- `trial.tensor_names(step=10, mode=modes.TRAIN)` Returns tensors saved for training step 10 -- `trial.tensor_names(regex='relu')` Returns all tensors matching the regex pattern `relu` saved for any step or mode. -- `trial.tensor_names(collection='gradients')` Returns tensors from collection "gradients" -- `trial.tensor_names(step=10, mode=modes.TRAIN, regex='softmax')` Returns tensor saved for 10th training step which matches the regex `softmax` - - -#### tensor -Retrieve the `smdebug.core.tensor.Tensor` object by the given name `tname`. You can review all the methods that this Tensor object provides [here](#Tensor-1). -```python -trial.tensor(tname) -``` -###### Arguments -- `tname (str)` Takes the name of tensor - -###### Returns -`smdebug.core.tensor.Tensor` object which has [this API](#Tensor-1) - -#### has_tensor -Query whether the trial has a tensor by the given name -```python -trial.has_tensor(tname) -``` - -###### Arguments -- `tname (str)` Takes the name of tensor - -###### Returns -`bool`: `True` if the tensor is seen by the trial so far, else `False`. - -#### steps -Retrieve a list of steps seen by the trial -```python -trial.steps(mode=None) -``` - -###### Arguments -- `mode (smdebug.modes enum value)` Passing a mode here allows you want to retrieve the list of steps seen by a trial for that mode -If this is not passed, returns steps for all modes. - -###### Returns -`list[int]` List of integers representing step numbers. If a mode was passed, this returns steps within that mode, i.e. mode steps. -Each of these mode steps has a global step number associated with it. The global step represents -the sequence of steps across all modes executed by the job. - -#### modes -Retrieve a list of modes seen by the trial -```python -trial.modes() -``` - -###### Returns -`list[smdebug.modes enum value]` List of modes for which data was saved from the training job across all steps seen. - -#### mode -Given a global step number you can identify the mode for that step using this method. -```python -trial.mode(global_step=100) -``` - -###### Arguments -- `global_step (int)` Takes the global step as an integer - -###### Returns -`smdebug.modes enum value` of the given global step - -#### mode_step -Given a global step number you can identify the `mode_step` for that step using this method. -```python -trial.mode_step(global_step=100) -``` - -###### Arguments -- `global_step (int)` Takes the global step as an integer - -###### Returns -`int`: An integer representing `mode_step` of the given global step. Typically used in conjunction with `mode` method. - -#### global_step -Given a mode and a mode_step number you can retrieve its global step using this method. -```python -trial.global_step(mode=modes.GLOBAL, mode_step=100) -``` - -###### Arguments -- `mode (smdebug.modes enum value)` Takes the mode as enum value -- `mode_step (int)` Takes the mode step as an integer - -###### Returns -`int` An integer representing `global_step` of the given mode and mode_step. - -#### workers -Query for all the worker processes from which data was saved by smdebug during multi worker training. -```python -trial.workers() -``` - -###### Returns -`list[str]` A sorted list of names of worker processes from which data was saved. If using TensorFlow Mirrored Strategy for multi worker training, these represent names of different devices in the process. For Horovod, torch.distributed and similar distributed training approaches, these represent names of the form `worker_0` where 0 is the rank of the process. - - -#### collections - -List the collections from the trial. Note that tensors part of these collections may not necessarily have been saved from the training job. Whether a collection was saved or not depends on the configuration of the Hook during training. - -```python -trial.collections() -``` - -###### Returns -`dict[str -> Collection]` A dictionary indexed by the name of the collection, with the Collection object as the value. Please refer [Collection API](api.md#Collection) for more details. - -#### collection - -Get a specific collection from the trial. Note that tensors which are part of this collection may not necessarily have been saved from the training job. Whether this collection was saved or not depends on the configuration of the Hook during training. - -```python -trial.collection(coll_name) -``` -###### Arguments -- `coll_name (str)` Name of the collection - -###### Returns -`Collection` The requested Collection object. Please refer [Collection API](api.md#Collection) for more details. - - -#### wait\_for\_steps -This method allows you to wait for steps before proceeding. You might want to use this method if you want to wait for smdebug to see the required steps so you can then query and analyze the tensors saved by that step. This method blocks till all data from the steps are seen by smdebug. -```python -trial.wait_for_steps(required_steps, mode=modes.GLOBAL) -``` - -###### Arguments -- `required_steps (list[int])` Step numbers to wait for -- `mode (smdebug.modes enum value)` The mode to which given step numbers correspond to. This defaults to modes.GLOBAL. - -###### Returns -None, but it only returns after we know definitely whether we have seen the steps. - -###### Exceptions raised -`StepUnavailable` and `NoMoreData`. See [Exceptions](#exceptions) section for more details. - -#### has\_passed\_step -```python -trial.has_passed_step(step, mode=modes.GLOBAL) -``` - -###### Arguments -- `step (int)` The step number to check if the trial has passed it -- `mode (smdebug.modes enum value)` The mode to which given step number corresponds to. This defaults to modes.GLOBAL. - -###### Returns -`smdebug.core.tensor.StepState enum value` which can take one of three values `UNAVAILABLE`, `AVAILABLE` and `NOT_YET_AVAILABLE`. - -TODO@Nihal describe these in detail - -## Tensor -An smdebug `Tensor` object can be retrieved through the `trial.tensor(name)` API. It is uniquely identified by the string representing name. - It provides the following methods. - -| Method | Description| -| ---- | ----- | -| [steps()](#steps-1) | Query steps for which tensor was saved | -| [value(step)](#value) | Get the value of the tensor at a given step as a numpy array | -| [reduction_value(step)](#reduction_value) | Get the reduction value of the chosen tensor at a particular step | -| [reduction_values(step)](#reduction_values) | Get all reduction values saved for the chosen tensor at a particular step | -| [values(mode)](#values) | Get the values of the tensor for all steps of a given mode | -| [workers(step)](#workers-1) | Get all the workers for which this tensor was saved at a given step | -| [prev\_steps(step, n)](#prev_steps) | Get the last n step numbers of a given mode from a given step | - -### Tensor API -#### steps -Query for the steps at which the given tensor was saved -```python -trial.tensor(name).steps(mode=ModeKeys.GLOBAL, show_incomplete_steps=False) -``` - -###### Arguments -- `mode (smdebug.modes enum value)` The mode whose steps to return for the given tensor. Defaults to `modes.GLOBAL` -- `show_incomplete_steps (bool)` This parameter is relevant only for distributed training. By default this method only returns the steps which have been received from all workers. But if this parameter is set to True, this method will return steps received from at least one worker. - -###### Returns -`list[int]` A list of steps at which the given tensor was saved - -#### value -Get the value of the tensor at a given step as a numpy array -```python -trial.tensor(name).value(step_num, mode=ModeKeys.GLOBAL, worker=None) -``` - -###### Arguments -- `step_num (int)` The step number whose value is to be returned for the mode passed through the next parameter. -- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL` -- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)` - -###### Returns -`numpy.ndarray` The value of tensor at the given step and worker (if the training job saved data from multiple workers) - -#### reduction_value -Get the reduction value of the chosen tensor at a particular step. A reduction value is a tensor reduced to a single value through reduction or aggregation operations. The different reductions you can query for are the same as what are allowed in [ReductionConfig](api.md#reductionconfig) when saving tensors. -This API thus allows you to access the reduction you might have saved instead of the full tensor. If you had saved the full tensor, it will calculate the requested reduction at the time of this call. - -Reduction names allowed are `min`, `max`, `mean`, `prod`, `std`, `sum`, `variance` and `l1`, `l2` representing the norms. - -Each of these can be retrieved for the absolute value of the tensor or the original tensor. Above was an example to get the mean of the absolute value of the tensor. `abs` can be set to `False` if you want to see the `mean` of the actual tensor. - -If you had saved the tensor without any reduction, then you can retrieve the actual tensor as a numpy array and compute any reduction you might be interested in. In such a case you do not need this method. - -```python -trial.tensor(name).reduction_value(step_num, reduction_name, - mode=modes.GLOBAL, worker=None, abs=False) - -``` -###### Arguments -- `step_num (int)` The step number whose value is to be returned for the mode passed through the next parameter. -- `reduction_name (str)` The name of the reduction to query for. This can be one of `min`, `max`, `mean`, `std`, `variance`, `sum`, `prod` and the norms `l1`, `l2`. -- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL` -- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)` -- `abs (bool)` If abs is True, this method tries to return the reduction passed through `reduction_name` after taking the absolute value of the tensor. It defaults to `False`. - -###### Returns -`numpy.ndarray` The reduction value of tensor at the given step and worker (if the training job saved data from multiple workers) as a 1x1 numpy array. If this reduction was saved for the tensor during training as part of specification through reduction config, it will be loaded and returned. If the given reduction was not saved then, but the full tensor was saved, the reduction will be computed on the fly and returned. If both the chosen reduction and full tensor are not available, this method raises `TensorUnavailableForStep` exception. - -#### shape -Get the shape of the chosen tensor at a particular step. - -```python -trial.tensor(name).shape(step_num, mode=modes.GLOBAL, worker=None) - -``` -###### Arguments -- `step_num (int)` The step number whose value is to be returned for the mode passed through the next parameter. -- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL` -- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)` - -###### Returns -`tuple(int)` If only the shape of this tensor was saved through `save_shape` configuration in ReductionConfig, it will be returned. If the full tensor was saved, then shape will be computed and returned today. If both the shape and full tensor are not available, this method raises `TensorUnavailableForStep` exception. - -#### values -Get the values of the tensor for all steps of a given mode. - -```python -trial.tensor(name).values(mode=modes.GLOBAL, worker=None) -``` - -###### Arguments -- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL` -- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)` - -###### Returns -`dict[int -> numpy.ndarray]` A dictionary with step numbers as keys and numpy arrays representing the value of the tensor as values. - -#### reduction_values -Get all reduction values saved for the chosen tensor at a particular step. A reduction value is a tensor reduced to a single value through reduction or aggregation operations. Please go through the description of the method `reduction_value` for more details. - -```python -trial.tensor(name).reduction_values(step_num, mode=modes.GLOBAL, worker=None) -``` - -###### Arguments -- `step_num (int)` The step number whose value is to be returned for the mode passed through the next parameter. -- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL` -- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)` - -###### Returns -`dict[(str, bool) -> numpy.ndarray]` A dictionary with keys being tuples of the form `(reduction_name, abs)` to a 1x1 numpy ndarray value. `abs` here is a boolean that denotes whether the reduction was performed on the absolute value of the tensor or not. Note that this method only returns the reductions which were saved from the training job. It does not compute all known reductions and return them if only the raw tensor was saved. - -#### shapes -Get the shapes of the tensor for all steps of a given mode. - -```python -trial.tensor(name).shapes(mode=modes.GLOBAL, worker=None) -``` - -###### Arguments -- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL` -- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)` - -###### Returns -`dict[int -> tuple(int)]` A dictionary with step numbers as keys and tuples of ints representing the shapes of the tensor as values. - -#### workers -Get all the workers for which this tensor was saved at a given step - -```python -trial.tensor(name).workers(step_num, mode=modes.GLOBAL) -``` - -###### Arguments -- `step_num (int)` The step number whose value is to be returned for the mode passed through the next parameter. -- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL` - -###### Returns -`list[str]` A list of worker names for which the tensor was saved at the given step. - -#### prev_steps -Get the last n step numbers of a given mode from a given step. - -```python -trial.tensor(name).prev_steps(step, n, mode=modes.GLOBAL) -``` -###### Arguments -- `step (int)` The step number whose value is to be returned for the mode passed. -- `n (int)` Number of previous steps to return -- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL` - -###### Returns -`list[int]` A list of size at most n representing the previous steps for the given step and mode. Note that this list can be of size less than n if there were only less than n steps saved before the given step in this trial. - -## Rules -Rules are the medium by which SageMaker Debugger executes a certain piece of code regularly on different steps of a training job. A rule is assigned to a trial and can be invoked at each new step of the trial. It can also access other trials for its evaluation. You can evaluate a rule using tensors from the current step or any step before the current step. Please ensure your logic respects these semantics, else you will get a `TensorUnavailableForStep` exception as the data would not yet be available for future steps. - -### Built In Rules -Please refer to the built-in rules that SageMaker provides [here](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/sagemaker.md#built-in-rules). - -### Writing a custom rule -Writing a rule involves implementing the [Rule interface](../smdebug/rules/rule.py). Below, let us look at a simplified version of a VanishingGradient rule. - -##### Constructor -Creating a rule involves first inheriting from the base `Rule` class provided by smdebug. -For this example rule here, we do not need to look at any other trials, so we set `other_trials` to None. - -```python -from smdebug.rules import Rule - -class VanishingGradientRule(Rule): - def __init__(self, base_trial, threshold=0.0000001): - super().__init__(base_trial, other_trials=None) - self.threshold = float(threshold) -``` - -Please note that apart from `base_trial` and `other_trials` (if required), we require all -arguments of the rule constructor to take a string as value. You can parse them to the type -that you want from the string. This means if you want to pass a list of strings, you might want to pass them as a comma separated string. This restriction is being enforced so as to let you create and invoke rules from json using Sagemaker's APIs. - -##### Function to invoke at a given step -In this function you can implement the core logic of what you want to do with these tensors. -It should return a boolean value `True` or `False`, where `True` means the rule evaluation condition has been met. When you invoke these rules through SageMaker, the rule evaluation ends when the rule evaluation condition is met. SageMaker creates a Cloudwatch event for every rule evaluation job, which can be used to define actions that you might want to take based on the state of the rule. - -A simplified version of the actual invoke function for `VanishingGradientRule` is below: - -```python - def invoke_at_step(self, step): - for tensorname in self.base_trial.tensors(collection='gradients'): - tensor = self.base_trial.tensor(tensorname) - abs_mean = tensor.reduction_value(step, 'mean', abs=True) - if abs_mean < self.threshold: - return True - else: - return False -``` - -That's it, writing a rule is as simple as that. - -### Invoking a rule through SageMaker -After you've written your rule, you can ask SageMaker to evaluate the rule against your training job by either using SageMaker Python SDK as -``` -estimator = Estimator( - ... - rules = Rules.custom( - name='VGRule', - image_uri='864354269164.dkr.ecr.us-east-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest', - instance_type='ml.t3.medium', # instance type to run the rule evaluation on - source='rules/vanishing_gradient_rule.py', # path to the rule source file - rule_to_invoke='VanishingGradientRule', # name of the class to invoke in the rule source file - volume_size_in_gb=30, # EBS volume size required to be attached to the rule evaluation instance - collections_to_save=[CollectionConfig("gradients")], # collections to be analyzed by the rule - rule_parameters={ - "threshold": "20.0" # this will be used to initialize 'threshold' param in your rule constructor - } -) -``` -If you're using the SageMaker API directly to evaluate the rule, then you can specify the rule configuration [`DebugRuleConfigurations`](https://docs.aws.amazon.com/sagemaker/latest/dg/API_DebugRuleConfiguration.html) in the CreateTrainingJob API request as: -``` -"DebugRuleConfigurations": [ - { - "RuleConfigurationName": "VGRule", - "InstanceType": "ml.t3.medium", - "VolumeSizeInGB": 30, - "RuleEvaluatorImage": "864354269164.dkr.ecr.us-east-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest", - "RuleParameters": { - "source_s3_uri": "s3://path/to/vanishing_gradient_rule.py", - "rule_to_invoke": "VanishingGradient", - "threshold": "20.0" - } - } -] -``` - -#### Invoking a rule outside of SageMaker through `invoke_rule` -You might want to invoke the rule locally during development. We provide a function to invoke rules easily. Refer [smdebug/rules/rule_invoker.py](../smdebug/rules/rule_invoker.py). The invoke function has the following syntax. It takes a instance of a Rule and invokes it for a series of steps one after the other. - -```python -from smdebug.rules import invoke_rule -from smdebug.trials import create_trial - -trial = create_trial('s3://smdebug-dev-test/mnist-job/') -rule_obj = VanishingGradientRule(trial, threshold=0.0001) -invoke_rule(rule_obj, start_step=0, end_step=None) -``` - -###### Arguments -- `rule_obj (Rule)` An instance of a subclass of `smdebug.rules.Rule` that you want to invoke. -- `start_step (int)` Global step number to start invoking the rule from. Note that this refers to a global step. This defaults to 0. -- `end_step (int or None)`: Global step number to end the invocation of rule before. To clarify, `end_step` is an exclusive bound. The rule is invoked at `end_step`. This defaults to `None` which means run till the end of the job. -- `raise_eval_cond (bool)` This parameter controls whether to raise the exception `RuleEvaluationConditionMet` when raised by the rule, or to catch it and log the message and move to the next step. Defaults to `False`, which implies that the it catches the exception, logs that the evaluation condition was met for a step and moves on to evaluate the next step. - - -## Exceptions -smdebug is designed to be aware that tensors required to evaluate a rule may not be available at every step. Hence, it raises a few exceptions which allow us to control what happens when a tensor is missing. These are available in the `smdebug.exceptions` module. You can import them as follows: - -```python -from smdebug.exceptions import * -``` - -Here are the exceptions (along with others) and their meaning: - -- `TensorUnavailableForStep` : This means that the tensor requested is not available for the step. It may have been or will be saved for a different step number. You can check which steps tensor is saved for by `trial.tensor('tname').steps()` [api](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/analysis.md#steps-1). Note that this exception implies that the requested tensor will never become available for this step in the future. - -- `TensorUnavailable` : This means that this tensor has not been saved from the training job. Note that if you have a `SaveConfig` which saves a certain tensor only after the time you queried for the tensor, you might get a `TensorUnavailable` exception even if the tensor may become available later for some step. - -- `StepUnavailable`: This means that the step was not saved from the training job. No tensor will be available for this step. - -- `StepNotYetAvailable`: This means that the step has not yet been seen from the training job. It may be available in the future if the training is still going on. We automatically load new data as and when it becomes available. This step may either become available in the future, or the exception might change to `StepUnavailable`. - -- `NoMoreData` : This will be raised when the training ends. Once you see this, you will know that there will be no more steps and no more tensors saved. - -- `RuleEvaluationConditionMet`: This is raised when the rule invocation returns `True` for some step. - -- `MissingCollectionFiles`: This is raised when no data was saved by the training job. Check that the `Hook` was configured correctly before starting the training job. - -## Utils - -### Enable or disable refresh of tensors in a trial - -By default smdebug refreshes tensors each time you try to query the tensor. -It looks for whether this tensor is saved for new steps and if so fetches them. -If you know the saved data will not change (stopped the machine learning job), or -are not interested in the latest data, you can stop the refreshing of tensors as follows: - -`no_refresh` takes a trial or a list of trials, which should not be refreshed. -Anything executed inside the with `no_refresh` block will not be refreshed. - -```python -from smdebug.analysis.utils import no_refresh -with no_refresh(trials): - pass -``` - -Similarly if you want to refresh tensors only within a block, you can do: - -```python -from smdebug.analysis.utils import refresh -with refresh(trials): - pass -``` - -During rule invocation smdebug waits till the current step is available and then turns off refresh to ensure that you do not get different results for methods like `trial.tensor(name).steps()` and run into subtle issues. diff --git a/docs/api.md b/docs/api.md deleted file mode 100644 index fe01532df..000000000 --- a/docs/api.md +++ /dev/null @@ -1,472 +0,0 @@ - -# Saving Tensors API - -- [Glossary](#glossary) -- [Hook](#hook) - - [Creating a Hook](#creating-a-hook) - - [Hook when using SageMaker Python SDK](#hook-when-using-sagemaker-python-sdk) - - [Configuring Hook using SageMaker Python SDK](#configuring-hook-using-sagemaker-python-sdk) - - [Hook from Python constructor](#hook-from-python-constructor) - - [Common Hook API](#common-hook-api) - - [TensorFlow specific Hook API](#tensorflow-specific-hook-api) - - [MXNet specific Hook API](#mxnet-specific-hook-api) - - [PyTorch specific Hook API](#pytorch-specific-hook-api) -- [Modes](#modes) -- [Collection](#collection) -- [SaveConfig](#saveconfig) -- [ReductionConfig](#reductionconfig) - -## Glossary - -The imports assume `import smdebug.{tensorflow,pytorch,mxnet,xgboost} as smd`. - -**Step**: Step means one the work done by the training job for one batch (i.e. forward and backward pass). (An exception is with TensorFlow's Session interface, where a step also includes the initialization session run calls). SageMaker Debugger is designed in terms of steps. When to save data is specified using steps as well as the invocation of Rules is on a step-by-step basis. - -**Hook**: The main class to pass as a callback object, or to create callback functions. It keeps track of collections and writes output files at each step. The current hook implementation does not support merging tensors from current job with tensors from previous job(s). Hence ensure that the 'out_dir' does not exist prior to instantiating the 'Hook' object. -- `hook = smd.Hook(out_dir="/tmp/mnist_job")` - -**Mode**: One of "train", "eval", "predict", or "global". Helpful for segmenting data based on the phase -you're in. Defaults to "global". -- `train_mode = smd.modes.TRAIN` - -**Collection**: A group of tensors. Each collection contains its configuration for what tensors are part of it, and when to save them. -- `collection = hook.get_collection("losses")` - -**SaveConfig**: A Python dict specifying how often to save losses and tensors. -- `save_config = smd.SaveConfig(save_interval=10)` - -**ReductionConfig**: Allows you to save a reduction, such as 'mean' or 'l1 norm', instead of the full tensor. Reductions are simple floats. -- `reduction_config = smd.ReductionConfig(reductions=['min', 'max', 'mean'], norms=['l1'])` - -**Trial**: The main interface to use when analyzing a completed training job. Access collections and tensors. See [trials documentation](analysis.md). -- `trial = smd.create_trial(out_dir="/tmp/mnist_job")` - -**Rule**: A condition to monitor the saved data for. It can trigger an exception when the condition is met, for example a vanishing gradient. See [rules documentation](analysis.md). - ---- - -## Hook -### Creating a Hook -By using AWS Deep Learning Containers, you can directly run your own training script without any additional effort to make it compatible with the SageMaker Python SDK. For a detailed developer guide for this, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). - -However, for some advanced use cases where you need access to customized tensors from targeted parts of a training script, you can manually construct the hook object. The smdebug library provides hook classes to make this process simple and compatible with the SageMaker ecosystem and Debugger. - -#### Hook when using the SageMaker Python SDK -If you create a SageMaker job and specify the hook configuration in the SageMaker Estimator API -as described in [AWS Docs](https://docs.aws.amazon.com/sagemaker/latest/dg/train-model.html), -the CreateTrainingJob API operation containing the hook configuration will be automatically written to the training container. - -To capture tensors from your training model, paste the following code to the top or the main function of the training script. -```python -import smdebug.Framework as smd -hook = smd.HookClass.create_from_json_file() -``` - -Depending on your choice of framework, `HookClass` need to be replaced by one of `KerasHook`, `SessionHook` or `EstimatorHook` for TensorFlow, and `Hook` for PyTorch, MXNet, and XGBoost. - -The framework in `smd.Framework` import refers to one of `tensorflow`, `mxnet`, `pytorch`, or `xgboost`. - -After choosing a framework and defining the hook object, you need to embed the hooks into target parts of your training script to retrieve tensors and to use with the SageMaker Debugger Python SDK. - -For more information about constructing the hook depending on a framework of your choice and adding the hooks to your model, see the following pages. - -* [TensorFlow hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/tensorflow.md) -* [MXNet hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/mxnet.md) -* [PyTorch hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/pytorch.md) -* [XGBoost hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/xgboost.md) - -#### Configuring Hook using SageMaker Python SDK -After you make the minimal changes to your training script, you can configure the hook with parameters to the SageMaker Debugger API operation, `DebuggerHookConfig`. - -```python -from sagemaker.debugger import DebuggerHookConfig -hook_config = DebuggerHookConfig( - s3_output_path='s3://smdebug-dev-demo-pdx/mnist', - hook_parameters={ - "parameter": "value" - }) -``` - -The available hook parameters are listed in the following. The meaning of these parameters will be clear as you review the sections of documentation below. Note that all parameters below have to be strings. So for any parameter which accepts a list (such as save_steps, reductions, include_regex), the value needs to be given as strings separated by a comma between them. - -``` -dry_run -save_all -include_workers -include_regex -reductions -save_raw_tensor -save_shape -save_interval -save_steps -start_step -end_step -train.save_interval -train.save_steps -train.start_step -train.end_step -eval.save_interval -eval.save_steps -eval.start_step -eval.end_step -predict.save_interval -predict.save_steps -predict.start_step -predict.end_step -global.save_interval -global.save_steps -global.start_step -global.end_step -``` - -#### Hook from Python constructor -See the framework-specific pages for more details. - -HookClass below can be one of `KerasHook`, `SessionHook`, `EstimatorHook` for TensorFlow, or is just `Hook` for MXNet, Pytorch and XGBoost. - -```python -hook = HookClass( - out_dir, - export_tensorboard = False, - tensorboard_dir = None, - dry_run = False, - reduction_config = None, - save_config = None, - include_regex = None, - include_collections = None, - save_all = False, - include_workers="one" -) -``` -##### Arguments -- `out_dir` (str): Path where to save tensors and metadata. This is a required argument. Please ensure that the 'out_dir' does not exist. -- `export_tensorboard` (bool): Whether to export TensorBoard summaries (distributions and histograms for tensors saved, and scalar summaries for scalars saved). Defaults to `False`. Note that when running on SageMaker this parameter will be ignored. You will need to use the TensorBoardOutputConfig section in API to enable TensorBoard summaries. Refer [SageMaker page](sagemaker.md) for an example. -- `tensorboard_dir` (str): Path where to save TensorBoard artifacts. If this is not passed and `export_tensorboard` is True, then TensorBoard artifacts are saved in `out_dir/tensorboard` . Note that when running on SageMaker this parameter will be ignored. You will need to use the TensorBoardOutputConfig section in API to enable TensorBoard summaries. Refer [SageMaker page](sagemaker.md) for an example. -- `dry_run` (bool): If true, don't write any files -- `reduction_config`: ([ReductionConfig](#reductionconfig) object) Specifies the reductions to be applied as default for tensors saved. A collection can have its own `ReductionConfig` object which overrides this for the tensors which belong to that collection. -- `save_config`: ([SaveConfig](#saveconfig) object) Specifies when to save tensors. A collection can have its own `SaveConfig` object which overrides this for the tensors which belong to that collection. -- `include_regex` (list[str]): list of regex patterns which specify the tensors to save. Tensors whose names match these patterns will be saved -- `include_collections` (list[str]): List of which collections to save specified by name -- `save_all` (bool): Saves all tensors and collections. Increases the amount of disk space used, and can reduce the performance of the training job significantly, depending on the size of the model. -- `include_workers` (str): Used for distributed training. It can take the values `one` or `all`. `one` means only the tensors from one chosen worker will be saved. This is the default behavior. `all` means tensors from all workers will be saved. - -### Common Hook API -These methods are common for all hooks in any framework. - -Note that `smd` import below translates to `import smdebug.{framework} as smd`. - -| Method | Arguments | Behavior | -| --- | --- | --- | -|`add_collection(collection)` | `collection (smd.Collection)` | Takes a Collection object and adds it to the CollectionManager that the Hook holds. Note that you should only pass in a Collection object for the same framework as the hook | -|`get_collection(name)`| `name (str)` | Returns collection identified by the given name | -|`get_collections()` | - | Returns all collection objects held by the hook | -|`set_mode(mode)`| value of the enum `smd.modes` | Sets mode of the job, can be one of `smd.modes.TRAIN`, `smd.modes.EVAL`, `smd.modes.PREDICT` or `smd.modes.GLOBAL`. Refer [Modes](#modes) for more on that. | -|`create_from_json_file(`
` json_file_path=None)` | `json_file_path (str)` | Takes the path of a file which holds the json configuration of the hook, and creates hook from that configuration. This is an optional parameter.
If this is not passed it tries to get the file path from the value of the environment variable `SMDEBUG_CONFIG_FILE_PATH` and defaults to `/opt/ml/input/config/debughookconfig.json`. When training on SageMaker you do not have to specify any path because this is the default path that SageMaker writes the hook configuration to. -|`close()` | - | Closes all files that are currently open by the hook | -| `save_scalar()` | `name (str)`
`value (float)`
`sm_metric (bool)`| Saves a scalar value by the given name. Passing `sm_metric=True` flag also makes this scalar available as a SageMaker Metric to show up in SageMaker Studio. Note that when `sm_metric` is False, this scalar always resides only in your AWS account, but setting it to True saves the scalar also on AWS servers. The default value of `sm_metric` for this method is False. | -| `save_tensor()`| `tensor_name (str)`, `tensor_value (numpy.array or numpy.ndarray)`, `collections_to_write (str or list[str])` | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.| - - -### TensorFlow specific Hook API -Note that there are three types of Hooks in TensorFlow: SessionHook, EstimatorHook and KerasHook based on the TensorFlow interface being used for training. [This page](tensorflow.md) shows examples of each of these. - -| Method | Arguments | Returns | Behavior | -| --- | --- | --- | --- | -| `wrap_optimizer(optimizer)` | `optimizer` (tf.train.Optimizer or tf.keras.Optimizer) | Returns the same optimizer object passed with a couple of identifying markers to help `smdebug`. This returned optimizer should be used for training. | When not using Zero Script Change environments, calling this method on your optimizer is necessary for SageMaker Debugger to identify and save gradient tensors. Note that this method returns the same optimizer object passed and does not change your optimization logic. If the hook is of type `KerasHook`, you can pass in either an object of type `tf.train.Optimizer` or `tf.keras.Optimizer`. If the hook is of type `SessionHook` or `EstimatorHook`, the optimizer can only be of type `tf.train.Optimizer`. This new -| `add_to_collection(`
`collection_name, variable)` | `collection_name (str)` : name of the collection to add to.
`variable` parameter to pass to the collection's `add` method. | `None` | Calls the `add` method of a collection object. See [this section](#collection) for more. | - -The following hook APIs are specific to training scripts using the TF 2.x GradientTape ([Example](tensorflow.md#TF 2.x GradientTape example)): - -| Method | Arguments | Returns | Behavior | -| --- | --- | --- | --- | -| `wrap_tape(tape)` | `tape` (tensorflow.python.eager.backprop.GradientTape) | Returns a tape object with three identifying markers to help `smdebug`. This returned tape should be used for training. | When not using Zero Script Change environments, calling this method on your tape is necessary for SageMaker Debugger to identify and save gradient tensors. Note that this method returns the same tape object passed. - -### MXNet specific Hook API - -| Method | Arguments | Behavior | -| --- | --- | --- | -| `register_block(block)` | `block (mx.gluon.Block)` | Calling this method applies the hook to the Gluon block representing the model, so SageMaker Debugger gets called by MXNet and can save the tensors required. | - -### PyTorch specific Hook API - - -| Method | Arguments | Behavior | -| --- | --- | --- | -| `register_module(module)` | `module (torch.nn.Module)` | Calling this method applies the hook to the Torch Module representing the model, so SageMaker Debugger gets called by PyTorch and can save the tensors required. | -| `register_loss(loss_module)` | `loss_module (torch.nn.modules.loss._Loss)` | Calling this method applies the hook to the Torch Module representing the loss, so SageMaker Debugger can save losses | - ---- - -## Modes -Used to signify which part of training you're in, similar to Keras modes. `GLOBAL` mode is used as -a default when no mode was set. Choose from -```python -smdebug.modes.TRAIN -smdebug.modes.EVAL -smdebug.modes.PREDICT -smdebug.modes.GLOBAL -``` - -The modes enum is also available under the alias `smdebug.{framework}.modes`. - ---- - -## Collection - -The construct of a Collection groups tensors together. A Collection is identified by a string representing the name of the collection. It can be used to group tensors of a particular kind such as "losses", "weights", "biases", or "gradients". A Collection has its own list of tensors specified by include regex patterns, and other parameters determining how these tensors should be saved and when. Using collections enables you to save different types of tensors at different frequencies and in different forms. These collections are then also available during analysis so you can query a group of tensors at once. - -There are a number of built-in collections that SageMaker Debugger manages by default. This means that the library takes care of identifying what tensors should be saved as part of that collection. You can also define custom collections, to do which there are couple of different ways. - -You can specify which of these collections to save in the hook's `include_collections` parameter, or through the `collection_configs` parameter to the `DebuggerHookConfig` in the SageMaker Python SDK. - -### Built in Collections -Below is a comprehensive list of the built-in collections that are managed by SageMaker Debugger. The Hook identifes the tensors that should be saved as part of that collection for that framework and saves them if they were requested. - -The names of these collections are all lower case strings. - -| Name | Supported by frameworks/hooks | Description | -|---|---|---| -| `all` | all | Matches all tensors | -| `default` | all | It's a default collection created, which matches the regex patterns passed as `include_regex` to the Hook | -| `weights` | TensorFlow, PyTorch, MXNet | Matches all weights of the model | -| `biases` | TensorFlow, PyTorch, MXNet | Matches all biases of the model | -| `gradients` | TensorFlow, PyTorch, MXNet | Matches all gradients of the model. In TensorFlow when not using Zero Script Change environments, must use `hook.wrap_optimizer()`. | -| `losses` | TensorFlow, PyTorch, MXNet | Saves the loss for the model | -| `metrics` | TensorFlow's KerasHook, XGBoost | For KerasHook, saves the metrics computed by Keras for the model. For XGBoost, the evaluation metrics computed by the algorithm. | -| `outputs` | TensorFlow's KerasHook | Matches the outputs of the model | -| `layers` | TensorFlow's KerasHook | Input and output of intermediate convolutional layers | -| `sm_metrics` | TensorFlow | You can add scalars that you want to show up in SageMaker Metrics to this collection. SageMaker Debugger will save these scalars both to the out_dir of the hook, as well as to SageMaker Metric. Note that the scalars passed here will be saved on AWS servers outside of your AWS account. | -| `optimizer_variables` | TensorFlow's KerasHook | Matches all optimizer variables, currently only supported in Keras. | -| `hyperparameters` | XGBoost | [Booster paramameters](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html) | -| `predictions` | XGBoost | Predictions on validation set (if provided) | -| `labels` | XGBoost | Labels on validation set (if provided) | -| `feature_importance` | XGBoost | Feature importance given by [get_score()](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.get_score) | -| `full_shap` | XGBoost | A matrix of (nsmaple, nfeatures + 1) with each record indicating the feature contributions ([SHAP values](https://github.com/slundberg/shap)) for that prediction. Computed on training data with [predict()](https://github.com/slundberg/shap) | -| `average_shap` | XGBoost | The sum of SHAP value magnitudes over all samples. Represents the impact each feature has on the model output. | -| `trees` | XGBoost | Boosted tree model given by [trees_to_dataframe()](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.trees_to_dataframe) | - -### Default collections saved -The following collections are saved regardless of the hook configuration. - -| Framework | Default collections saved | -|---|---| -| `TensorFlow` | METRICS, LOSSES, SM_METRICS | -| `PyTorch` | LOSSES | -| `MXNet` | LOSSES | -| `XGBoost` | METRICS | - - - If for some reason, you want to disable the saving of these collections, you can do so by setting end_step to 0 in the collection's SaveConfig. - When using the SageMaker Python SDK this would look like - ```python -from sagemaker.debugger import DebuggerHookConfig, CollectionConfig -hook_config = DebuggerHookConfig( - s3_output_path='s3://smdebug-dev-demo-pdx/mnist', - collection_configs=[ - CollectionConfig(name="metrics", parameters={"end_step": 0}) - ] -) - ``` - When configuring the Collection in your Python script, it would be as follows: - ```python - hook.get_collection("metrics").save_config.end_step = 0 - ``` - -### Creating or retrieving a Collection - -| Function | Behavior | -|---|---| -| ```hook.get_collection(collection_name)``` | Returns the collection with the given name. Creates the collection with default configuration if it doesn't already exist. A new collection created by default does not match any tensor and is configured to save histograms and distributions along with the tensor if tensorboard support is enabled, and uses the reduction configuration and save configuration passed to the hook. | - -### Properties of a Collection -| Property | Description | -|---|---| -| `tensor_names` | Get or set list of tensor names as strings | -| `include_regex` | Get or set list of regexes to include. Tensors whose names match these regex patterns will be included in the collection | -| `reduction_config` | Get or set the ReductionConfig object to be used for tensors part of this collection | -| `save_config` | Get or set the SaveConfig object to be used for tensors part of this collection | -| `save_histogram` | Get or set the boolean flag which determines whether to write histograms to enable histograms and distributions in TensorBoard, for tensors part of this collection. Only applicable if TensorBoard support is enabled.| - - -### Methods on a Collection - -| Method | Behavior | -|---|---| -| ```coll.include(regex)``` | Takes a regex string or a list of regex strings to match tensors to include in the collection. | -| ```coll.add(tensor)``` | **(TensorFlow only)** Takes an instance or list or set of tf.Tensor/tf.Variable/tf.MirroredVariable/tf.Operation to add to the collection. | -| ```coll.add_keras_layer(layer, inputs=False, outputs=True)``` | **(tf.keras only)** Takes an instance of a tf.keras layer and logs input/output tensors for that module. By default, only outputs are saved. | -| ```coll.add_module_tensors(module, inputs=False, outputs=True)``` | **(PyTorch only)** Takes an instance of a PyTorch module and logs input/output tensors for that module. By default, only outputs are saved. | -| ```coll.add_block_tensors(block, inputs=False, outputs=True)``` | **(MXNet only)** Takes an instance of a Gluon block,and logs input/output tensors for that module. By default, only outputs are saved. | - -### Configuring Collection using SageMaker Python SDK -Parameters to configure Collection are passed as below when using the SageMaker Python SDK. -```python -from sagemaker.debugger import CollectionConfig -coll_config = CollectionConfig( - name="weights", - parameters={ "parameter": "value" }) -``` -The parameters can be one of the following. The meaning of these parameters will be clear as you review the sections of documentation below. Note that all parameters below have to be strings. So any parameter which accepts a list (such as save_steps, reductions, include_regex), needs to be given as strings separated by a comma between them. - -``` -include_regex -save_histogram -reductions -save_raw_tensor -save_interval -save_steps -start_step -end_step -train.save_interval -train.save_steps -train.start_step -train.end_step -eval.save_interval -eval.save_steps -eval.start_step -eval.end_step -predict.save_interval -predict.save_steps -predict.start_step -predict.end_step -global.save_interval -global.save_steps -global.start_step -global.end_step -``` - - ---- - -## SaveConfig -The SaveConfig class customizes the frequency of saving tensors. -The hook takes a SaveConfig object which is applied as default to all tensors included. -A collection can also have a SaveConfig object which is applied to the collection's tensors. -You can also choose to have different configuration for when to save tensors based on the mode of the job. - -This class is available in the following namespaces `smdebug` and `smdebug.{framework}`. - -```python -import smdebug as smd -save_config = smd.SaveConfig( - mode_save_configs = None, - save_interval = 100, - start_step = 0, - end_step = None, - save_steps = None, -) -``` -##### Arguments -- `mode_save_configs` (dict): Used for advanced cases; see details below. -- `save_interval` (int): How often, in steps, to save tensors. Defaults to 500. A step is saved if `step % save_interval == 0` -- `start_step` (int): When to start saving tensors. -- `end_step` (int): When to stop saving tensors, exclusive. -- `save_steps` (list[int]): Specific steps to save tensors at. Union with save_interval. - -##### Examples - -- `SaveConfig()` will save at steps 0, 500, ... -- `SaveConfig(save_interval=1)` will save at steps 0, 1, ... -- `SaveConfig(save_interval=100, end_step=200)` will save at steps 0, 100 -- `SaveConfig(save_interval=100, end_step=201)` will save at steps 0, 100, 200 -- `SaveConfig(save_interval=100, start_step=150)` will save at steps 200, 300, ... -- `SaveConfig(save_steps=[3, 7])` will save at steps 0, 3, 7, 500, ... - -##### Specifying different configuration based on mode -There is also a more advanced use case, where you specify a different SaveConfig for each mode. -It is best understood through an example: -```python -import smdebug as smd -smd.SaveConfig(mode_save_configs={ - smd.modes.TRAIN: smd.SaveConfigMode(save_interval=1), - smd.modes.EVAL: smd.SaveConfigMode(save_interval=2), - smd.modes.PREDICT: smd.SaveConfigMode(save_interval=3), - smd.modes.GLOBAL: smd.SaveConfigMode(save_interval=4) -}) -``` -Essentially, create a dictionary mapping modes to SaveConfigMode objects. The SaveConfigMode objects -take the same four parameters (save_interval, start_step, end_step, save_steps) as the main object. -Any mode not specified will default to the default configuration. If a mode is provided but not all -params are specified, we use the default values for non-specified parameters. - -#### Configuration using SageMaker Python SDK -Refer [Configuring Hook using SageMaker Python SDK](#configuring-hook-using-sagemaker-python-sdk) and [Configuring Collection using SageMaker Python SDK](#configuring-collection-using-sagemaker-python-sdk) - ---- - -## ReductionConfig -ReductionConfig allows the saving of certain reductions of tensors instead -of saving the full tensor. The motivation here is to reduce the amount of data -saved, and increase the speed in cases where you don't need the full -tensor. The reduction operations which are computed in the training process -and then saved. - -During analysis, these are available as reductions of the original tensor. -Please note that using reduction config means that you will not have -the full tensor available during analysis, so this can restrict what you can do with the tensor saved. You can choose to also save the raw tensor along with the reductions if you so desire. - -The hook takes a ReductionConfig object which is applied as default to all tensors included. -A collection can also have its own ReductionConfig object which is applied -to the tensors belonging to that collection. - -```python -import smdebug as smd -reduction_config = smd.ReductionConfig( - reductions = None, - abs_reductions = None, - norms = None, - abs_norms = None, - save_raw_tensor = False, -) -``` - -##### Arguments -- `reductions` (list[str]): Takes names of reductions, choosing from "min", "max", "median", "mean", "std", "variance", "sum", "prod" -- `abs_reductions` (list[str]): Same as reductions, except the reduction will be computed on the absolute value of the tensor -- `norms` (list[str]): Takes names of norms to compute, choosing from "l1", "l2" -- `abs_norms` (list[str]): Same as norms, except the norm will be computed on the absolute value of the tensor -- `save_raw_tensor` (bool): Saves the tensor directly, in addition to other desired reductions - -For example, - -`ReductionConfig(reductions=['std', 'variance'], abs_reductions=['mean'], norms=['l1'])` - -will save the standard deviation and variance, the mean of the absolute value, and the l1 norm. - -#### Configuration using SageMaker Python SDK -The reductions are passed as part of the "reductions" parameter to HookParameters or Collection Parameters. -Refer [Configuring Hook using SageMaker Python SDK](#configuring-hook-using-sagemaker-python-sdk) and [Configuring Collection using SageMaker Python SDK](#configuring-collection-using-sagemaker-python-sdk) for more on that. - -The parameter "reductions" can take a comma separated string consisting of the following values: -``` -min -max -median -mean -std -variance -sum -prod -l1 -l2 -abs_min -abs_max -abs_median -abs_mean -abs_std -abs_variance -abs_sum -abs_prod -abs_l1 -abs_l2 -``` - ---- - -## Frameworks - -For details on what's supported for different framework, go here: -* [TensorFlow](tensorflow.md) -* [PyTorch](pytorch.md) -* [MXNet](mxnet.md) -* [XGBoost](xgboost.md) diff --git a/docs/api.rst.bkup b/docs/api.rst.bkup new file mode 100644 index 000000000..9b239951e --- /dev/null +++ b/docs/api.rst.bkup @@ -0,0 +1,1024 @@ +Hook API +======== + +- `Glossary <#glossary>`__ +- `Hook <#hook>`__ + + - `Creating a Hook <#creating-a-hook>`__ + + - `Hook when using SageMaker Python + SDK <#hook-when-using-sagemaker-python-sdk>`__ + - `Configuring Hook using SageMaker Python + SDK <#configuring-hook-using-sagemaker-python-sdk>`__ + - `Hook from Python + constructor <#hook-from-python-constructor>`__ + + - `Common Hook API <#common-hook-api>`__ + - `TensorFlow specific Hook API <#tensorflow-specific-hook-api>`__ + - `MXNet specific Hook API <#mxnet-specific-hook-api>`__ + - `PyTorch specific Hook API <#pytorch-specific-hook-api>`__ + +- `Modes <#modes>`__ +- `Collection <#collection>`__ +- `SaveConfig <#saveconfig>`__ +- `ReductionConfig <#reductionconfig>`__ + +Glossary +-------- + +The following glossary items assume you import the SMDebug framework modules as follows: + +- **TensorFlow** + + .. code:: python + + import smdebug.tensorflow as smd + +- **PyTorch** + + .. code:: python + + import smdebug.pytorch as smd + +- **MXNet** + + .. code:: python + + import smdebug.mxnet as smd + +- **XGBoost** + + .. code:: python + + import smdebug.xgboost as smd + +Terminologies +~~~~~~~~~~~~~ + +- **Step**: Step means the work done for one batch by a training job + (i.e. forward and backward pass). (An exception is with TensorFlow’s + Session interface, where a step also includes the initialization session + run calls). SageMaker Debugger is designed in terms of steps. When to + save data is specified using steps. Also, invocation of Rules is + on a step-by-step basis. + +- **Hook**: The main class to pass as a callback object or to create + callback functions. It keeps track of collections and writes output + files at each step. The current hook implementation does not support + merging tensors from current job with tensors from previous job(s). + Therefore, ensure that the ``out_dir`` does not exist prior to instantiating + the ‘Hook’ object. - ``hook = smd.Hook(out_dir="/tmp/mnist_job")`` + +- **Mode**: One of “train”, “eval”, “predict”, or “global”. Helpful for + segmenting data based on the phase you’re in. Defaults to “global”. - + ``train_mode = smd.modes.TRAIN`` + +- **Collection**: A group of tensors. Each collection contains its + configuration for what tensors are part of it, and when to save them. - + ``collection = hook.get_collection("losses")`` + +- **SaveConfig**: A Python dict specifying how often to save losses and + tensors. - ``save_config = smd.SaveConfig(save_interval=10)`` + +- **ReductionConfig**: Allows you to save a reduction, such as ‘mean’ or + ‘l1 norm’, instead of the full tensor. Reductions are simple floats. - + ``reduction_config = smd.ReductionConfig(reductions=['min', 'max', 'mean'], norms=['l1'])`` + +- **Trial**: The main interface to use when analyzing a completed training + job. Access collections and tensors. See `trials + documentation `__. - + ``trial = smd.create_trial(out_dir="/tmp/mnist_job")`` + +- **Rule**: A condition to monitor the saved data for. It can trigger an + exception when the condition is met, for example a vanishing gradient. + See `rules documentation `__. + +-------------- + +Hook +---- + +Creating a Hook +~~~~~~~~~~~~~~~ + +By using AWS Deep Learning Containers, you can directly run your own +training scripts without any additional effort to make it compatible with +the SageMaker Python SDK. For a detailed developer guide for this, see +`Use Debugger in AWS +Containers `__. + +However, for some advanced use cases where you need access to customized +tensors from targeted parts of a training script, you can manually +construct the hook object. The SMDebug library provides hook classes to +make this process simple and compatible with the SageMaker ecosystem and +Debugger. + +To capture tensors from your training model, paste the following code to +the top or the main function of the training script. + +.. code:: python + + import smdebug.Framework as smd + hook = smd.HookClass.create_from_json_file() + +Depending on your choice of framework, ``HookClass`` need to be replaced +by one of ``KerasHook``, ``SessionHook`` or ``EstimatorHook`` for +TensorFlow, and ``Hook`` for PyTorch, MXNet, and XGBoost. + +The framework in ``smd.Framework`` import refers to one of +``tensorflow``, ``mxnet``, ``pytorch``, or ``xgboost``. + +After choosing a framework and defining the hook object, you need to +embed the hooks into target parts of your training script to retrieve +tensors and to use with the SageMaker Debugger Python SDK. + +For more information about constructing the hook depending on a +framework of your choice and adding the hooks to your model, see the +following pages. + +- :doc:`TensorFlow ` +- :doc:`MXNet hook ` +- :doc:`PyTorch hook ` +- :doc:`XGBoost hook ` + +.. toctree:: + :maxdepth: 2 + + tensorflow + pytorch + mxnet + xgboost + +Configuring Hook using SageMaker Python SDK +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +After you make the minimal changes to your training script, you can +configure the hook with parameters to the SageMaker Debugger API +operation, ``DebuggerHookConfig``. + +.. code:: python + + from sagemaker.debugger import DebuggerHookConfig + hook_config = DebuggerHookConfig( + s3_output_path='s3://smdebug-dev-demo-pdx/mnist', + hook_parameters={ + "parameter": "value" + }) + +The available hook parameters are listed in the following. The meaning +of these parameters will be clear as you review the sections of +documentation below. Note that all parameters below have to be strings. +So for any parameter which accepts a list (such as save_steps, +reductions, include_regex), the value needs to be given as strings +separated by a comma between them. + +:: + + dry_run + save_all + include_workers + include_regex + reductions + save_raw_tensor + save_shape + save_interval + save_steps + start_step + end_step + train.save_interval + train.save_steps + train.start_step + train.end_step + eval.save_interval + eval.save_steps + eval.start_step + eval.end_step + predict.save_interval + predict.save_steps + predict.start_step + predict.end_step + global.save_interval + global.save_steps + global.start_step + global.end_step + +Hook from Python constructor +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +See the framework-specific pages for more details. + +HookClass below can be one of ``KerasHook``, ``SessionHook``, +``EstimatorHook`` for TensorFlow, or is just ``Hook`` for MXNet, Pytorch +and XGBoost. + +.. code:: python + + hook = HookClass( + out_dir, + export_tensorboard = False, + tensorboard_dir = None, + dry_run = False, + reduction_config = None, + save_config = None, + include_regex = None, + include_collections = None, + save_all = False, + include_workers="one" + ) + +Arguments +''''''''' + +- ``out_dir`` (str): Path where to save tensors and metadata. This is a + required argument. Please ensure that the ‘out_dir’ does not exist. +- ``export_tensorboard`` (bool): Whether to export TensorBoard + summaries (distributions and histograms for tensors saved, and scalar + summaries for scalars saved). Defaults to ``False``. Note that when + running on SageMaker this parameter will be ignored. You will need to + use the TensorBoardOutputConfig section in API to enable TensorBoard + summaries. Refer `SageMaker page `__ for an example. +- ``tensorboard_dir`` (str): Path where to save TensorBoard artifacts. + If this is not passed and ``export_tensorboard`` is True, then + TensorBoard artifacts are saved in ``out_dir/tensorboard`` . Note + that when running on SageMaker this parameter will be ignored. You + will need to use the TensorBoardOutputConfig section in API to enable + TensorBoard summaries. Refer `SageMaker page `__ for an + example. +- ``dry_run`` (bool): If true, don’t write any files +- ``reduction_config``: (`ReductionConfig <#reductionconfig>`__ object) + Specifies the reductions to be applied as default for tensors saved. + A collection can have its own ``ReductionConfig`` object which + overrides this for the tensors which belong to that collection. +- ``save_config``: (`SaveConfig <#saveconfig>`__ object) Specifies when + to save tensors. A collection can have its own ``SaveConfig`` object + which overrides this for the tensors which belong to that collection. +- ``include_regex`` (list[str]): list of regex patterns which specify + the tensors to save. Tensors whose names match these patterns will be + saved +- ``include_collections`` (list[str]): List of which collections to + save specified by name +- ``save_all`` (bool): Saves all tensors and collections. Increases the + amount of disk space used, and can reduce the performance of the + training job significantly, depending on the size of the model. +- ``include_workers`` (str): Used for distributed training. It can take + the values ``one`` or ``all``. ``one`` means only the tensors from + one chosen worker will be saved. This is the default behavior. + ``all`` means tensors from all workers will be saved. + +Common Hook API +~~~~~~~~~~~~~~~ + +These methods are common for all hooks in any framework. + +Note that ``smd`` import below translates to +``import smdebug.{framework} as smd``. + ++-----------------------+-----------------------+-----------------------+ +| Method | Arguments | Behavior | ++=======================+=======================+=======================+ +| ``add_col | ``collecti | Takes a Collection | +| lection(collection)`` | on (smd.Collection)`` | object and adds it to | +| | | the CollectionManager | +| | | that the Hook holds. | +| | | Note that you should | +| | | only pass in a | +| | | Collection object for | +| | | the same framework as | +| | | the hook | ++-----------------------+-----------------------+-----------------------+ +| ``g | ``name (str)`` | Returns collection | +| et_collection(name)`` | | identified by the | +| | | given name | ++-----------------------+-----------------------+-----------------------+ +| ``get_collections()`` | - | Returns all | +| | | collection objects | +| | | held by the hook | ++-----------------------+-----------------------+-----------------------+ +| ``set_mode(mode)`` | value of the enum | Sets mode of the job, | +| | ``smd.modes`` | can be one of | +| | | ``smd.modes.TRAIN``, | +| | | ``smd.modes.EVAL``, | +| | | ``smd.modes.PREDICT`` | +| | | or | +| | | ``smd.modes.GLOBAL``. | +| | | Refer | +| | | `Modes <#modes>`__ | +| | | for more on that. | ++-----------------------+-----------------------+-----------------------+ +| ``create_fro | ``j | Takes the path of a | +| m_json_file(``\ \ ``j | son_file_path (str)`` | file which holds the | +| son_file_path=None)`` | | json configuration of | +| | | the hook, and creates | +| | | hook from that | +| | | configuration. This | +| | | is an optional | +| | | parameter. If this is | +| | | not passed it tries | +| | | to get the file path | +| | | from the value of the | +| | | environment variable | +| | | ``SMDEB | +| | | UG_CONFIG_FILE_PATH`` | +| | | and defaults to | +| | | ``/o | +| | | pt/ml/input/config/de | +| | | bughookconfig.json``. | +| | | When training on | +| | | SageMaker you do not | +| | | have to specify any | +| | | path because this is | +| | | the default path that | +| | | SageMaker writes the | +| | | hook configuration | +| | | to. | ++-----------------------+-----------------------+-----------------------+ +| ``close()`` | - | Closes all files that | +| | | are currently open by | +| | | the hook | ++-----------------------+-----------------------+-----------------------+ +| ``save_scalar()`` | ``name (str)`` | Saves a scalar value | +| | ``value (float)`` | by the given name. | +| | ``sm_metric (bool)`` | Passing | +| | | ``sm_metric=True`` | +| | | flag also makes this | +| | | scalar available as a | +| | | SageMaker Metric to | +| | | show up in SageMaker | +| | | Studio. Note that | +| | | when ``sm_metric`` is | +| | | False, this scalar | +| | | always resides only | +| | | in your AWS account, | +| | | but setting it to | +| | | True saves the scalar | +| | | also on AWS servers. | +| | | The default value of | +| | | ``sm_metric`` for | +| | | this method is False. | ++-----------------------+-----------------------+-----------------------+ +| ``save_tensor()`` | ` | Manually save metrics | +| | `tensor_name (str)``, | tensors. The | +| | ``tens | ``re | +| | or_value (numpy.array | cord_tensor_value()`` | +| | or numpy.ndarray)``, | API is deprecated in | +| | ` | favor or | +| | `collections_to_write | ``save_tensor()``. | +| | (str or list[str])`` | | ++-----------------------+-----------------------+-----------------------+ + +TensorFlow specific Hook API +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Note that there are three types of Hooks in TensorFlow: SessionHook, +EstimatorHook and KerasHook based on the TensorFlow interface being used +for training. `This page `__ shows examples of each of +these. + ++-----------------+-----------------+-----------------+-----------------+ +| Method | Arguments | Returns | Behavior | ++=================+=================+=================+=================+ +| ``wrap_optimiz | ``optimizer`` | Returns the | When not using | +| er(optimizer)`` | (tf. | same optimizer | Zero Script | +| | train.Optimizer | object passed | Change | +| | or | with a couple | environments, | +| | tf.k | of identifying | calling this | +| | eras.Optimizer) | markers to help | method on your | +| | | ``smdebug``. | optimizer is | +| | | This returned | necessary for | +| | | optimizer | SageMaker | +| | | should be used | Debugger to | +| | | for training. | identify and | +| | | | save gradient | +| | | | tensors. Note | +| | | | that this | +| | | | method returns | +| | | | the same | +| | | | optimizer | +| | | | object passed | +| | | | and does not | +| | | | change your | +| | | | optimization | +| | | | logic. If the | +| | | | hook is of type | +| | | | ``KerasHook``, | +| | | | you can pass in | +| | | | either an | +| | | | object of type | +| | | | ``tf.tr | +| | | | ain.Optimizer`` | +| | | | or | +| | | | ``tf.ker | +| | | | as.Optimizer``. | +| | | | If the hook is | +| | | | of type | +| | | | ``SessionHook`` | +| | | | or | +| | | | ``E | +| | | | stimatorHook``, | +| | | | the optimizer | +| | | | can only be of | +| | | | type | +| | | | ``tf.tra | +| | | | in.Optimizer``. | +| | | | This new | ++-----------------+-----------------+-----------------+-----------------+ +| ``add_to_ | ``collecti | ``None`` | Calls the | +| collection(``\ | on_name (str)`` | | ``add`` method | +| ``collection_na | : name of the | | of a collection | +| me, variable)`` | collection to | | object. See | +| | add to. | | `this | +| | ``variable`` | | section < | +| | parameter to | | #collection>`__ | +| | pass to the | | for more. | +| | collection’s | | | +| | ``add`` method. | | | ++-----------------+-----------------+-----------------+-----------------+ + +The following hook APIs are specific to training scripts using the TF +2.x GradientTape +(`Example `__): + ++-----------------+-----------------+-----------------+-----------------+ +| Method | Arguments | Returns | Behavior | ++=================+=================+=================+=================+ +| ``wr | ``tape`` | Returns a tape | When not using | +| ap_tape(tape)`` | (t | object with | Zero Script | +| | ensorflow.pytho | three | Change | +| | n.eager.backpro | identifying | environments, | +| | p.GradientTape) | markers to help | calling this | +| | | ``smdebug``. | method on your | +| | | This returned | tape is | +| | | tape should be | necessary for | +| | | used for | SageMaker | +| | | training. | Debugger to | +| | | | identify and | +| | | | save gradient | +| | | | tensors. Note | +| | | | that this | +| | | | method returns | +| | | | the same tape | +| | | | object passed. | ++-----------------+-----------------+-----------------+-----------------+ + +MXNet specific Hook API +~~~~~~~~~~~~~~~~~~~~~~~ + ++-----------------------+-----------------------+-----------------------+ +| Method | Arguments | Behavior | ++=======================+=======================+=======================+ +| ``re | ``blo | Calling this method | +| gister_block(block)`` | ck (mx.gluon.Block)`` | applies the hook to | +| | | the Gluon block | +| | | representing the | +| | | model, so SageMaker | +| | | Debugger gets called | +| | | by MXNet and can save | +| | | the tensors required. | ++-----------------------+-----------------------+-----------------------+ + +PyTorch specific Hook API +~~~~~~~~~~~~~~~~~~~~~~~~~ + ++-----------------------+-----------------------+-----------------------+ +| Method | Arguments | Behavior | ++=======================+=======================+=======================+ +| ``regi | ``modul | Calling this method | +| ster_module(module)`` | e (torch.nn.Module)`` | applies the hook to | +| | | the Torch Module | +| | | representing the | +| | | model, so SageMaker | +| | | Debugger gets called | +| | | by PyTorch and can | +| | | save the tensors | +| | | required. | ++-----------------------+-----------------------+-----------------------+ +| ``registe | ``l | Calling this method | +| r_loss(loss_module)`` | oss_module (torch.nn. | applies the hook to | +| | modules.loss._Loss)`` | the Torch Module | +| | | representing the | +| | | loss, so SageMaker | +| | | Debugger can save | +| | | losses | ++-----------------------+-----------------------+-----------------------+ + +-------------- + +Modes +----- + +Used to signify which part of training you’re in, similar to Keras +modes. ``GLOBAL`` mode is used as a default when no mode was set. Choose +from + +.. code:: python + + smdebug.modes.TRAIN + smdebug.modes.EVAL + smdebug.modes.PREDICT + smdebug.modes.GLOBAL + +The modes enum is also available under the alias +``smdebug.{framework}.modes``. + +-------------- + +Collection +---------- + +The construct of a Collection groups tensors together. A Collection is +identified by a string representing the name of the collection. It can +be used to group tensors of a particular kind such as “losses”, +“weights”, “biases”, or “gradients”. A Collection has its own list of +tensors specified by include regex patterns, and other parameters +determining how these tensors should be saved and when. Using +collections enables you to save different types of tensors at different +frequencies and in different forms. These collections are then also +available during analysis so you can query a group of tensors at once. + +There are a number of built-in collections that SageMaker Debugger +manages by default. This means that the library takes care of +identifying what tensors should be saved as part of that collection. You +can also define custom collections, to do which there are couple of +different ways. + +You can specify which of these collections to save in the hook’s +``include_collections`` parameter, or through the ``collection_configs`` +parameter to the ``DebuggerHookConfig`` in the SageMaker Python SDK. + +Built in Collections +~~~~~~~~~~~~~~~~~~~~ + +Below is a comprehensive list of the built-in collections that are +managed by SageMaker Debugger. The Hook identifes the tensors that +should be saved as part of that collection for that framework and saves +them if they were requested. + +The names of these collections are all lower case strings. + ++-----------------------+-----------------------+-----------------------+ +| Name | Supported by | Description | +| | frameworks/hooks | | ++=======================+=======================+=======================+ +| ``all`` | all | Matches all tensors | ++-----------------------+-----------------------+-----------------------+ +| ``default`` | all | It’s a default | +| | | collection created, | +| | | which matches the | +| | | regex patterns passed | +| | | as ``include_regex`` | +| | | to the Hook | ++-----------------------+-----------------------+-----------------------+ +| ``weights`` | TensorFlow, PyTorch, | Matches all weights | +| | MXNet | of the model | ++-----------------------+-----------------------+-----------------------+ +| ``biases`` | TensorFlow, PyTorch, | Matches all biases of | +| | MXNet | the model | ++-----------------------+-----------------------+-----------------------+ +| ``gradients`` | TensorFlow, PyTorch, | Matches all gradients | +| | MXNet | of the model. In | +| | | TensorFlow when not | +| | | using Zero Script | +| | | Change environments, | +| | | must use | +| | | ``hoo | +| | | k.wrap_optimizer()``. | ++-----------------------+-----------------------+-----------------------+ +| ``losses`` | TensorFlow, PyTorch, | Saves the loss for | +| | MXNet | the model | ++-----------------------+-----------------------+-----------------------+ +| ``metrics`` | TensorFlow’s | For KerasHook, saves | +| | KerasHook, XGBoost | the metrics computed | +| | | by Keras for the | +| | | model. For XGBoost, | +| | | the evaluation | +| | | metrics computed by | +| | | the algorithm. | ++-----------------------+-----------------------+-----------------------+ +| ``outputs`` | TensorFlow’s | Matches the outputs | +| | KerasHook | of the model | ++-----------------------+-----------------------+-----------------------+ +| ``layers`` | TensorFlow’s | Input and output of | +| | KerasHook | intermediate | +| | | convolutional layers | ++-----------------------+-----------------------+-----------------------+ +| ``sm_metrics`` | TensorFlow | You can add scalars | +| | | that you want to show | +| | | up in SageMaker | +| | | Metrics to this | +| | | collection. SageMaker | +| | | Debugger will save | +| | | these scalars both to | +| | | the out_dir of the | +| | | hook, as well as to | +| | | SageMaker Metric. | +| | | Note that the scalars | +| | | passed here will be | +| | | saved on AWS servers | +| | | outside of your AWS | +| | | account. | ++-----------------------+-----------------------+-----------------------+ +| `` | TensorFlow’s | Matches all optimizer | +| optimizer_variables`` | KerasHook | variables, currently | +| | | only supported in | +| | | Keras. | ++-----------------------+-----------------------+-----------------------+ +| ``hyperparameters`` | XGBoost | `Booster | +| | | paramamete | +| | | rs `__ | ++-----------------------+-----------------------+-----------------------+ +| ``predictions`` | XGBoost | Predictions on | +| | | validation set (if | +| | | provided) | ++-----------------------+-----------------------+-----------------------+ +| ``labels`` | XGBoost | Labels on validation | +| | | set (if provided) | ++-----------------------+-----------------------+-----------------------+ +| ` | XGBoost | Feature importance | +| `feature_importance`` | | given by | +| | | `g | +| | | et_score() `__ | ++-----------------------+-----------------------+-----------------------+ +| ``full_shap`` | XGBoost | A matrix of (nsmaple, | +| | | nfeatures + 1) with | +| | | each record | +| | | indicating the | +| | | feature contributions | +| | | (`SHAP | +| | | valu | +| | | es `__) | +| | | for that prediction. | +| | | Computed on training | +| | | data with | +| | | `predic | +| | | t() `__ | ++-----------------------+-----------------------+-----------------------+ +| ``average_shap`` | XGBoost | The sum of SHAP value | +| | | magnitudes over all | +| | | samples. Represents | +| | | the impact each | +| | | feature has on the | +| | | model output. | ++-----------------------+-----------------------+-----------------------+ +| ``trees`` | XGBoost | Boosted tree model | +| | | given by | +| | | `trees_to_dataframe( | +| | | ) `__ | ++-----------------------+-----------------------+-----------------------+ + +Default collections saved +~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following collections are saved regardless of the hook +configuration. + +============== =========================== +Framework Default collections saved +============== =========================== +``TensorFlow`` METRICS, LOSSES, SM_METRICS +``PyTorch`` LOSSES +``MXNet`` LOSSES +``XGBoost`` METRICS +============== =========================== + +If for some reason, you want to disable the saving of these collections, +you can do so by setting end_step to 0 in the collection’s SaveConfig. +When using the SageMaker Python SDK this would look like +``python from sagemaker.debugger import DebuggerHookConfig, CollectionConfig hook_config = DebuggerHookConfig( s3_output_path='s3://smdebug-dev-demo-pdx/mnist', collection_configs=[ CollectionConfig(name="metrics", parameters={"end_step": 0}) ] )`` +When configuring the Collection in your Python script, it would be as +follows: +``python hook.get_collection("metrics").save_config.end_step = 0`` + +Creating or retrieving a Collection +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++-----------------------------------+-----------------------------------+ +| Function | Behavior | ++===================================+===================================+ +| ``hook. | Returns the collection with the | +| get_collection(collection_name)`` | given name. Creates the | +| | collection with default | +| | configuration if it doesn’t | +| | already exist. A new collection | +| | created by default does not match | +| | any tensor and is configured to | +| | save histograms and distributions | +| | along with the tensor if | +| | tensorboard support is enabled, | +| | and uses the reduction | +| | configuration and save | +| | configuration passed to the hook. | ++-----------------------------------+-----------------------------------+ + +Properties of a Collection +~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++-----------------------------------+-----------------------------------+ +| Property | Description | ++===================================+===================================+ +| ``tensor_names`` | Get or set list of tensor names | +| | as strings | ++-----------------------------------+-----------------------------------+ +| ``include_regex`` | Get or set list of regexes to | +| | include. Tensors whose names | +| | match these regex patterns will | +| | be included in the collection | ++-----------------------------------+-----------------------------------+ +| ``reduction_config`` | Get or set the ReductionConfig | +| | object to be used for tensors | +| | part of this collection | ++-----------------------------------+-----------------------------------+ +| ``save_config`` | Get or set the SaveConfig object | +| | to be used for tensors part of | +| | this collection | ++-----------------------------------+-----------------------------------+ +| ``save_histogram`` | Get or set the boolean flag which | +| | determines whether to write | +| | histograms to enable histograms | +| | and distributions in TensorBoard, | +| | for tensors part of this | +| | collection. Only applicable if | +| | TensorBoard support is enabled. | ++-----------------------------------+-----------------------------------+ + +Methods on a Collection +~~~~~~~~~~~~~~~~~~~~~~~ + ++-----------------------------------+-----------------------------------+ +| Method | Behavior | ++===================================+===================================+ +| ``coll.include(regex)`` | Takes a regex string or a list of | +| | regex strings to match tensors to | +| | include in the collection. | ++-----------------------------------+-----------------------------------+ +| ``coll.add(tensor)`` | **(TensorFlow only)** Takes an | +| | instance or list or set of | +| | tf.Tensor/tf.Variable | +| | /tf.MirroredVariable/tf.Operation | +| | to add to the collection. | ++-----------------------------------+-----------------------------------+ +| ``coll.add_keras_layer(lay | **(tf.keras only)** Takes an | +| er, inputs=False, outputs=True)`` | instance of a tf.keras layer and | +| | logs input/output tensors for | +| | that module. By default, only | +| | outputs are saved. | ++-----------------------------------+-----------------------------------+ +| ``coll.add_module_tensors(modu | **(PyTorch only)** Takes an | +| le, inputs=False, outputs=True)`` | instance of a PyTorch module and | +| | logs input/output tensors for | +| | that module. By default, only | +| | outputs are saved. | ++-----------------------------------+-----------------------------------+ +| ``coll.add_block_tensors(blo | **(MXNet only)** Takes an | +| ck, inputs=False, outputs=True)`` | instance of a Gluon block,and | +| | logs input/output tensors for | +| | that module. By default, only | +| | outputs are saved. | ++-----------------------------------+-----------------------------------+ + +Configuring Collection using SageMaker Python SDK +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Parameters to configure Collection are passed as below when using the +SageMaker Python SDK. + +.. code:: python + + from sagemaker.debugger import CollectionConfig + coll_config = CollectionConfig( + name="weights", + parameters={ "parameter": "value" }) + +The parameters can be one of the following. The meaning of these +parameters will be clear as you review the sections of documentation +below. Note that all parameters below have to be strings. So any +parameter which accepts a list (such as save_steps, reductions, +include_regex), needs to be given as strings separated by a comma +between them. + +:: + + include_regex + save_histogram + reductions + save_raw_tensor + save_interval + save_steps + start_step + end_step + train.save_interval + train.save_steps + train.start_step + train.end_step + eval.save_interval + eval.save_steps + eval.start_step + eval.end_step + predict.save_interval + predict.save_steps + predict.start_step + predict.end_step + global.save_interval + global.save_steps + global.start_step + global.end_step + +-------------- + +SaveConfig +---------- + +The SaveConfig class customizes the frequency of saving tensors. The +hook takes a SaveConfig object which is applied as default to all +tensors included. A collection can also have a SaveConfig object which +is applied to the collection’s tensors. You can also choose to have +different configuration for when to save tensors based on the mode of +the job. + +This class is available in the following namespaces ``smdebug`` and +``smdebug.{framework}``. + +.. code:: python + + import smdebug as smd + save_config = smd.SaveConfig( + mode_save_configs = None, + save_interval = 100, + start_step = 0, + end_step = None, + save_steps = None, + ) + +.. _arguments-1: + +Arguments +~~~~~~~~~ + +- ``mode_save_configs`` (dict): Used for advanced cases; see details + below. +- ``save_interval`` (int): How often, in steps, to save tensors. + Defaults to 500. A step is saved if ``step % save_interval == 0`` +- ``start_step`` (int): When to start saving tensors. +- ``end_step`` (int): When to stop saving tensors, exclusive. +- ``save_steps`` (list[int]): Specific steps to save tensors at. Union + with save_interval. + +Examples +~~~~~~~~ + +- ``SaveConfig()`` will save at steps 0, 500, … +- ``SaveConfig(save_interval=1)`` will save at steps 0, 1, … +- ``SaveConfig(save_interval=100, end_step=200)`` will save at steps 0, + 100 +- ``SaveConfig(save_interval=100, end_step=201)`` will save at steps 0, + 100, 200 +- ``SaveConfig(save_interval=100, start_step=150)`` will save at steps + 200, 300, … +- ``SaveConfig(save_steps=[3, 7])`` will save at steps 0, 3, 7, 500, … + +Specifying different configuration based on mode +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There is also a more advanced use case, where you specify a different +SaveConfig for each mode. It is best understood through an example: + +.. code:: python + + import smdebug as smd + smd.SaveConfig(mode_save_configs={ + smd.modes.TRAIN: smd.SaveConfigMode(save_interval=1), + smd.modes.EVAL: smd.SaveConfigMode(save_interval=2), + smd.modes.PREDICT: smd.SaveConfigMode(save_interval=3), + smd.modes.GLOBAL: smd.SaveConfigMode(save_interval=4) + }) + +Essentially, create a dictionary mapping modes to SaveConfigMode +objects. The SaveConfigMode objects take the same four parameters +(save_interval, start_step, end_step, save_steps) as the main object. +Any mode not specified will default to the default configuration. If a +mode is provided but not all params are specified, we use the default +values for non-specified parameters. + +Configuration using SageMaker Python SDK +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Refer `Configuring Hook using SageMaker Python +SDK <#configuring-hook-using-sagemaker-python-sdk>`__ and `Configuring +Collection using SageMaker Python +SDK <#configuring-collection-using-sagemaker-python-sdk>`__ + +-------------- + +ReductionConfig +--------------- + +ReductionConfig allows the saving of certain reductions of tensors +instead of saving the full tensor. The motivation here is to reduce the +amount of data saved, and increase the speed in cases where you don’t +need the full tensor. The reduction operations which are computed in the +training process and then saved. + +During analysis, these are available as reductions of the original +tensor. Please note that using reduction config means that you will not +have the full tensor available during analysis, so this can restrict +what you can do with the tensor saved. You can choose to also save the +raw tensor along with the reductions if you so desire. + +The hook takes a ReductionConfig object which is applied as default to +all tensors included. A collection can also have its own ReductionConfig +object which is applied to the tensors belonging to that collection. + +.. code:: python + + import smdebug as smd + reduction_config = smd.ReductionConfig( + reductions = None, + abs_reductions = None, + norms = None, + abs_norms = None, + save_raw_tensor = False, + ) + +.. _arguments-2: + +Arguments +~~~~~~~~~ + +- ``reductions`` (list[str]): Takes names of reductions, choosing from + “min”, “max”, “median”, “mean”, “std”, “variance”, “sum”, “prod” +- ``abs_reductions`` (list[str]): Same as reductions, except the + reduction will be computed on the absolute value of the tensor +- ``norms`` (list[str]): Takes names of norms to compute, choosing from + “l1”, “l2” +- ``abs_norms`` (list[str]): Same as norms, except the norm will be + computed on the absolute value of the tensor +- ``save_raw_tensor`` (bool): Saves the tensor directly, in addition to + other desired reductions + +For example, + +``ReductionConfig(reductions=['std', 'variance'], abs_reductions=['mean'], norms=['l1'])`` + +will save the standard deviation and variance, the mean of the absolute +value, and the l1 norm. + +.. _configuration-using-sagemaker-python-sdk-1: + +Configuration using SageMaker Python SDK +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The reductions are passed as part of the “reductions” parameter to +HookParameters or Collection Parameters. Refer `Configuring Hook using +SageMaker Python SDK <#configuring-hook-using-sagemaker-python-sdk>`__ +and `Configuring Collection using SageMaker Python +SDK <#configuring-collection-using-sagemaker-python-sdk>`__ for more on +that. + +The parameter “reductions” can take a comma separated string consisting +of the following values: + +:: + + min + max + median + mean + std + variance + sum + prod + l1 + l2 + abs_min + abs_max + abs_median + abs_mean + abs_std + abs_variance + abs_sum + abs_prod + abs_l1 + abs_l2 + +-------------- + +Frameworks +---------- + +For details on what’s supported for different frameworks, see: + - :doc:`TensorFlow ` + - :doc:`PyTorch ` + - :doc:`MXNet ` + - :doc:`XGBoost ` diff --git a/docs/collections.rst b/docs/collections.rst new file mode 100644 index 000000000..9f32e51cd --- /dev/null +++ b/docs/collections.rst @@ -0,0 +1,328 @@ +Tensor Collections +------------------ + +The construct of a Collection groups tensors together. A Collection is +identified by a string representing the name of the collection. It can +be used to group tensors of a particular kind such as “losses”, +“weights”, “biases”, or “gradients”. A Collection has its own list of +tensors specified by include regex patterns, and other parameters +determining how these tensors should be saved and when. Using +collections enables you to save different types of tensors at different +frequencies and in different forms. These collections are then also +available during analysis so you can query a group of tensors at once. + +There are a number of built-in collections that SageMaker Debugger +manages by default. This means that the library takes care of +identifying what tensors should be saved as part of that collection. You +can also define custom collections, to do which there are couple of +different ways. + +You can specify which of these collections to save in the hook’s +``include_collections`` parameter, or through the ``collection_configs`` +parameter to the ``DebuggerHookConfig`` in the SageMaker Python SDK. + +Built in Collections +~~~~~~~~~~~~~~~~~~~~ + +Below is a comprehensive list of the built-in collections that are +managed by SageMaker Debugger. The Hook identifes the tensors that +should be saved as part of that collection for that framework and saves +them if they were requested. + +The names of these collections are all lower case strings. + ++------------------------+-----------------------+-----------------------+ +| Name | Supported by | Description | +| | frameworks/hooks | | ++========================+=======================+=======================+ +| ``all`` | all | Matches all tensors | ++------------------------+-----------------------+-----------------------+ +| ``default`` | all | It’s a default | +| | | collection created, | +| | | which matches the | +| | | regex patterns passed | +| | | as ``include_regex`` | +| | | to the Hook | ++------------------------+-----------------------+-----------------------+ +| ``weights`` | TensorFlow, PyTorch, | Matches all weights | +| | MXNet | of the model | ++------------------------+-----------------------+-----------------------+ +| ``biases`` | TensorFlow, PyTorch, | Matches all biases of | +| | MXNet | the model | ++------------------------+-----------------------+-----------------------+ +| ``gradients`` | TensorFlow, PyTorch, | Matches all gradients | +| | MXNet | of the model. In | +| | | TensorFlow when not | +| | | using Zero Script | +| | | Change environments, | +| | | must use | +| | | ``hoo | +| | | k.wrap_optimizer()``. | ++------------------------+-----------------------+-----------------------+ +| ``losses`` | TensorFlow, PyTorch, | Saves the loss for | +| | MXNet | the model | ++------------------------+-----------------------+-----------------------+ +| ``metrics`` | TensorFlow’s | For KerasHook, saves | +| | KerasHook, XGBoost | the metrics computed | +| | | by Keras for the | +| | | model. For XGBoost, | +| | | the evaluation | +| | | metrics computed by | +| | | the algorithm. | ++------------------------+-----------------------+-----------------------+ +| ``outputs`` | TensorFlow’s | Matches the outputs | +| | KerasHook | of the model | ++------------------------+-----------------------+-----------------------+ +| ``layers`` | TensorFlow’s | Input and output of | +| | KerasHook | intermediate | +| | | convolutional layers | ++------------------------+-----------------------+-----------------------+ +| ``sm_metrics`` | TensorFlow | You can add scalars | +| | | that you want to show | +| | | up in SageMaker | +| | | Metrics to this | +| | | collection. SageMaker | +| | | Debugger will save | +| | | these scalars both to | +| | | the out_dir of the | +| | | hook, as well as to | +| | | SageMaker Metric. | +| | | Note that the scalars | +| | | passed here will be | +| | | saved on AWS servers | +| | | outside of your AWS | +| | | account. | ++------------------------+-----------------------+-----------------------+ +| ``optimizer_variables``| TensorFlow’s | Matches all optimizer | +| | KerasHook | variables, currently | +| | | only supported in | +| | | Keras. | ++------------------------+-----------------------+-----------------------+ +| ``hyperparameters`` | XGBoost | `Booster | +| | | paramamete | +| | | rs `__ | ++------------------------+-----------------------+-----------------------+ +| ``predictions`` | XGBoost | Predictions on | +| | | validation set (if | +| | | provided) | ++------------------------+-----------------------+-----------------------+ +| ``labels`` | XGBoost | Labels on validation | +| | | set (if provided) | ++------------------------+-----------------------+-----------------------+ +| ``feature_importance`` | XGBoost | Feature importance | +| | | given by | +| | | `g | +| | | et_score() `__ | ++------------------------+-----------------------+-----------------------+ +| ``full_shap`` | XGBoost | A matrix of (nsmaple, | +| | | nfeatures + 1) with | +| | | each record | +| | | indicating the | +| | | feature contributions | +| | | (`SHAP | +| | | valu | +| | | es `__) | +| | | for that prediction. | +| | | Computed on training | +| | | data with | +| | | `predic | +| | | t() `__ | ++------------------------+-----------------------+-----------------------+ +| ``average_shap`` | XGBoost | The sum of SHAP value | +| | | magnitudes over all | +| | | samples. Represents | +| | | the impact each | +| | | feature has on the | +| | | model output. | ++------------------------+-----------------------+-----------------------+ +| ``trees`` | XGBoost | Boosted tree model | +| | | given by | +| | | `trees_to_dataframe( | +| | | ) `__ | ++------------------------+-----------------------+-----------------------+ + +Default collections saved +~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following collections are saved regardless of the hook +configuration. + +============== =========================== +Framework Default collections saved +============== =========================== +``TensorFlow`` METRICS, LOSSES, SM_METRICS +``PyTorch`` LOSSES +``MXNet`` LOSSES +``XGBoost`` METRICS +============== =========================== + +If for some reason, you want to disable the saving of these collections, +you can do so by setting end_step to 0 in the collection’s SaveConfig. +When using the SageMaker Python SDK this would look like + +.. code:: python + + from sagemaker.debugger import DebuggerHookConfig, CollectionConfig + + hook_config = DebuggerHookConfig( + s3_output_path='s3://smdebug-dev-demo-pdx/mnist', + collection_configs=[ + CollectionConfig(name="metrics", parameters={"end_step": 0}) + ] + ) + +When configuring the Collection in your Python script, it would be as +follows: + +.. code:: python + + hook.get_collection("metrics").save_config.end_step = 0 + +Creating or retrieving a Collection +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++-----------------------------------+-----------------------------------+ +| Function | Behavior | ++===================================+===================================+ +| ``hook. | Returns the collection with the | +| get_collection(collection_name)`` | given name. Creates the | +| | collection with default | +| | configuration if it doesn’t | +| | already exist. A new collection | +| | created by default does not match | +| | any tensor and is configured to | +| | save histograms and distributions | +| | along with the tensor if | +| | tensorboard support is enabled, | +| | and uses the reduction | +| | configuration and save | +| | configuration passed to the hook. | ++-----------------------------------+-----------------------------------+ + +Properties of a Collection +~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++-----------------------------------+-----------------------------------+ +| Property | Description | ++===================================+===================================+ +| ``tensor_names`` | Get or set list of tensor names | +| | as strings | ++-----------------------------------+-----------------------------------+ +| ``include_regex`` | Get or set list of regexes to | +| | include. Tensors whose names | +| | match these regex patterns will | +| | be included in the collection | ++-----------------------------------+-----------------------------------+ +| ``reduction_config`` | Get or set the ReductionConfig | +| | object to be used for tensors | +| | part of this collection | ++-----------------------------------+-----------------------------------+ +| ``save_config`` | Get or set the SaveConfig object | +| | to be used for tensors part of | +| | this collection | ++-----------------------------------+-----------------------------------+ +| ``save_histogram`` | Get or set the boolean flag which | +| | determines whether to write | +| | histograms to enable histograms | +| | and distributions in TensorBoard, | +| | for tensors part of this | +| | collection. Only applicable if | +| | TensorBoard support is enabled. | ++-----------------------------------+-----------------------------------+ + +Methods on a Collection +~~~~~~~~~~~~~~~~~~~~~~~ + ++-----------------------------------+-----------------------------------+ +| Method | Behavior | ++===================================+===================================+ +| ``coll.include(regex)`` | Takes a regex string or a list of | +| | regex strings to match tensors to | +| | include in the collection. | ++-----------------------------------+-----------------------------------+ +| ``coll.add(tensor)`` | **(TensorFlow only)** Takes an | +| | instance or list or set of | +| | tf.Tensor/tf.Variable | +| | /tf.MirroredVariable/tf.Operation | +| | to add to the collection. | ++-----------------------------------+-----------------------------------+ +| ``coll.add_keras_layer(lay | **(tf.keras only)** Takes an | +| er, inputs=False, outputs=True)`` | instance of a tf.keras layer and | +| | logs input/output tensors for | +| | that module. By default, only | +| | outputs are saved. | ++-----------------------------------+-----------------------------------+ +| ``coll.add_module_tensors(modu | **(PyTorch only)** Takes an | +| le, inputs=False, outputs=True)`` | instance of a PyTorch module and | +| | logs input/output tensors for | +| | that module. By default, only | +| | outputs are saved. | ++-----------------------------------+-----------------------------------+ +| ``coll.add_block_tensors(blo | **(MXNet only)** Takes an | +| ck, inputs=False, outputs=True)`` | instance of a Gluon block,and | +| | logs input/output tensors for | +| | that module. By default, only | +| | outputs are saved. | ++-----------------------------------+-----------------------------------+ + +Configuring Collection using SageMaker Python SDK +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Parameters to configure Collection are passed as below when using the +SageMaker Python SDK. + +.. code:: python + + from sagemaker.debugger import CollectionConfig + coll_config = CollectionConfig( + name="weights", + parameters={ "parameter": "value" }) + +The parameters can be one of the following. The meaning of these +parameters will be clear as you review the sections of documentation +below. Note that all parameters below have to be strings. So any +parameter which accepts a list (such as save_steps, reductions, +include_regex), needs to be given as strings separated by a comma +between them. + +:: + + include_regex + save_histogram + reductions + save_raw_tensor + save_interval + save_steps + start_step + end_step + train.save_interval + train.save_steps + train.start_step + train.end_step + eval.save_interval + eval.save_steps + eval.start_step + eval.end_step + predict.save_interval + predict.save_steps + predict.start_step + predict.end_step + global.save_interval + global.save_steps + global.start_step + global.end_step + +-------------- diff --git a/docs/conf.py b/docs/conf.py new file mode 100644 index 000000000..a05c8b9d1 --- /dev/null +++ b/docs/conf.py @@ -0,0 +1,105 @@ +# Configuration file for the Sphinx documentation builder. +# +# This file only contains a selection of the most common options. For a full +# list see the documentation: +# https://www.sphinx-doc.org/en/master/usage/configuration.html + +# -- Path setup -------------------------------------------------------------- + +# If extensions (or modules to document with autodoc) are in another directory, +# add these directories to sys.path here. If the directory is relative to the +# documentation root, use os.path.abspath to make it absolute, like shown here. + +# Standard Library +import os +import sys + +# sys.path.insert(0, os.path.abspath('../')) +# curr_path = os.path.dirname(os.path.abspath(os.path.expanduser(__file__))) +# sys.path.insert(0, os.path.join(curr_path, '..')) +sys.path.insert(0, os.path.abspath("../smdebug")) + +# -- Project information ----------------------------------------------------- + +project = "SMDebug Documentation" +copyright = "2021, AWS Deep Engine Team" +author = "AWS Deep Engine Team" + +autodoc_mock_imports = ["mxnet", "tensorflow", "torch", "xgboost"] + +# -- General configuration --------------------------------------------------- + +# Add any Sphinx extension module names here, as strings. They can be +# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom +# ones. +extensions = [ + "sphinx.ext.autodoc", + "sphinx.ext.doctest", + "sphinx.ext.intersphinx", + "sphinx.ext.todo", + "sphinx.ext.coverage", + "sphinx.ext.autosummary", + "sphinx.ext.napoleon", + "sphinx.ext.autosectionlabel", +] + +# Markdown support +# The suffix of source filenames. +source_suffix = {".rst": "restructuredtext"} +master_doc = "index" # The master toctree document. + +# pygments_style = "default" + +# autoclass_content = "both" +# autodoc_default_flags = ["show-inheritance", "members", "undoc-members"] +# autodoc_member_order = "bysource" + +# Add any paths that contain templates here, relative to this directory. +templates_path = ["_templates"] + +# List of patterns, relative to source directory, that match files and +# directories to ignore when looking for source files. +# This pattern also affects html_static_path and html_extra_path. +exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"] + +# List of directories, relative to source directory, that shouldn't be searched +# for source files. +exclude_trees = ["_build"] + +pygments_style = "default" + +autoclass_content = "both" +autodoc_default_flags = ["show-inheritance", "members", "undoc-members"] +autodoc_member_order = "bysource" + + +# -- Options for HTML output ------------------------------------------------- + +# The theme to use for HTML and HTML Help pages. See the documentation for +# a list of builtin themes. +# +html_theme = "sphinx_rtd_theme" + +html_theme_options = { + "collapse_navigation": True, + "sticky_navigation": True, + "navigation_depth": 6, + "includehidden": True, + "titles_only": False, +} + +# Add any paths that contain custom static files (such as style sheets) here, +# relative to this directory. They are copied after the builtin static files, +# so a file named "default.css" will overwrite the builtin "default.css". +html_static_path = ["_static"] + +html_css_files = ["css/custom.css"] + +suppress_warnings = ["autosectionlabel.*"] + +# For Adobe Analytics +html_js_files = [ + "https://a0.awsstatic.com/s_code/js/3.0/awshome_s_code.js", + "aws-ux-shortbread/index.js", + "aws-ux-shortbread/init.js", +] diff --git a/docs/environment.yml b/docs/environment.yml new file mode 100644 index 000000000..f4df9325f --- /dev/null +++ b/docs/environment.yml @@ -0,0 +1,14 @@ +channels: + - conda-forge + +dependencies: + - sphinx==3.1.1 + - Jinja2==2.11 + - docutils==0.15.2 + - bokeh + - ipython + - pandas + - pip + - pip: + - protobuf3 + - smdebug diff --git a/docs/glossary.rst b/docs/glossary.rst new file mode 100644 index 000000000..e99948dbb --- /dev/null +++ b/docs/glossary.rst @@ -0,0 +1,69 @@ +Glossary +-------- + +The following glossary items assume you import the SMDebug framework modules as follows: + +- **TensorFlow** + + .. code:: python + + import smdebug.tensorflow as smd + +- **PyTorch** + + .. code:: python + + import smdebug.pytorch as smd + +- **MXNet** + + .. code:: python + + import smdebug.mxnet as smd + +- **XGBoost** + + .. code:: python + + import smdebug.xgboost as smd + +Terminologies +~~~~~~~~~~~~~ + +- **Step**: Step means the work done for one batch by a training job + (i.e. forward and backward pass). (An exception is with TensorFlow’s + Session interface, where a step also includes the initialization session + run calls). SageMaker Debugger is designed in terms of steps. When to + save data is specified using steps. Also, invocation of Rules is + on a step-by-step basis. + +- **Hook**: The main class to pass as a callback object or to create + callback functions. It keeps track of collections and writes output + files at each step. The current hook implementation does not support + merging tensors from current job with tensors from previous job(s). + Therefore, ensure that the ``out_dir`` does not exist prior to instantiating + the ‘Hook’ object. - ``hook = smd.Hook(out_dir="/tmp/mnist_job")`` + +- **Mode**: One of “train”, “eval”, “predict”, or “global”. Helpful for + segmenting data based on the phase you’re in. Defaults to “global”. - + ``train_mode = smd.modes.TRAIN`` + +- **Collection**: A group of tensors. Each collection contains its + configuration for what tensors are part of it, and when to save them. - + ``collection = hook.get_collection("losses")`` + +- **SaveConfig**: A Python dict specifying how often to save losses and + tensors. - ``save_config = smd.SaveConfig(save_interval=10)`` + +- **ReductionConfig**: Allows you to save a reduction, such as ‘mean’ or + ‘l1 norm’, instead of the full tensor. Reductions are simple floats. - + ``reduction_config = smd.ReductionConfig(reductions=['min', 'max', 'mean'], norms=['l1'])`` + +- **Trial**: The main interface to use when analyzing a completed training + job. Access collections and tensors. See `trials + documentation `__. - + ``trial = smd.create_trial(out_dir="/tmp/mnist_job")`` + +- **Rule**: A condition to monitor the saved data for. It can trigger an + exception when the condition is met, for example a vanishing gradient. + See `rules documentation `__. diff --git a/docs/hook-api.rst b/docs/hook-api.rst new file mode 100644 index 000000000..ff293dc65 --- /dev/null +++ b/docs/hook-api.rst @@ -0,0 +1,42 @@ +Hook API +======== + +Create a Hook +------------- + +By using AWS Deep Learning Containers, you can directly run your own +training scripts without any additional effort to make it compatible with +the SageMaker Python SDK. For a detailed developer guide for this, see +`Use Debugger in AWS +Containers `__. + +However, for some advanced use cases where you need access to customized +tensors from targeted parts of a training script, you can manually +construct the hook object. The SMDebug library provides hook classes to +make this process simple and compatible with the SageMaker ecosystem and +Debugger. The high-level workflow is as simple as a 2-step process: + +1. Register SMDebug hook to your training script. +2. Run a training job within or outside SageMaker. + + - SageMaker APIs for Debugger are available through the SageMaker Python SDK or API. + - Run it locally + +To capture output tensors from your training model, register SMDebug hooks +to your training scripts. + +After choosing a framework and defining the hook object, you need to +embed the hooks into target parts of your training script to retrieve +tensors and to use with the SageMaker Debugger Python SDK. + +To learn more about registering the hook to your model on a +framework of your choice, see the following pages. + +.. toctree:: + :maxdepth: 1 + + tensorflow + pytorch + mxnet + xgboost + hook-constructor diff --git a/docs/hook-config-in-smestimator.rst b/docs/hook-config-in-smestimator.rst new file mode 100644 index 000000000..0135f8007 --- /dev/null +++ b/docs/hook-config-in-smestimator.rst @@ -0,0 +1,111 @@ +Configure Hook using SageMaker Python SDK +========================================= + +After you make the changes to your training script, you can +configure the hook with parameters to the SageMaker Debugger API +operation, ``DebuggerHookConfig``. + +.. code:: python + + from sagemaker.debugger import DebuggerHookConfig + + collection_configs=[ + CollectionConfig(name="tensor_collection_1") + CollectionConfig(name="tensor_collection_2") + ... + CollectionConfig(name="tensor_collection_n") + ] + + hook_config = DebuggerHookConfig( + s3_output_path='s3://smdebug-dev-demo-pdx/mnist', + collection_configs=collection_configs, + hook_parameters={ + "parameter": "value" + } + ) + +Path to SMDebug artifacts +------------------------- + +To create an SMDebug trial object, you need to know where the SMDebug artifacts are saved. + +1. For SageMaker training jobs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When running a SageMaker job, SMDebug artifacts are saved to Amazon S3. +SageMaker saves data +from your training job to a local path of the training container and +uploads them to an S3 bucket of your account. When you start a +SageMaker training job with the python SDK, you can set the path +using the parameter ``s3_output_path`` of the ``DebuggerHookConfig`` +object. If you don't specify the path, SageMaker automatically sets the +output path to your default S3 bucket. + +**Example** + +.. code:: python + + from sagemaker.debugger import CollectionConfig, DebuggerHookConfig + + collection_configs=[ + CollectionConfig(name="weights"), + CollectionConfig(name="gradients") + ] + + debugger_hook_config=DebuggerHookConfig( + s3_output_path="specify-your-s3-bucket-uri" # Optional + collection_configs=collection_configs + ) + +For more information, see `Configure Debugger Hook to Save Tensors +`__ +in the *Amazon SageMaker Developer Guide*. + +2. For non-SageMaker training jobs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If you are running a training job outside SageMaker, this is the path you +pass as ``out_dir`` when you create an SMDebug Hook. +When creating the hook, you can +pass either a local path (for example, ``/home/ubuntu/smdebug_outputs/``) +or an S3 bucket path (for example, ``s3://bucket/prefix``). + +Hook Configuration Parameter Keys +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The available ``hook_parameters`` keys are listed in the following. The meaning +of these parameters will be clear as you review the sections of +documentation below. Note that all parameters below have to be strings. +So for any parameter which accepts a list (such as save_steps, +reductions, include_regex), the value needs to be given as strings +separated by a comma between them. + +:: + + dry_run + save_all + include_workers + include_regex + reductions + save_raw_tensor + save_shape + save_interval + save_steps + start_step + end_step + train.save_interval + train.save_steps + train.start_step + train.end_step + eval.save_interval + eval.save_steps + eval.start_step + eval.end_step + predict.save_interval + predict.save_steps + predict.start_step + predict.end_step + global.save_interval + global.save_steps + global.start_step + global.end_step diff --git a/docs/hook-constructor.rst b/docs/hook-constructor.rst new file mode 100644 index 000000000..3bf168887 --- /dev/null +++ b/docs/hook-constructor.rst @@ -0,0 +1,64 @@ +Hook from Python constructor +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Using the Hook + +.. autoclass:: smdebug.core.hook.BaseHook + :members: + :undoc-members: + :show-inheritance: + :inherited-members: + + +.. code:: python + + hook = HookClass( + out_dir, + export_tensorboard = False, + tensorboard_dir = None, + dry_run = False, + reduction_config = None, + save_config = None, + include_regex = None, + include_collections = None, + save_all = False, + include_workers="one" + ) + +**Parameters:** + + - ``out_dir`` (str): Path where to save tensors and metadata. This is a + required argument. Please ensure that the ‘out_dir’ does not exist. + - ``export_tensorboard`` (bool): Whether to export TensorBoard + summaries (distributions and histograms for tensors saved, and scalar + summaries for scalars saved). Defaults to ``False``. Note that when + running on SageMaker this parameter will be ignored. You will need to + use the TensorBoardOutputConfig section in API to enable TensorBoard + summaries. Refer `SageMaker page `__ for an example. + - ``tensorboard_dir`` (str): Path where to save TensorBoard artifacts. + If this is not passed and ``export_tensorboard`` is True, then + TensorBoard artifacts are saved in ``out_dir/tensorboard`` . Note + that when running on SageMaker this parameter will be ignored. You + will need to use the TensorBoardOutputConfig section in API to enable + TensorBoard summaries. Refer `SageMaker page `__ for an + example. + - ``dry_run`` (bool): If true, don’t write any files + - ``reduction_config``: (`ReductionConfig <#reductionconfig>`__ object) + Specifies the reductions to be applied as default for tensors saved. + A collection can have its own ``ReductionConfig`` object which + overrides this for the tensors which belong to that collection. + - ``save_config``: (`SaveConfig <#saveconfig>`__ object) Specifies when + to save tensors. A collection can have its own ``SaveConfig`` object + which overrides this for the tensors which belong to that collection. + - ``include_regex`` (list[str]): list of regex patterns which specify + the tensors to save. Tensors whose names match these patterns will be + saved + - ``include_collections`` (list[str]): List of which collections to + save specified by name + - ``save_all`` (bool): Saves all tensors and collections. Increases the + amount of disk space used, and can reduce the performance of the + training job significantly, depending on the size of the model. + - ``include_workers`` (str): Used for distributed training. It can take + the values ``one`` or ``all``. ``one`` means only the tensors from + one chosen worker will be saved. This is the default behavior. + ``all`` means tensors from all workers will be saved. diff --git a/docs/hook-methods.rst b/docs/hook-methods.rst new file mode 100644 index 000000000..a34057d8c --- /dev/null +++ b/docs/hook-methods.rst @@ -0,0 +1,276 @@ +Hook Methods +============ + +Common Hook Methods +------------------- + +These methods are common for all hooks in any framework. + +.. note:: + + The methods in this page are available after you create a hook object. + + - **TensorFlow** + + .. code:: python + + import smdebug.tensorflow as smd + hook = smd.KerasHook.create_from_json_file() + hook = smd.SessionHook.create_from_json_file() + hook = smd.EstimatorHook.create_from_json_file() + + In case of TensorFlow, you need to pick the appropriate ``HookClass`` among + ``KerasHook``, ``SessionHook``, and ``EstimatorHook`` depending on how your training script + is composed. For more information, see :doc:`TensorFlow Hook `. + + + - **PyTorch** + + .. code:: python + + import smdebug.pytorch as smd + hook = smd.Hook.create_from_json_file() + + - **MXNet** + + .. code:: python + + import smdebug.mxnet as smd + hook = smd.Hook.create_from_json_file() + + - **XGBoost** + + .. code:: python + + import smdebug.xgboost as smd + hook = smd.Hook.create_from_json_file() + + +.. method:: add_collection(collection) + + Takes a Collection object and adds it to the CollectionManager that the + Hook holds. Note that you should only pass in a Collection object for the + same framework as the hook + + **Parameters:** + + - collection (smd.Collection) + + +.. method:: add_collection(collection) + + Takes a Collection object and adds it to the CollectionManager that the + Hook holds. Note that you should only pass in a Collection object for the + same framework as the hook + + **Parameters:** + + - collection (smd.Collection) + + +.. method:: get_collection(name) + + Returns collection identified by the given name + + **Parameters:** + + - name (str) + +.. method:: get_collections() + + Returns all collection objects held by the hook + + +.. method:: set_mode(mode) + + Sets mode of the job. ``smd.modes.TRAIN``, + ``smd.modes.EVAL``, ``smd.modes.PREDICT``, ``smd.modes.GLOBAL``. + For more information, see :doc:`Modes `__. + + **Parameters:** + + - value of the enum ``smd.modes`` + +.. method:: create_from_json_file(json_file_path (str) + + Takes the path of a file which holds the json configuration of the hook, + and creates hook from that configuration. This is an optional parameter. + If this is not passed it tries to get the file path from the value of the + environment variable SMDEB UG_CONFIG_FILE_PATH and defaults to + ``/opt/ml/input/config/debughookconfig.json``. + When training on SageMaker you do not have to specify any path because + this is the default path that SageMaker writes the hook configuration to. + + **Parameters:** + + - json_file_path (str) + + +.. method:: close() + + Closes all files that are currently open by the hook + + +.. method:: save_scalar() + + Saves a scalar value by the given name. Passing ``sm_metric=True`` flag also + makes this scalar available as a SageMaker Metric to show up in SageMaker + Studio. Note that when ``sm_metric`` is False, this scalar always resides + only in your AWS account, but setting it to True saves the scalar also + on AWS servers. The default value of ``sm_metric`` for this method is False. + + **Parameters:** + + - ``name (str)``, ``value (float)``, ``sm_metric (bool)`` + + +.. method:: save_tensor() + + Manually save metrics tensors. The re cord_tensor_value() API is + deprecated in favor or ``save_tensor()``. + + **Parameters:** + + - ``tensor_name (str)``, ``tensor_value (numpy.array or numpy.ndarray)``, + ``collections_to_write (str or list[str])`` + + + +TensorFlow specific Hook API +---------------------------- + +Note that there are three types of Hooks in TensorFlow: SessionHook, +EstimatorHook and KerasHook based on the TensorFlow interface being used +for training. :doc:`TensorFlow ` shows examples of each of +these. + ++-----------------+-----------------+-----------------+-----------------+ +| Method | Arguments | Returns | Behavior | ++=================+=================+=================+=================+ +| ``wrap_optimiz | ``optimizer`` | Returns the | When not using | +| er(optimizer)`` | (tf. | same optimizer | Zero Script | +| | train.Optimizer | object passed | Change | +| | or | with a couple | environments, | +| | tf.k | of identifying | calling this | +| | eras.Optimizer) | markers to help | method on your | +| | | ``smdebug``. | optimizer is | +| | | This returned | necessary for | +| | | optimizer | SageMaker | +| | | should be used | Debugger to | +| | | for training. | identify and | +| | | | save gradient | +| | | | tensors. Note | +| | | | that this | +| | | | method returns | +| | | | the same | +| | | | optimizer | +| | | | object passed | +| | | | and does not | +| | | | change your | +| | | | optimization | +| | | | logic. If the | +| | | | hook is of type | +| | | | ``KerasHook``, | +| | | | you can pass in | +| | | | either an | +| | | | object of type | +| | | | ``tf.tr | +| | | | ain.Optimizer`` | +| | | | or | +| | | | ``tf.ker | +| | | | as.Optimizer``. | +| | | | If the hook is | +| | | | of type | +| | | | ``SessionHook`` | +| | | | or | +| | | | ``E | +| | | | stimatorHook``, | +| | | | the optimizer | +| | | | can only be of | +| | | | type | +| | | | ``tf.tra | +| | | | in.Optimizer``. | +| | | | This new | ++-----------------+-----------------+-----------------+-----------------+ +| ``add_to_ | ``collecti | ``None`` | Calls the | +| collection(``\ | on_name (str)`` | | ``add`` method | +| ``collection_na | : name of the | | of a collection | +| me, variable)`` | collection to | | object. See | +| | add to. | | `this | +| | ``variable`` | | section < | +| | parameter to | | #collection>`__ | +| | pass to the | | for more. | +| | collection’s | | | +| | ``add`` method. | | | ++-----------------+-----------------+-----------------+-----------------+ + +The following hook APIs are specific to training scripts using the TF +2.x GradientTape +(`Example `__): + ++-----------------+-----------------+-----------------+-----------------+ +| Method | Arguments | Returns | Behavior | ++=================+=================+=================+=================+ +| ``wr | ``tape`` | Returns a tape | When not using | +| ap_tape(tape)`` | (t | object with | Zero Script | +| | ensorflow.pytho | three | Change | +| | n.eager.backpro | identifying | environments, | +| | p.GradientTape) | markers to help | calling this | +| | | ``smdebug``. | method on your | +| | | This returned | tape is | +| | | tape should be | necessary for | +| | | used for | SageMaker | +| | | training. | Debugger to | +| | | | identify and | +| | | | save gradient | +| | | | tensors. Note | +| | | | that this | +| | | | method returns | +| | | | the same tape | +| | | | object passed. | ++-----------------+-----------------+-----------------+-----------------+ + + +MXNet specific Hook API +----------------------- + ++-----------------------+-----------------------+-----------------------+ +| Method | Arguments | Behavior | ++=======================+=======================+=======================+ +| ``re | ``blo | Calling this method | +| gister_block(block)`` | ck (mx.gluon.Block)`` | applies the hook to | +| | | the Gluon block | +| | | representing the | +| | | model, so SageMaker | +| | | Debugger gets called | +| | | by MXNet and can save | +| | | the tensors required. | ++-----------------------+-----------------------+-----------------------+ + + +PyTorch specific Hook API +------------------------- + ++-----------------------+-----------------------+-----------------------+ +| Method | Arguments | Behavior | ++=======================+=======================+=======================+ +| ``regi | ``modul | Calling this method | +| ster_module(module)`` | e (torch.nn.Module)`` | applies the hook to | +| | | the Torch Module | +| | | representing the | +| | | model, so SageMaker | +| | | Debugger gets called | +| | | by PyTorch and can | +| | | save the tensors | +| | | required. | ++-----------------------+-----------------------+-----------------------+ +| ``registe | ``l | Calling this method | +| r_loss(loss_module)`` | oss_module (torch.nn. | applies the hook to | +| | modules.loss._Loss)`` | the Torch Module | +| | | representing the | +| | | loss, so SageMaker | +| | | Debugger can save | +| | | losses | ++-----------------------+-----------------------+-----------------------+ + +-------------- diff --git a/docs/index.rst b/docs/index.rst new file mode 100644 index 000000000..1f6d4da09 --- /dev/null +++ b/docs/index.rst @@ -0,0 +1,61 @@ +Welcome to the SMDebug documentation! +=============================================== + +|codecov| |PyPI| + +.. toctree:: + :maxdepth: 2 + :caption: Introduction to SMDebug + + README + release-notes + +.. toctree:: + :maxdepth: 2 + :caption: SMDebug Hooks + + hook-api + hook-methods + collections + hook-config-in-smestimator + other-configs + +.. toctree:: + :maxdepth: 2 + :caption: Rules + + smdebug.rules + +.. toctree:: + :maxdepth: 2 + :caption: SMDebug Debugger Modules + + smdebug.trials + trial-api + tensor-api + smdebug.modes + +.. toctree:: + :maxdepth: 2 + :caption: SMDebug Profiler Modules + + smdebug.profiler.analysis + smdebug.profiler.analysis.utils + smdebug.profiler.analysis.notebook_utils + +.. toctree:: + :maxdepth: 2 + :caption: Handling Exceptions + + smdebug.exceptions + +.. toctree:: + :maxdepth: 1 + :caption: Glossary + + glossary + +.. |codecov| image:: https://codecov.io/gh/awslabs/sagemaker-debugger/branch/master/graph/badge.svg + :target: https://codecov.io/gh/awslabs/sagemaker-debugger +.. |PyPI| image:: https://badge.fury.io/py/smdebug.svg + :target: https://badge.fury.io/py/smdebug diff --git a/docs/make.bat b/docs/make.bat new file mode 100644 index 000000000..2119f5109 --- /dev/null +++ b/docs/make.bat @@ -0,0 +1,35 @@ +@ECHO OFF + +pushd %~dp0 + +REM Command file for Sphinx documentation + +if "%SPHINXBUILD%" == "" ( + set SPHINXBUILD=sphinx-build +) +set SOURCEDIR=. +set BUILDDIR=_build + +if "%1" == "" goto help + +%SPHINXBUILD% >NUL 2>NUL +if errorlevel 9009 ( + echo. + echo.The 'sphinx-build' command was not found. Make sure you have Sphinx + echo.installed, then set the SPHINXBUILD environment variable to point + echo.to the full path of the 'sphinx-build' executable. Alternatively you + echo.may add the Sphinx directory to PATH. + echo. + echo.If you don't have Sphinx installed, grab it from + echo.http://sphinx-doc.org/ + exit /b 1 +) + +%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% +goto end + +:help +%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% + +:end +popd diff --git a/docs/make_html.sh b/docs/make_html.sh new file mode 100755 index 000000000..d8157a19c --- /dev/null +++ b/docs/make_html.sh @@ -0,0 +1,3 @@ +make clean +make html +open _build/html/index.html diff --git a/docs/mxnet.md b/docs/mxnet.md deleted file mode 100644 index d418647d2..000000000 --- a/docs/mxnet.md +++ /dev/null @@ -1,102 +0,0 @@ -# MXNet - -## Contents -- [Support](#support) -- [How to Use](#how-to-use) -- [Example](#example) -- [Full API](#full-api) - ---- - -## Support - -- Zero Script Change experience where you need no modifications to your training script is supported in the official [AWS Deep Learning Container for MXNet](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#general-framework-containers). -- This library itself supports the following versions when you use our API which requires a few minimal changes to your training script: MXNet 1.4, 1.5, 1.6, and 1.7. -- Only Gluon models are supported -- When the Gluon model is hybridized, inputs and outputs of intermediate layers can not be saved -- Parameter server based distributed training is not yet supported - ---- - -## How to Use -### Using Zero Script Change containers -In this case, you don't need to do anything to get the hook running. You are encouraged to configure the hook from the SageMaker python SDK so you can run different jobs with different configurations without having to modify your script. If you want access to the hook to configure certain things which can not be configured through the SageMaker SDK, you can retrieve the hook as follows. -``` -import smdebug.mxnet as smd -hook = smd.Hook.create_from_json_file() -``` -Note that you can create the hook from smdebug's python API as is being done in the next section even in such containers. - -### Bring your own container experience -#### 1. Create a hook -If using SageMaker, you will configure the hook in SageMaker's python SDK using the Estimator class. Instantiate it with -`smd.Hook.create_from_json_file()`. Otherwise, call the hook class constructor, `smd.Hook()`. - -#### 2. Register the model to the hook -Call `hook.register_block(net)`. - -#### 3. Take actions using the hook APIs - -For a full list of actions that the hook APIs offer to construct hooks and save tensors, see [Common hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#common-hook-api) and [MXNet specific hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#mxnet-specific-hook-api). - ---- - -## Example -```python -####################################### -# Creating a hook. Refer `API for Saving Tensors` page for more on this -import smdebug.mxnet as smd -hook = smd.Hook(out_dir=args.out_dir) -####################################### - -import mxnet as mx -from mxnet import gluon -from mxnet import autograd as ag -from mxnet.gluon import nn -net = nn.HybridSequential() -net.add( - nn.Dense(128, activation='relu'), - nn.Dense(64, activation='relu'), - nn.Dense(10, activation="relu"), -) -net.initialize(init=init.Xavier(), ctx=mx.cpu()) -softmax_cross_entropy_loss = gluon.loss.SoftmaxCrossEntropyLoss() -trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': args.lr}) - - -####################################### -# Here we register the block to smdebug -hook.register_block(net) -####################################### - -batch_size = 100 -mnist = mx.test_utils.get_mnist() -train_data = mx.io.NDArrayIter(mnist['train_data'], mnist['train_label'], batch_size, shuffle=True) -val_data = mx.io.NDArrayIter(mnist['test_data'], mnist['test_label'], batch_size) - -for i in range(args.epochs): - # Reset the train data iterator. - train_data.reset() - # Loop over the train data iterator. - for batch in train_data: - data = gluon.utils.split_and_load(batch.data[0], ctx_list=ctx, batch_axis=0) - label = gluon.utils.split_and_load(batch.label[0], ctx_list=ctx, batch_axis=0) - outputs = [] - with ag.record(): - for x, y in zip(data, label): - z = net(x) - loss = softmax_cross_entropy_loss(z, y) - loss.backward() - outputs.append(z) - metric.update(label, outputs) - trainer.step(batch.data[0].shape[0]) - name, acc = metric.get() - metric.reset() -``` - ---- - -## Full API -See the [API for Saving Tensors](api.md) page for details about Hook, Collection, SaveConfig, and ReductionConfig - -See the [Analysis](analysis) page for details about analyzing a training job. diff --git a/docs/mxnet.rst b/docs/mxnet.rst new file mode 100644 index 000000000..5316f13ac --- /dev/null +++ b/docs/mxnet.rst @@ -0,0 +1,143 @@ +MXNet +===== + +Contents +-------- + +- `Support <#support>`__ +- `How to Use <#how-to-use>`__ +- `Example <#example>`__ +- `Full API <#full-api>`__ + +-------------- + +Support +------- + +- Zero Script Change experience where you need no modifications to your + training script is supported in the official `AWS Deep Learning + Container for + MXNet `__. +- This library itself supports the following versions when you use our + API which requires a few minimal changes to your training script: + MXNet 1.4, 1.5, 1.6, and 1.7. +- Only Gluon models are supported +- When the Gluon model is hybridized, inputs and outputs of + intermediate layers can not be saved +- Parameter server based distributed training is not yet supported + +-------------- + +How to Use +---------- + +Using Zero Script Change containers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In this case, you don’t need to do anything to get the hook running. You +are encouraged to configure the hook from the SageMaker python SDK so +you can run different jobs with different configurations without having +to modify your script. If you want access to the hook to configure +certain things which can not be configured through the SageMaker SDK, +you can retrieve the hook as follows. + +:: + + import smdebug.mxnet as smd + hook = smd.Hook.create_from_json_file() + +Note that you can create the hook from smdebug’s python API as is being +done in the next section even in such containers. + +Bring your own container experience +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +1. Create a hook +^^^^^^^^^^^^^^^^ + +If using SageMaker, you will configure the hook in SageMaker’s python +SDK using the Estimator class. Instantiate it with +``smd.Hook.create_from_json_file()``. Otherwise, call the hook class +constructor, ``smd.Hook()``. + +2. Register the model to the hook +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Call ``hook.register_block(net)``. + +3. Take actions using the hook APIs +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For a full list of actions that the hook APIs offer to construct hooks +and save tensors, see `Common hook +API `__ +and `MXNet specific hook +API `__. + +-------------- + +Example +------- + +.. code:: python + + ####################################### + # Creating a hook. Refer `API for Saving Tensors` page for more on this + import smdebug.mxnet as smd + hook = smd.Hook(out_dir=args.out_dir) + ####################################### + + import mxnet as mx + from mxnet import gluon + from mxnet import autograd as ag + from mxnet.gluon import nn + net = nn.HybridSequential() + net.add( + nn.Dense(128, activation='relu'), + nn.Dense(64, activation='relu'), + nn.Dense(10, activation="relu"), + ) + net.initialize(init=init.Xavier(), ctx=mx.cpu()) + softmax_cross_entropy_loss = gluon.loss.SoftmaxCrossEntropyLoss() + trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': args.lr}) + + + ####################################### + # Here we register the block to smdebug + hook.register_block(net) + ####################################### + + batch_size = 100 + mnist = mx.test_utils.get_mnist() + train_data = mx.io.NDArrayIter(mnist['train_data'], mnist['train_label'], batch_size, shuffle=True) + val_data = mx.io.NDArrayIter(mnist['test_data'], mnist['test_label'], batch_size) + + for i in range(args.epochs): + # Reset the train data iterator. + train_data.reset() + # Loop over the train data iterator. + for batch in train_data: + data = gluon.utils.split_and_load(batch.data[0], ctx_list=ctx, batch_axis=0) + label = gluon.utils.split_and_load(batch.label[0], ctx_list=ctx, batch_axis=0) + outputs = [] + with ag.record(): + for x, y in zip(data, label): + z = net(x) + loss = softmax_cross_entropy_loss(z, y) + loss.backward() + outputs.append(z) + metric.update(label, outputs) + trainer.step(batch.data[0].shape[0]) + name, acc = metric.get() + metric.reset() + +-------------- + +Full API +-------- + +See the `API for Saving Tensors `__ page for details about Hook, +Collection, SaveConfig, and ReductionConfig + +See the `Analysis `__ page for details about analyzing a +training job. diff --git a/docs/other-configs.rst b/docs/other-configs.rst new file mode 100644 index 000000000..53b71a584 --- /dev/null +++ b/docs/other-configs.rst @@ -0,0 +1,177 @@ +Additional Hook Config APIs +=========================== + +SaveConfig +---------- + +The SaveConfig class customizes the frequency of saving tensors. The +hook takes a SaveConfig object which is applied as default to all +tensors included. A collection can also have a SaveConfig object which +is applied to the collection’s tensors. You can also choose to have +different configuration for when to save tensors based on the mode of +the job. + +This class is available in the following namespaces ``smdebug`` and +``smdebug.{framework}``. + +.. code:: python + + import smdebug as smd + save_config = smd.SaveConfig( + mode_save_configs = None, + save_interval = 100, + start_step = 0, + end_step = None, + save_steps = None, + ) + +.. _arguments-1: + +Arguments +~~~~~~~~~ + +- ``mode_save_configs`` (dict): Used for advanced cases; see details + below. +- ``save_interval`` (int): How often, in steps, to save tensors. + Defaults to 500. A step is saved if ``step % save_interval == 0`` +- ``start_step`` (int): When to start saving tensors. +- ``end_step`` (int): When to stop saving tensors, exclusive. +- ``save_steps`` (list[int]): Specific steps to save tensors at. Union + with save_interval. + +Examples +~~~~~~~~ + +- ``SaveConfig()`` will save at steps 0, 500, … +- ``SaveConfig(save_interval=1)`` will save at steps 0, 1, … +- ``SaveConfig(save_interval=100, end_step=200)`` will save at steps 0, + 100 +- ``SaveConfig(save_interval=100, end_step=201)`` will save at steps 0, + 100, 200 +- ``SaveConfig(save_interval=100, start_step=150)`` will save at steps + 200, 300, … +- ``SaveConfig(save_steps=[3, 7])`` will save at steps 0, 3, 7, 500, … + +Specifying different configuration based on mode +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There is also a more advanced use case, where you specify a different +SaveConfig for each mode. It is best understood through an example: + +.. code:: python + + import smdebug as smd + smd.SaveConfig(mode_save_configs={ + smd.modes.TRAIN: smd.SaveConfigMode(save_interval=1), + smd.modes.EVAL: smd.SaveConfigMode(save_interval=2), + smd.modes.PREDICT: smd.SaveConfigMode(save_interval=3), + smd.modes.GLOBAL: smd.SaveConfigMode(save_interval=4) + }) + +Essentially, create a dictionary mapping modes to SaveConfigMode +objects. The SaveConfigMode objects take the same four parameters +(save_interval, start_step, end_step, save_steps) as the main object. +Any mode not specified will default to the default configuration. If a +mode is provided but not all params are specified, we use the default +values for non-specified parameters. + +Configuration using SageMaker Python SDK +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Refer `Configuring Hook using SageMaker Python +SDK <#configuring-hook-using-sagemaker-python-sdk>`__ and `Configuring +Collection using SageMaker Python +SDK <#configuring-collection-using-sagemaker-python-sdk>`__ + +-------------- + +ReductionConfig +--------------- + +ReductionConfig allows the saving of certain reductions of tensors +instead of saving the full tensor. The motivation here is to reduce the +amount of data saved, and increase the speed in cases where you don’t +need the full tensor. The reduction operations which are computed in the +training process and then saved. + +During analysis, these are available as reductions of the original +tensor. Please note that using reduction config means that you will not +have the full tensor available during analysis, so this can restrict +what you can do with the tensor saved. You can choose to also save the +raw tensor along with the reductions if you so desire. + +The hook takes a ReductionConfig object which is applied as default to +all tensors included. A collection can also have its own ReductionConfig +object which is applied to the tensors belonging to that collection. + +.. code:: python + + import smdebug as smd + reduction_config = smd.ReductionConfig( + reductions = None, + abs_reductions = None, + norms = None, + abs_norms = None, + save_raw_tensor = False, + ) + +.. _arguments-2: + +Arguments +~~~~~~~~~ + +- ``reductions`` (list[str]): Takes names of reductions, choosing from + “min”, “max”, “median”, “mean”, “std”, “variance”, “sum”, “prod” +- ``abs_reductions`` (list[str]): Same as reductions, except the + reduction will be computed on the absolute value of the tensor +- ``norms`` (list[str]): Takes names of norms to compute, choosing from + “l1”, “l2” +- ``abs_norms`` (list[str]): Same as norms, except the norm will be + computed on the absolute value of the tensor +- ``save_raw_tensor`` (bool): Saves the tensor directly, in addition to + other desired reductions + +For example, + +``ReductionConfig(reductions=['std', 'variance'], abs_reductions=['mean'], norms=['l1'])`` + +will save the standard deviation and variance, the mean of the absolute +value, and the l1 norm. + +.. _configuration-using-sagemaker-python-sdk-1: + +Configuration using SageMaker Python SDK +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The reductions are passed as part of the “reductions” parameter to +HookParameters or Collection Parameters. Refer `Configuring Hook using +SageMaker Python SDK <#configuring-hook-using-sagemaker-python-sdk>`__ +and `Configuring Collection using SageMaker Python +SDK <#configuring-collection-using-sagemaker-python-sdk>`__ for more on +that. + +The parameter “reductions” can take a comma separated string consisting +of the following values: + +:: + + min + max + median + mean + std + variance + sum + prod + l1 + l2 + abs_min + abs_max + abs_median + abs_mean + abs_std + abs_variance + abs_sum + abs_prod + abs_l1 + abs_l2 diff --git a/docs/pytorch.md b/docs/pytorch.md deleted file mode 100644 index f312d19e1..000000000 --- a/docs/pytorch.md +++ /dev/null @@ -1,125 +0,0 @@ -# PyTorch - -## Contents -- [Support](#support) -- [How to Use](#how-to-use) -- [Module Loss Example](#module-loss-example) -- [Functional Loss Example](#functional-loss-example) -- [Full API](#full-api) - -## Support -### Versions -- Zero Script Change experience where you need no modifications to your training script is supported in the official [AWS Deep Learning Container for PyTorch](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#general-framework-containers). - -- The library itself supports the following versions when using changes to the training script: PyTorch 1.2, 1.3, 1.4, 1.5, and 1.6. - ---- - -## How to Use -### Using Zero Script Change containers -In this case, you don't need to do anything to get the hook running. You are encouraged to configure the hook from the SageMaker python SDK so you can run different jobs with different configurations without having to modify your script. If you want access to the hook to configure certain things which can not be configured through the SageMaker SDK, you can retrieve the hook as follows. -``` -import smdebug.pytorch as smd -hook = smd.Hook.create_from_json_file() -``` -Note that you can create the hook from smdebug's python API as is being done in the next section even in such containers. - -### Bring your own container experience -#### 1. Create a hook -If using SageMaker, you will configure the hook in SageMaker's python SDK using the Estimator class. Instantiate it with -`smd.Hook.create_from_json_file()`. Otherwise, call the hook class constructor, `smd.Hook()`. - -#### 2. Register the model to the hook -Call `hook.register_module(net)`. - -#### 3. Register your loss function to the hook -If using a loss which is a subclass of `nn.Module`, call `hook.register_loss(loss_criterion)` once before starting training.\ -If using a loss which is a subclass of `nn.functional`, call `hook.record_tensor_value(loss)` after each training step. - -#### 4. Take actions using the hook APIs - -For a full list of actions that the hook APIs offer to construct hooks and save tensors, see [Common hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#common-hook-api) and [PyTorch specific hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#pytorch-specific-hook-api). - ---- - -## Module Loss Example -```python -####################################### -# Creating a hook. Refer `API for Saving Tensors` page for more on this -import smdebug.pytorch as smd -hook = smd.Hook(out_dir=args.out_dir) -####################################### - -class Model(nn.Module) - def __init__(self): - super().__init__() - self.fc = nn.Linear(784, 10) - - def forward(self, x): - return F.relu(self.fc(x)) - -net = Model() -criterion = nn.CrossEntropyLoss() -optimizer = optim.Adam(net.parameters(), lr=args.lr) - -####################################### -# Register the hook and the loss -hook.register_module(net) -hook.register_loss(criterion) -####################################### - -# Training loop as usual -for (inputs, labels) in trainloader: - optimizer.zero_grad() - outputs = net(inputs) - loss = criterion(outputs, labels) - loss.backward() - optimizer.step() -``` - ---- - -## Functional Loss Example -```python -####################################### -# Register the hook and the loss -import smdebug.pytorch as smd -hook = smd.Hook(out_dir=args.out_dir) -####################################### - -class Model(nn.Module) - def __init__(self): - super().__init__() - self.fc = nn.Linear(784, 10) - - def forward(self, x): - return F.relu(self.fc(x)) - -net = Model() -optimizer = optim.Adam(net.parameters(), lr=args.lr) - -####################################### -# Register the hook -hook.register_module(net) -####################################### - -# Training loop, recording the loss at each iteration -for (inputs, labels) in trainloader: - optimizer.zero_grad() - outputs = net(inputs) - loss = F.cross_entropy(outputs, labels) - - ####################################### - # Manually record the loss - hook.record_tensor_value(tensor_name="loss", tensor_value=loss) - ####################################### - - loss.backward() - optimizer.step() -``` - ---- - -## Full API -See the [API for Saving Tensors](api.md) page for details about Hook, Collection, SaveConfig, and ReductionConfig. -See the [Analysis](analysis.md) page for details about analyzing a training job. diff --git a/docs/pytorch.rst b/docs/pytorch.rst new file mode 100644 index 000000000..a8d873851 --- /dev/null +++ b/docs/pytorch.rst @@ -0,0 +1,173 @@ +PyTorch +======= + +Contents +-------- + +- `Support <#support>`__ +- `How to Use <#how-to-use>`__ +- `Module Loss Example <#module-loss-example>`__ +- `Functional Loss Example <#functional-loss-example>`__ +- `Full API <#full-api>`__ + +Support +------- + +Versions +~~~~~~~~ + +- Zero Script Change experience where you need no modifications to your + training script is supported in the official `AWS Deep Learning + Container for + PyTorch `__. + +- The library itself supports the following versions when using changes + to the training script: PyTorch 1.2, 1.3, 1.4, 1.5, and 1.6. + +-------------- + +How to Use +---------- + +Using Zero Script Change containers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In this case, you don’t need to do anything to get the hook running. You +are encouraged to configure the hook from the SageMaker python SDK so +you can run different jobs with different configurations without having +to modify your script. If you want access to the hook to configure +certain things which can not be configured through the SageMaker SDK, +you can retrieve the hook as follows. + +:: + + import smdebug.pytorch as smd + hook = smd.Hook.create_from_json_file() + +Note that you can create the hook from smdebug’s python API as is being +done in the next section even in such containers. + +Bring your own container experience +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +1. Create a hook +^^^^^^^^^^^^^^^^ + +If using SageMaker, you will configure the hook in SageMaker’s python +SDK using the Estimator class. Instantiate it with +``smd.Hook.create_from_json_file()``. Otherwise, call the hook class +constructor, ``smd.Hook()``. + +2. Register the model to the hook +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Call ``hook.register_module(net)``. + +3. Register your loss function to the hook +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +| If using a loss which is a subclass of ``nn.Module``, call + ``hook.register_loss(loss_criterion)`` once before starting training. +| If using a loss which is a subclass of ``nn.functional``, call + ``hook.record_tensor_value(loss)`` after each training step. + +4. Take actions using the hook APIs +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For a full list of actions that the hook APIs offer to construct hooks +and save tensors, see `Common hook +API `__ +and `PyTorch specific hook +API `__. + +-------------- + +Module Loss Example +------------------- + +.. code:: python + + ####################################### + # Creating a hook. Refer `API for Saving Tensors` page for more on this + import smdebug.pytorch as smd + hook = smd.Hook(out_dir=args.out_dir) + ####################################### + + class Model(nn.Module) + def __init__(self): + super().__init__() + self.fc = nn.Linear(784, 10) + + def forward(self, x): + return F.relu(self.fc(x)) + + net = Model() + criterion = nn.CrossEntropyLoss() + optimizer = optim.Adam(net.parameters(), lr=args.lr) + + ####################################### + # Register the hook and the loss + hook.register_module(net) + hook.register_loss(criterion) + ####################################### + + # Training loop as usual + for (inputs, labels) in trainloader: + optimizer.zero_grad() + outputs = net(inputs) + loss = criterion(outputs, labels) + loss.backward() + optimizer.step() + +-------------- + +Functional Loss Example +----------------------- + +.. code:: python + + ####################################### + # Register the hook and the loss + import smdebug.pytorch as smd + hook = smd.Hook(out_dir=args.out_dir) + ####################################### + + class Model(nn.Module) + def __init__(self): + super().__init__() + self.fc = nn.Linear(784, 10) + + def forward(self, x): + return F.relu(self.fc(x)) + + net = Model() + optimizer = optim.Adam(net.parameters(), lr=args.lr) + + ####################################### + # Register the hook + hook.register_module(net) + ####################################### + + # Training loop, recording the loss at each iteration + for (inputs, labels) in trainloader: + optimizer.zero_grad() + outputs = net(inputs) + loss = F.cross_entropy(outputs, labels) + + ####################################### + # Manually record the loss + hook.record_tensor_value(tensor_name="loss", tensor_value=loss) + ####################################### + + loss.backward() + optimizer.step() + +-------------- + +Full API +-------- + +See the `API for Saving Tensors `__ page for details about Hook, +Collection, SaveConfig, and ReductionConfig. See the +`Analysis `__ page for details about analyzing a training +job. diff --git a/docs/release-notes.rst b/docs/release-notes.rst new file mode 100644 index 000000000..24ce67586 --- /dev/null +++ b/docs/release-notes.rst @@ -0,0 +1,80 @@ +Release Notes +============= + +---- + +SMDebug Library 1.0.10 Release Notes +----------------------------------- + +Date: June 10. 2021 + + +New Features +~~~~~~~~~~~~ + +- PyTorch + + - Support for PyTorch 1.9.0 (#501) + + +Bug Fixes +~~~~~~~~~ + +- None + + +Improvements +~~~~~~~~~~~~ + +- TensorFlow + + - Add error handling for TensorFlow v1.x (#498) + - Add safety checks for TensorFlow v2.x error handling (#497) + +- MXNet + + - Add error handling for MXNet (#499) + +- XGBoost + + - Add error handling for XGBoost (#496) + +- Other + + - Error handling updates for SMDebug to not disrupt training jobs using the default configurations. + This improvement is to not fail training jobs due to an error in SMDebug or its dependencies. + + +Known Issues +~~~~~~~~~~~~ + +- PyTorch + + - The autograd based detailed profiling is not supported for PyTorch 1.9.0. + +- SMDebug has a fixed range of framework versions that it supports for TensorFlow and PyTorch. + +- Detailed profiling is not supported for training jobs with SageMaker distributed model parallel. + + +Migration to Deep Learning Containers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +TBD + +For previous migrations, see `Release Notes for Deep Learning Containers +`__. + + +---- + + +SMDebug Library Release Notes +----------------------------- + +Date: June. 10. 2021 + +The SMDebug client library started tracking releases. + +For previous release notes, see `Releases `__ +in the SMDebug GitHub repository. diff --git a/docs/sagemaker.md b/docs/sagemaker.md deleted file mode 100644 index e8b4db36c..000000000 --- a/docs/sagemaker.md +++ /dev/null @@ -1,267 +0,0 @@ -## Running SageMaker jobs with Amazon SageMaker Debugger - -### Outline -- [Configuring SageMaker Debugger](#configuring-sagemaker-debugger) - - [Saving data](#saving-data) - - [Saving built-in collections that we manage](#saving-built-in-collections-that-we-manage) - - [Saving reductions for a custom collection](#saving-reductions-for-a-custom-collection) - - [Enabling TensorBoard summaries](#enabling-tensorboard-summaries) - - [Rules](#rules) - - [Built In Rules](#built-in-rules) - - [Custom Rules](#custom-rules) -- [Interactive Exploration](#interactive-exploration) -- [SageMaker Studio](#sagemaker-studio) -- [TensorBoard Visualization](#tensorboard-visualization) -- [Example Notebooks](#example-notebooks) - -## Configuring SageMaker Debugger - -Regardless of which of the two above ways you have enabled SageMaker Debugger, you can configure it using the SageMaker python SDK. There are two aspects to this configuration. -- You can specify what tensors to be saved, when they should be saved and in what form they should be saved. -- You can specify which Rule you want to monitor your training job with. This can be either a built in rule that SageMaker provides, or a custom rule that you can write yourself. - -### Saving Data - -SageMaker Debugger gives you a powerful and flexible API to save the tensors you choose at the frequencies you want. These configurations are made available in the SageMaker Python SDK through the `DebuggerHookConfig` class. - -#### Saving built-in collections that we manage -Learn more about these built in collections [here](api.md). - -```python -from sagemaker.debugger import DebuggerHookConfig, CollectionConfig -hook_config = DebuggerHookConfig( - s3_output_path='s3://smdebug-dev-demo-pdx/mnist', - hook_parameters={ - "save_interval": 100 - }, - collection_configs=[ - CollectionConfig("weights"), - CollectionConfig("gradients"), - CollectionConfig("losses"), - CollectionConfig( - name="biases", - parameters={ - "save_interval": 10, - "end_step": 500 - } - ), - ] -) -import sagemaker as sm -sagemaker_estimator = sm.tensorflow.TensorFlow( - entry_point='src/mnist.py', - role=sm.get_execution_role(), - base_job_name='smdebug-demo-job', - train_instance_count=1, - train_instance_type="ml.m4.xlarge", - framework_version="1.15", - py_version="py3", - # smdebug-specific arguments below - debugger_hook_config=hook_config -) -sagemaker_estimator.fit() -``` - -#### Saving reductions for a custom collection -You can define your collection of tensors. You can also choose to save certain reductions of tensors only instead of saving the full tensor. You may choose to do this to reduce the amount of data saved. Please note that when you save reductions, unless you pass the flag `save_raw_tensor`, only these reductions will be available for analysis. The raw tensor will not be saved. - -```python -from sagemaker.debugger import DebuggerHookConfig, CollectionConfig -hook_config = DebuggerHookConfig( - s3_output_path='s3://smdebug-dev-demo-pdx/mnist', - collection_configs=[ - CollectionConfig( - name="activations", - parameters={ - "include_regex": "relu|tanh", - "reductions": "mean,variance,max,abs_mean,abs_variance,abs_max" - }) - ] -) -import sagemaker as sm -sagemaker_estimator = sm.tensorflow.TensorFlow( - entry_point='src/mnist.py', - role=sm.get_execution_role(), - base_job_name='smdebug-demo-job', - train_instance_count=1, - train_instance_type="ml.m4.xlarge", - framework_version="1.15", - py_version="py3", - # smdebug-specific arguments below - debugger_hook_config=hook_config -) -sagemaker_estimator.fit() -``` - -#### Enabling TensorBoard summaries -SageMaker Debugger can automatically generate tensorboard scalar summaries, -distributions and histograms for tensors saved. This can be enabled by -passing a `TensorBoardOutputConfig` object when creating an Estimator as follows. -You can also choose to disable or enable histograms specifically for different collections. -By default a collection has `save_histogram` flag set to True. -Note that scalar summaries are added to TensorBoard for all `ScalarCollections` and any scalar saved through `hook.save_scalar`. -Refer [API](api.md) for more details on scalar collections and `save_scalar` method. - -The below example saves weights and gradients as full tensors, and also saves the gradients as histograms and distributions to visualize in TensorBoard. -These will be saved to the location passed in `TensorBoardOutputConfig` object. -```python -from sagemaker.debugger import DebuggerHookConfig, CollectionConfig, TensorBoardOutputConfig -hook_config = DebuggerHookConfig( - s3_output_path='s3://smdebug-dev-demo-pdx/mnist', - collection_configs=[ - CollectionConfig( - name="weights", - parameters={"save_histogram": False}), - CollectionConfig(name="gradients"), - ] -) - -tb_config = TensorBoardOutputConfig('s3://smdebug-dev-demo-pdx/mnist/tensorboard') - -import sagemaker as sm -sagemaker_estimator = sm.tensorflow.TensorFlow( - entry_point='src/mnist.py', - role=sm.get_execution_role(), - base_job_name='smdebug-demo-job', - train_instance_count=1, - train_instance_type="ml.m4.xlarge", - framework_version="1.15", - py_version="py3", - # smdebug-specific arguments below - debugger_hook_config=hook_config, - tensorboard_output_config=tb_config -) -sagemaker_estimator.fit() -``` - -For more details, refer our [API page](api.md). - -### Rules -Here are some examples on how to run Rules with your training jobs. - -Note that passing a `CollectionConfig` object to the Rule as `collections_to_save` -is equivalent to passing it to the `DebuggerHookConfig` object as `collection_configs`. -This is just a shortcut for your convenience. - -#### Built-in Rules -To find a full list of built-in rules that you can use with the SageMaker Python SDK, see the [List of Debugger Built-in Rules](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html) page. - -#### Running built-in SageMaker Rules -You can run a SageMaker built-in Rule as follows using the `Rule.sagemaker` method. -The first argument to this method is the base configuration that is associated with the Rule. -We configure them as much as possible. -You can take a look at the ruleconfigs that we populate for all built-in rules [here](https://github.com/awslabs/sagemaker-debugger-rulesconfig). -You can choose to customize these parameters using the other parameters. - -These rules are run on our pre-built Docker images which are listed [here](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-docker-images-rules.html). -You are not charged for the instances when running SageMaker built-in rules. - -A list of all our built-in rules are provided [below](#built-in-rules). -```python -from sagemaker.debugger import Rule, CollectionConfig, rule_configs - -exploding_tensor_rule = Rule.sagemaker( - base_config=rule_configs.exploding_tensor(), - rule_parameters={"collection_names": "weights,losses"}, - collections_to_save=[ - CollectionConfig("weights"), - CollectionConfig("losses") - ] -) - -vanishing_gradient_rule = Rule.sagemaker( - base_config=rule_configs.vanishing_gradient() -) - -import sagemaker as sm -sagemaker_estimator = sm.tensorflow.TensorFlow( - entry_point='src/mnist.py', - role=sm.get_execution_role(), - base_job_name='smdebug-demo-job', - train_instance_count=1, - train_instance_type="ml.m4.xlarge", - framework_version="1.15", - py_version="py3", - # smdebug-specific arguments below - rules=[exploding_tensor_rule, vanishing_gradient_rule] -) -sagemaker_estimator.fit() -``` - -#### Custom Rules - -You can write your own rule custom made for your application and provide it, so SageMaker can monitor your training job using your rule. To do so, you need to understand the programming model that `smdebug` provides. Our page on [Programming Model for Analysis](analysis.md) describes the APIs that we provide to help you write your own rule. -Please refer to [this example notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/tensorflow_keras_custom_rule/tf-keras-custom-rule.ipynb) for a demonstration of creating your custom rule and running it on SageMaker. - -#### Running custom Rules -To run a custom rule, you have to provide a few additional parameters. -Key parameters to note are a file which has the implementation of your Rule class `source`, - the name of the Rule class (`rule_to_invoke`), the type of instance to run the Rule job on (`instance_type`), - the size of the volume on that instance (`volume_size_in_gb`), and the docker image to use for running this job (`image_uri`). - -Please refer to the documentation [here](https://github.com/aws/sagemaker-python-sdk/blob/391733efd433c5e26afb56102c76ab7472f94b3d/src/sagemaker/debugger.py#L190) for more details. - -We have pre-built Docker images that you can use to run your custom rules. -These are listed [here](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-docker-images-rules.html). -You can also choose to build your own Docker image for custom rule evaluation. -Please refer to the repository [SageMaker Debugger Rules Container](https://github.com/awslabs/sagemaker-debugger-rules-container) for instructions on how to build such an image. - -```python -from sagemaker.debugger import Rule, CollectionConfig - -custom_coll = CollectionConfig( - name="relu_activations", - parameters={ - "include_regex": "relu", - "save_interval": 500, - "end_step": 5000 - }) -improper_activation_rule = Rule.custom( - name='improper_activation_job', - image_uri='552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest', - instance_type='ml.c4.xlarge', - volume_size_in_gb=400, - source='rules/custom_rules.py', - rule_to_invoke='ImproperActivation', - rule_parameters={"collection_names": "relu_activations"}, - collections_to_save=[custom_coll] -) - -import sagemaker as sm -sagemaker_estimator = sm.tensorflow.TensorFlow( - entry_point='src/mnist.py', - role=sm.get_execution_role(), - base_job_name='smdebug-demo-job', - train_instance_count=1, - train_instance_type="ml.m4.xlarge", - framework_version="1.15", - py_version="py3", - # smdebug-specific arguments below - rules=[improper_activation_rule], -) -sagemaker_estimator.fit() -``` - -For more details, refer our [Analysis page](analysis.md). - -## Interactive Exploration - -`smdebug` SDK also allows you perform interactive and real-time exploration of the data saved. You can choose to inspect the tensors saved, or visualize them through your custom plots. -You can retrieve these tensors as numpy arrays allowing you to use your favorite analysis libraries right in a SageMaker notebook instance. We have couple of example notebooks demonstrating this. -- [Real-time anaysis in a notebook during training](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/mxnet_realtime_analysis/mxnet-realtime-analysis.ipynb) -- [Interactive tensor analysis in a notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/mnist_tensor_analysis/mnist_tensor_analysis.ipynb) - -## SageMaker Studio - -SageMaker Debugger is on by default for supported training jobs on the official SageMaker Framework containers (or AWS Deep Learning Containers) during SageMaker training jobs. -In this default scenario, SageMaker Debugger takes the losses and metrics from your training job and publishes them to SageMaker Metrics, allowing you to track these metrics in SageMaker Studio. -You can also see the status of Rules you have enabled for your training job right in the Studio. [Here](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-visualization.html) are screenshots of that experience. - -## TensorBoard Visualization - -If you have enabled TensorBoard outputs for your training job through SageMaker Debugger, TensorBoard artifacts will automatically be generated for the tensors saved. -You can then point your TensorBoard instance to that S3 location and review the visualizations for the tensors saved. - -## Example Notebooks - -We have a bunch of [example notebooks](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger) here demonstrating different aspects of SageMaker Debugger. diff --git a/docs/sagemaker.rst b/docs/sagemaker.rst new file mode 100644 index 000000000..5d092c948 --- /dev/null +++ b/docs/sagemaker.rst @@ -0,0 +1,357 @@ +Running SageMaker jobs with Amazon SageMaker Debugger +===================================================== + +Outline +------- + +- `Configuring SageMaker Debugger <#configuring-sagemaker-debugger>`__ + + - `Saving data <#saving-data>`__ + + - `Saving built-in collections that we + manage <#saving-built-in-collections-that-we-manage>`__ + - `Saving reductions for a custom + collection <#saving-reductions-for-a-custom-collection>`__ + - `Enabling TensorBoard + summaries <#enabling-tensorboard-summaries>`__ + + - `Rules <#rules>`__ + + - `Built In Rules <#built-in-rules>`__ + - `Custom Rules <#custom-rules>`__ + +- `Interactive Exploration <#interactive-exploration>`__ +- `SageMaker Studio <#sagemaker-studio>`__ +- `TensorBoard Visualization <#tensorboard-visualization>`__ +- `Example Notebooks <#example-notebooks>`__ + +Configuring SageMaker Debugger +============================== + +Regardless of which of the two above ways you have enabled SageMaker +Debugger, you can configure it using the SageMaker python SDK. There are +two aspects to this configuration. - You can specify what tensors to be +saved, when they should be saved and in what form they should be saved. +- You can specify which Rule you want to monitor your training job with. +This can be either a built in rule that SageMaker provides, or a custom +rule that you can write yourself. + +Saving Data +----------- + +SageMaker Debugger gives you a powerful and flexible API to save the +tensors you choose at the frequencies you want. These configurations are +made available in the SageMaker Python SDK through the +``DebuggerHookConfig`` class. + +Saving built-in collections that we manage +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Learn more about these built in collections `here `__. + +.. code:: python + + from sagemaker.debugger import DebuggerHookConfig, CollectionConfig + hook_config = DebuggerHookConfig( + s3_output_path='s3://smdebug-dev-demo-pdx/mnist', + hook_parameters={ + "save_interval": 100 + }, + collection_configs=[ + CollectionConfig("weights"), + CollectionConfig("gradients"), + CollectionConfig("losses"), + CollectionConfig( + name="biases", + parameters={ + "save_interval": 10, + "end_step": 500 + } + ), + ] + ) + import sagemaker as sm + sagemaker_estimator = sm.tensorflow.TensorFlow( + entry_point='src/mnist.py', + role=sm.get_execution_role(), + base_job_name='smdebug-demo-job', + train_instance_count=1, + train_instance_type="ml.m4.xlarge", + framework_version="1.15", + py_version="py3", + # smdebug-specific arguments below + debugger_hook_config=hook_config + ) + sagemaker_estimator.fit() + +Saving reductions for a custom collection +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can define your collection of tensors. You can also choose to save +certain reductions of tensors only instead of saving the full tensor. +You may choose to do this to reduce the amount of data saved. Please +note that when you save reductions, unless you pass the flag +``save_raw_tensor``, only these reductions will be available for +analysis. The raw tensor will not be saved. + +.. code:: python + + from sagemaker.debugger import DebuggerHookConfig, CollectionConfig + hook_config = DebuggerHookConfig( + s3_output_path='s3://smdebug-dev-demo-pdx/mnist', + collection_configs=[ + CollectionConfig( + name="activations", + parameters={ + "include_regex": "relu|tanh", + "reductions": "mean,variance,max,abs_mean,abs_variance,abs_max" + }) + ] + ) + import sagemaker as sm + sagemaker_estimator = sm.tensorflow.TensorFlow( + entry_point='src/mnist.py', + role=sm.get_execution_role(), + base_job_name='smdebug-demo-job', + train_instance_count=1, + train_instance_type="ml.m4.xlarge", + framework_version="1.15", + py_version="py3", + # smdebug-specific arguments below + debugger_hook_config=hook_config + ) + sagemaker_estimator.fit() + +Enabling TensorBoard summaries +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +SageMaker Debugger can automatically generate tensorboard scalar +summaries, distributions and histograms for tensors saved. This can be +enabled by passing a ``TensorBoardOutputConfig`` object when creating an +Estimator as follows. You can also choose to disable or enable +histograms specifically for different collections. By default a +collection has ``save_histogram`` flag set to True. Note that scalar +summaries are added to TensorBoard for all ``ScalarCollections`` and any +scalar saved through ``hook.save_scalar``. Refer `API `__ for +more details on scalar collections and ``save_scalar`` method. + +The below example saves weights and gradients as full tensors, and also +saves the gradients as histograms and distributions to visualize in +TensorBoard. These will be saved to the location passed in +``TensorBoardOutputConfig`` object. + +.. code:: python + + from sagemaker.debugger import DebuggerHookConfig, CollectionConfig, TensorBoardOutputConfig + hook_config = DebuggerHookConfig( + s3_output_path='s3://smdebug-dev-demo-pdx/mnist', + collection_configs=[ + CollectionConfig( + name="weights", + parameters={"save_histogram": False}), + CollectionConfig(name="gradients"), + ] + ) + + tb_config = TensorBoardOutputConfig('s3://smdebug-dev-demo-pdx/mnist/tensorboard') + + import sagemaker as sm + sagemaker_estimator = sm.tensorflow.TensorFlow( + entry_point='src/mnist.py', + role=sm.get_execution_role(), + base_job_name='smdebug-demo-job', + train_instance_count=1, + train_instance_type="ml.m4.xlarge", + framework_version="1.15", + py_version="py3", + # smdebug-specific arguments below + debugger_hook_config=hook_config, + tensorboard_output_config=tb_config + ) + sagemaker_estimator.fit() + +For more details, refer our `API page `__. + +Rules +----- + +Here are some examples on how to run Rules with your training jobs. + +Note that passing a ``CollectionConfig`` object to the Rule as +``collections_to_save`` is equivalent to passing it to the +``DebuggerHookConfig`` object as ``collection_configs``. This is just a +shortcut for your convenience. + +Built-in Rules +~~~~~~~~~~~~~~ + +To find a full list of built-in rules that you can use with the +SageMaker Python SDK, see the `List of Debugger Built-in +Rules `__ +page. + +Running built-in SageMaker Rules +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can run a SageMaker built-in Rule as follows using the +``Rule.sagemaker`` method. The first argument to this method is the base +configuration that is associated with the Rule. We configure them as +much as possible. You can take a look at the ruleconfigs that we +populate for all built-in rules +`here `__. +You can choose to customize these parameters using the other parameters. + +These rules are run on our pre-built Docker images which are listed +`here `__. +You are not charged for the instances when running SageMaker built-in +rules. + +A list of all our built-in rules are provided +`below <#built-in-rules>`__. + +.. code:: python + + from sagemaker.debugger import Rule, CollectionConfig, rule_configs + + exploding_tensor_rule = Rule.sagemaker( + base_config=rule_configs.exploding_tensor(), + rule_parameters={"collection_names": "weights,losses"}, + collections_to_save=[ + CollectionConfig("weights"), + CollectionConfig("losses") + ] + ) + + vanishing_gradient_rule = Rule.sagemaker( + base_config=rule_configs.vanishing_gradient() + ) + + import sagemaker as sm + sagemaker_estimator = sm.tensorflow.TensorFlow( + entry_point='src/mnist.py', + role=sm.get_execution_role(), + base_job_name='smdebug-demo-job', + train_instance_count=1, + train_instance_type="ml.m4.xlarge", + framework_version="1.15", + py_version="py3", + # smdebug-specific arguments below + rules=[exploding_tensor_rule, vanishing_gradient_rule] + ) + sagemaker_estimator.fit() + +Custom Rules +~~~~~~~~~~~~ + +You can write your own rule custom made for your application and provide +it, so SageMaker can monitor your training job using your rule. To do +so, you need to understand the programming model that ``smdebug`` +provides. Our page on `Programming Model for Analysis `__ +describes the APIs that we provide to help you write your own rule. +Please refer to `this example +notebook `__ +for a demonstration of creating your custom rule and running it on +SageMaker. + +Running custom Rules +~~~~~~~~~~~~~~~~~~~~ + +To run a custom rule, you have to provide a few additional parameters. +Key parameters to note are a file which has the implementation of your +Rule class ``source``, the name of the Rule class (``rule_to_invoke``), +the type of instance to run the Rule job on (``instance_type``), the +size of the volume on that instance (``volume_size_in_gb``), and the +docker image to use for running this job (``image_uri``). + +Please refer to the documentation +`here `__ +for more details. + +We have pre-built Docker images that you can use to run your custom +rules. These are listed +`here `__. +You can also choose to build your own Docker image for custom rule +evaluation. Please refer to the repository `SageMaker Debugger Rules +Container `__ +for instructions on how to build such an image. + +.. code:: python + + from sagemaker.debugger import Rule, CollectionConfig + + custom_coll = CollectionConfig( + name="relu_activations", + parameters={ + "include_regex": "relu", + "save_interval": 500, + "end_step": 5000 + }) + improper_activation_rule = Rule.custom( + name='improper_activation_job', + image_uri='552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest', + instance_type='ml.c4.xlarge', + volume_size_in_gb=400, + source='rules/custom_rules.py', + rule_to_invoke='ImproperActivation', + rule_parameters={"collection_names": "relu_activations"}, + collections_to_save=[custom_coll] + ) + + import sagemaker as sm + sagemaker_estimator = sm.tensorflow.TensorFlow( + entry_point='src/mnist.py', + role=sm.get_execution_role(), + base_job_name='smdebug-demo-job', + train_instance_count=1, + train_instance_type="ml.m4.xlarge", + framework_version="1.15", + py_version="py3", + # smdebug-specific arguments below + rules=[improper_activation_rule], + ) + sagemaker_estimator.fit() + +For more details, refer our `Analysis page `__. + +Interactive Exploration +======================= + +``smdebug`` SDK also allows you perform interactive and real-time +exploration of the data saved. You can choose to inspect the tensors +saved, or visualize them through your custom plots. You can retrieve +these tensors as numpy arrays allowing you to use your favorite analysis +libraries right in a SageMaker notebook instance. We have couple of +example notebooks demonstrating this. - `Real-time anaysis in a notebook +during +training `__ +- `Interactive tensor analysis in a +notebook `__ + +SageMaker Studio +================ + +SageMaker Debugger is on by default for supported training jobs on the +official SageMaker Framework containers (or AWS Deep Learning +Containers) during SageMaker training jobs. In this default scenario, +SageMaker Debugger takes the losses and metrics from your training job +and publishes them to SageMaker Metrics, allowing you to track these +metrics in SageMaker Studio. You can also see the status of Rules you +have enabled for your training job right in the Studio. +`Here `__ +are screenshots of that experience. + +TensorBoard Visualization +========================= + +If you have enabled TensorBoard outputs for your training job through +SageMaker Debugger, TensorBoard artifacts will automatically be +generated for the tensors saved. You can then point your TensorBoard +instance to that S3 location and review the visualizations for the +tensors saved. + +Example Notebooks +================= + +We have a bunch of `example +notebooks `__ +here demonstrating different aspects of SageMaker Debugger. diff --git a/docs/smdebug.analysis.rst b/docs/smdebug.analysis.rst new file mode 100644 index 000000000..df2ce6f24 --- /dev/null +++ b/docs/smdebug.analysis.rst @@ -0,0 +1,46 @@ +smdebug.analysis +================ + +smdebug.analysis.utils module +----------------------------- + +.. automodule:: smdebug.analysis.utils + :members: + :undoc-members: + :show-inheritance: + +Utils +----- + +Enable or disable refresh of tensors in a trial +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +By default smdebug refreshes tensors each time you try to query the +tensor. It looks for whether this tensor is saved for new steps and if +so fetches them. If you know the saved data will not change (stopped the +machine learning job), or are not interested in the latest data, you can +stop the refreshing of tensors as follows: + +``no_refresh`` takes a trial or a list of trials, which should not be +refreshed. Anything executed inside the with ``no_refresh`` block will +not be refreshed. + +.. code:: python + + from smdebug.analysis.utils import no_refresh + with no_refresh(trials): + pass + +Similarly if you want to refresh tensors only within a block, you can +do: + +.. code:: python + + from smdebug.analysis.utils import refresh + with refresh(trials): + pass + +During rule invocation smdebug waits till the current step is available +and then turns off refresh to ensure that you do not get different +results for methods like ``trial.tensor(name).steps()`` and run into +subtle issues. diff --git a/docs/smdebug.exceptions.rst b/docs/smdebug.exceptions.rst new file mode 100644 index 000000000..1b0737c88 --- /dev/null +++ b/docs/smdebug.exceptions.rst @@ -0,0 +1,19 @@ +Exceptions +---------- + +SMDebug is designed to be aware of that tensors required to evaluate a rule +may not be available at every step. Hence, it raises a few exceptions +which allow us to control what happens when a tensor is missing. These +are available in the ``smdebug.exceptions`` module. You can import them +as follows: + +.. code:: python + + from smdebug.exceptions import * + +The following functions are the exceptions (along with others) and their meanings. + +.. automodule:: smdebug.exceptions + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/smdebug.modes.rst b/docs/smdebug.modes.rst new file mode 100644 index 000000000..9692ec1c3 --- /dev/null +++ b/docs/smdebug.modes.rst @@ -0,0 +1,26 @@ +Modes for Tensors +----------------- + +Used to enumerate different training phases: ``TRAIN``, ``EVAL``, ``PREDICT``, +and ``GLOBAL``. SMDebug APIs use the ``GLOBAL`` mode by default when mode is explicitly set. +You can use this when you register a SMDebug hook to training scripts +and when you retrieve output tensors from specific training phases. + +.. autoclass:: smdebug.modes + :members: + :undoc-members: + :show-inheritance: + +There are four mode enums as shown below: + +.. code:: python + + import smdebug + + smdebug.modes.TRAIN + smdebug.modes.EVAL + smdebug.modes.PREDICT + smdebug.modes.GLOBAL + +The modes enum is also available under the framework hook class, +``smdebug.{framework}.modes``. diff --git a/docs/smdebug.profiler.analysis.notebook_utils.rst b/docs/smdebug.profiler.analysis.notebook_utils.rst new file mode 100644 index 000000000..a666f161e --- /dev/null +++ b/docs/smdebug.profiler.analysis.notebook_utils.rst @@ -0,0 +1,32 @@ +SMDebug Profiler Visualization API +================================== + +.. automodule:: smdebug.profiler.analysis.notebook_utils.heatmap + :members: + :undoc-members: + :show-inheritance: + +.. automodule:: smdebug.profiler.analysis.notebook_utils.metrics_histogram + :members: + :undoc-members: + :show-inheritance: + +.. automodule:: smdebug.profiler.analysis.notebook_utils.step_histogram + :members: + :undoc-members: + :show-inheritance: + +.. automodule:: smdebug.profiler.analysis.notebook_utils.step_timeline_chart + :members: + :undoc-members: + :show-inheritance: + +.. automodule:: smdebug.profiler.analysis.notebook_utils.timeline_charts + :members: + :undoc-members: + :show-inheritance: + +.. automodule:: smdebug.profiler.analysis.notebook_utils.training_job + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/smdebug.profiler.analysis.rst b/docs/smdebug.profiler.analysis.rst new file mode 100644 index 000000000..f457410cb --- /dev/null +++ b/docs/smdebug.profiler.analysis.rst @@ -0,0 +1,15 @@ +Python Profiler Analysis API +============================= + +.. automodule:: smdebug.profiler.analysis.python_profile_analysis + :members: + :undoc-members: + :show-inheritance: + +Python Stats Reader API +======================= + +.. automodule:: smdebug.profiler.analysis.python_stats_reader + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/smdebug.profiler.analysis.utils.rst b/docs/smdebug.profiler.analysis.utils.rst new file mode 100644 index 000000000..9a72a10b3 --- /dev/null +++ b/docs/smdebug.profiler.analysis.utils.rst @@ -0,0 +1,27 @@ +SMDebug Profiler Analysis Utils +=============================== + +.. automodule:: smdebug.profiler.analysis.utils.merge_timelines + :members: + :undoc-members: + :show-inheritance: + +.. automodule:: smdebug.profiler.analysis.utils.pandas_data_analysis + :members: + :undoc-members: + :show-inheritance: + +.. automodule:: smdebug.profiler.analysis.utils.profiler_data_to_pandas + :members: + :undoc-members: + :show-inheritance: + +.. automodule:: smdebug.profiler.analysis.utils.python_profile_analysis_utils + :members: + :undoc-members: + :show-inheritance: + +.. automodule:: smdebug.profiler.analysis.utils.pytorch_dataloader_analysis + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/smdebug.profiler.rst b/docs/smdebug.profiler.rst new file mode 100644 index 000000000..b94f5c6d1 --- /dev/null +++ b/docs/smdebug.profiler.rst @@ -0,0 +1,18 @@ +smdebug.profiler +================ + +smdebug.profiler.tf\_profiler\_parser module +-------------------------------------------- + +.. automodule:: smdebug.profiler.tf_profiler_parser + :members: + :undoc-members: + :show-inheritance: + +smdebug.profiler.trace\_event\_file\_parser module +-------------------------------------------------- + +.. automodule:: smdebug.profiler.trace_event_file_parser + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/smdebug.rules.rst b/docs/smdebug.rules.rst new file mode 100644 index 000000000..7b88a2176 --- /dev/null +++ b/docs/smdebug.rules.rst @@ -0,0 +1,164 @@ +SMDebug Rules +============= + +Rules are the medium by which SageMaker Debugger executes a certain +piece of code regularly on different steps of a training job. A rule is +assigned to a trial and can be invoked at each new step of the trial. It +can also access other trials for its evaluation. You can evaluate a rule +using tensors from the current step or any step before the current step. +Please ensure your logic respects these semantics, else you will get a +``TensorUnavailableForStep`` exception as the data would not yet be +available for future steps. + +Use Built-in Rules Officially Provided by SageMaker +--------------------------------------------------- + +Amazon SageMaker Debugger rules analyze tensors emitted during the training of a model. +Debugger offers the Rule API operation that monitors training job progress and errors +for the success of training your model. For example, the rules can detect whether gradients +are getting too large or too small, whether a model is overfitting or overtraining, +and whether a training job does not decrease loss function and improve. +To see a full list of available built-in rules, see +`List of Debugger Built-in Rules `__. +. + +Write Custom Rules Within or Outside SageMaker +---------------------------------------------- + +Writing a rule involves implementing the `Rule +APIs `__. Below, let's start with a +simplified version of a custom VanishingGradient rule. + +Step 1: Construct a Rule Class +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Creating a rule involves first inheriting from the base ``Rule`` class +provided by smdebug. For this example rule here, we do not need to look +at any other trials, so we set ``other_trials`` to None. + +.. code:: python + + from smdebug.rules import Rule + + class VanishingGradientRule(Rule): + def __init__(self, base_trial, threshold=0.0000001): + super().__init__(base_trial, other_trials=None) + self.threshold = float(threshold) + +Please note that apart from ``base_trial`` and ``other_trials`` (if +required), we require all arguments of the rule constructor to take a +string as value. You can parse them to the type that you want from the +string. This means if you want to pass a list of strings, you might want +to pass them as a comma separated string. This restriction is being +enforced so as to let you create and invoke rules from json using +Sagemaker’s APIs. + +Step 2: Create a Function to Invoke at a Step +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In this function you can implement the core logic of what you want to do +with these tensors. It should return a boolean value ``True`` or +``False``, where ``True`` means the rule evaluation condition has been +met. When you invoke these rules through SageMaker, the rule evaluation +ends when the rule evaluation condition is met. SageMaker creates a +Cloudwatch event for every rule evaluation job, which can be used to +define actions that you might want to take based on the state of the +rule. + +A simplified version of the actual invoke function for +``VanishingGradientRule`` is below: + +.. code:: python + + def invoke_at_step(self, step): + for tensorname in self.base_trial.tensors(collection='gradients'): + tensor = self.base_trial.tensor(tensorname) + abs_mean = tensor.reduction_value(step, 'mean', abs=True) + if abs_mean < self.threshold: + return True + else: + return False + +That’s it, writing a rule is as simple as that. + +Step 3: Invoke the Rule +~~~~~~~~~~~~~~~~~~~~~~~ + +Option 1: Invoking a rule through SageMaker +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +After you’ve written your rule, you can ask SageMaker to evaluate the +rule against your training job by either using SageMaker Python SDK as + +.. code:: python + + estimator = Estimator( + ... + rules = Rules.custom( + name='VGRule', + image_uri='864354269164.dkr.ecr.us-east-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest', + instance_type='ml.t3.medium', # instance type to run the rule evaluation on + source='rules/vanishing_gradient_rule.py', # path to the rule source file + rule_to_invoke='VanishingGradientRule', # name of the class to invoke in the rule source file + volume_size_in_gb=30, # EBS volume size required to be attached to the rule evaluation instance + collections_to_save=[CollectionConfig("gradients")], # collections to be analyzed by the rule + rule_parameters={ + "threshold": "20.0" # this will be used to initialize 'threshold' param in your rule constructor + } + ) + +If you’re using the SageMaker API directly to evaluate the rule, then +you can specify the rule configuration +`DebugRuleConfigurations `__ +in the CreateTrainingJob API request as: + +.. code:: python + + "DebugRuleConfigurations": [ + { + "RuleConfigurationName": "VGRule", + "InstanceType": "ml.t3.medium", + "VolumeSizeInGB": 30, + "RuleEvaluatorImage": "864354269164.dkr.ecr.us-east-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest", + "RuleParameters": { + "source_s3_uri": "s3://path/to/vanishing_gradient_rule.py", + "rule_to_invoke": "VanishingGradient", + "threshold": "20.0" + } + } + ] + +Option 2: Invoking a rule outside SageMaker through ``invoke_rule`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +You might want to invoke the rule locally during development. We provide +a function to invoke rules easily. Refer +`smdebug/rules/rule_invoker.py <../smdebug/rules/rule_invoker.py>`__. +The invoke function has the following syntax. It takes a instance of a +Rule and invokes it for a series of steps one after the other. + +.. code:: python + + from smdebug.rules import invoke_rule + from smdebug.trials import create_trial + + trial = create_trial('s3://smdebug-dev-test/mnist-job/') + rule_obj = VanishingGradientRule(trial, threshold=0.0001) + invoke_rule(rule_obj, start_step=0, end_step=None) + + +Rule API +======== + +.. currentmodule:: smdebug.rules + +.. autoclass:: Rule + :members: + :show-inheritance: + :inherited-members: + +.. autoclass:: invoke_rule + :members: + :undoc-members: + :show-inheritance: + :inherited-members: diff --git a/docs/smdebug.trials.rst b/docs/smdebug.trials.rst new file mode 100644 index 000000000..eb4f8be68 --- /dev/null +++ b/docs/smdebug.trials.rst @@ -0,0 +1,21 @@ +SMDebug Trial +============= + +An SMDebug trial is an object which lets you query for tensors for a given training +job, specified by the path where SMDebug's artifacts are saved. Trial is +capable of loading new tensors as soon as they become available from the +given path, allowing you to do both offline as well as real-time +analysis. + +Create an SMDebug trial object +------------------------------ + +Depending on the output path, there are two types of trials you can create: LocalTrial or S3Trial. +The SMDebug library provides the following wrapper method that automatically +creates the right trial. + +.. autoclass:: smdebug.trials.create_trial + :members: + :undoc-members: + :show-inheritance: + :inherited-members: diff --git a/docs/tensor-api.rst b/docs/tensor-api.rst new file mode 100644 index 000000000..81f6b7199 --- /dev/null +++ b/docs/tensor-api.rst @@ -0,0 +1,286 @@ +Tensor API +---------- + +An smdebug ``Tensor`` object can be retrieved through the +``trial.tensor(tname)`` API. It is uniquely identified by the string +representing name. It provides the following methods. + +.. code:: python + + from smdebug.trials import create_trial + trial = create_trial( + path='s3://smdebug-testing-bucket/outputs/resnet', + name='resnet_training_run' + ) + output_tensor=trial.tensor(tname) + +.. note:: + To use the following methods, you must create a tensor object as shown + in the code example above. The object name ``output_tensor`` is used as an example + throughout this page. You can name the object by replacing the ``output_tensor`` as you want. + For example, if you are logging ``nll_loss_output_0`` outputs from a PyTorch training job, + you can define the tensor object as following: + + .. code:: python + + output_loss=trial.tensor("nll_loss_output_0") + + You can then replace ``output_tensor`` to ``output_loss`` to use the following methods. + ++---------------------------------------------+---------------------------------------+ +| Method | Description | ++=============================================+=======================================+ +| `steps() <#output_tensor.steps>`__ | Query steps for which tensor was | +| | saved | ++---------------------------------------------+---------------------------------------+ +| `value(step) <#output_tensor.value>`__ | Get the value of the tensor at a | +| | given step as a numpy array | ++---------------------------------------------+---------------------------------------+ +| `reduction_value(step) | Get the reduction value of the chosen | +| <#output_tensor.reduction_value>`__ | tensor at a particular step | ++---------------------------------------------+---------------------------------------+ +| `reduction_values | Get all reduction values saved for | +| (step) <#output_tensor.reduction_values>`__ | the chosen tensor at a particular | +| | step | ++---------------------------------------------+---------------------------------------+ +| `values(mode) <#output_tensor.values>`__ | Get the values of the tensor for all | +| | steps of a given mode | ++---------------------------------------------+---------------------------------------+ +| `workers(step) | Get all the workers for which this | +| <#output_tensor.workers>`__ | tensor was saved at a given step | ++---------------------------------------------+---------------------------------------+ +| `prev_steps(step, | Get the last n step numbers of a | +| n) <#output_tensor.prev_steps>`__ | given mode from a given step | ++---------------------------------------------+---------------------------------------+ + + + +.. method:: output_tensor.steps(mode=ModeKeys.GLOBAL, show_incomplete_steps=False) + + Query for the steps at which the given tensor was saved + + **Parameters:** + + - ``mode (smdebug.modes enum value)`` The mode whose steps to return + for the given tensor. Defaults to ``modes.GLOBAL`` + - ``show_incomplete_steps (bool)`` This parameter is relevant only for + distributed training. By default this method only returns the steps + which have been received from all workers. But if this parameter is + set to True, this method will return steps received from at least one + worker. + + **Returns:** + + ``list[int]`` A list of steps at which the given tensor was saved + +.. method:: output_tensor.value(step_num, mode=ModeKeys.GLOBAL, worker=None) + + Get the value of the tensor at a given step as a numpy array + + **Parameters:** + + - ``step_num (int)`` The step number whose value is to be returned for + the mode passed through the next parameter. + - ``mode (smdebug.modes enum value)`` The mode applicable for the step + number passed above. Defaults to ``modes.GLOBAL`` + - ``worker (str)`` This parameter is only applicable for distributed + training. You can retrieve the value of the tensor from a specific + worker by passing the worker name. You can query all the workers seen + by the trial with the ``trial.workers()`` method. You might also be + interested in querying the workers which saved a value for the tensor + at a specific step, this is possible with the method: + ``trial.tensor(tname).workers(step, mode)`` + + **Returns:** + + ``numpy.ndarray`` The value of tensor at the given step and worker (if + the training job saved data from multiple workers) + +.. method:: output_tensor.reduction_value(step_num, reduction_name, mode=modes.GLOBAL, worker=None, abs=False) + + Get the reduction value of the chosen tensor at a particular step. A + reduction value is a tensor reduced to a single value through reduction + or aggregation operations. The different reductions you can query for + are the same as what are allowed in + `ReductionConfig `__ when saving tensors. This + API thus allows you to access the reduction you might have saved instead + of the full tensor. If you had saved the full tensor, it will calculate + the requested reduction at the time of this call. + + Reduction names allowed are ``min``, ``max``, ``mean``, ``prod``, + ``std``, ``sum``, ``variance`` and ``l1``, ``l2`` representing the + norms. + + Each of these can be retrieved for the absolute value of the tensor or + the original tensor. Above was an example to get the mean of the + absolute value of the tensor. ``abs`` can be set to ``False`` if you + want to see the ``mean`` of the actual tensor. + + If you had saved the tensor without any reduction, then you can retrieve + the actual tensor as a numpy array and compute any reduction you might + be interested in. In such a case you do not need this method. + + **Parameters:** + + - ``step_num (int)`` The step number whose value is to be returned for + the mode passed through the next parameter. + - ``reduction_name (str)`` The name of the reduction to query for. This + can be one of ``min``, ``max``, ``mean``, ``std``, ``variance``, + ``sum``, ``prod`` and the norms ``l1``, ``l2``. + - ``mode (smdebug.modes enum value)`` The mode applicable for the step + number passed above. Defaults to ``modes.GLOBAL`` + - ``worker (str)`` This parameter is only applicable for distributed + training. You can retrieve the value of the tensor from a specific + worker by passing the worker name. You can query all the workers seen + by the trial with the ``trial.workers()`` method. You might also be + interested in querying the workers which saved a value for the tensor + at a specific step, this is possible with the method: + ``trial.tensor(tname).workers(step, mode)`` + - ``abs (bool)`` If abs is True, this method tries to return the + reduction passed through ``reduction_name`` after taking the absolute + value of the tensor. It defaults to ``False``. + + **Returns:** + + ``numpy.ndarray`` The reduction value of tensor at the given step and + worker (if the training job saved data from multiple workers) as a 1x1 + numpy array. If this reduction was saved for the tensor during training + as part of specification through reduction config, it will be loaded and + returned. If the given reduction was not saved then, but the full tensor + was saved, the reduction will be computed on the fly and returned. If + both the chosen reduction and full tensor are not available, this method + raises ``TensorUnavailableForStep`` exception. + +.. method:: output_tensor.shape(step_num, mode=modes.GLOBAL, worker=None) + + Get the shape of the chosen tensor at a particular step. + + **Parameters:** + + - ``step_num (int)`` The step number whose value is to be returned for + the mode passed through the next parameter. + - ``mode (smdebug.modes enum value)`` The mode applicable for the step + number passed above. Defaults to ``modes.GLOBAL`` + - ``worker (str)`` This parameter is only applicable for distributed + training. You can retrieve the value of the tensor from a specific + worker by passing the worker name. You can query all the workers seen + by the trial with the ``trial.workers()`` method. You might also be + interested in querying the workers which saved a value for the tensor + at a specific step, this is possible with the method: + ``trial.tensor(tname).workers(step, mode)`` + + **Returns:** + + - ``tuple(int)`` If only the shape of this tensor was saved through. + - ``save_shape`` configuration in ReductionConfig, it will be returned. If + the full tensor was saved, then shape will be computed and returned + today. If both the shape and full tensor are not available, this method + raises ``TensorUnavailableForStep`` exception. + +.. method:: output_tensor.values(mode=modes.GLOBAL, worker=None) + + Get the values of the tensor for all steps of a given mode. + + **Parameters:** + + - ``mode (smdebug.modes enum value)`` The mode applicable for the step + number passed above. Defaults to ``modes.GLOBAL`` + - ``worker (str)`` This parameter is only applicable for distributed + training. You can retrieve the value of the tensor from a specific + worker by passing the worker name. You can query all the workers seen + by the trial with the ``trial.workers()`` method. You might also be + interested in querying the workers which saved a value for the tensor + at a specific step, this is possible with the method: + ``trial.tensor(tname).workers(step, mode)`` + + **Returns:** + + ``dict[int -> numpy.ndarray]`` A dictionary with step numbers as keys + and numpy arrays representing the value of the tensor as values. + +.. method:: output_tensor.reduction_values(step_num, mode=modes.GLOBAL, worker=None) + + Get all reduction values saved for the chosen tensor at a particular + step. A reduction value is a tensor reduced to a single value through + reduction or aggregation operations. Please go through the description + of the method ``reduction_value`` for more details. + + **Parameters:** + + - ``step_num (int)`` The step number whose value is to be returned for + the mode passed through the next parameter. + - ``mode (smdebug.modes enum value)`` The mode applicable for the step + number passed above. Defaults to ``modes.GLOBAL`` + - ``worker (str)`` This parameter is only applicable for distributed + training. You can retrieve the value of the tensor from a specific + worker by passing the worker name. You can query all the workers seen + by the trial with the ``trial.workers()`` method. You might also be + interested in querying the workers which saved a value for the tensor + at a specific step, this is possible with the method: + ``trial.tensor(tname).workers(step, mode)`` + + **Returns:** + + ``dict[(str, bool) -> numpy.ndarray]`` A dictionary with keys being + tuples of the form ``(reduction_name, abs)`` to a 1x1 numpy ndarray + value. ``abs`` here is a boolean that denotes whether the reduction was + performed on the absolute value of the tensor or not. Note that this + method only returns the reductions which were saved from the training + job. It does not compute all known reductions and return them if only + the raw tensor was saved. + +.. method:: output_tensor.shapes(mode=modes.GLOBAL, worker=None) + + Get the shapes of the tensor for all steps of a given mode. + + **Parameters:** + + - ``mode (smdebug.modes enum value)`` The mode applicable for the step + number passed above. Defaults to ``modes.GLOBAL`` + - ``worker (str)`` This parameter is only applicable for distributed + training. You can retrieve the value of the tensor from a specific + worker by passing the worker name. You can query all the workers seen + by the trial with the ``trial.workers()`` method. You might also be + interested in querying the workers which saved a value for the tensor + at a specific step, this is possible with the method: + ``trial.tensor(tname).workers(step, mode)`` + + **Returns:** + + ``dict[int -> tuple(int)]`` A dictionary with step numbers as keys and + tuples of ints representing the shapes of the tensor as values. + +.. method:: output_tensor.workers(step_num, mode=modes.GLOBAL) + + Get all the workers for which this tensor was saved at a given step + + **Parameters:** + + - ``step_num (int)`` The step number whose value is to be returned for + the mode passed through the next parameter. + - ``mode (smdebug.modes enum value)`` The mode applicable for the step + number passed above. Defaults to ``modes.GLOBAL`` + + **Returns:** + + ``list[str]`` A list of worker names for which the tensor was saved at + the given step. + +.. method:: output_tensor.prev_steps(step, n, mode=modes.GLOBAL) + + Get the last n step numbers of a given mode from a given step. + + **Parameters:** + + - ``step (int)`` The step number whose value is to be returned for the + mode passed. + - ``n (int)`` Number of previous steps to return + - ``mode (smdebug.modes enum value)`` The mode applicable for the step + number passed above. Defaults to ``modes.GLOBAL`` + + **Returns:** + + ``list[int]`` A list of size at most n representing the previous steps + for the given step and mode. Note that this list can be of size less + than n if there were only less than n steps saved before the given step + in this trial. diff --git a/docs/tensorflow.md b/docs/tensorflow.md deleted file mode 100644 index 590630b77..000000000 --- a/docs/tensorflow.md +++ /dev/null @@ -1,323 +0,0 @@ -# Tensorflow - -## Contents -- [What SageMaker Debugger Supports](#support) -- [Debugger on AWS Deep Learning Containers with TensorFlow](#debugger-dlc) - - [Debugger Built-in Tensor Collections for TensorFlow](#tf-built-in-collection) -- [Debugger on SageMaker Training Containers and Custom Containers](#debugger-script-change) -- [Code Examples](#examples) -- [References](#references) - ---- - -## Amazon SageMaker Debugger Support for TensorFlow - -Amazon SageMaker Debugger python SDK and its client library `smdebug` now fully support TensorFlow 2.3 with the latest version release. - -Using Debugger, you can access tensors of any kind for TensorFlow models, from the Keras model zoo to your own custom model, and save them using Debugger built-in or custom tensor collections. You can run your training script on [the official AWS Deep Learning Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#general-framework-containers) where Debugger can automatically capture tensors from your training job. It doesn't matter whether your TensorFlow models use Keras API or pure TensorFlow API (in eager mode or non-eager mode), you can directly run them on the AWS Deep Learning Containers. - -Debugger and its client library `smdebug` support debugging your training job on other AWS training containers and custom containers. In this case, a hook registration process is required to manually add the hook features to your training script. For a full list of AWS TensorFlow containers to use Debugger, see [SageMaker containers to use Debugger with script mode](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). For a complete guide for using custom containers, see [Use Debugger in Custom Training Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-bring-your-own-container.html). - -### New Features supported by Debugger -- The latest TensorFlow version fully covered by Debugger is 2.3.0 -- Debug training jobs with the TensorFlow framework or Keras TensorFlow -- Debug training jobs with the TensorFlow eager or non-eager mode -- New built-in tensor collections: `inputs`, `outputs`, `layers`, `gradients` -- New hook APIs to save tensors, in addition to scalars: `save_tensors`, `save_scalar` - ---- - -## Using Debugger on AWS Deep Learning Containers with TensorFlow - -The Debugger built-in rules and hook features are fully integrated with the AWS Deep Learning Containers. You can run your training script without any script changes. When running training jobs on those Deep Learning Containers, Debugger registers its hooks automatically to your training script in order to retrieve tensors. To find a comprehensive guide of using the high-level SageMaker TensorFlow estimator with Debugger, see [Amazon SageMaker Debugger with TensorFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html#debugger-zero-script-change-TensorFlow) in the Amazon SageMaker Developer Guide. - -The following code example provides the base structure for a SageMaker TensorFlow estimator with Debugger. - -```python -from sagemaker.tensorflow import TensorFlow -from sagemaker.debugger import Rule, DebuggerHookConfig, CollectionConfig, rule_configs - -tf_estimator = TensorFlow( - entry_point = "tf-train.py", - role = "SageMakerRole", - train_instance_count = 1, - train_instance_type = "ml.p2.xlarge", - framework_version = "2.2.0", - py_version = "py37" - - # Debugger-specific Parameters - rules = [ - Rule.sagemaker(rule_configs.vanishing_gradient()), - Rule.sagemaker(rule_configs.loss_not_decreasing()), - ... - ], - debugger_hook_config = DebuggerHookConfig( - CollectionConfig(name="inputs"), - CollectionConfig(name="outputs"), - CollectionConfig(name="layers"), - CollectionConfig(name="gradients") - ... - ) -) -tf_estimator.fit("s3://bucket/path/to/training/data") -``` ->**Note**: The SageMaker TensorFlow estimator and the Debugger collections in the example are based on the SageMaker python SDK v2 and `smdebug` v0.9.2. It is highly recommended to upgrade the packages by executing the following command line. -```bash -pip install -U sagemaker -pip install -U smdebug -``` -If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. - -#### Available Tensor Collections for TensorFlow - -**Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest `smdebug` library. We highly recommend that you upgrade the packages by running the following commands at the command line: -``` -pip install -U sagemaker -pip install -U smdebug -``` -If you are using a Jupyter Notebook, put an exclamation mark (!) at the beginning of the code string and restart your kernel. For more information about the SageMaker Python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). - -### Debugger Built-in Tensor Collections for TensorFlow - -The following table lists the pre-configured tensor collections for TensorFlow models. You can pick any tensor collections by specifying the `name` parameter of `CollectionConfig()` as shown in the previous base code example. SageMaker Debugger will save these tensors to the default out_dir of the hook. - -| Name | Description| -| --- | --- | -| `all` | Matches all tensors. | -| `default` | Includes `metrics`, `losses`, and `sm_metrics`. | -| `metrics` | For KerasHook, saves the metrics computed by Keras for the model. | -| `losses` | Saves all losses of the model. | -| `sm_metrics` | Saves scalars that you want to include in the SageMaker metrics collection. | -| `inputs` | Matches all model inputs to the model. | -| `outputs` | Matches all model outputs of the model, such as predictions (logits) and labels. | -| `layers` | Matches all inputs and outputs of intermediate layers. | -| `gradients` | Matches all gradients of the model. | -| `weights` | Matches all weights of the model. | -| `biases` | Matches all biases of the model. | -| `optimizer_variables` | Matches all optimizer variables, currently only supported for Keras. | - -For more information about adjusting the tensor collection parameters, see [Save Tensors Using Debugger Modified Built-in Collections](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-data.html#debugger-save-modified-built-in-collections). - -For a full list of available tensor collection parameters, see [Configuring Collection using SageMaker Python SDK](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk). - ->**Note**: The `inputs`, `outputs`, `gradients`, and `layers` built-in collections are currently available for TensorFlow versions <2.0 and ==2.2.0. - ---- - -## Using Debugger on SageMaker Training Containers and Custom Containers - -If you want to run your own training script or custom containers other than the AWS Deep Learning Containers in the previous option, you can use any of the following options: - -- **Option 1** - Use the SageMaker TensorFlow training containers with training script modification -- **Option 2** - Use your custom container with modified training script and push the container to Amazon ECR. - -For both options, you need to manually register the Debugger hook to your training script. Depending on the TensorFlow and Keras API operations used to construct your model, you need to pick the right TensorFlow hook class, register the hook, and then save the tensors. - -1. [Create a hook](#create-a-hook) - - [KerasHook](#kerashook) - - [SessionHook](#sessionhook) - - [EstimatorHook](#estimatorhook) -2. [Wrap the optimizer and the gradient tape with the hook to retrieve gradient tensors](#wrap-opt-with-hook) -3. [Register the hook to model.fit()](#register-a-hook) - - -### Step 1: Create a hook - -To create the hook constructor, add the following code to your training script. This enables the `smdebug` tools for TensorFlow and creates a TensorFlow `hook` object. When you run the `fit()` API for training, specify the smdebug `hook` as `callbacks`, as shown in the following subsections. - -Depending on the TensorFlow versions and the Keras API that you use in your training script, you need to choose the right hook class. The hook constructors for TensorFlow that you can choose are `smd.KerasHook`, `smd.SessionHook`, and `smd.EstimatorHook`. - -#### KerasHook - -If you use the Keras model zoo and a Keras `model.fit()` API, use `KerasHook`. `KerasHook` is available for the Keras model with the TensorFlow backend interface. `KerasHook` covers the eager execution modes and the gradient tape features that are introduced in the TensorFlow framework version 2.0. You can set the smdebug Keras hook constructor by adding the following code to your training script. Place this code line before `model.compile()`: - -```python -import smdebug.tensorflow as smd -hook = smd.KerasHook.create_from_json_file() -``` - -To learn how to fully implement the hook in your training script, see the [Keras with the TensorFlow gradient tape and the smdebug hook example scripts](https://github.com/awslabs/sagemaker-debugger/tree/master/examples/tensorflow2/scripts). - ->**Note**: If you use the AWS Deep Learning Containers for zero script change, Debugger collects most of the tensors through its high-level API, regardless of the eager execution modes. - -#### SessionHook - -If your model is created in TensorFlow version 1.x with the low-level approach (not using the Keras API), use `SessionHook`. `SessionHook` is for the TensorFlow 1.x monitored training session API, `tf.train.MonitoredSessions()`, as shown following: - -```python -import smdebug.tensorflow as smd -hook = smd.SessionHook.create_from_json_file() -``` - -To learn how to fully implement the hook into your training script, see the [TensorFlow monitored training session with the smdebug hook example script](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow/sagemaker_byoc/simple.py). - ->**Note**: The official TensorFlow library deprecated the `tf.train.MonitoredSessions()` API in favor of `tf.function()` in TensorFlow 2.0 and later. You can use `SessionHook` for `tf.function()` in TensorFlow 2.0 and later. - -#### EstimatorHook - -If you have a model using the `tf.estimator()` API, use `EstimatorHook`. `EstimatorHook` is available for any TensorFlow framework versions that support the `tf.estimator()` API, as shown following: - -```python -import smdebug.tensorflow as smd -hook = smd.EstimatorHook.create_from_json_file() -``` - -To learn how to fully implement the hook into your training script, see the [simple MNIST training script with the Tensorflow estimator](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow/sagemaker_byoc/simple.py). - -### Step 2: Wrap the optimizer and the gradient tape to retrieve gradient tensors - -The smdebug TensorFlow hook provides tools to manually retrieve `gradients` tensors specific to the TensorFlow framework. - -If you want to save `gradients` (for example, from the Keras Adam optimizer) wrap it with the hook as shown following: - -```python -optimizer = tf.keras.optimizers.Adam(learning_rate=args.lr) -optimizer = hook.wrap_optimizer(optimizer) -``` - -If you want to save gradients and outputs tensors from the TensorFlow `GradientTape` feature, wrap `tf.GradientTape` with the smdebug `hook.wrap_tape` method and save using the `hook.save_tensor` function. The input of `hook.save_tensor` is in (tensor_name, tensor_value, collections_to_write="default") format. For example: -```python -with hook.wrap_tape(tf.GradientTape(persistent=True)) as tape: - logits = model(data, training=True) - loss_value = cce(labels, logits) -hook.save_tensor("y_labels", labels, "outputs") -hook.save_tensor("predictions", logits, "outputs") -grads = tape.gradient(loss_value, model.variables) -hook.save_tensor("grads", grads, "gradients") -``` - -These smdebug hook wrapper functions capture the gradient tensors, not affecting your optimization logic at all. - -For examples of code structures that you can use to apply the hook wrappers, see the [Code Examples](#examples) section. - -### Step 3: Register the hook to model.fit() - -To collect the tensors from the hooks that you registered, add `callbacks=[hook]` to the Keras `model.fit()` API. This will pass the SageMaker Debugger hook as a Keras callback. Similarly, add `hooks=[hook]` to the `MonitoredSession()`, `tf.function()`, and `tf.estimator()` APIs. For example: - -```python -model.fit(X_train, Y_train, - batch_size=batch_size, - epochs=epoch, - validation_data=(X_valid, Y_valid), - shuffle=True, - # smdebug modification: Pass the hook as a Keras callback - callbacks=[hook]) -``` - -### Step 4: Perform actions using the hook APIs - -For a full list of actions that the hook APIs offer to construct hooks and save tensors, see [Common hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#common-hook-api) and [TensorFlow specific hook API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#tensorflow-specific-hook-api). - ---- - -## Code Examples - -The following code examples show the base structures that you can use for hook registration in various TensorFlow training scripts. If you want to use the high-level Debugger features with zero script change on AWS Deep Learning Containers, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). - -### Keras API (tf.keras) - -The following code example shows how to register the smdebug `KerasHook` for the Keras `model.fit()`. You can also set the hook mode to track stored tensors in different phases of training job. For a list of available hook modes, see [smdebug modes](api.md#modes). - -```python -import smdebug.tensorflow as smd - -hook = smd.KerasHook.create_from_json_file() - -model = tf.keras.models.Sequential([ ... ]) -model.compile( - optimizer='adam', - loss='sparse_categorical_crossentropy', -) -# Add the hook as a callback -# Set hook.set_mode to set tensors to be stored in different phases of training job, such as TRAIN and EVAL -hook.set_mode(mode=smd.modes.TRAIN) -model.fit(x_train, y_train, epochs=args.epochs, callbacks=[hook]) - -hook.set_mode(mode=smd.modes.EVAL) -model.evaluate(x_test, y_test, callbacks=[hook]) -``` - -### Keras GradientTape example for TensorFlow 2.0 and later - -The following code example shows how to register the smdebug `KerasHook` by wrapping the TensorFlow `GradientTape()` with the smdebug `hook.wrap_tape()` API. - -```python -import smdebug.tensorflow as smd - -hook = smd.KerasHook.create_from_json_file() - -model = tf.keras.models.Sequential([ ... ]) - for epoch in range(n_epochs): - for data, labels in dataset: - dataset_labels = labels - # wrap the tape to capture tensors - with hook.wrap_tape(tf.GradientTape(persistent=True)) as tape: - logits = model(data, training=True) # (32,10) - loss_value = cce(labels, logits) - grads = tape.gradient(loss_value, model.variables) - opt.apply_gradients(zip(grads, model.variables)) - acc = train_acc_metric(dataset_labels, logits) - # manually save metric values - hook.save_tensor(tensor_name="accuracy", tensor_value=acc, collections_to_write="default") -``` - -### Monitored Session (tf.train.MonitoredSession) - -The following code example shows how to register the smdebug `SessionHook`. - -```python -import smdebug.tensorflow as smd - -hook = smd.SessionHook.create_from_json_file() - -loss = tf.reduce_mean(tf.matmul(...), name="loss") -optimizer = tf.train.AdamOptimizer(args.lr) - -# Wrap the optimizer -optimizer = hook.wrap_optimizer(optimizer) - -# Add the hook as a callback -sess = tf.train.MonitoredSession(hooks=[hook]) - -sess.run([loss, ...]) -``` - -### Estimator (tf.estimator.Estimator) - -The following code example shows how to register the smdebug `EstimatorHook`. You can also set the hook mode to track stored tensors in different phases of training job. For a list of available hook modes, see [smdebug modes](api.md#modes). - -```python -import smdebug.tensorflow as smd - -hook = smd.EstimatorHook.create_from_json_file() - -train_input_fn, eval_input_fn = ... -estimator = tf.estimator.Estimator(...) - -# Set hook.set_mode to set tensors to be stored in different phases of training job, such as TRAIN and EVAL -hook.set_mode(mode=smd.modes.TRAIN) -estimator.train(input_fn=train_input_fn, steps=args.steps, hooks=[hook]) - -hook.set_mode(mode=smd.modes.EVAL) -estimator.evaluate(input_fn=eval_input_fn, steps=args.steps, hooks=[hook]) -``` - ---- - -## References - -### The smdebug API for saving tensors -See the [API for saving tensors](api.md) page for details about the Hooks, Collection, SaveConfig, and ReductionConfig. -See the [Analysis](analysis.md) page for details about analyzing a training job. - -### TensorFlow References -- TF 1.x: - - [tf.estimator](https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/estimator) - - [tf.keras](https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/keras) - - [tf.train.MonitoredSession](https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/train/MonitoredSession?hl=en) -- TF 2.1: - - [tf.estimator](https://www.tensorflow.org/versions/r2.1/api_docs/python/tf/estimator) - - [tf.keras](https://www.tensorflow.org/versions/r2.1/api_docs/python/tf/keras) -- TF 2.2: - - [tf.estimator](https://www.tensorflow.org/api_docs/python/tf/estimator) - - [tf.keras](https://www.tensorflow.org/versions/r2.2/api_docs/python/tf) diff --git a/docs/tensorflow.rst b/docs/tensorflow.rst new file mode 100644 index 000000000..43cb8557c --- /dev/null +++ b/docs/tensorflow.rst @@ -0,0 +1,498 @@ +Tensorflow +========== + +Contents +-------- + +- `SMDebug Support for TensorFlow <#smdebug-support-for-tensorflow>`__ + + - `Features supported by SMDebug <#features-supported-by-smdebug>`__ + +- `Using Debugger on AWS Deep Learning Containers with + TensorFlow <#Using-Debugger-on-AWS-Deep-Learning-Containers-with-TensorFlow>`__ + + - `Debugger Built-in Tensor Collections for + TensorFlow <#tf-built-in-collection>`__ + +- `Debugger on SageMaker Training Containers and Custom + Containers <#debugger-script-change>`__ +- `Code Examples <#examples>`__ +- `References <#references>`__ + +SMDebug for TensorFlow +------------------------------ + +Amazon SageMaker Debugger and the ``smdebug`` client library +fully support TensorFlow framework. + +Using Debugger, you can access tensors of any kind for TensorFlow +models, from the Keras model zoo to your own custom model, and save them +using Debugger built-in or custom tensor collections. You can run your +training script on `the official AWS Deep Learning +Containers `__ +where Debugger can automatically capture tensors from your training job. +It doesn’t matter whether your TensorFlow models use Keras API or pure +TensorFlow API (in eager mode or non-eager mode), you can directly run +them on the AWS Deep Learning Containers. + +Debugger and its client library ``smdebug`` support debugging your +training job on other AWS training containers and custom containers. In +this case, a hook registration process is required to manually add the +hook features to your training script. For a full list of AWS TensorFlow +containers to use Debugger, see `SageMaker containers to use Debugger +with script +mode `__. +For a complete guide for using custom containers, see `Use Debugger in +Custom Training +Containers `__. + +Features supported by SMDebug +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +- Debug training jobs with the TensorFlow framework or Keras TensorFlow +- Debug training jobs with the TensorFlow eager or non-eager mode +- Extended built-in tensor collections: ``inputs``, ``outputs``, + ``layers``, and ``gradients`` +- Hook APIs to save model parameters: ``save_tensors``, ``save_scalar`` + +-------------- + +Using Debugger on AWS Deep Learning Containers with TensorFlow +-------------------------------------------------------------- + +The Debugger built-in rules and hook features are fully integrated with +the AWS Deep Learning Containers. You can run your training script +without any script changes. When running training jobs on those Deep +Learning Containers, Debugger registers its hooks automatically to your +training script in order to retrieve tensors. To find a comprehensive +guide of using the high-level SageMaker TensorFlow estimator with +Debugger, see `Amazon SageMaker Debugger with +TensorFlow `__ +in the Amazon SageMaker Developer Guide. + +The following code example provides the base structure for a SageMaker +TensorFlow estimator with Debugger. + +.. code:: python + + from sagemaker.tensorflow import TensorFlow + from sagemaker.debugger import Rule, DebuggerHookConfig, CollectionConfig, rule_configs + + tf_estimator = TensorFlow( + entry_point = "tf-train.py", + role = "SageMakerRole", + train_instance_count = 1, + train_instance_type = "ml.p2.xlarge", + framework_version = "2.2.0", + py_version = "py37" + + # Debugger-specific Parameters + rules = [ + Rule.sagemaker(rule_configs.vanishing_gradient()), + Rule.sagemaker(rule_configs.loss_not_decreasing()), + ... + ], + debugger_hook_config = DebuggerHookConfig( + CollectionConfig(name="inputs"), + CollectionConfig(name="outputs"), + CollectionConfig(name="layers"), + CollectionConfig(name="gradients") + ... + ) + ) + tf_estimator.fit("s3://bucket/path/to/training/data") + +.. note:: + + The SageMaker TensorFlow estimator and the Debugger + collections in the example are based on the SageMaker python SDK v2 + and ``smdebug`` v0.9.2. It is highly recommended to upgrade the + packages by executing the following command line. + + .. code:: bash + + pip install -U sagemaker + pip install -U smdebug + + If you are using Jupyter Notebook, put exclamation mark at the front of + the code lines and restart your kernel. For more + information about the SageMaker Python SDK, see `Use Version 2.x of the + SageMaker Python + SDK `__. + + +Debugger Built-in Tensor Collections for TensorFlow +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following table lists the pre-configured tensor collections for +TensorFlow models. You can pick any tensor collections by specifying the +``name`` parameter of ``CollectionConfig()`` as shown in the previous +base code example. SageMaker Debugger will save these tensors to the +default out_dir of the hook. + ++-----------------------------------+-----------------------------------+ +| Name | Description | ++===================================+===================================+ +| ``all`` | Matches all tensors. | ++-----------------------------------+-----------------------------------+ +| ``default`` | Includes ``metrics``, ``losses``, | +| | and ``sm_metrics``. | ++-----------------------------------+-----------------------------------+ +| ``metrics`` | For KerasHook, saves the metrics | +| | computed by Keras for the model. | ++-----------------------------------+-----------------------------------+ +| ``losses`` | Saves all losses of the model. | ++-----------------------------------+-----------------------------------+ +| ``sm_metrics`` | Saves scalars that you want to | +| | include in the SageMaker metrics | +| | collection. | ++-----------------------------------+-----------------------------------+ +| ``inputs`` | Matches all model inputs to the | +| | model. | ++-----------------------------------+-----------------------------------+ +| ``outputs`` | Matches all model outputs of the | +| | model, such as predictions | +| | (logits) and labels. | ++-----------------------------------+-----------------------------------+ +| ``layers`` | Matches all inputs and outputs of | +| | intermediate layers. | ++-----------------------------------+-----------------------------------+ +| ``gradients`` | Matches all gradients of the | +| | model. | ++-----------------------------------+-----------------------------------+ +| ``weights`` | Matches all weights of the model. | ++-----------------------------------+-----------------------------------+ +| ``biases`` | Matches all biases of the model. | ++-----------------------------------+-----------------------------------+ +| ``optimizer_variables`` | Matches all optimizer variables, | +| | currently only supported for | +| | Keras. | ++-----------------------------------+-----------------------------------+ + +For more information about adjusting the tensor collection parameters, +see `Save Tensors Using Debugger Modified Built-in +Collections `__. + +For a full list of available tensor collection parameters, see +`Configuring Collection using SageMaker Python +SDK `__. + +.. note:: + The ``inputs``, ``outputs``, ``gradients``, and ``layers`` + built-in collections are currently available for TensorFlow versions + <2.0 and ==2.2.0. + +-------------- + +Using Debugger on SageMaker Training Containers and Custom Containers +--------------------------------------------------------------------- + +If you want to run your own training script or custom containers other +than the AWS Deep Learning Containers in the previous option, you can +use any of the following options: + +- **Option 1** - Use the SageMaker TensorFlow training containers with + training script modification +- **Option 2** - Use your custom container with modified training + script and push the container to Amazon ECR. + +For both options, you need to manually register the Debugger hook to +your training script. Depending on the TensorFlow and Keras API +operations used to construct your model, you need to pick the right +TensorFlow hook class, register the hook, and then save the tensors. + +1. `Create a hook <#create-a-hook>`__ + + - `KerasHook <#kerashook>`__ + - `SessionHook <#sessionhook>`__ + - `EstimatorHook <#estimatorhook>`__ + +2. `Wrap the optimizer and the gradient tape with the hook to retrieve + gradient tensors <#wrap-opt-with-hook>`__ +3. `Register the hook to model.fit() <#register-a-hook>`__ + +Step 1: Create a hook +~~~~~~~~~~~~~~~~~~~~~ + +To create the hook constructor, add the following code to your training +script. This enables the ``smdebug`` tools for TensorFlow and creates a +TensorFlow ``hook`` object. When you run the ``fit()`` API for training, +specify the smdebug ``hook`` as ``callbacks``, as shown in the following +subsections. + +Depending on the TensorFlow versions and the Keras API that you use in +your training script, you need to choose the right hook class. The hook +constructors for TensorFlow that you can choose are ``smd.KerasHook``, +``smd.SessionHook``, and ``smd.EstimatorHook``. + +KerasHook +^^^^^^^^^ + +If you use the Keras model zoo and a Keras ``model.fit()`` API, use +``KerasHook``. ``KerasHook`` is available for the Keras model with the +TensorFlow backend interface. ``KerasHook`` covers the eager execution +modes and the gradient tape features that are introduced in the +TensorFlow framework version 2.0. You can set the smdebug Keras hook +constructor by adding the following code to your training script. Place +this code line before ``model.compile()``: + +.. code:: python + + import smdebug.tensorflow as smd + hook = smd.KerasHook.create_from_json_file() + +To learn how to fully implement the hook in your training script, see +the `Keras with the TensorFlow gradient tape and the smdebug hook +example +scripts `__. + + **Note**: If you use the AWS Deep Learning Containers for zero script + change, Debugger collects most of the tensors through its high-level + API, regardless of the eager execution modes. + +SessionHook +^^^^^^^^^^^ + +If your model is created in TensorFlow version 1.x with the low-level +approach (not using the Keras API), use ``SessionHook``. ``SessionHook`` +is for the TensorFlow 1.x monitored training session API, +``tf.train.MonitoredSessions()``, as shown following: + +.. code:: python + + import smdebug.tensorflow as smd + hook = smd.SessionHook.create_from_json_file() + +To learn how to fully implement the hook into your training script, see +the `TensorFlow monitored training session with the smdebug hook example +script `__. + + **Note**: The official TensorFlow library deprecated the + ``tf.train.MonitoredSessions()`` API in favor of ``tf.function()`` in + TensorFlow 2.0 and later. You can use ``SessionHook`` for + ``tf.function()`` in TensorFlow 2.0 and later. + +EstimatorHook +^^^^^^^^^^^^^ + +If you have a model using the ``tf.estimator()`` API, use +``EstimatorHook``. ``EstimatorHook`` is available for any TensorFlow +framework versions that support the ``tf.estimator()`` API, as shown +following: + +.. code:: python + + import smdebug.tensorflow as smd + hook = smd.EstimatorHook.create_from_json_file() + +To learn how to fully implement the hook into your training script, see +the `simple MNIST training script with the Tensorflow +estimator `__. + +Step 2: Wrap the optimizer and the gradient tape to retrieve gradient tensors +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The smdebug TensorFlow hook provides tools to manually retrieve +``gradients`` tensors specific to the TensorFlow framework. + +If you want to save ``gradients`` (for example, from the Keras Adam +optimizer) wrap it with the hook as shown following: + +.. code:: python + + optimizer = tf.keras.optimizers.Adam(learning_rate=args.lr) + optimizer = hook.wrap_optimizer(optimizer) + +If you want to save gradients and outputs tensors from the TensorFlow +``GradientTape`` feature, wrap ``tf.GradientTape`` with the smdebug +``hook.wrap_tape`` method and save using the ``hook.save_tensor`` +function. The input of ``hook.save_tensor`` is in (tensor_name, +tensor_value, collections_to_write=“default”) format. For example: + +.. code:: python + + with hook.wrap_tape(tf.GradientTape(persistent=True)) as tape: + logits = model(data, training=True) + loss_value = cce(labels, logits) + hook.save_tensor("y_labels", labels, "outputs") + hook.save_tensor("predictions", logits, "outputs") + grads = tape.gradient(loss_value, model.variables) + hook.save_tensor("grads", grads, "gradients") + +These smdebug hook wrapper functions capture the gradient tensors, not +affecting your optimization logic at all. + +For examples of code structures that you can use to apply the hook +wrappers, see the `Code Examples <#examples>`__ section. + +Step 3: Register the hook to model.fit() +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To collect the tensors from the hooks that you registered, add +``callbacks=[hook]`` to the Keras ``model.fit()`` API. This will pass +the SageMaker Debugger hook as a Keras callback. Similarly, add +``hooks=[hook]`` to the ``MonitoredSession()``, ``tf.function()``, and +``tf.estimator()`` APIs. For example: + +.. code:: python + + model.fit(X_train, Y_train, + batch_size=batch_size, + epochs=epoch, + validation_data=(X_valid, Y_valid), + shuffle=True, + # smdebug modification: Pass the hook as a Keras callback + callbacks=[hook]) + +Step 4: Perform actions using the hook APIs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For a full list of actions that the hook APIs offer to construct hooks +and save tensors, see `Common hook +API `__ +and `TensorFlow specific hook +API `__. + +-------------- + +Code Examples +------------- + +The following code examples show the base structures that you can use +for hook registration in various TensorFlow training scripts. If you +want to use the high-level Debugger features with zero script change on +AWS Deep Learning Containers, see `Use Debugger in AWS +Containers `__. + +Keras API (tf.keras) +~~~~~~~~~~~~~~~~~~~~ + +The following code example shows how to register the smdebug +``KerasHook`` for the Keras ``model.fit()``. You can also set the hook +mode to track stored tensors in different phases of training job. For a +list of available hook modes, see `smdebug modes `__. + +.. code:: python + + import smdebug.tensorflow as smd + + hook = smd.KerasHook.create_from_json_file() + + model = tf.keras.models.Sequential([ ... ]) + model.compile( + optimizer='adam', + loss='sparse_categorical_crossentropy', + ) + # Add the hook as a callback + # Set hook.set_mode to set tensors to be stored in different phases of training job, such as TRAIN and EVAL + hook.set_mode(mode=smd.modes.TRAIN) + model.fit(x_train, y_train, epochs=args.epochs, callbacks=[hook]) + + hook.set_mode(mode=smd.modes.EVAL) + model.evaluate(x_test, y_test, callbacks=[hook]) + +Keras GradientTape example for TensorFlow 2.0 and later +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following code example shows how to register the smdebug +``KerasHook`` by wrapping the TensorFlow ``GradientTape()`` with the +smdebug ``hook.wrap_tape()`` API. + +.. code:: python + + import smdebug.tensorflow as smd + + hook = smd.KerasHook.create_from_json_file() + + model = tf.keras.models.Sequential([ ... ]) + for epoch in range(n_epochs): + for data, labels in dataset: + dataset_labels = labels + # wrap the tape to capture tensors + with hook.wrap_tape(tf.GradientTape(persistent=True)) as tape: + logits = model(data, training=True) # (32,10) + loss_value = cce(labels, logits) + grads = tape.gradient(loss_value, model.variables) + opt.apply_gradients(zip(grads, model.variables)) + acc = train_acc_metric(dataset_labels, logits) + # manually save metric values + hook.save_tensor(tensor_name="accuracy", tensor_value=acc, collections_to_write="default") + +Monitored Session (tf.train.MonitoredSession) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following code example shows how to register the smdebug +``SessionHook``. + +.. code:: python + + import smdebug.tensorflow as smd + + hook = smd.SessionHook.create_from_json_file() + + loss = tf.reduce_mean(tf.matmul(...), name="loss") + optimizer = tf.train.AdamOptimizer(args.lr) + + # Wrap the optimizer + optimizer = hook.wrap_optimizer(optimizer) + + # Add the hook as a callback + sess = tf.train.MonitoredSession(hooks=[hook]) + + sess.run([loss, ...]) + +Estimator (tf.estimator.Estimator) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following code example shows how to register the smdebug +``EstimatorHook``. You can also set the hook mode to track stored +tensors in different phases of training job. For a list of available +hook modes, see `smdebug modes `__. + +.. code:: python + + import smdebug.tensorflow as smd + + hook = smd.EstimatorHook.create_from_json_file() + + train_input_fn, eval_input_fn = ... + estimator = tf.estimator.Estimator(...) + + # Set hook.set_mode to set tensors to be stored in different phases of training job, such as TRAIN and EVAL + hook.set_mode(mode=smd.modes.TRAIN) + estimator.train(input_fn=train_input_fn, steps=args.steps, hooks=[hook]) + + hook.set_mode(mode=smd.modes.EVAL) + estimator.evaluate(input_fn=eval_input_fn, steps=args.steps, hooks=[hook]) + +-------------- + +References +---------- + +The smdebug API for saving tensors +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +See the `API for saving tensors `__ page for details about the +Hooks, Collection, SaveConfig, and ReductionConfig. See the +`Analysis `__ page for details about analyzing a training +job. + +TensorFlow References +~~~~~~~~~~~~~~~~~~~~~ + +- TF 1.x: + + - `tf.estimator `__ + - `tf.keras `__ + - `tf.train.MonitoredSession `__ + +- TF 2.1: + + - `tf.estimator `__ + - `tf.keras `__ + +- TF 2.2: + + - `tf.estimator `__ + - `tf.keras `__ diff --git a/docs/trial-api.rst b/docs/trial-api.rst new file mode 100644 index 000000000..cd4b6455d --- /dev/null +++ b/docs/trial-api.rst @@ -0,0 +1,14 @@ +Trial API +--------- + +Use Trial methods of an SMDebug trial object. The methods are tools to +load output tensors collected by SageMaker Debugger for further analysis. + +.. note:: + To use the following methods of the :class:`~smdebug.trials.trial.Trial` class, + you must create a ``trial`` instance as guided + in the previous topic `SMDebug Trial `__. + +.. autoclass:: smdebug.trials.trial.Trial + :members: + :show-inheritance: diff --git a/docs/xgboost.md b/docs/xgboost.md deleted file mode 100644 index 2ec65157a..000000000 --- a/docs/xgboost.md +++ /dev/null @@ -1,100 +0,0 @@ -# XGBoost - -## Contents - -- [SageMaker Example](#sagemaker-example) -- [Full API](#full-api) - -## SageMaker Example - -### Use XGBoost as a built-in algorithm - -The XGBoost algorithm can be used 1) as a built-in algorithm, or 2) as a framework such as MXNet, PyTorch, or Tensorflow. -If SageMaker XGBoost is used as a built-in algorithm in container version `0.90-2` or later, Amazon SageMaker Debugger will be available by default (i.e., zero code change experience). -See [XGBoost Algorithm AWS docmentation](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) for more information on how to use XGBoost as a built-in algorithm. -See [Amazon SageMaker Debugger examples](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger) for sample notebooks that demonstrate debugging and monitoring capabilities of Amazon SageMaker Debugger. -See [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/) for more information on how to configure the Amazon SageMaker Debugger from the Python SDK. - -### Use XGBoost as a framework - -When SageMaker XGBoost is used as a framework, it is recommended that the hook is configured from the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/). -By using SageMaker Python SDK, you can run different jobs (e.g., Processing jobs) on the SageMaker platform. -You can retrieve the hook as follows. -```python -import xgboost as xgb -from smdebug.xgboost import Hook - -dtrain = xgb.DMatrix("train.libsvm") -dtest = xgb.DMatrix("test.libsmv") - -hook = Hook.create_from_json_file() -hook.train_data = dtrain # required -hook.validation_data = dtest # optional -hook.hyperparameters = params # optional - -bst = xgb.train( - params, - dtrain, - callbacks=[hook], - evals_result=[(dtrain, "train"), (dvalid, "validation")] -) -``` - -Alternatively, you can also create the hook from `smdebug`'s Python API as shown in the next section. - -### Use the Debugger hook - -If you are in a non-SageMaker environment, or even in SageMaker, if you want to configure the hook in a certain way in script mode, you can use the full Debugger hook API as follows. -```python -import xgboost as xgb -from smdebug.xgboost import Hook - -dtrain = xgb.DMatrix("train.libsvm") -dvalid = xgb.DMatrix("validation.libsmv") - -hook = Hook( - out_dir=out_dir, # required - train_data=dtrain, # required - validation_data=dvalid, # optional - hyperparameters=hyperparameters, # optional -) -``` - -## Full API - -```python -def __init__( - self, - out_dir, - export_tensorboard = False, - tensorboard_dir = None, - dry_run = False, - reduction_config = None, - save_config = None, - include_regex = None, - include_collections = None, - save_all = False, - include_workers = "one", - hyperparameters = None, - train_data = None, - validation_data = None, -) -``` - -Initializes the hook. Pass this object as a callback to `xgboost.train()`. -* `out_dir` (str): A path into which tensors and metadata will be written. -* `export_tensorboard` (bool): Whether to use TensorBoard logs. -* `tensorboard_dir` (str): Where to save TensorBoard logs. -* `dry_run` (bool): If true, evaluations are not actually saved to disk. -* `reduction_config` (ReductionConfig object): Not supported in XGBoost and will be ignored. -* `save_config` (SaveConfig object): See the [Common API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md). -* `include_regex` (list[str]): List of additional regexes to save. -* `include_collections` (list[str]): List of collections to save. -* `save_all` (bool): Saves all tensors and collections. **WARNING: May be memory-intensive and slow.** -* `include_workers` (str): Used for distributed training, can also be "all". -* `hyperparameters` (dict): Booster params. -* `train_data` (DMatrix object): Data to be trained. -* `validation_data` (DMatrix object): Validation set for which metrics will evaluated during training. - -See the [Common API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md) page for details about Collection, SaveConfig, and ReductionConfig.\ -See the [Analysis](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/analysis.md) page for details about analyzing a training job. diff --git a/docs/xgboost.rst b/docs/xgboost.rst new file mode 100644 index 000000000..84c9cda3e --- /dev/null +++ b/docs/xgboost.rst @@ -0,0 +1,138 @@ +XGBoost +======= + +Contents +-------- + +- `SageMaker Example <#sagemaker-example>`__ +- `Full API <#full-api>`__ + +SageMaker Example +----------------- + +Use XGBoost as a built-in algorithm +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The XGBoost algorithm can be used 1) as a built-in algorithm, or 2) as a +framework such as MXNet, PyTorch, or Tensorflow. If SageMaker XGBoost is +used as a built-in algorithm in container version ``0.90-2`` or later, +Amazon SageMaker Debugger will be available by default (i.e., zero code +change experience). See `XGBoost Algorithm AWS +docmentation `__ +for more information on how to use XGBoost as a built-in algorithm. See +`Amazon SageMaker Debugger +examples `__ +for sample notebooks that demonstrate debugging and monitoring +capabilities of Amazon SageMaker Debugger. See `SageMaker Python +SDK `__ for more +information on how to configure the Amazon SageMaker Debugger from the +Python SDK. + +Use XGBoost as a framework +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When SageMaker XGBoost is used as a framework, it is recommended that +the hook is configured from the `SageMaker Python +SDK `__. By using SageMaker +Python SDK, you can run different jobs (e.g., Processing jobs) on the +SageMaker platform. You can retrieve the hook as follows. + +.. code:: python + + import xgboost as xgb + from smdebug.xgboost import Hook + + dtrain = xgb.DMatrix("train.libsvm") + dtest = xgb.DMatrix("test.libsmv") + + hook = Hook.create_from_json_file() + hook.train_data = dtrain # required + hook.validation_data = dtest # optional + hook.hyperparameters = params # optional + + bst = xgb.train( + params, + dtrain, + callbacks=[hook], + evals_result=[(dtrain, "train"), (dvalid, "validation")] + ) + +Alternatively, you can also create the hook from ``smdebug``\ ’s Python +API as shown in the next section. + +Use the Debugger hook +~~~~~~~~~~~~~~~~~~~~~ + +If you are in a non-SageMaker environment, or even in SageMaker, if you +want to configure the hook in a certain way in script mode, you can use +the full Debugger hook API as follows. + +.. code:: python + + import xgboost as xgb + from smdebug.xgboost import Hook + + dtrain = xgb.DMatrix("train.libsvm") + dvalid = xgb.DMatrix("validation.libsmv") + + hook = Hook( + out_dir=out_dir, # required + train_data=dtrain, # required + validation_data=dvalid, # optional + hyperparameters=hyperparameters, # optional + ) + +Full API +-------- + +.. code:: python + + def __init__( + self, + out_dir, + export_tensorboard = False, + tensorboard_dir = None, + dry_run = False, + reduction_config = None, + save_config = None, + include_regex = None, + include_collections = None, + save_all = False, + include_workers = "one", + hyperparameters = None, + train_data = None, + validation_data = None, + ) + +Initializes the hook. Pass this object as a callback to +``xgboost.train()``. + - ``out_dir`` (str): A path into which tensors and + metadata will be written. + - ``export_tensorboard`` (bool): Whether to + use TensorBoard logs. + - ``tensorboard_dir`` (str): Where to save + TensorBoard logs. + - ``dry_run`` (bool): If true, evaluations are not + actually saved to disk. + - ``reduction_config`` (ReductionConfig + object): Not supported in XGBoost and will be ignored. + - ``save_config`` (SaveConfig object): See the `Common + API `__. + - ``include_regex`` (list[str]): List of additional regexes to save. + - ``include_collections`` (list[str]): List of collections to save. + - ``save_all`` (bool): Saves all tensors and collections. **WARNING: May + be memory-intensive and slow.** + - ``include_workers`` (str): Used for + distributed training, can also be “all”. + - ``hyperparameters`` (dict): + Booster params. + - ``train_data`` (DMatrix object): Data to be trained. + - ``validation_data`` (DMatrix object): Validation set for which + metrics will evaluated during training. + +| See the `Common + API `__ + page for details about Collection, SaveConfig, and ReductionConfig. +| See the + `Analysis `__ + page for details about analyzing a training job. diff --git a/setup.py b/setup.py index 0369c8dc9..a9fae46e3 100644 --- a/setup.py +++ b/setup.py @@ -10,7 +10,6 @@ - Interactive exploration of saved tensors - Distributed training support - TensorBoard support - """ # Standard Library @@ -24,6 +23,8 @@ exec(open("smdebug/_version.py").read()) CURRENT_VERSION = __version__ +docs_env = os.environ.get("READTHEDOCS") + DOCLINES = (__doc__ or "").split("\n") FRAMEWORKS = ["tensorflow", "pytorch", "mxnet", "xgboost"] TESTS_PACKAGES = ["pytest", "torchvision", "pandas"] @@ -80,7 +81,19 @@ def build_package(version): "ERROR: Compiling summary protocol buffers failed. You will not be able to use smdebug. " "Please make sure that you have installed protobuf3 compiler and runtime correctly." ) - sys.exit(1) + if docs_env == "False": + sys.exit(1) + else: + os.system( + "curl -OL https://github.com/google/protobuf/releases/download/v3.7.1/protoc-3.7.1-linux-x86_64.zip" + ) + os.system( + "unzip -o protoc-3.7.1-linux-x86_64.zip -d /home/docs/checkouts/readthedocs.org/user_builds/local bin/protoc" + ) + os.system( + "unzip -o protoc-3.7.1-linux-x86_64.zip -d /home/docs/checkouts/readthedocs.org/user_builds/local include/*" + ) + os.system("rm -f protoc-3.7.1-linux-x86_64.zip") def scan_git_secrets(): @@ -102,10 +115,9 @@ def git(*args): return git("secrets", "--scan", "-r") -if scan_git_secrets() != 0: - import sys - - sys.exit(1) +if docs_env == "False": + if scan_git_secrets() != 0: + sys.exit(1) def detect_smdebug_version(): diff --git a/smdebug/analysis/utils.py b/smdebug/analysis/utils.py index 4174a5a91..8fa7e9b7f 100644 --- a/smdebug/analysis/utils.py +++ b/smdebug/analysis/utils.py @@ -54,10 +54,11 @@ def refresh(trials): def parse_list_from_str(arg, delimiter=","): """ :param arg: string or list of strings - if it is string it is treated as character delimited string + if it is string it is treated as character delimited string :param delimiter: string - if arg is a string, this delimiter is used to split the string + if arg is a string, this delimiter is used to split the string :return: list of strings + """ if arg is None: rval = [] diff --git a/smdebug/core/hook.py b/smdebug/core/hook.py index 38ff7c135..cefee6998 100644 --- a/smdebug/core/hook.py +++ b/smdebug/core/hook.py @@ -79,6 +79,7 @@ def __init__( timestamp: Timestamp at which this object is created. The 'save_scalar()' method creates objects of this class and caches the scalars that users intends to store. These objects will be written to disk in the next available step. + """ self.name = scalar_name self.value = scalar_val diff --git a/smdebug/exceptions.py b/smdebug/exceptions.py index 27d2c821e..ed6eccd93 100644 --- a/smdebug/exceptions.py +++ b/smdebug/exceptions.py @@ -12,6 +12,14 @@ def __str__(self): class StepNotYetAvailable(Exception): + """This means that the step has not yet been + seen from the training job. It may be available in the future if the + training is still going on. We automatically load new data as and + when it becomes available. This step may either become available in + the future, or the exception might change to ``StepUnavailable``. + + """ + def __init__(self, step, mode): self.step = step self.mode = mode @@ -21,6 +29,12 @@ def __str__(self): class MissingCollectionFiles(Exception): + """This is raised when no data was saved by + the training job. Check that the ``Hook`` was configured correctly + before starting the training job. + + """ + def __init__(self): pass @@ -37,6 +51,11 @@ def __str__(self): class StepUnavailable(Exception): + """This means that the step was not saved from the + training job. No tensor will be available for this step. + + """ + def __init__(self, step, mode): self.step = step self.mode = mode @@ -48,6 +67,15 @@ def __str__(self): class TensorUnavailableForStep(Exception): + """This is raised when the tensor requested + is not available for the step. It may have been or will be saved for + a different step number. You can check which steps tensor is saved + for by ``trial.tensor('tname').steps()`` + `api `__. + Note that this exception implies that the requested tensor will never + become available for this step in the future. + """ + def __init__(self, tname, step, mode=modes.GLOBAL, has_reductions=False): self.step = step self.mode = mode @@ -84,6 +112,14 @@ def __str__(self): class TensorUnavailable(Exception): + """This means that this tensor has not been + saved from the training job. Note that if you have a ``SaveConfig`` + which saves a certain tensor only after the time you queried for the + tensor, you might get a ``TensorUnavailable`` exception even if the + tensor may become available later for some step. + + """ + def __init__(self, tname): self.tname = tname @@ -100,6 +136,12 @@ def __str__(self): class NoMoreProfilerData(Exception): + """This will be raised when the training ends. Once you + see this, you will know that there will be no more steps and no more + tensors saved. + + """ + def __init__(self, timestamp): self.timestamp = timestamp self.msg = "Looking for timestamp {} and reached " "end of training.".format(timestamp) @@ -126,6 +168,11 @@ def __str__(self): class RuleEvaluationConditionMet(Exception): + """This is raised when the rule + invocation returns ``True`` for some step. + + """ + def __init__(self, rule_name, step, end_of_rule=False): self.rule_name = rule_name self.step = step diff --git a/smdebug/profiler/analysis/notebook_utils/heatmap.py b/smdebug/profiler/analysis/notebook_utils/heatmap.py index 4a093b326..5a3594be0 100644 --- a/smdebug/profiler/analysis/notebook_utils/heatmap.py +++ b/smdebug/profiler/analysis/notebook_utils/heatmap.py @@ -15,6 +15,11 @@ class Heatmap: + """ + Docstring placeholder. + + """ + def __init__( self, metrics_reader, @@ -26,6 +31,10 @@ def __init__( plot_height=350, show_workers=True, ): + """ + Docstring placeholder. + + """ self.select_dimensions = select_dimensions self.select_events = select_events diff --git a/smdebug/profiler/analysis/python_profile_analysis.py b/smdebug/profiler/analysis/python_profile_analysis.py index 215da7721..d775e5ede 100644 --- a/smdebug/profiler/analysis/python_profile_analysis.py +++ b/smdebug/profiler/analysis/python_profile_analysis.py @@ -48,6 +48,7 @@ def __init__(self, local_profile_dir="/tmp/python_stats", s3_path=None): The reader to use for loading the python stats. python_profile_stats: list of StepPythonProfileStats List of stats for each step profiled. + """ self.python_stats_reader = ( S3PythonStatsReader(local_profile_dir, s3_path) @@ -178,6 +179,7 @@ def list_profile_stats(self, refresh_stats=True): - end_phase The phase at which python profiling was stopped. - end_step: The step at which python profiling was stopped. - stats_path The path to the dumped python stats resulting from profiling this step. + """ self._refresh_python_profile_stats(refresh_stats) if len(self.python_profile_stats) == 0: @@ -207,6 +209,7 @@ class cProfileAnalysis(PythonProfileAnalysis): def _refresh_python_profile_stats(self, refresh_stats): """Helper function to load in the most recent python stats via the python stats reader. Filters out any stats not generated by cProfile. + """ super()._refresh_python_profile_stats(refresh_stats) self.python_profile_stats = list( @@ -239,6 +242,7 @@ def fetch_profile_stats_by_training_phase(self, node_id="any", refresh_stats=Tru - (EVAL, POST_HOOK_CLOSE) All stats files within each of these combinations are aggregated. + """ self._refresh_python_profile_stats(refresh_stats) training_phase_stats = {} @@ -285,6 +289,7 @@ class PyinstrumentAnalysis(PythonProfileAnalysis): def _refresh_python_profile_stats(self, refresh_stats): """Helper function to load in the most recent python stats via the python stats reader. Filters out any stats not generated by pyinstrument. + """ super()._refresh_python_profile_stats(refresh_stats) self.python_profile_stats = list( @@ -293,6 +298,7 @@ def _refresh_python_profile_stats(self, refresh_stats): def _aggregate_stats(self, stats): """Load and return a list of dictionaries corresponding to each step's stats file. + """ if len(stats) == 0: print("No stats were found for the requested interval!") diff --git a/smdebug/profiler/analysis/utils/profiler_data_to_pandas.py b/smdebug/profiler/analysis/utils/profiler_data_to_pandas.py index 62d4201c8..83585460a 100644 --- a/smdebug/profiler/analysis/utils/profiler_data_to_pandas.py +++ b/smdebug/profiler/analysis/utils/profiler_data_to_pandas.py @@ -46,6 +46,7 @@ def get_all_system_metrics(self, selected_system_metrics=[]): Get system metrics :param systemk_metrics_list: list of system metrics.If not empty, function will only return framework events that are part of this list. :return: System metrics DataFrame + """ # get all system metrics from last to current timestamp @@ -68,6 +69,7 @@ def get_all_framework_metrics(self, selected_framework_metrics=[]): Get framework metrics :param selected_framework_metrics: list of framework metrics.If not empty, function will only return framework events that are part of this list. :return: Framework metrics DataFrame + """ # get all framework metrics from last to current timestamp self.framework_metrics_reader.refresh_event_file_list() @@ -92,6 +94,7 @@ def convert_datetime_to_timestamp(self, timestamp): A helper function to convert datetime into timestamp :param timestep: timestamp in datetime :return: timestamp in microseconds + """ timestamp = pd.to_datetime(timestamp, format="%Y-%m-%dT%H:%M:%S:%f", utc=True) return convert_utc_datetime_to_microseconds(timestamp) @@ -102,6 +105,7 @@ def get_framework_metrics_by_timesteps(self, timestep_list=[], selected_framewor :param timestep_list: list of timestamps :param selected_framework_metrics: list of framework metrics which will be stored in the dataframe :return: Framework metrics DataFrame + """ # get min and max search range timestep_list = sorted(timestep_list) @@ -175,6 +179,7 @@ def get_framework_metrics_by_begin_and_end_timesteps( :param end_timestep_list: list of end intervals in datetime :param selected_framework_metrics: list of framework metrics which will be stored in the dataframe :return: Framework metrics DataFrame + """ # Get min and max timestamps from the list of timeranges start_time_us = self.convert_datetime_to_timestamp(min(begin_timestep_list)) @@ -261,8 +266,9 @@ def get_profiler_data_by_time( :param selected_system_metrics: list of system metrics. If not empty, function will only return system events that are part of this list. :param selected_framework_metrics: if True, get framework metrics :param get_system_metrics: if True: get system metrics - return current request + return current request :return: System metrics DataFrame, Framework metrics DataFrame + """ # read system metrics system_metrics = [] @@ -406,6 +412,7 @@ def get_profiler_data_by_step(self, start_step, end_step, cache_metrics=False): :param cache_metrics: If True, collect and return all metrics requested so far, else, return current request :return: System metrics DataFrame, Framework metrics DataFrame + """ sys_metrics_df, fw_metrics_df = ( self.get_all_system_metrics(), @@ -431,6 +438,7 @@ def get_all_dataloader_metrics(self, selected_framework_metrics=[]): Get framework metrics :param selected_framework_metrics: list of framework metrics.If not empty, function will only return framework events that are part of this list. :return: Framework metrics DataFrame + """ # get all framework metrics from last to current timestamp self.framework_metrics_reader.refresh_event_file_list() @@ -459,6 +467,7 @@ def _get_dataloader_profiler_data_by_time( :param cache_metrics: If True, collect and return all metrics requested so far, else, :param framework_metrics_list: list of framework metrics. If not empty, function will only return framework events that are part of this list. :return: Framework metrics DataFrame + """ # get framework metrics framework_metrics = [] diff --git a/smdebug/rules/rule.py b/smdebug/rules/rule.py index 6cbf27d9a..64f129f03 100644 --- a/smdebug/rules/rule.py +++ b/smdebug/rules/rule.py @@ -13,6 +13,31 @@ # This is Rule interface class Rule(ABC): + """The Rule class to create an instance of Rule evaluator. You can construct + a rule class and add thresholds and criteria to its ``__init__`` function. + + **Example of a Rule class** + + .. code:: python + + from smdebug.rules import Rule + + class VanishingGradientRule(Rule): + def __init__(self, base_trial, threshold=0.0000001): + super().__init__(base_trial, other_trials=None) + self.threshold = float(threshold) + + def invoke_at_step(self, step): + for tensorname in self.base_trial.tensors(collection='gradients'): + tensor = self.base_trial.tensor(tensorname) + abs_mean = tensor.reduction_value(step, 'mean', abs=True) + if abs_mean < self.threshold: + return True + else: + return False + + """ + def __init__(self, base_trial, action_str, other_trials=None): self.base_trial = base_trial self.other_trials = other_trials @@ -45,7 +70,9 @@ def invoke_at_step(self, step): # --> storage_handler.save("last_processed_tensor",(tensor_name,step)) # check-pointing is needed if execution is longer duration, # so that we don't lose the work done in certain step - pass + """The abstract method to construct a rule invokation logic against output tensors. + + """ # step specific for which global step this rule was invoked # storage_handler is used to save & get states across different invocations diff --git a/smdebug/rules/rule_invoker.py b/smdebug/rules/rule_invoker.py index 8d8472a9c..cf71169c1 100644 --- a/smdebug/rules/rule_invoker.py +++ b/smdebug/rules/rule_invoker.py @@ -12,6 +12,28 @@ def invoke_rule(rule_obj, start_step=0, end_step=None, raise_eval_cond=False): + """The rule invoker function against a defined smdebug rule using :class:`~smdebug.rules.Rule`. + + Args: + rule_obj (Rule): An instance of a subclass of + :class:`~smdebug.rules.Rule` that you want to invoke. + + start_step (int): Global step number to start invoking the rule + from. Note that this refers to a global step. The default value is 0. + + end_step (int or None): Global step number to end the invocation + of rule before. To clarify, ``end_step`` is an exclusive bound. The + rule is invoked at ``end_step``. The default value is ``None``, which + means run till the end of the job. + + raise_eval_cond (bool): This parameter controls whether to raise + the exception ``RuleEvaluationConditionMet`` when raised by the rule, + or to catch it and log the message and move to the next step. + The default value is ``False``, which implies that the it catches the + exception, logs that the evaluation condition was met for a step and + moves on to evaluate the next step. + + """ step = start_step if start_step is not None else 0 logger.info("Started execution of rule {} at step {}".format(type(rule_obj).__name__, step)) while (end_step is None) or (step < end_step): diff --git a/smdebug/trials/trial.py b/smdebug/trials/trial.py index 9899f45ea..e09d8e0e3 100644 --- a/smdebug/trials/trial.py +++ b/smdebug/trials/trial.py @@ -35,14 +35,14 @@ class Trial(ABC): """ - Attributes: - _tensors - _index_tensors_dict - - ['name', '_tensors', '_mode_to_global', '_global_to_mode', 'logger', 'parallel', - 'check', 'range_steps', 'collection_manager', 'loaded_all_steps', 'cache', 'path', - 'index_tensors_dict', 'index_mode', 'last_event_token', 'last_index_token', 'index_reader', - 'dynamic_refresh', 'trial_dir'] + The base class for creating an SMDebug trial objects. + A trial creation helper function + :class:`~smdebug.trials.create_trial` was introduced in + the previous topic at :doc:`smdebug.trials`. + + After you create an SMDebug trial object, use the following + Trial class methods for accessing output tensor information. + """ def __init__( @@ -79,11 +79,10 @@ def __init__( self.workers_for_global_step = {} self.last_complete_step = -1 - """ - INCOMPLETE_STEP_WAIT_WINDOW defines the maximum number - of incomplete steps that the trial will wait for before marking - half of them as complete. - """ + # INCOMPLETE_STEP_WAIT_WINDOW defines the maximum number + # of incomplete steps that the trial will wait for before marking + # half of them as complete. + self._incomplete_wait_for_step_window = int( os.getenv(INCOMPLETE_STEP_WAIT_WINDOW_KEY, INCOMPLETE_STEP_WAIT_WINDOW_DEFAULT) ) @@ -220,6 +219,22 @@ def refresh_tensor(self, tname, steps=None): def tensor(self, tname): # will not show tensor if it was not written yet # has tensor will refresh + """Retrieves the ``smdebug.core.tensor.Tensor`` object by the given name + ``tname``. To find available methods that this Tensor object + provides, see :doc:`tensor-api`. + + If output tensor is still not available when you run this method, + it refreshes the method call until the first output tensor becomes available. + + Args: + tname (str): Takes the name of tensor + + Returns: + :class:`~smdebug.core.tensor.Tensor` object: + An output tensor object. + + """ + if self.has_tensor(tname): return self._tensors[tname] else: @@ -227,6 +242,18 @@ def tensor(self, tname): def has_tensor(self, tname): # will return false if tensor was not written yet + """Checks if the trial has a tensor of the given tensor name. + + Args: + tname (str): Takes the name of tensor + + Returns: + bool: + ``True`` if the tensor is found by the trial, else it returns + ``False``. + + """ + if tname not in self._tensors: self.maybe_refresh(tname) return tname in self._tensors @@ -402,6 +429,56 @@ def tensor_names( regex=None, collection=None, ) -> list: + """Retrieves names of tensors saved. + + All arguments to this method are optional. You are not required to pass + any of the following arguments. By default, this method returns all tensor names + if you don't pass any arguments. + + Args: + step (int): If you want to retrieve the list of tensors saved at a + particular step, pass the step number as an integer. This step number + will be treated as step number corresponding to the mode passed + below. By default it is treated as global step. + + mode (smdebug.modes enum value): If you want to retrieve the list + of tensors saved for a particular mode, pass the mode here as + ``smd.modes.TRAIN``, ``smd.modes.EVAL``, ``smd.modes.PREDICT``, or + ``smd.modes.GLOBAL``. + + regex (str or list[str]): You can filter tensors matching regex + expressions by passing a regex expressions as a string or list of + strings. You can only pass one of ``regex`` or ``collection`` + parameters. + + collection (Collection or str): You can filter tensors belonging + to a collection by either passing a collection object or the name of + collection as a string. You can only pass one of ``regex`` or + ``collection`` parameters. + + Returns: + list[str]: + List of strings representing names of tensors matching + the given arguments. Arguments are processed as follows: get the list of + tensor names for given step and mode, saved for given step matching all + the given arguments, i.e. intersection of tensors matching each of the + parameters. + + **Examples:** + + - ``trial.tensor_names()`` - Returns all tensors saved for any step or + mode. + - ``trial.tensor_names(step=10, mode=modes.TRAIN)`` - Returns tensors + saved for training step 10 + - ``trial.tensor_names(regex='relu')`` - Returns all tensors matching the + regex pattern ``relu`` saved for any step or mode. + - ``trial.tensor_names(collection='gradients')`` - Returns tensors from + collection “gradients” + - ``trial.tensor_names(step=10, mode=modes.TRAIN, regex='softmax')`` - + Returns tensor saved for 10th training step which matches the regex + ``softmax`` + + """ self.maybe_refresh() ts = set() if step is None and mode == ModeKeys.GLOBAL: @@ -441,16 +518,40 @@ def _tensors_for_step(self, step, mode=ModeKeys.GLOBAL) -> list: return [] def workers(self): + """Query for all the worker processes from which data was saved by smdebug + during multi worker training. + + Returns: + list[str]: + A sorted list of names of worker processes from which data + was saved. If using TensorFlow Mirrored Strategy for multi worker + training, these represent names of different devices in the process. For + Horovod, torch.distributed and similar distributed training approaches, + these represent names of the form ``worker_0`` where 0 is the rank of + the process. + + """ + self.maybe_refresh() return sorted(list(self.worker_set)) def steps(self, mode=ModeKeys.GLOBAL, show_incomplete_steps=False) -> list: - """ - the steps function call returns only completed steps to - the user. - :param mode: ModeKeys - :param show_incomplete_steps: bool - :return: list + """Retrieves a list of steps collected by SageMaker Debugger. + + Args: + mode (smdebug.modes enum value): Passing a mode here allows you + want to retrieve the list of steps seen by a trial for that mode If + this is not passed, returns steps for all modes. + show_incomplete_steps (bool) + + Returns: + list[int]: + List of integers representing step numbers. If a mode was + passed, this returns steps within that mode, i.e. mode steps. Each of + these mode steps has a global step number associated with it. The global + step represents the sequence of steps across all modes executed by the + job. + """ all_steps = self._all_steps(mode) if show_incomplete_steps is True: @@ -488,6 +589,20 @@ def _global_step_currently(self, mode, mode_step): return self._mode_to_global[mode][mode_step] def global_step(self, mode, mode_step): + """Given a mode and a mode_step number you can retrieve its global step + using this method. + + Args: + mode (smdebug.modes enum value): Takes the mode as enum value + mode_step (int): Takes the mode step as an integer + + Returns: + ``int``: + An integer representing ``global_step`` of the given mode and + ``mode_step``. + + """ + s = self._global_step_currently(mode, mode_step) if s is not None: return s @@ -512,27 +627,116 @@ def mode_modestep(self, global_step): def mode_step(self, global_step): # can return global step itself in some cases + """Given a global step number you can identify the ``mode_step`` for that + step using this method. + + Args: + global_step (int): Takes the global step as an integer. + + Returns: + ``int``: + An integer representing ``mode_step`` of the given global step. + Typically used in conjunction with ``mode`` method. + + """ + x = self.mode_modestep(global_step) if x: return x[1] def mode(self, global_step): # can return global mode in some cases + """Given a global step number you can identify the mode for that step using + this method. + + Args: + global_step (int): Takes the global step as an integer. + + Returns: + ``smdebug.modes enum value`` of the given global step. + + """ + x = self.mode_modestep(global_step) if x: return x[0] def modes(self): # will not return global mode + """Retrieve a list of modes seen by the trial. + + Returns: + list[smdebug.modes enum value]: + List of modes for which data was + saved at all steps collected from the training job. + + """ + return self._mode_to_global.keys() def collections(self): + """List the collections from the trial. + + Note that tensors part of these + collections may not necessarily have been saved from the training job. + Whether a collection was saved or not depends on the configuration of + the Hook during training. + + Returns: + dict[str -> Collection]: + A dictionary indexed by the name of the + collection, with the Collection object as the value. Please refer + :doc:`collections` for more details. + + """ + return self.collection_manager.collections def collection(self, coll_name): + """Get a specific collection from the trial. + + Note that tensors which are + part of this collection may not necessarily have been saved from the + training job. Whether this collection was saved or not depends on the + configuration of the Hook during training. + + Args: + coll_name (str): Name of the collection + + Returns: + ``Collection``: + The requested Collection object. Please refer + :doc:`collections` for more details. + + """ return self.collection_manager.get(coll_name) def wait_for_steps(self, required_steps, mode=ModeKeys.GLOBAL): + """This method allows you to wait for steps before proceeding. + + You might + want to use this method if you want to wait for smdebug to see the + required steps so you can then query and analyze the tensors saved by + that step. This method blocks till all data from the steps are seen by + smdebug. + + Args: + required_steps (list[int]): Step numbers to wait for + mode (smdebug.modes enum value): The mode to which given step + numbers correspond to. This defaults to modes.GLOBAL. + + Returns: + None: + Only returns after we know definitely whether we have seen + the steps. + + **Exceptions raised:** + + ``StepUnavailable`` and ``NoMoreData``. See `Exceptions <#exceptions>`__ + section for more details. + + """ + with refresh(self): for step in required_steps: while True: @@ -557,29 +761,49 @@ def has_passed_step(self, step, mode=ModeKeys.GLOBAL) -> StepState: This function indicates whether a step is complete (AVAILABLE), incomplete ( NOT_YET_AVAILABLE ) or absent ( UNAVAILABLE ). - Overview of logic: + **Overview of logic:** + + 1. if the queried step is greater than all the available steps (complete / incomplete): + + .. code:: python - 1. if the queried step is greater than all the available steps (complete / incomplete): if job is not complete: return StepState.NOT_YET_AVAILABLE else: return StepState.UNAVAILABLE - 2. if the queried step is less or equal to a step in available steps (complete / incomplete): + + 2. if the queried step is less or equal to a step in available steps (complete / incomplete): + + .. code:: python + if the queried step is less than all the available steps: if single_worker: return UNAVAILABLE ( step has been skipped or will not written) else: return NOT_YET_AVAILABLE - 3. queried step is available: + + 3. queried step is available: + + .. code:: python + if all workers have written the step or job is complete or last_complete_step > step ( All workers have written a step greater than the step we are checking. Hence, the step will never be complete. ) return AVAILABLE else: return NOT_YET_AVAILABLE - :param step: str - :param mode: ModeKeys.GLOBAL - :return: StepState + + Args: + step (int): The step number to check if the trial has passed it. + mode (smdebug.modes enum value): The mode to which given step + number corresponds to. This defaults to modes.GLOBAL. + + Returns: + smdebug.core.tensor.StepState enum value: + Returns one of the following + values: ``UNAVAILABLE``, ``AVAILABLE``, + and ``NOT_YET_AVAILABLE``. + """ all_steps = self.steps(mode=mode, show_incomplete_steps=True) bisect_idx = bisect_left(all_steps, step) @@ -628,6 +852,7 @@ def _update_last_index_token(self, new_index_token: str) -> None: we update the last_index_token and last_complete_step by (window_size_limit // 2) :param new_index_token: :return:None + """ if self.last_index_token is None: last_index_token_step = 0 diff --git a/smdebug/trials/utils.py b/smdebug/trials/utils.py index 448e5a752..bbd699616 100644 --- a/smdebug/trials/utils.py +++ b/smdebug/trials/utils.py @@ -13,6 +13,68 @@ def create_trial( path, name=None, profiler=False, output_dir="/opt/ml/processing/outputs/", **kwargs ): + """ + Args: + path (str): A local path or an S3 path of the form ``s3://bucket/prefix``. You should see + directories such as ``collections``, ``events`` and ``index`` at this + path once the training job starts. + + name (str): A name for a trial. + It is to help you manage different trials. This is an optional + parameter, which defaults to the basename of the path if not passed. + Make sure to give it a unique name to prevent duplication. + + Returns: + :class:`~smdebug.trials.trial.Trial`: + An SMDebug trial instance + + The following examples show how to create an SMDebug trial object. + + **Example: Creating an S3 trial** + + .. code:: python + + from smdebug.trials import create_trial + trial = create_trial( + path='s3://smdebug-testing-bucket/outputs/resnet', + name='resnet_training_run' + ) + + **Example: Creating a local trial** + + .. code:: python + + from smdebug.trials import create_trial + trial = create_trial( + path='/home/ubuntu/smdebug_outputs/resnet', + name='resnet_training_run' + ) + + **Example: Restricting analysis to a range of steps** + + You can optionally pass ``range_steps`` to restrict your analysis to a + certain range of steps. Note that if you do so, Trial will not load data + from other steps. + + - ``range_steps=(100, None)``: This will load all steps after 100 + + - ``range_steps=(None, 100)``: This will load all steps before 100 + + - ``range_steps=(100, 200)`` : This will load steps between 100 and 200 + + - ``range_steps=None``: This will load all steps + + .. code:: python + + from smdebug.trials import create_trial + trial = create_trial( + path='s3://smdebug-testing-bucket/outputs/resnet', + name='resnet_training', + range_steps=(100, 200) + ) + + """ + path = path.strip() # Remove any accidental leading/trailing whitespace input by the user if name is None: name = os.path.basename(path) From 437d9d753f0293f6dfc18caa858e368dfada2810 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Wed, 23 Jun 2021 14:02:21 -0700 Subject: [PATCH 27/39] add readthedocs yml --- .readthedocs.yml | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) create mode 100644 .readthedocs.yml diff --git a/.readthedocs.yml b/.readthedocs.yml new file mode 100644 index 000000000..d14ff4241 --- /dev/null +++ b/.readthedocs.yml @@ -0,0 +1,29 @@ +# .readthedocs.yml +# Read the Docs configuration file +# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details + +# Required +version: 2 + +# Build documentation in the docs/ directory with Sphinx +sphinx: + configuration: docs/conf.py + fail_on_warning: false + +# Build documentation with MkDocs +#mkdocs: +# configuration: mkdocs.yml + +# Optionally build your docs in additional formats such as PDF +#formats: +# - pdf + +conda: + environment: docs/environment.yml + +# Optionally set the version of Python and requirements required to build your docs +python: + version: 3.6 + install: + - method: setuptools + path: . From 84edbae91b0b6f004b276b53f75271684589b3ed Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Wed, 23 Jun 2021 15:53:58 -0700 Subject: [PATCH 28/39] rm all warnings --- docs/README.rst | 15 +- docs/collections.rst | 2 - docs/hook-methods.rst | 31 +- docs/release-notes.rst | 3 +- docs/sagemaker.rst | 357 ------------------ docs/smdebug.analysis.rst | 46 --- docs/smdebug.profiler.rst | 18 - docs/tensor-api.rst | 28 +- docs/xgboost.rst | 1 + .../analysis/python_profile_analysis.py | 19 +- .../analysis/utils/profiler_data_to_pandas.py | 24 +- .../utils/python_profile_analysis_utils.py | 74 ++-- 12 files changed, 93 insertions(+), 525 deletions(-) delete mode 100644 docs/sagemaker.rst delete mode 100644 docs/smdebug.analysis.rst delete mode 100644 docs/smdebug.profiler.rst diff --git a/docs/README.rst b/docs/README.rst index 3e5a93029..f3e72c9b5 100644 --- a/docs/README.rst +++ b/docs/README.rst @@ -376,28 +376,15 @@ Further Documentation and References +-----------------------------------+-----------------------------------+ | Section | Description | +===================================+===================================+ -| :doc:`sagemaker` | SageMaker users, we recommend you | -| | start with this page on how to | -| | run SageMaker training jobs with | -| | SageMaker Debugger. | -+-----------------------------------+-----------------------------------+ | Frameworks | See the frameworks pages for | | - :doc:`tensorflow` | details on what’s supported and | | - :doc:`pytorch` | how to modify your training | | - :doc:`mxnet` | script if applicable. | | - :doc:`xgboost` | | +-----------------------------------+-----------------------------------+ -| :doc:`api` | Full description of our APIs | +| :doc:`hook-api` | Full description of our APIs | | | on saving tensors. | +-----------------------------------+-----------------------------------+ -| :doc:`analysis` | For description of the programming| -| | model provided by the APIs that | -| | enable you to perform interactive | -| | exploration of tensors saved, | -| | as well as to write your own | -| | rules monitoring your training | -| | jobs. | -+-----------------------------------+-----------------------------------+ License ------- diff --git a/docs/collections.rst b/docs/collections.rst index 9f32e51cd..73132a940 100644 --- a/docs/collections.rst +++ b/docs/collections.rst @@ -324,5 +324,3 @@ between them. global.save_steps global.start_step global.end_step - --------------- diff --git a/docs/hook-methods.rst b/docs/hook-methods.rst index a34057d8c..0da7fe161 100644 --- a/docs/hook-methods.rst +++ b/docs/hook-methods.rst @@ -21,7 +21,7 @@ These methods are common for all hooks in any framework. In case of TensorFlow, you need to pick the appropriate ``HookClass`` among ``KerasHook``, ``SessionHook``, and ``EstimatorHook`` depending on how your training script - is composed. For more information, see :doc:`TensorFlow Hook `. + is composed. For more information, see :doc:`tensorflow`. - **PyTorch** @@ -57,17 +57,6 @@ These methods are common for all hooks in any framework. - collection (smd.Collection) -.. method:: add_collection(collection) - - Takes a Collection object and adds it to the CollectionManager that the - Hook holds. Note that you should only pass in a Collection object for the - same framework as the hook - - **Parameters:** - - - collection (smd.Collection) - - .. method:: get_collection(name) Returns collection identified by the given name @@ -85,7 +74,7 @@ These methods are common for all hooks in any framework. Sets mode of the job. ``smd.modes.TRAIN``, ``smd.modes.EVAL``, ``smd.modes.PREDICT``, ``smd.modes.GLOBAL``. - For more information, see :doc:`Modes `__. + For more information, see :doc:`smdebug.modes`. **Parameters:** @@ -141,7 +130,7 @@ TensorFlow specific Hook API Note that there are three types of Hooks in TensorFlow: SessionHook, EstimatorHook and KerasHook based on the TensorFlow interface being used -for training. :doc:`TensorFlow ` shows examples of each of +for training. :doc:`tensorflow` shows examples of each of these. +-----------------+-----------------+-----------------+-----------------+ @@ -195,18 +184,18 @@ these. | ``add_to_ | ``collecti | ``None`` | Calls the | | collection(``\ | on_name (str)`` | | ``add`` method | | ``collection_na | : name of the | | of a collection | -| me, variable)`` | collection to | | object. See | -| | add to. | | `this | -| | ``variable`` | | section < | -| | parameter to | | #collection>`__ | -| | pass to the | | for more. | +| me, variable)`` | collection to | | object. | +| | add to. | | | +| | ``variable`` | | | +| | parameter to | | | +| | pass to the | | | | | collection’s | | | | | ``add`` method. | | | +-----------------+-----------------+-----------------+-----------------+ The following hook APIs are specific to training scripts using the TF 2.x GradientTape -(`Example `__): +(:doc:`tensorflow`): +-----------------+-----------------+-----------------+-----------------+ | Method | Arguments | Returns | Behavior | @@ -272,5 +261,3 @@ PyTorch specific Hook API | | | Debugger can save | | | | losses | +-----------------------+-----------------------+-----------------------+ - --------------- diff --git a/docs/release-notes.rst b/docs/release-notes.rst index 24ce67586..200d730ac 100644 --- a/docs/release-notes.rst +++ b/docs/release-notes.rst @@ -1,10 +1,9 @@ Release Notes ============= ----- SMDebug Library 1.0.10 Release Notes ------------------------------------ +------------------------------------ Date: June 10. 2021 diff --git a/docs/sagemaker.rst b/docs/sagemaker.rst deleted file mode 100644 index 5d092c948..000000000 --- a/docs/sagemaker.rst +++ /dev/null @@ -1,357 +0,0 @@ -Running SageMaker jobs with Amazon SageMaker Debugger -===================================================== - -Outline -------- - -- `Configuring SageMaker Debugger <#configuring-sagemaker-debugger>`__ - - - `Saving data <#saving-data>`__ - - - `Saving built-in collections that we - manage <#saving-built-in-collections-that-we-manage>`__ - - `Saving reductions for a custom - collection <#saving-reductions-for-a-custom-collection>`__ - - `Enabling TensorBoard - summaries <#enabling-tensorboard-summaries>`__ - - - `Rules <#rules>`__ - - - `Built In Rules <#built-in-rules>`__ - - `Custom Rules <#custom-rules>`__ - -- `Interactive Exploration <#interactive-exploration>`__ -- `SageMaker Studio <#sagemaker-studio>`__ -- `TensorBoard Visualization <#tensorboard-visualization>`__ -- `Example Notebooks <#example-notebooks>`__ - -Configuring SageMaker Debugger -============================== - -Regardless of which of the two above ways you have enabled SageMaker -Debugger, you can configure it using the SageMaker python SDK. There are -two aspects to this configuration. - You can specify what tensors to be -saved, when they should be saved and in what form they should be saved. -- You can specify which Rule you want to monitor your training job with. -This can be either a built in rule that SageMaker provides, or a custom -rule that you can write yourself. - -Saving Data ------------ - -SageMaker Debugger gives you a powerful and flexible API to save the -tensors you choose at the frequencies you want. These configurations are -made available in the SageMaker Python SDK through the -``DebuggerHookConfig`` class. - -Saving built-in collections that we manage -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Learn more about these built in collections `here `__. - -.. code:: python - - from sagemaker.debugger import DebuggerHookConfig, CollectionConfig - hook_config = DebuggerHookConfig( - s3_output_path='s3://smdebug-dev-demo-pdx/mnist', - hook_parameters={ - "save_interval": 100 - }, - collection_configs=[ - CollectionConfig("weights"), - CollectionConfig("gradients"), - CollectionConfig("losses"), - CollectionConfig( - name="biases", - parameters={ - "save_interval": 10, - "end_step": 500 - } - ), - ] - ) - import sagemaker as sm - sagemaker_estimator = sm.tensorflow.TensorFlow( - entry_point='src/mnist.py', - role=sm.get_execution_role(), - base_job_name='smdebug-demo-job', - train_instance_count=1, - train_instance_type="ml.m4.xlarge", - framework_version="1.15", - py_version="py3", - # smdebug-specific arguments below - debugger_hook_config=hook_config - ) - sagemaker_estimator.fit() - -Saving reductions for a custom collection -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -You can define your collection of tensors. You can also choose to save -certain reductions of tensors only instead of saving the full tensor. -You may choose to do this to reduce the amount of data saved. Please -note that when you save reductions, unless you pass the flag -``save_raw_tensor``, only these reductions will be available for -analysis. The raw tensor will not be saved. - -.. code:: python - - from sagemaker.debugger import DebuggerHookConfig, CollectionConfig - hook_config = DebuggerHookConfig( - s3_output_path='s3://smdebug-dev-demo-pdx/mnist', - collection_configs=[ - CollectionConfig( - name="activations", - parameters={ - "include_regex": "relu|tanh", - "reductions": "mean,variance,max,abs_mean,abs_variance,abs_max" - }) - ] - ) - import sagemaker as sm - sagemaker_estimator = sm.tensorflow.TensorFlow( - entry_point='src/mnist.py', - role=sm.get_execution_role(), - base_job_name='smdebug-demo-job', - train_instance_count=1, - train_instance_type="ml.m4.xlarge", - framework_version="1.15", - py_version="py3", - # smdebug-specific arguments below - debugger_hook_config=hook_config - ) - sagemaker_estimator.fit() - -Enabling TensorBoard summaries -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -SageMaker Debugger can automatically generate tensorboard scalar -summaries, distributions and histograms for tensors saved. This can be -enabled by passing a ``TensorBoardOutputConfig`` object when creating an -Estimator as follows. You can also choose to disable or enable -histograms specifically for different collections. By default a -collection has ``save_histogram`` flag set to True. Note that scalar -summaries are added to TensorBoard for all ``ScalarCollections`` and any -scalar saved through ``hook.save_scalar``. Refer `API `__ for -more details on scalar collections and ``save_scalar`` method. - -The below example saves weights and gradients as full tensors, and also -saves the gradients as histograms and distributions to visualize in -TensorBoard. These will be saved to the location passed in -``TensorBoardOutputConfig`` object. - -.. code:: python - - from sagemaker.debugger import DebuggerHookConfig, CollectionConfig, TensorBoardOutputConfig - hook_config = DebuggerHookConfig( - s3_output_path='s3://smdebug-dev-demo-pdx/mnist', - collection_configs=[ - CollectionConfig( - name="weights", - parameters={"save_histogram": False}), - CollectionConfig(name="gradients"), - ] - ) - - tb_config = TensorBoardOutputConfig('s3://smdebug-dev-demo-pdx/mnist/tensorboard') - - import sagemaker as sm - sagemaker_estimator = sm.tensorflow.TensorFlow( - entry_point='src/mnist.py', - role=sm.get_execution_role(), - base_job_name='smdebug-demo-job', - train_instance_count=1, - train_instance_type="ml.m4.xlarge", - framework_version="1.15", - py_version="py3", - # smdebug-specific arguments below - debugger_hook_config=hook_config, - tensorboard_output_config=tb_config - ) - sagemaker_estimator.fit() - -For more details, refer our `API page `__. - -Rules ------ - -Here are some examples on how to run Rules with your training jobs. - -Note that passing a ``CollectionConfig`` object to the Rule as -``collections_to_save`` is equivalent to passing it to the -``DebuggerHookConfig`` object as ``collection_configs``. This is just a -shortcut for your convenience. - -Built-in Rules -~~~~~~~~~~~~~~ - -To find a full list of built-in rules that you can use with the -SageMaker Python SDK, see the `List of Debugger Built-in -Rules `__ -page. - -Running built-in SageMaker Rules -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -You can run a SageMaker built-in Rule as follows using the -``Rule.sagemaker`` method. The first argument to this method is the base -configuration that is associated with the Rule. We configure them as -much as possible. You can take a look at the ruleconfigs that we -populate for all built-in rules -`here `__. -You can choose to customize these parameters using the other parameters. - -These rules are run on our pre-built Docker images which are listed -`here `__. -You are not charged for the instances when running SageMaker built-in -rules. - -A list of all our built-in rules are provided -`below <#built-in-rules>`__. - -.. code:: python - - from sagemaker.debugger import Rule, CollectionConfig, rule_configs - - exploding_tensor_rule = Rule.sagemaker( - base_config=rule_configs.exploding_tensor(), - rule_parameters={"collection_names": "weights,losses"}, - collections_to_save=[ - CollectionConfig("weights"), - CollectionConfig("losses") - ] - ) - - vanishing_gradient_rule = Rule.sagemaker( - base_config=rule_configs.vanishing_gradient() - ) - - import sagemaker as sm - sagemaker_estimator = sm.tensorflow.TensorFlow( - entry_point='src/mnist.py', - role=sm.get_execution_role(), - base_job_name='smdebug-demo-job', - train_instance_count=1, - train_instance_type="ml.m4.xlarge", - framework_version="1.15", - py_version="py3", - # smdebug-specific arguments below - rules=[exploding_tensor_rule, vanishing_gradient_rule] - ) - sagemaker_estimator.fit() - -Custom Rules -~~~~~~~~~~~~ - -You can write your own rule custom made for your application and provide -it, so SageMaker can monitor your training job using your rule. To do -so, you need to understand the programming model that ``smdebug`` -provides. Our page on `Programming Model for Analysis `__ -describes the APIs that we provide to help you write your own rule. -Please refer to `this example -notebook `__ -for a demonstration of creating your custom rule and running it on -SageMaker. - -Running custom Rules -~~~~~~~~~~~~~~~~~~~~ - -To run a custom rule, you have to provide a few additional parameters. -Key parameters to note are a file which has the implementation of your -Rule class ``source``, the name of the Rule class (``rule_to_invoke``), -the type of instance to run the Rule job on (``instance_type``), the -size of the volume on that instance (``volume_size_in_gb``), and the -docker image to use for running this job (``image_uri``). - -Please refer to the documentation -`here `__ -for more details. - -We have pre-built Docker images that you can use to run your custom -rules. These are listed -`here `__. -You can also choose to build your own Docker image for custom rule -evaluation. Please refer to the repository `SageMaker Debugger Rules -Container `__ -for instructions on how to build such an image. - -.. code:: python - - from sagemaker.debugger import Rule, CollectionConfig - - custom_coll = CollectionConfig( - name="relu_activations", - parameters={ - "include_regex": "relu", - "save_interval": 500, - "end_step": 5000 - }) - improper_activation_rule = Rule.custom( - name='improper_activation_job', - image_uri='552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest', - instance_type='ml.c4.xlarge', - volume_size_in_gb=400, - source='rules/custom_rules.py', - rule_to_invoke='ImproperActivation', - rule_parameters={"collection_names": "relu_activations"}, - collections_to_save=[custom_coll] - ) - - import sagemaker as sm - sagemaker_estimator = sm.tensorflow.TensorFlow( - entry_point='src/mnist.py', - role=sm.get_execution_role(), - base_job_name='smdebug-demo-job', - train_instance_count=1, - train_instance_type="ml.m4.xlarge", - framework_version="1.15", - py_version="py3", - # smdebug-specific arguments below - rules=[improper_activation_rule], - ) - sagemaker_estimator.fit() - -For more details, refer our `Analysis page `__. - -Interactive Exploration -======================= - -``smdebug`` SDK also allows you perform interactive and real-time -exploration of the data saved. You can choose to inspect the tensors -saved, or visualize them through your custom plots. You can retrieve -these tensors as numpy arrays allowing you to use your favorite analysis -libraries right in a SageMaker notebook instance. We have couple of -example notebooks demonstrating this. - `Real-time anaysis in a notebook -during -training `__ -- `Interactive tensor analysis in a -notebook `__ - -SageMaker Studio -================ - -SageMaker Debugger is on by default for supported training jobs on the -official SageMaker Framework containers (or AWS Deep Learning -Containers) during SageMaker training jobs. In this default scenario, -SageMaker Debugger takes the losses and metrics from your training job -and publishes them to SageMaker Metrics, allowing you to track these -metrics in SageMaker Studio. You can also see the status of Rules you -have enabled for your training job right in the Studio. -`Here `__ -are screenshots of that experience. - -TensorBoard Visualization -========================= - -If you have enabled TensorBoard outputs for your training job through -SageMaker Debugger, TensorBoard artifacts will automatically be -generated for the tensors saved. You can then point your TensorBoard -instance to that S3 location and review the visualizations for the -tensors saved. - -Example Notebooks -================= - -We have a bunch of `example -notebooks `__ -here demonstrating different aspects of SageMaker Debugger. diff --git a/docs/smdebug.analysis.rst b/docs/smdebug.analysis.rst deleted file mode 100644 index df2ce6f24..000000000 --- a/docs/smdebug.analysis.rst +++ /dev/null @@ -1,46 +0,0 @@ -smdebug.analysis -================ - -smdebug.analysis.utils module ------------------------------ - -.. automodule:: smdebug.analysis.utils - :members: - :undoc-members: - :show-inheritance: - -Utils ------ - -Enable or disable refresh of tensors in a trial -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -By default smdebug refreshes tensors each time you try to query the -tensor. It looks for whether this tensor is saved for new steps and if -so fetches them. If you know the saved data will not change (stopped the -machine learning job), or are not interested in the latest data, you can -stop the refreshing of tensors as follows: - -``no_refresh`` takes a trial or a list of trials, which should not be -refreshed. Anything executed inside the with ``no_refresh`` block will -not be refreshed. - -.. code:: python - - from smdebug.analysis.utils import no_refresh - with no_refresh(trials): - pass - -Similarly if you want to refresh tensors only within a block, you can -do: - -.. code:: python - - from smdebug.analysis.utils import refresh - with refresh(trials): - pass - -During rule invocation smdebug waits till the current step is available -and then turns off refresh to ensure that you do not get different -results for methods like ``trial.tensor(name).steps()`` and run into -subtle issues. diff --git a/docs/smdebug.profiler.rst b/docs/smdebug.profiler.rst deleted file mode 100644 index b94f5c6d1..000000000 --- a/docs/smdebug.profiler.rst +++ /dev/null @@ -1,18 +0,0 @@ -smdebug.profiler -================ - -smdebug.profiler.tf\_profiler\_parser module --------------------------------------------- - -.. automodule:: smdebug.profiler.tf_profiler_parser - :members: - :undoc-members: - :show-inheritance: - -smdebug.profiler.trace\_event\_file\_parser module --------------------------------------------------- - -.. automodule:: smdebug.profiler.trace_event_file_parser - :members: - :undoc-members: - :show-inheritance: diff --git a/docs/tensor-api.rst b/docs/tensor-api.rst index 81f6b7199..21467bea4 100644 --- a/docs/tensor-api.rst +++ b/docs/tensor-api.rst @@ -256,15 +256,15 @@ representing name. It provides the following methods. **Parameters:** - - ``step_num (int)`` The step number whose value is to be returned for - the mode passed through the next parameter. - - ``mode (smdebug.modes enum value)`` The mode applicable for the step - number passed above. Defaults to ``modes.GLOBAL`` + - ``step_num (int)`` The step number whose value is to be returned for + the mode passed through the next parameter. + - ``mode (smdebug.modes enum value)`` The mode applicable for the step + number passed above. Defaults to ``modes.GLOBAL`` **Returns:** - ``list[str]`` A list of worker names for which the tensor was saved at - the given step. + ``list[str]`` A list of worker names for which the tensor was saved at + the given step. .. method:: output_tensor.prev_steps(step, n, mode=modes.GLOBAL) @@ -272,15 +272,15 @@ representing name. It provides the following methods. **Parameters:** - - ``step (int)`` The step number whose value is to be returned for the - mode passed. - - ``n (int)`` Number of previous steps to return - - ``mode (smdebug.modes enum value)`` The mode applicable for the step + - ``step (int)`` The step number whose value is to be returned for the + mode passed. + - ``n (int)`` Number of previous steps to return + - ``mode (smdebug.modes enum value)`` The mode applicable for the step number passed above. Defaults to ``modes.GLOBAL`` **Returns:** - ``list[int]`` A list of size at most n representing the previous steps - for the given step and mode. Note that this list can be of size less - than n if there were only less than n steps saved before the given step - in this trial. + ``list[int]`` A list of size at most n representing the previous steps + for the given step and mode. Note that this list can be of size less + than n if there were only less than n steps saved before the given step + in this trial. diff --git a/docs/xgboost.rst b/docs/xgboost.rst index 84c9cda3e..9a6aeccde 100644 --- a/docs/xgboost.rst +++ b/docs/xgboost.rst @@ -106,6 +106,7 @@ Full API Initializes the hook. Pass this object as a callback to ``xgboost.train()``. + - ``out_dir`` (str): A path into which tensors and metadata will be written. - ``export_tensorboard`` (bool): Whether to diff --git a/smdebug/profiler/analysis/python_profile_analysis.py b/smdebug/profiler/analysis/python_profile_analysis.py index d775e5ede..8180f9eac 100644 --- a/smdebug/profiler/analysis/python_profile_analysis.py +++ b/smdebug/profiler/analysis/python_profile_analysis.py @@ -40,14 +40,11 @@ def __init__(self, local_profile_dir="/tmp/python_stats", s3_path=None): Otherwise, LocalPythonStatsReader is used and local_profile_dir represents the path to the stats directory, which already holds the stats. - ... - - Attributes - ---------- - python_stats_reader: PythonStatsReader - The reader to use for loading the python stats. - python_profile_stats: list of StepPythonProfileStats - List of stats for each step profiled. + Args: + python_stats_reader: PythonStatsReader + The reader to use for loading the python stats. + python_profile_stats: list of StepPythonProfileStats + List of stats for each step profiled. """ self.python_stats_reader = ( @@ -94,6 +91,7 @@ def fetch_profile_stats_by_step( refresh_stats=True, ): """API function to fetch stats based on step interval. + """ self._refresh_python_profile_stats(refresh_stats) @@ -116,6 +114,7 @@ def fetch_profile_stats_by_time( refresh_stats=True, ): """API function to fetch stats based on time interval. + """ self._refresh_python_profile_stats(refresh_stats) start_time_since_epoch_in_micros = start_time_since_epoch_in_secs * CONVERT_TO_MICROSECS @@ -133,6 +132,7 @@ def fetch_profile_stats_between_modes( self, start_mode, end_mode, node_id="any", refresh_stats=True ): """API function that fetches stats with the provided start and end mode. + """ self._refresh_python_profile_stats(refresh_stats) requested_stats = [ @@ -169,11 +169,12 @@ def list_profile_stats(self, refresh_stats=True): each instance of profiling and the corresponding stats file (one per step). The columns of this DataFrame include: + - profiler_name: The name of the profiler used to generate this stats file, cProfile or pyinstrument - framework: The machine learning framework used in training. - start_time_since_epoch_in_micros: The UTC time (in microseconds) at which profiling started for this step. - end_time_since_epoch_in_micros: The UTC time (in microseconds) at which profiling finished for this step. - = node_id The node ID of the node used in the session. + - node_id The node ID of the node used in the session. - start_phase The phase at which python profiling was started. - start_step: The step at which python profiling was started. -1 if before step 0. - end_phase The phase at which python profiling was stopped. diff --git a/smdebug/profiler/analysis/utils/profiler_data_to_pandas.py b/smdebug/profiler/analysis/utils/profiler_data_to_pandas.py index 83585460a..2c1e33aa7 100644 --- a/smdebug/profiler/analysis/utils/profiler_data_to_pandas.py +++ b/smdebug/profiler/analysis/utils/profiler_data_to_pandas.py @@ -257,17 +257,19 @@ def get_profiler_data_by_time( get_framework_metrics=True, get_system_metrics=True, ): - """ - Get metrics data within a time interval. - :param start_time_us: Start of the interval in microseconds - :param end_time_us: End of the interval in microseconds - :param cache_metrics: If True, collect and return all metrics requested so far, else, - :param framework_metrics_list: list of framework metrics. If not empty, function will only return framework events that are part of this list. - :param selected_system_metrics: list of system metrics. If not empty, function will only return system events that are part of this list. - :param selected_framework_metrics: if True, get framework metrics - :param get_system_metrics: if True: get system metrics - return current request - :return: System metrics DataFrame, Framework metrics DataFrame + """Get metrics data within a time interval. + + Args: + start_time_us: Start of the interval in microseconds + end_time_us: End of the interval in microseconds + cache_metrics: If True, collect and return all metrics requested so far, else, + framework_metrics_list: list of framework metrics. If not empty, function will only return framework events that are part of this list. + selected_system_metrics: list of system metrics. If not empty, function will only return system events that are part of this list. + selected_framework_metrics: if True, get framework metrics + get_system_metrics: if True: get system metrics + + Returns: + System metrics DataFrame, Framework metrics DataFrame """ # read system metrics diff --git a/smdebug/profiler/analysis/utils/python_profile_analysis_utils.py b/smdebug/profiler/analysis/utils/python_profile_analysis_utils.py index f14524202..2a6098503 100644 --- a/smdebug/profiler/analysis/utils/python_profile_analysis_utils.py +++ b/smdebug/profiler/analysis/utils/python_profile_analysis_utils.py @@ -110,13 +110,17 @@ def in_time_interval(self, start_time_since_epoch_in_micros, end_time_since_epoc ) def in_step_interval(self, start_step, end_step, start_phase, end_phase): - """Returns whether this is in the provided step interval. This is defined as: - 1. This start step is greater than the provided start step and the end step is greater than the provided end - step. - 2. If this start step equals the provided start step, verify that this start phase does not occur before the - provided start phase. - 3. If this end step equals the provided end step, verify that this end phase does not occur after the provided - end phase. + """Returns whether this is in the provided step interval. + + This is defined as: + + 1. This start step is greater than the provided start step and the end step is greater than the provided end + step. + 2. If this start step equals the provided start step, verify that this start phase does not occur before the + provided start phase. + 3. If this end step equals the provided end step, verify that this end phase does not occur after the provided + end phase. + """ if start_step < self.start_step and end_step > self.end_step: return True @@ -163,6 +167,7 @@ class cProfileStats: The cProfile stats of Python functions as a list of cProfileFunctionStats objects, which contain specific metrics corresponding to each function profiled. Parsed from the pStats.Stats object. Useful for more in depth analysis as it allows users physical access to the metrics for each function. + """ def __init__(self, ps): @@ -171,13 +176,21 @@ def __init__(self, ps): def print_top_n_functions(self, by, n=10): """Print the stats for the top n functions with respect to the provided metric. - :param by The metric to sort the functions by. Must be one of the following from the Metrics enum: TOTAL_TIME, - CUMULATIVE_TIME, PRIMITIVE_CALLS, TOTAL_CALLS. - :param n The first n functions and stats to print after sorting. - For example, to print the top 20 functions with respect to cumulative time spent in function: - >>> from smdebug.profiler.analysis.utils.python_profile_analysis_utils import Metrics - >>> cprofile_stats.print_top_n_function(self, Metrics.CUMULATIVE_TIME, n=20) + Args: + by (Metrics enum): + The metric to sort the functions by. Must be one of the following from the Metrics enum: TOTAL_TIME, + CUMULATIVE_TIME, PRIMITIVE_CALLS, TOTAL_CALLS. + n (int): + The first n functions and stats to print after sorting. + + For example, to print the top 20 functions with respect to cumulative time spent in function + + .. code :: python + + from smdebug.profiler.analysis.utils.python_profile_analysis_utils import Metrics + cprofile_stats.print_top_n_function(self, Metrics.CUMULATIVE_TIME, n=20) + """ assert isinstance(by, Metrics), "by must be valid metric from Metrics!" assert isinstance(n, int), "n must be an integer!" @@ -185,6 +198,7 @@ def print_top_n_functions(self, by, n=10): def get_function_stats(self): """Return the function stats list as a DataFrame, where each row represents a cProfileFunctionStats object. + """ return pd.DataFrame([repr(function_stats) for function_stats in self.function_stats_list]) @@ -194,23 +208,23 @@ class cProfileFunctionStats: Processes the stats dictionary's (key, value) pair to get the function name and the specific stats. Key is a tuple of (filename, lineno, function). Value is a tuple of (prim_calls, total_calls, total_time, cumulative_time, callers). See below for details. - ... - Attributes - ---------- - function_name: str - The full function name, derived from the key tuple. Defined as filename:lineno(function). - prim_calls: int - The number of primitive (non-recursive) calls to this function. - total_calls: int - The total number of calls to this function. - total_time: int - The total amount of time spent in the scope of this function alone, in seconds. - cumulative_time: int - The total amount of time spent in the scope of this function and in the scope of all other functions - that this function calls, in seconds. - callers: list of str - The list of functions that call this function. Organized as a list of function names, which follow the above - format for function_name: filename:lineno(function) + + Args: + function_name (str): + The full function name, derived from the key tuple. Defined as filename:lineno(function). + prim_calls (int): + The number of primitive (non-recursive) calls to this function. + total_calls (int): + The total number of calls to this function. + total_time (int): + The total amount of time spent in the scope of this function alone, in seconds. + cumulative_time (int): + The total amount of time spent in the scope of this function and in the scope of all other functions + that this function calls, in seconds. + callers (list of str): + The list of functions that call this function. Organized as a list of function names, which follow the above + format for function_name: filename:lineno(function) + """ def __init__(self, key, value): From c6a94ea1a6c9ff84d7d8024a16a8dc008e895eb1 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Wed, 23 Jun 2021 16:09:34 -0700 Subject: [PATCH 29/39] rm pip protobuf --- docs/environment.yml | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/environment.yml b/docs/environment.yml index f4df9325f..9428509ef 100644 --- a/docs/environment.yml +++ b/docs/environment.yml @@ -10,5 +10,4 @@ dependencies: - pandas - pip - pip: - - protobuf3 - smdebug From 5d328646d21650b5790b7614627d056f79b84428 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 24 Jun 2021 17:44:35 -0700 Subject: [PATCH 30/39] test: try /usr/local dir --- setup.py | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/setup.py b/setup.py index a9fae46e3..acc72b9fa 100644 --- a/setup.py +++ b/setup.py @@ -87,12 +87,8 @@ def build_package(version): os.system( "curl -OL https://github.com/google/protobuf/releases/download/v3.7.1/protoc-3.7.1-linux-x86_64.zip" ) - os.system( - "unzip -o protoc-3.7.1-linux-x86_64.zip -d /home/docs/checkouts/readthedocs.org/user_builds/local bin/protoc" - ) - os.system( - "unzip -o protoc-3.7.1-linux-x86_64.zip -d /home/docs/checkouts/readthedocs.org/user_builds/local include/*" - ) + os.system("unzip -o protoc-3.7.1-linux-x86_64.zip -d /usr/local bin/protoc") + os.system("unzip -o protoc-3.7.1-linux-x86_64.zip -d /usr/local include/*") os.system("rm -f protoc-3.7.1-linux-x86_64.zip") From 248de9eaa3091ab8d958c94d7364c388c5a69382 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Wed, 30 Jun 2021 16:12:19 -0700 Subject: [PATCH 31/39] incorp comments --- README.md | 50 +++++++++++++++++++------------ docs/README.rst | 71 ++++++++++++++++++++------------------------- docs/tensorflow.rst | 2 +- 3 files changed, 63 insertions(+), 60 deletions(-) diff --git a/README.md b/README.md index 950ce374e..3cea0f6d5 100644 --- a/README.md +++ b/README.md @@ -57,33 +57,45 @@ pip install smdebug For a complete overview of Amazon SageMaker Debugger to learn how it works, go to the [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) developer guide. ### AWS Deep Learning Containers with zero code change -Debugger is installed by default in AWS Deep Learning Containers with TensorFlow, PyTorch, MXNet, and XGBoost. The following framework containers enable you to use Debugger with no changes to your training script, by automatically adding [SageMaker Debugger's Hook](docs/api.md#glossary). -The following frameworks are available AWS Deep Learning Containers with the deep learning frameworks for the zero script change experience. +Debugger is installed by default in AWS Deep Learning Containers +(TensorFlow, PyTorch, MXNet) and the SageMaker XGBoost containers. The +training containers are bundled and tested for integration with the +SMDebug library the entire SageMaker platform. -| Framework | Version | -| --- | --- | -| [TensorFlow](tensorflow.md) | 1.15, 2.1.0, 2.2.0, 2.3.0, 2.3.1 | -| [MXNet](mxnet.md) | 1.6, 1.7 | -| [PyTorch](pytorch.md) | 1.4, 1.5, 1.6 | -| [XGBoost](xgboost.md) | 0.90-2, 1.0-1 ([As a built-in algorithm](docs/xgboost.md#use-xgboost-as-a-built-in-algorithm))| +To find a complete list of available Deep Learning Containers, See +[General Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#general-framework-containers) in the AWS Deep Learning Container +repository. -**Note**: Debugger with zero script change is partially available for TensorFlow v2.1.0. The `inputs`, `outputs`, `gradients`, and `layers` built-in collections are currently not available for these TensorFlow versions. +This enables you to use Debugger with no changes to your training +script, by automatically adding `hook-api`. -### AWS training containers with script mode +The following frameworks are available AWS Deep Learning Containers with +the deep learning frameworks for the zero script change experience. -The `smdebug` library supports frameworks other than the ones listed above while using AWS containers with script mode. If you want to use SageMaker Debugger with one of the following framework versions, you need to make minimal changes to your training script. +### Frameworks supported by the SMDebug library -| Framework | Versions | -| --- | --- | -| [TensorFlow](tensorflow.md) | 1.13, 1.14, 1.15, 2.1.0, 2.2.0, 2.3.0, 2.3.1 | -| Keras (with TensorFlow backend) | 2.3 | -| [MXNet](mxnet.md) | 1.4, 1.5, 1.6, 1.7 | -| [PyTorch](pytorch.md) | 1.2, 1.3, 1.4, 1.5, 1.6 | -| [XGBoost](xgboost.md) | 0.90-2, 1.0-1 (As a framework)| +The SMDebug library supports machine learning frameworks for SageMaker +training jobs with script mode and custom training containers. If you +want to use SageMaker Debugger with one of the following framework +versions, you need to make minimal changes to your training script using +the SMDebug library. + +| Framework | Versions | +|---------------------------------|------------------------------------------------------------| +| `tensorflow` | 1.13, 1.14, 1.15, 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.4.1, 2.5.0 | +| Keras (with TensorFlow backend) | 2.3 | +| `mxnet` | 1.4, 1.5, 1.6, 1.7, 1.8 | +| `pytorch` | 1.2, 1.3, 1.4, 1.5, 1.6, 1.8, 1.9 | +| `xgboost` | 0.90-2, 1.0-1, 1.2-1 (As a framework) | ### Debugger on custom containers or local machines -You can also fully use the Debugger features in custom containers with the SageMaker Python SDK. Furthermore, `smdebug` is an open source library, so you can install it on your local machine for any advanced use cases that cannot be run in the SageMaker environment and for constructing `smdebug` custom hooks and rules. + +You can also fully use the Debugger features in custom containers with +the SageMaker Python SDK. Furthermore, `smdebug` is an open source +library, so you can install it on your local machine for any advanced +use cases that cannot be run in the SageMaker environment and for +constructing `smdebug` custom hooks and rules. --- diff --git a/docs/README.rst b/docs/README.rst index f3e72c9b5..6133aafd5 100644 --- a/docs/README.rst +++ b/docs/README.rst @@ -79,62 +79,49 @@ Debugger-supported Frameworks ----------------------------- For a complete overview of Amazon SageMaker Debugger to learn how it -works, go to the `Use Debugger in AWS -Containers `__ +works, see the `Amazon SageMaker Debugger `__ developer guide. AWS Deep Learning Containers with zero code change ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Debugger is installed by default in AWS Deep Learning Containers with -TensorFlow, PyTorch, MXNet, and XGBoost. The following framework -containers enable you to use Debugger with no changes to your training -script, by automatically adding `SageMaker Debugger’s -Hook `__. +Debugger is installed by default in AWS Deep Learning Containers +(TensorFlow, PyTorch, MXNet) and the SageMaker XGBoost containers. +The training containers are bundled and tested for integration +with the SMDebug library the entire SageMaker platform. + +To find a complete list of available Deep Learning Containers, See +`General Framework Containers `__ +in the AWS Deep Learning Container repository. + +This enables you to use Debugger with no changes to your training +script, by automatically adding :doc:`hook-api`. The following frameworks are available AWS Deep Learning Containers with the deep learning frameworks for the zero script change experience. -+--------------------------------+------------------------------------+ -| Framework | Version | -+================================+====================================+ -| `TensorFlow `__ | 1.15, 2.1.0, 2.2.0, 2.3.0, 2.3.1 | -+--------------------------------+------------------------------------+ -| `MXNet `__ | 1.6, 1.7 | -+--------------------------------+------------------------------------+ -| `PyTorch `__ | 1.4, 1.5, 1.6 | -+--------------------------------+------------------------------------+ -| `XGBoost `__ | 0.90-2, 1.0-1 (`As a built-in | -| | algorithm `__) | -+--------------------------------+------------------------------------+ - -**Note**: Debugger with zero script change is partially available for -TensorFlow v2.1.0. The ``inputs``, ``outputs``, ``gradients``, and -``layers`` built-in collections are currently not available for these -TensorFlow versions. - -AWS training containers with script mode -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The ``smdebug`` library supports frameworks other than the ones listed -above while using AWS containers with script mode. If you want to use -SageMaker Debugger with one of the following framework versions, you -need to make minimal changes to your training script. + +Frameworks supported by the SMDebug library +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The SMDebug library supports machine learning frameworks +for SageMaker training jobs with script mode and custom training containers. +If you want to use SageMaker Debugger with one of the following framework versions, you +need to make minimal changes to your training script using the SMDebug library. +---------------------------------+-----------------------------------+ | Framework | Versions | +=================================+===================================+ -| `TensorFlow `__ | 1.13, 1.14, 1.15, 2.1.0, 2.2.0, | -| | 2.3.0, 2.3.1 | +| :doc:`tensorflow` | 1.13, 1.14, 1.15, 2.1.0, 2.2.0, | +| | 2.3.0, 2.3.1, 2.4.1, 2.5.0 | +---------------------------------+-----------------------------------+ | Keras (with TensorFlow backend) | 2.3 | +---------------------------------+-----------------------------------+ -| `MXNet `__ | 1.4, 1.5, 1.6, 1.7 | +| :doc:`mxnet` | 1.4, 1.5, 1.6, 1.7, 1.8 | +---------------------------------+-----------------------------------+ -| `PyTorch `__ | 1.2, 1.3, 1.4, 1.5, 1.6 | +| :doc:`pytorch` | 1.2, 1.3, 1.4, 1.5, 1.6, 1.8, 1.9 | +---------------------------------+-----------------------------------+ -| `XGBoost `__ | 0.90-2, 1.0-1 (As a framework) | +| :doc:`xgboost` | 0.90-2, 1.0-1 (As a framework) | +---------------------------------+-----------------------------------+ Debugger on custom containers or local machines @@ -190,8 +177,12 @@ Debugger’s ``Hook``. For an example template of code to use Debugger on your own container in TensorFlow 2.x frameworks, see `Run Debugger in custom container <#Run-Debugger-in-custom-container>`__. See the following instruction pages to set up Debugger in your preferred -framework. - `TensorFlow `__ - `MXNet `__ - -`PyTorch `__ - `XGBoost `__ +framework. + +- :doc:`tensorflow` +- :doc:`mxnet` +- :doc:`pytorch` +- :doc:`xgboost` Using SageMaker Debugger on custom containers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/docs/tensorflow.rst b/docs/tensorflow.rst index 43cb8557c..6efe03ea2 100644 --- a/docs/tensorflow.rst +++ b/docs/tensorflow.rst @@ -20,7 +20,7 @@ Contents - `References <#references>`__ SMDebug for TensorFlow ------------------------------- +---------------------- Amazon SageMaker Debugger and the ``smdebug`` client library fully support TensorFlow framework. From d11b76ef48bfe690723db5d60a95bf68a97c30fa Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Wed, 30 Jun 2021 21:10:48 -0700 Subject: [PATCH 32/39] Trigger Build for testing RTD PR builder From 2fe16db45207ab08df7df0cf790392fe1be971fa Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Sun, 25 Jul 2021 20:42:09 -0700 Subject: [PATCH 33/39] sync --- .github/PULL_REQUEST_TEMPLATE.md | 2 +- compile_protobuf.py | 91 +++++++++++++++++++ config/buildspec.yml | 1 - config/buildspec_build_wheel.yml | 1 - config/buildspec_tensorflow_2_3.yml | 1 - config/buildspec_tensorflow_2_4.yml | 1 - config/buildspec_vanilla_framework_tests.yml | 1 - config/buildspec_xgboost_1_2_1.yml | 1 - config/buildspec_zero_code_change.yml | 1 - config/profiler/requirements.txt | 20 ++-- .../run_profiler_integration_tests.sh | 3 + setup.py | 37 +------- smdebug/_version.py | 2 +- 13 files changed, 111 insertions(+), 51 deletions(-) create mode 100755 compile_protobuf.py diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index 2b36e61c0..ecb0677bf 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -4,7 +4,7 @@ #### Style and formatting: -I have run `pre-commit install` to ensure that auto-formatting happens with every commit. +I have run `pre-commit install && pre-commit run --all-files` to ensure that auto-formatting happens with every commit. #### Issue number, if available diff --git a/compile_protobuf.py b/compile_protobuf.py new file mode 100755 index 000000000..2a875e9be --- /dev/null +++ b/compile_protobuf.py @@ -0,0 +1,91 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +# Standard Library +import glob +import logging +import os +import platform +import shutil +import sys +import tempfile +import time +import urllib +import urllib.request +from subprocess import check_call +from zipfile import ZipFile + + +def script_name() -> str: + """:returns: script name with leading paths removed""" + return os.path.split(sys.argv[0])[1] + + +def configure_logging(): + logging.getLogger().setLevel(logging.INFO) + logging.basicConfig(format="{}: %(asctime)sZ %(levelname)s %(message)s".format(script_name())) + logging.Formatter.converter = time.gmtime + + +def _get_system_details(): + return platform.system() + + +def _get_machine_details(): + return platform.machine() + + +def get_protoc_download_url(): + """ + Returns an archive with the binary protoc distro for the platform + """ + + system = _get_system_details() + machine = _get_machine_details() + if system == "Darwin": + archive_url = "https://github.com/protocolbuffers/protobuf/releases/download/v3.11.4/protoc-3.11.4-osx-x86_64.zip" + logging.info("Downloading protoc for Darwin: %s", archive_url) + elif system == "Linux": + archive_url = f"https://github.com/protocolbuffers/protobuf/releases/download/v3.11.4/protoc-3.11.4-linux-{machine}.zip" + logging.info("Downloading protoc for Linux: %s", archive_url) + else: + system = platform.system() + raise RuntimeError( + f"Could not find protoc for System: {system} Machine: {machine}.\ + Please install it manually by running sh protoc_downloader.sh" + ) + return archive_url + + +def get_protoc(): + """make sure protoc is available, otherwise download it and return a tuple with the protoc + binary and a temporary dir if it needed to be downloaded""" + if shutil.which("protoc"): + return shutil.which("protoc"), None + + archive_url = get_protoc_download_url() + (fname, headers) = urllib.request.urlretrieve(archive_url) + tmpdir = tempfile.mkdtemp(prefix="protoc_smdebug") + with ZipFile(fname, "r") as zipf: + zipf.extractall(tmpdir) + protoc_bin = os.path.join(tmpdir, "bin", "protoc") + + # Make the binary executable + os.chmod(protoc_bin, 0o755) + return protoc_bin, tmpdir + + +def compile_protobuf(): + """ + Compile protobuf files for smdebug + """ + logging.info("Compile protobuf") + logging.info("================") + (protoc_bin, tmpdir) = get_protoc() + cmd = [protoc_bin] + proto_files = glob.glob("smdebug/core/tfevent/proto/*.proto") + cmd.extend(proto_files) + cmd.append("--python_out=.") + logging.info("Call to protoc: %s", " ".join(cmd)) + check_call(cmd) + if tmpdir: + shutil.rmtree(tmpdir, ignore_errors=True) diff --git a/config/buildspec.yml b/config/buildspec.yml index 2c3bd3f63..0d32bc0aa 100755 --- a/config/buildspec.yml +++ b/config/buildspec.yml @@ -30,7 +30,6 @@ phases: - apt-get install sudo -qq -o=Dpkg::Use-Pty=0 # silence output: https://askubuntu.com/a/668859/724247 - sudo apt-get update -qq -o=Dpkg::Use-Pty=0 - sudo apt-get install unzip -qq -o=Dpkg::Use-Pty=0 - - cd $CODEBUILD_SRC_DIR && chmod +x config/protoc_downloader.sh && ./config/protoc_downloader.sh - pip install --upgrade pip==20.3.3 - pip install -q matplotlib==3.3.1 seaborn==0.10.1 nbconvert==5.6.1 papermill==2.1.2 flaky==3.7.0 beautifulsoup4==4.8.2 jupyter==1.0.0 scipy==1.5.2 scikit-learn==0.23.2 bokeh==2.2.3 simplejson==3.17.2 - if [ "$run_pytest_xgboost" = "enable" ]; then pip install --upgrade pyYaml==5.1; else pip install -q pyYaml; fi diff --git a/config/buildspec_build_wheel.yml b/config/buildspec_build_wheel.yml index 437978647..9119aafa1 100644 --- a/config/buildspec_build_wheel.yml +++ b/config/buildspec_build_wheel.yml @@ -14,7 +14,6 @@ phases: - apt-get install sudo -qq -o=Dpkg::Use-Pty=0 - sudo apt-get update -qq -o=Dpkg::Use-Pty=0 - sudo apt-get install unzip -qq -o=Dpkg::Use-Pty=0 - - cd $CODEBUILD_SRC_DIR && chmod +x config/protoc_downloader.sh && ./config/protoc_downloader.sh - pip install --upgrade pip==20.3.3 - pip install -q pytest==5.3.3 wheel pyYaml pytest-html pre-commit pytest-cov - pip uninstall -y boto3 && pip uninstall -y aiobotocore && pip uninstall -y botocore diff --git a/config/buildspec_tensorflow_2_3.yml b/config/buildspec_tensorflow_2_3.yml index 6e181d54c..8ef55695d 100644 --- a/config/buildspec_tensorflow_2_3.yml +++ b/config/buildspec_tensorflow_2_3.yml @@ -35,7 +35,6 @@ phases: - apt-get install sudo -qq -o=Dpkg::Use-Pty=0 # silence output: https://askubuntu.com/a/668859/724247 - sudo apt-get update -qq -o=Dpkg::Use-Pty=0 - sudo apt-get install unzip -qq -o=Dpkg::Use-Pty=0 - - cd $CODEBUILD_SRC_DIR && chmod +x config/protoc_downloader.sh && ./config/protoc_downloader.sh - pip install --upgrade pip==19.3.1 - pip install -q matplotlib==3.3.1 seaborn==0.10.1 nbconvert==5.6.1 papermill==2.1.2 flaky==3.7.0 beautifulsoup4==4.8.2 jupyter==1.0.0 scipy==1.5.2 scikit-learn==0.23.2 bokeh==2.2.3 simplejson==3.17.2 transformers==4.2.1 - pip install -q pytest wheel pytest-html pre-commit awscli pytest-cov diff --git a/config/buildspec_tensorflow_2_4.yml b/config/buildspec_tensorflow_2_4.yml index 6b1038cbc..58d436ae4 100644 --- a/config/buildspec_tensorflow_2_4.yml +++ b/config/buildspec_tensorflow_2_4.yml @@ -30,7 +30,6 @@ phases: - apt-get install sudo -qq -o=Dpkg::Use-Pty=0 # silence output: https://askubuntu.com/a/668859/724247 - sudo apt-get update -qq -o=Dpkg::Use-Pty=0 - sudo apt-get install unzip -qq -o=Dpkg::Use-Pty=0 - - cd $CODEBUILD_SRC_DIR && chmod +x config/protoc_downloader.sh && ./config/protoc_downloader.sh - pip install --upgrade pip==19.3.1 - pip install -q matplotlib==3.3.1 seaborn==0.10.1 nbconvert==5.6.1 papermill==2.1.2 flaky==3.7.0 beautifulsoup4==4.8.2 jupyter==1.0.0 scipy==1.5.2 scikit-learn==0.23.2 bokeh==2.2.3 simplejson==3.17.2 transformers==4.2.1 - if [ "$run_pytest_xgboost" = "enable" ]; then pip install --upgrade pyYaml==5.1; else pip install -q pyYaml; fi diff --git a/config/buildspec_vanilla_framework_tests.yml b/config/buildspec_vanilla_framework_tests.yml index b90b75154..831d741e9 100644 --- a/config/buildspec_vanilla_framework_tests.yml +++ b/config/buildspec_vanilla_framework_tests.yml @@ -18,7 +18,6 @@ phases: - apt-get update - apt-get install sudo -qq -o=Dpkg::Use-Pty=0 - sudo apt-get install unzip -qq -o=Dpkg::Use-Pty=0 - - cd $CODEBUILD_SRC_DIR && chmod +x config/protoc_downloader.sh && ./config/protoc_downloader.sh - pip install --upgrade pip==20.3.3 - pip install -q -U pytest pytest-cov wheel pyYaml pytest-html keras==2.3.1 mxnet==1.6.0 torch==1.8.0 xgboost==1.3.3 pre-commit tensorflow_datasets==4.0.1 torchvision - cd $CODEBUILD_SRC_DIR && chmod +x config/install_smdebug.sh && chmod +x config/check_smdebug_install.sh && ./config/install_smdebug.sh; diff --git a/config/buildspec_xgboost_1_2_1.yml b/config/buildspec_xgboost_1_2_1.yml index 3f0e57d4f..79f3f9949 100644 --- a/config/buildspec_xgboost_1_2_1.yml +++ b/config/buildspec_xgboost_1_2_1.yml @@ -33,7 +33,6 @@ phases: - apt-get install sudo -qq -o=Dpkg::Use-Pty=0 # silence output: https://askubuntu.com/a/668859/724247 - pip install awscli==1.19.5 - . config/change_branch.sh #EXPORTS BRANCHES FOR OTHER REPOS AND CURRENT REPO. - - cd $CODEBUILD_SRC_DIR && chmod +x config/protoc_downloader.sh && ./config/protoc_downloader.sh - pip install --upgrade pip==19.3.1 - pip install -q pytest==6.1.2 pytest-cov==2.10.1 wheel==0.35.1 pyYaml==5.3.1 pytest-html==3.0.0 sagemaker==2.16.3 pre-commit==2.6.0 flaky==3.7.0 - pip install -q matplotlib==3.3.1 && pip3 install seaborn==0.10.1 nbconvert==5.6.1 papermill==2.1.2 beautifulsoup4==4.8.2 jupyter==1.0.0 scipy==1.5.2 scikit-learn==0.23.2 bokeh==2.2.3 diff --git a/config/buildspec_zero_code_change.yml b/config/buildspec_zero_code_change.yml index 0f8e86975..f9447d634 100644 --- a/config/buildspec_zero_code_change.yml +++ b/config/buildspec_zero_code_change.yml @@ -26,7 +26,6 @@ phases: - apt-get update - apt-get install sudo -qq -o=Dpkg::Use-Pty=0 - sudo apt-get install unzip -qq -o=Dpkg::Use-Pty=0 - - cd $CODEBUILD_SRC_DIR && chmod +x config/protoc_downloader.sh && ./config/protoc_downloader.sh - pip install --upgrade pip==20.3.3 - pip install -q pytest wheel pyYaml pytest-html pre-commit awscli pytest-cov - cd $CODEBUILD_SRC_DIR && chmod +x config/install_smdebug.sh && chmod +x config/check_smdebug_install.sh && ./config/install_smdebug.sh; diff --git a/config/profiler/requirements.txt b/config/profiler/requirements.txt index 357419dc0..0900a0739 100644 --- a/config/profiler/requirements.txt +++ b/config/profiler/requirements.txt @@ -1,12 +1,12 @@ pip==20.3.3 -wheel==0.35.1 -pyYaml==5.4 -pytest-html==3.0.0 -sagemaker==2.23.0 -pre-commit==2.6.0 -awscli==1.18.203 -pytest==6.1.2 -pytest-cov==2.10.1 +awscli==1.19.102 +wheel==0.36.2 +pyYaml==5.4.1 +pytest-html==3.1.1 +sagemaker==2.47.1 +pre-commit==2.13.0 +pytest==6.2.4 +pytest-cov==2.12.1 flaky==3.7.0 -pytest-xdist==2.2.0 -pandas==1.1.5 +pytest-xdist==2.3.0 +pandas==1.2.5 diff --git a/config/profiler/run_profiler_integration_tests.sh b/config/profiler/run_profiler_integration_tests.sh index 063983560..0b3457d7f 100755 --- a/config/profiler/run_profiler_integration_tests.sh +++ b/config/profiler/run_profiler_integration_tests.sh @@ -54,6 +54,9 @@ apt-get install sudo -qq -o=Dpkg::Use-Pty=0 >/dev/null 2>/dev/null # mask output sudo apt-get install unzip -qq -o=Dpkg::Use-Pty=0 >/dev/null 2>/dev/null # mask output pip install -q -r config/profiler/requirements.txt >/dev/null 2>/dev/null # mask output +# Check for broken dependencies +pip check || exit 1 + cd $CODEBUILD_SRC_DIR chmod +x config/protoc_downloader.sh ./config/protoc_downloader.sh >/dev/null 2>/dev/null # mask output diff --git a/setup.py b/setup.py index acc72b9fa..1b782807a 100644 --- a/setup.py +++ b/setup.py @@ -1,4 +1,4 @@ -#!/usr/bin/env python +#!/usr/bin/env python3 """ Amazon SageMaker Debugger is an offering from AWS which helps you automate the debugging of machine learning training jobs. This library powers Amazon SageMaker Debugger, and helps you develop better, faster and cheaper models by catching common errors quickly. It allows you to save tensors from training jobs and makes these tensors available for analysis, all through a flexible and powerful API. @@ -14,10 +14,13 @@ # Standard Library import os +import shutil +import subprocess import sys from datetime import date # Third Party +import compile_protobuf import setuptools exec(open("smdebug/_version.py").read()) @@ -37,18 +40,8 @@ ] -def compile_summary_protobuf(): - proto_paths = ["smdebug/core/tfevent/proto"] - cmd = "set -ex && protoc " - for proto_path in proto_paths: - proto_files = os.path.join(proto_path, "*.proto") - cmd += proto_files + " " - print("compiling protobuf files in {}".format(proto_path)) - cmd += " --python_out=." - return os.system(cmd) - - def build_package(version): + compile_protobuf.compile_protobuf() packages = setuptools.find_packages(include=["smdebug", "smdebug.*"]) setuptools.setup( name="smdebug", @@ -76,27 +69,7 @@ def build_package(version): ) -if compile_summary_protobuf() != 0: - print( - "ERROR: Compiling summary protocol buffers failed. You will not be able to use smdebug. " - "Please make sure that you have installed protobuf3 compiler and runtime correctly." - ) - if docs_env == "False": - sys.exit(1) - else: - os.system( - "curl -OL https://github.com/google/protobuf/releases/download/v3.7.1/protoc-3.7.1-linux-x86_64.zip" - ) - os.system("unzip -o protoc-3.7.1-linux-x86_64.zip -d /usr/local bin/protoc") - os.system("unzip -o protoc-3.7.1-linux-x86_64.zip -d /usr/local include/*") - os.system("rm -f protoc-3.7.1-linux-x86_64.zip") - - def scan_git_secrets(): - import subprocess - import os - import shutil - def git(*args): return subprocess.call(["git"] + list(args)) diff --git a/smdebug/_version.py b/smdebug/_version.py index 9eb1ebec5..bd538f76e 100644 --- a/smdebug/_version.py +++ b/smdebug/_version.py @@ -1 +1 @@ -__version__ = "1.0.11" +__version__ = "1.0.12" From d847ffc4158dfaa4286b14466014b26541a93da7 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Sun, 25 Jul 2021 20:52:56 -0700 Subject: [PATCH 34/39] rm smdebug from env.yml --- docs/environment.yml | 3 --- 1 file changed, 3 deletions(-) diff --git a/docs/environment.yml b/docs/environment.yml index 9428509ef..190af4681 100644 --- a/docs/environment.yml +++ b/docs/environment.yml @@ -8,6 +8,3 @@ dependencies: - bokeh - ipython - pandas - - pip - - pip: - - smdebug From 9e8fac5c56309695913c9f4d03a50954b0fe76d7 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 26 Jul 2021 11:05:50 -0700 Subject: [PATCH 35/39] Trigger Build From 7bed697aad359cca8ddc5006d0769c3775f6ee7c Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Wed, 28 Jul 2021 09:08:48 -0700 Subject: [PATCH 36/39] Trigger Build From ccc802c0c19170f7b1142221ec6d5af797ab73f2 Mon Sep 17 00:00:00 2001 From: atqy <95724753+atqy@users.noreply.github.com> Date: Tue, 16 Aug 2022 10:26:28 -0700 Subject: [PATCH 37/39] Add unified RTD search to RTD website (#610) * add unified search to RTD website * configure rtd build environment to make it functional --- docs/_static/kendrasearchtools.js | 690 ++++++++++++++++++++++++++++ docs/_static/pagination.css | 17 + docs/_static/search_accessories.css | 29 ++ docs/_templates/search.html | 56 +++ docs/conf.py | 2 +- docs/environment.yml | 2 + docs/requirements.txt | 9 + 7 files changed, 804 insertions(+), 1 deletion(-) create mode 100644 docs/_static/kendrasearchtools.js create mode 100644 docs/_static/pagination.css create mode 100644 docs/_static/search_accessories.css create mode 100644 docs/_templates/search.html create mode 100644 docs/requirements.txt diff --git a/docs/_static/kendrasearchtools.js b/docs/_static/kendrasearchtools.js new file mode 100644 index 000000000..73d0abb1b --- /dev/null +++ b/docs/_static/kendrasearchtools.js @@ -0,0 +1,690 @@ +/* + * searchtools.js + * ~~~~~~~~~~~~~~~~ + * + * Sphinx JavaScript utilities for the full-text search. + * + * :copyright: Copyright 2007-2021 by the Sphinx team, see AUTHORS. + * :license: BSD, see LICENSE for details. + * + */ + +if (!Scorer) { + /** + * Simple result scoring code. + */ + var Scorer = { + // Implement the following function to further tweak the score for each result + // The function takes a result array [filename, title, anchor, descr, score] + // and returns the new score. + /* + score: function(result) { + return result[4]; + }, + */ + + // query matches the full name of an object + objNameMatch: 11, + // or matches in the last dotted part of the object name + objPartialMatch: 6, + // Additive scores depending on the priority of the object + objPrio: {0: 15, // used to be importantResults + 1: 5, // used to be objectResults + 2: -5}, // used to be unimportantResults + // Used when the priority is not in the mapping. + objPrioDefault: 0, + + // query found in title + title: 15, + partialTitle: 7, + // query found in terms + term: 5, + partialTerm: 2 + }; +} + +if (!splitQuery) { + function splitQuery(query) { + return query.split(/\s+/); + } +} + +/** + * default rtd search (used as fallback) + */ +var Search = { + + _index : null, + _queued_query : null, + _pulse_status : -1, + + htmlToText : function(htmlString) { + var virtualDocument = document.implementation.createHTMLDocument('virtual'); + var htmlElement = $(htmlString, virtualDocument); + htmlElement.find('.headerlink').remove(); + docContent = htmlElement.find('[role=main]')[0]; + if(docContent === undefined) { + console.warn("Content block not found. Sphinx search tries to obtain it " + + "via '[role=main]'. Could you check your theme or template."); + return ""; + } + return docContent.textContent || docContent.innerText; + }, + + init : function() { + var params = $.getQueryParameters(); + if (params.q) { + var query = params.q[0]; + $('input[name="q"]')[0].value = query; + // this.performSearch(query); + } + }, + + loadIndex : function(url) { + $.ajax({type: "GET", url: url, data: null, + dataType: "script", cache: true, + complete: function(jqxhr, textstatus) { + if (textstatus != "success") { + document.getElementById("searchindexloader").src = url; + } + }}); + }, + + setIndex : function(index) { + var q; + this._index = index; + if ((q = this._queued_query) !== null) { + this._queued_query = null; + Search.query(q); + } + }, + + hasIndex : function() { + return this._index !== null; + }, + + deferQuery : function(query) { + this._queued_query = query; + }, + + stopPulse : function() { + this._pulse_status = 0; + }, + + startPulse : function() { + if (this._pulse_status >= 0) + return; + function pulse() { + var i; + Search._pulse_status = (Search._pulse_status + 1) % 4; + var dotString = ''; + for (i = 0; i < Search._pulse_status; i++) + dotString += '.'; + Search.dots.text(dotString); + if (Search._pulse_status > -1) + window.setTimeout(pulse, 500); + } + pulse(); + }, + + /** + * perform a search for something (or wait until index is loaded) + */ + performSearch : function(query) { + // create the required interface elements + this.out = $('#search-results'); + this.title = $('#search-results h2:first'); // $('

' + _('Searching') + '

').appendTo(this.out); + this.dots = $('#search-results span:first'); //$('').appendTo(this.title); + this.status = $('#search-results p:first'); // $('

 

').appendTo(this.out); + this.output = $('#search-results ul:first'); //$('