From ca9efbd835b5407e541b1d3720fb7a4b963669d6 Mon Sep 17 00:00:00 2001 From: jsspric <80001315+jsspric@users.noreply.github.com> Date: Mon, 23 Aug 2021 15:48:53 -0700 Subject: [PATCH] jsspric copyedit --- .github/PULL_REQUEST_TEMPLATE.md | 8 +- CODE_OF_CONDUCT.md | 2 +- CONTRIBUTING.md | 37 +-- README.md | 102 ++++---- docs/analysis.md | 391 ++++++++++++++++--------------- docs/api.md | 256 ++++++++++---------- docs/env_var.md | 59 +++-- docs/mxnet.md | 30 +-- docs/pytorch.md | 32 +-- docs/sagemaker.md | 104 ++++---- docs/tensorflow.md | 88 +++---- docs/xgboost.md | 33 ++- examples/mxnet/README.md | 18 +- examples/pytorch/README.md | 18 +- examples/tensorflow/README.md | 18 +- examples/tensorflow2/README.md | 18 +- examples/xgboost/README.md | 4 +- 17 files changed, 611 insertions(+), 607 deletions(-) diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index ecb0677bf..4f16554ed 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -1,11 +1,13 @@ +# Pull request template +Use this template to submit pull requests. -### Description of changes: +## Description of changes: -#### Style and formatting: +### Style and formatting: I have run `pre-commit install && pre-commit run --all-files` to ensure that auto-formatting happens with every commit. -#### Issue number, if available +### Issue number, if available: By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md index 5b627cfa6..4c9228a93 100644 --- a/CODE_OF_CONDUCT.md +++ b/CODE_OF_CONDUCT.md @@ -1,4 +1,4 @@ ## Code of Conduct This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). -For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact +For more information, see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact opensource-codeofconduct@amazon.com with any additional questions or comments. diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index d0dfca7dc..a69e65c5f 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,4 +1,4 @@ -# Contributing Guidelines +# Contributing guidelines Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional documentation, we greatly value feedback and contributions from our community. @@ -7,12 +7,12 @@ Please read through this document before submitting any issues or pull requests information to effectively respond to your bug report or contribution. -## Reporting Bugs/Feature Requests +## Reporting bugs and requesting features -We welcome you to use the GitHub issue tracker to report bugs or suggest features. +Use the GitHub issue tracker to report bugs or suggest features. -When filing an issue, please check [existing open](https://github.com/awslabs/sagemaker-debugger/issues), or [recently closed](https://github.com/awslabs/sagemaker-debugger/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aclosed%20), issues to make sure somebody else hasn't already -reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: +When filing an issue, check [existing open](https://github.com/awslabs/sagemaker-debugger/issues) or [recently closed](https://github.com/awslabs/sagemaker-debugger/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aclosed%20) issues to make sure someone else hasn't already +reported the issue. Include as much information as you can. Details like these are incredibly useful: * A reproducible test case or series of steps * The version of our code being used @@ -20,38 +20,38 @@ reported the issue. Please try to include as much information as you can. Detail * Anything unusual about your environment or deployment -## Contributing via Pull Requests -Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: +## Contributing through pull requests +We greatly appreciate contributions made through pull requests. Before sending us a pull request, ensure that: 1. You are working against the latest source on the *master* branch. 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. -3. You open an issue to discuss any significant work - we would hate for your time to be wasted. +3. You open an issue to discuss any significant work; we would hate for your time to be wasted. -To send us a pull request, please: +To send us a pull request: 1. Fork the repository. -2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. +2. Modify the source; focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 3. Ensure local tests pass. 4. Commit to your fork using clear commit messages. 5. Send us a pull request, answering any default questions in the pull request interface. -6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. +6. Pay attention to any automated CI failures reported in the pull request and stay involved in the conversation. -GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and +GitHub provides additional documentation on [forking a repository](https://help.github.com/articles/fork-a-repo/) and [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). ## Developing SageMaker Debugger -To develop on your machine, here are some tips: +Here are some tips for developing on your machine: 1. Remove any existing installation: ``` pip uninstall smdebug ``` -2. Clone the package from source: +2. Clone the package from the source: ``` git clone https://github.com/awslabs/sagemaker-debugger cd sagemaker-debugger ``` -3. Installing in `develop` mode: +3. Install in `develop` mode: ``` python setup.py develop ``` @@ -61,8 +61,8 @@ pip install pre-commit pre-commit install ``` -## Finding contributions to work on -Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any ['help wanted'](https://github.com/awslabs/sagemaker-debugger/labels/help%20wanted) issues is a great place to start. +## Finding ways to contribute +As our projects, by default, use the default GitHub issue labels (`enhancement`, `bug`, `duplicate`, `help wanted`, `invalid`, `question`, `wontfix`), looking at any ['help wanted'](https://github.com/awslabs/sagemaker-debugger/labels/help%20wanted) issues is a great place to start. ## Code of Conduct @@ -72,7 +72,8 @@ opensource-codeofconduct@amazon.com with any additional questions or comments. ## Security issue notifications -If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. +If you discover a potential security issue in this project we ask that you notify Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public GitHub issue. + ## Licensing diff --git a/README.md b/README.md index d9cc437ab..5c31e35e4 100644 --- a/README.md +++ b/README.md @@ -2,48 +2,48 @@ [![codecov](https://codecov.io/gh/awslabs/sagemaker-debugger/branch/master/graph/badge.svg)](https://codecov.io/gh/awslabs/sagemaker-debugger) [![PyPI](https://badge.fury.io/py/smdebug.svg)](https://badge.fury.io/py/smdebug) -## Table of Contents +## Table of contents - [Overview](#overview) -- [Install the smdebug library](#install-the-smdebug-library) -- [Debugger-supported Frameworks](#debugger-supported-frameworks) -- [How It Works](#how-it-works) +- [Install the `smdebug` library](#install-the-smdebug-library) +- [Debugger-supported frameworks](#debugger-supported-frameworks) +- [How it works](#how-it-works) - [Examples](#examples) - [SageMaker Debugger in action](#sagemaker-debugger-in-action) -- [Further Documentation and References](#further-documentation-and-references) +- [Further documentation and references](#further-documentation-and-references) ## Overview [Amazon SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html) automates the debugging process of machine learning training jobs. From training jobs, Debugger allows you to -run your own training script (Zero Script Change experience) using Debugger built-in features—`Hook` and `Rule`—to capture tensors, -have flexibility to build customized Hooks and Rules for configuring tensors as you want, +run your own training script (zero script change experience) using Debugger built-in features—`Hook` and `Rule`—to capture tensors, +have the flexibility to build customized hooks and rules for configuring tensors as you want, and make the tensors available for analysis by saving in an [Amazon S3](https://aws.amazon.com/s3/?nc=sn&loc=0) bucket, all through a flexible and powerful API. The `smdebug` library powers Debugger by calling the saved tensors from the S3 bucket during the training job. `smdebug` retrieves and filters the tensors generated from Debugger such as gradients, weights, and biases. -Debugger helps you develop better, faster, and cheaper models by minimally modifying estimator, tracing the tensors, catching anomalies while training models, and iterative model pruning. +Debugger helps you develop better, faster, and cheaper models by minimally modifying estimators, tracing the tensors, catching anomalies while training models, and iterative model pruning. Debugger supports TensorFlow, PyTorch, MXNet, and XGBoost frameworks. The following list is a summary of the main functionalities of Debugger: -- Run and debug training jobs of your model on SageMaker when using [supported containers](#debugger-supported-frameworks) -- No changes needed to your training script if using AWS Deep Learning Containers with Debugger fully integrated -- Minimal changes to your training script if using AWS containers with script mode or custom containers -- Full visibility into any tensor retrieved from targeted parts of the training jobs -- Real-time training job monitoring through Rules -- Automated anomaly detection and state assertions through built-in and custom Rules on SageMaker -- Actions on your training jobs based on the status of Rules -- Interactive exploration of saved tensors -- Distributed training support -- TensorBoard support +- Run and debug training jobs of your model on SageMaker when using [supported containers](#debugger-supported-frameworks). +- No changes needed to your training script if you are using AWS Deep Learning Containers with Debugger fully integrated. +- Minimal changes to your training script if you are using AWS containers with script mode or custom containers. +- Full visibility into any tensor retrieved from targeted parts of the training jobs. +- Real-time training job monitoring through rules. +- Automated anomaly detection and state assertions through built-in and custom rules on SageMaker. +- Actions on your training jobs based on the status of rules. +- Interactive exploration of saved tensors. +- Distributed training support. +- TensorBoard support. See [How it works](#how-it-works) for more details. --- -## Install the smdebug library +## Install the `smdebug` library The `smdebug` library runs on Python 3. Install using the following command: @@ -53,13 +53,13 @@ pip install smdebug --- -## Debugger-supported Frameworks -For a complete overview of Amazon SageMaker Debugger to learn how it works, go to the [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html) developer guide. +## Debugger-supported frameworks +For a complete overview of Amazon SageMaker Debugger to learn how it works, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). ### AWS Deep Learning Containers with zero code change -Debugger is installed by default in AWS Deep Learning Containers with TensorFlow, PyTorch, MXNet, and XGBoost. The following framework containers enable you to use Debugger with no changes to your training script, by automatically adding [SageMaker Debugger's Hook](docs/api.md#glossary). +Debugger is installed by default in AWS Deep Learning Containers with TensorFlow, PyTorch, MXNet, and XGBoost. The following framework containers enable you to use Debugger with no changes to your training script by automatically adding [SageMaker Debugger's Hook](docs/api.md#glossary). -The following frameworks are available AWS Deep Learning Containers with the deep learning frameworks for the zero script change experience. +The following frameworks are available for AWS Deep Learning Containers with the deep learning frameworks for the zero script change experience. | Framework | Version | | --- | --- | @@ -72,7 +72,7 @@ The following frameworks are available AWS Deep Learning Containers with the dee ### AWS training containers with script mode -The `smdebug` library supports frameworks other than the ones listed above while using AWS containers with script mode. If you want to use SageMaker Debugger with one of the following framework versions, you need to make minimal changes to your training script. +The `smdebug` library supports frameworks other than the ones listed in the preceding section while using AWS containers with script mode. If you want to use SageMaker Debugger with one of the following framework versions, you need to make minimal changes to your training script. | Framework | Versions | | --- | --- | @@ -83,23 +83,23 @@ The `smdebug` library supports frameworks other than the ones listed above while | [XGBoost](docs/xgboost.md) | 0.90-2, 1.0-1 (As a framework)| ### Debugger on custom containers or local machines -You can also fully use the Debugger features in custom containers with the SageMaker Python SDK. Furthermore, `smdebug` is an open source library, so you can install it on your local machine for any advanced use cases that cannot be run in the SageMaker environment and for constructing `smdebug` custom hooks and rules. +You can also use the Debugger features in custom containers with the SageMaker Python SDK. Furthermore, `smdebug` is an open source library, so you can install it on your local machine for any advanced use cases that cannot run in the SageMaker environment and for constructing `smdebug` custom hooks and rules. --- -## How It Works +## How it works -Amazon SageMaker Debugger uses the construct of a `Hook` to save the values of requested tensors throughout the training process. You can then setup a `Rule` job which simultaneously monitors and validates these tensors to ensure +Amazon SageMaker Debugger uses the construct of a `hook` to save the values of requested tensors throughout the training process. You can then set up a `rule` job which simultaneously monitors and validates these tensors to ensure that training is progressing as expected. -A `Rule` checks for vanishing gradients, exploding tensor values, or poor weight initialization. Rules are attached to Amazon CloudWatch events, so that when a rule is triggered it changes the state of the CloudWatch event. -You can configure any action on the CloudWatch event, such as to stop the training job saving you time and money. +A `rule` checks for vanishing gradients, exploding tensor values, or poor weight initialization. Rules are attached to Amazon CloudWatch events, so that when a rule is triggered, it changes the state of the CloudWatch event. +You can configure any action on the CloudWatch event, such as to stop the training job, saving you time and money. Debugger can be used inside or outside of SageMaker. However the built-in rules that AWS provides are only available for SageMaker training. Scenarios of usage can be classified into the following three cases. -#### Using SageMaker Debugger on AWS Deep Learning Containers with zero training script change +### Using SageMaker Debugger on AWS Deep Learning Containers with zero training script change -Use Debugger built-in hook configurations and rules while setting up the estimator and monitor your training job. +Use Debugger built-in hook configurations and rules while setting up the estimator and monitoring your training job. For a full guide and examples of using the built-in rules, see [Running a Rule with zero script change on AWS Deep Learning Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/use-debugger-built-in-rules.html). @@ -107,7 +107,7 @@ To see a complete list of built-in rules and their functionalities, see [List of #### Using SageMaker Debugger on AWS training containers with script mode -You can use Debugger with your training script on your own container making only a minimal modification to your training script to add Debugger's `Hook`. +You can use Debugger with your training script on your own container, making only a minimal modification to your training script to add Debugger's `hook`. For an example template of code to use Debugger on your own container in TensorFlow 2.x frameworks, see [Run Debugger in custom container](#Run-Debugger-in-custom-container). See the following instruction pages to set up Debugger in your preferred framework. - [TensorFlow](docs/tensorflow.md) @@ -117,23 +117,22 @@ See the following instruction pages to set up Debugger in your preferred framewo #### Using SageMaker Debugger on custom containers -Debugger is available for any deep learning models that you bring to Amazon SageMaker. The AWS CLI, the SageMaker Estimator API, and the Debugger APIs enable you to use any Docker base images to build and customize containers to train and debug your models. To use Debugger with customized containers, go to [Use Debugger in Custom Training Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-bring-your-own-container.html). +Debugger is available for any deep learning models that you bring to Amazon SageMaker. The AWS CLI, the SageMaker Estimator API, and the Debugger APIs enable you to use any Docker base images to build and customize containers to train and debug your models. To use Debugger with customized containers, see [Use Debugger in Custom Training Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-bring-your-own-container.html). #### Using SageMaker Debugger on a non-SageMaker environment -Using the smdebug library, you can create custom hooks and rules (or manually analyze the tensors) and modify your training script to enable tensor analysis on a non-SageMaker environment, such as your local machine. For an example of this, see [Run Debugger locally](#run-debugger-locally). +Using the `smdebug` library, you can create custom hooks and rules (or manually analyze the tensors) and modify your training script to enable tensor analysis on a non-SageMaker environment, such as your local machine. For an example of this, see [Run Debugger locally](#run-debugger-locally). --- ## Examples ### SageMaker Notebook Examples - -To find a collection of demonstrations using Debugger, see [SageMaker Debugger Example Notebooks](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger). +For a collection of demonstrations using Debugger, see [SageMaker Debugger example notebooks](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger). #### Run Debugger rules with zero script change -This example shows a how to use Debugger with Zero Script Change of +This example shows a how to use Debugger with zero script change of your training script on a SageMaker DLC. ```python @@ -172,17 +171,17 @@ print(f"Loss values during evaluation were {trial.tensor('CrossEntropyLoss:0').v ``` That's it! When you configure the `sagemaker_simple_estimator`, -you simply specify the `entry_point` to your training script python file. +you simply specify the `entry_point` to your training script Python file. When you run the `sagemaker_simple_estimator.fit()` API, -SageMaker will automatically monitor your training job for you with the Rules specified and create a `CloudWatch` event that tracks the status of the Rule, +SageMaker automatically monitors your training job for you with the rules specified and creates a `CloudWatch` event that tracks the status of the rule so you can take any action based on them. -If you want additional configuration and control, see [Running SageMaker jobs with Debugger](docs/sagemaker.md) for more information. +To learn about additional configuration and control, see [Running SageMaker jobs with Debugger](docs/sagemaker.md). -#### Run Debugger in custom container +#### Run Debugger in a custom container -The following example shows how to set `hook` to set a training model using Debugger in your own container. -This example is for containers in TensorFlow 2.x framework using GradientTape to configure the `hook`. +The following example shows how to set a `hook` to set a training model using Debugger in your own container. +This example is for containers in the TensorFlow 2.x framework using GradientTape to configure the `hook`. ```python import smdebug.tensorflow as smd @@ -203,8 +202,8 @@ model = tf.keras.models.Sequential([ ... ]) hook.record_tensor_value(tensor_name="accuracy", tensor_value=acc) ``` -To see a full script of this, refer to the [tf_keras_gradienttape.py](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow2/scripts/tf_keras_gradienttape.py) example script. -For a notebook example of using BYOC in PyTorch, see [Using Amazon SageMaker Debugger with Your Own PyTorch Container](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/pytorch_custom_container/pytorch_byoc_smdebug.ipynb) +To see a full script, see the [tf_keras_gradienttape.py](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow2/scripts/tf_keras_gradienttape.py) example script. +For a notebook example using BYOC in PyTorch, see [Using Amazon SageMaker Debugger with Your Own PyTorch Container](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/pytorch_custom_container/pytorch_byoc_smdebug.ipynb) #### Run Debugger locally This example shows how to use Debugger for the Keras `model.fit()` API. @@ -233,10 +232,9 @@ print(f"Loss values during evaluation were {trial.tensor('CrossEntropyLoss:0').v --- ## SageMaker Debugger in Action -- Through the model pruning process using Debugger and `smdebug`, you can iteratively identify the importance of weights and cut neurons below a threshold you define. This process allows you to train the model with significantly fewer neurons, which means a lighter, more efficient, faster, and cheaper model without compromising accuracy. The following accuracy versus the number of parameters graph is produced in Studio. It shows that the model accuracy started from about 0.9 with 12 million parameters (the data point moves from right to left along with the pruning process), improved during the first few pruning iterations, kept the quality of accuracy until it cut the number of parameters down to 6 million, and start sacrificing the accuracy afterwards. +The following examples show Debugger in action. -![Debugger Iterative Model Pruning using ResNet](docs/resources/results_resnet.png?raw=true) -Debugger provides you tools to access such training process and have a complete control over your model. See [Using SageMaker Debugger and SageMaker Experiments for iterative model pruning](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/pytorch_iterative_model_pruning/iterative_model_pruning_resnet.ipynb) notebook for the full example and more information. +- Through the model pruning process using Debugger and `smdebug`, you can iteratively identify the importance of weights and cut neurons below a threshold you define. This process allows you to train the model with significantly fewer neurons, which means a lighter, more efficient, faster, and cheaper model without compromising accuracy. The following accuracy versus the number of parameters graph is produced in Studio. It shows that the model accuracy started from about 0.9 with 12 million parameters (the data point moves from right to left along with the pruning process), improved during the first few pruning iterations, kept the quality of accuracy until it cut the number of parameters down to 6 million, and started sacrificing the accuracy afterwards.![Debugger Iterative Model Pruning using ResNet](docs/resources/results_resnet.png?raw=true) Debugger provides you tools to access such training process and have complete control over your model. See the [Using SageMaker Debugger and SageMaker Experiments for iterative model pruning](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/pytorch_iterative_model_pruning/iterative_model_pruning_resnet.ipynb) notebook for the full example and more information. - Use Debugger with XGBoost in SageMaker Studio to save feature importance values and plot them in a notebook during training. ![Debugger XGBoost Visualization Example](docs/resources/xgboost_feature_importance.png?raw=true) @@ -248,10 +246,10 @@ Debugger provides you tools to access such training process and have a complete | Section | Description | | --- | --- | -| [SageMaker Training](docs/sagemaker.md) | SageMaker users, we recommend you start with this page on how to run SageMaker training jobs with SageMaker Debugger | -| Frameworks | See the frameworks pages for details on what's supported and how to modify your training script if applicable | -| [APIs for Saving Tensors](docs/api.md) | Full description of our APIs on saving tensors | -| [Programming Model for Analysis](docs/analysis.md) | For description of the programming model provided by the APIs that enable you to perform interactive exploration of tensors saved, as well as to write your own Rules monitoring your training jobs. | +| [SageMaker Training](docs/sagemaker.md) | SageMaker users, we recommend you start with this page on how to run SageMaker training jobs with SageMaker Debugger. | +| Frameworks | See the frameworks pages for details on what's supported and how to modify your training script if applicable. | +| [APIs for Saving Tensors](docs/api.md) | Full description of our APIs on saving tensors. | +| [Programming Model for Analysis](docs/analysis.md) | Description of the programming model provided by the APIs with which you can perform interactive exploration of tensors saved, as well as to write your own rules monitoring your training jobs. | ## License diff --git a/docs/analysis.md b/docs/analysis.md index d00a13c64..ee76fca3d 100644 --- a/docs/analysis.md +++ b/docs/analysis.md @@ -1,15 +1,15 @@ -# Programming Model for Analysis +# Programming model for analysis -This page describes the programming model that SageMaker Debugger provides for your analysis, and introduces you to the constructs of Trial, Tensor and Rule. +This page describes the programming model that Amazon SageMaker Debugger provides for your analysis, and introduces you to the constructs of trial, tensor and rule. ## Table of Contents * [Trial](#Trial) * [Path of trial](#Path-of-trial) * [SageMaker training job](#SageMaker-training-job) - * [Non SageMaker training jobs](#Non-SageMaker-training-jobs) + * [Non-SageMaker training jobs](#Non-SageMaker-training-jobs) * [Creating a trial object](#Creating-a-trial-object) - * [Creating S3 trial](#Creating-S3-trial) - * [Creating local trial](#Creating-local-trial) + * [Creating an S3 trial](#Creating-S3-trial) + * [Creating a local trial](#Creating-local-trial) * [Restricting analysis to a range of steps](#Restricting-analysis-to-a-range-of-steps) * [Trial API](#Trial-API) * [tensor_names](#tensor_names) @@ -37,7 +37,7 @@ This page describes the programming model that SageMaker Debugger provides for y * [workers](#workers-1) * [prev_steps](#prev_steps) * [Rules](#Rules) - * [Built In Rules](#Built-In-Rules) + * [Built-in rules](#Built-In-Rules) * [Writing a custom rule](#Writing-a-custom-rule) * [Constructor](#Constructor) * [Function to invoke at a given step](#Function-to-invoke-at-a-given-step) @@ -48,30 +48,30 @@ This page describes the programming model that SageMaker Debugger provides for y * [Enable or disable refresh of tensors in a trial](#Enable-or-disable-refresh-of-tensors-in-a-trial) ## Trial -Trial is an object which lets you query for tensors for a given training job, specified by the path where smdebug's artifacts are saved. -Trial is capable of loading new tensors as and when they become available at the given path, allowing you to do both offline as well as realtime analysis. +A trial is an object which lets you query for tensors for a given training job, specified by the path where `smdebug`'s artifacts are saved. +A trial is capable of loading new tensors as and when they become available at the given path, allowing you to do both offline as well as realtime analysis. ### Path of trial -#### SageMaker training job -When running a SageMaker job this path is on S3. SageMaker saves data from your training job locally on the training instance first and uploads them to an S3 location in your account. When you start a SageMaker training job with the python SDK, you can control this path using the parameter `s3_output_path` in the `DebuggerHookConfig` object. This is an optional parameter, if you do not pass this the python SDK will populate a default location for you. If you do pass this, make sure the bucket is in the same region as where the training job is running. If you're not using the python SDK, set this path for the parameter `S3OutputPath` in the `DebugHookConfig` section of `CreateTrainingJob` API. SageMaker takes this path and appends training_job_name and "debug-output" to it to ensure we have a unique path for each training job. +#### SageMaker training jobs +When running a SageMaker job, this path is on Amazon S3. SageMaker saves data from your training job locally on the training instance first and uploads them to an Amazon S3 location in your account. When you start a SageMaker training job with the Python SDK, you can control this path using the parameter `s3_output_path` in the `DebuggerHookConfig` object. This is an optional parameter; if you do not pass this, the Python SDK populates a default location for you. If you do pass this, make sure the bucket is in the same Region as the training job. If you're not using the Python SDK, set this path for the parameter `S3OutputPath` in the `DebugHookConfig` section of the `CreateTrainingJob` API. SageMaker takes this path and appends `training_job_name` and `debug-output` to it to ensure a unique path for each training job. -#### Non SageMaker training jobs -If you are not running a SageMaker training job, this is the path you pass as `out_dir` when you create a smdebug [`Hook`](api.md#hook). Just like when creating the hook, you can pass either a local path or an S3 path (as `s3://bucket/prefix`). +#### Non-SageMaker training jobs +If you are not running a SageMaker training job, this is the path you pass as `out_dir` when you create a `smdebug` [`hook`](api.md#hook). You can pass either a local path or an Amazon S3 path (as `s3://bucket/prefix`). ### Creating a trial object -There are two types of trials you can create: LocalTrial or S3Trial depending on the path. We provide a wrapper method to create the appropriate trial. +There are two types of trials you can create: `LocalTrial` or `S3Trial`, depending on the path. We provide a wrapper method to create the appropriate trial. The parameters you have to provide are: -- `path`: path can be a local path or an S3 path of the form `s3://bucket/prefix`. You should see directories such as `collections`, `events` and `index` at this path once the training job starts. -- `name`: name can be any string. It is to help you manage different trials. This is an optional parameter, which defaults to the basename of the path if not passed. Please make sure to give it a unique name to prevent confusion. +- `path`: The path can be a local path or an Amazon S3 path of the form `s3://bucket/prefix`. You should see directories such as `collections`, `events` and `index` at this path once the training job starts. +- `name`: A name can be any string. The name helps you manage different trials. This is an optional parameter, which defaults to the basename of the path if it is not passed. Please make sure to give it a unique name to prevent confusion. -#### Creating S3 trial +#### Creating an Amazon S3 trial ```python from smdebug.trials import create_trial trial = create_trial(path='s3://smdebug-testing-bucket/outputs/resnet', name='resnet_training_run') ``` -#### Creating local trial +#### Creating a local trial ```python from smdebug.trials import create_trial trial = create_trial(path='/home/ubuntu/smdebug_outputs/resnet', name='resnet_training_run') @@ -79,7 +79,7 @@ trial = create_trial(path='/home/ubuntu/smdebug_outputs/resnet', name='resnet_tr #### Restricting analysis to a range of steps You can optionally pass `range_steps` to restrict your analysis to a certain range of steps. -Note that if you do so, Trial will not load data from other steps. +Note that if you do so, the trial will not load data from other steps. *Examples* - `range_steps=(100, None)`: This will load all steps after 100 @@ -95,12 +95,12 @@ tr = create_trial(path='s3://smdebug-testing-bucket/outputs/resnet', name='resne ### Trial API -Here's a list of methods that the Trial API provides which helps you load data for analysis. Please click on the method to see all the parameters it takes and a detailed description. If you are not familiar with smdebug constructs, you might want to review [this doc](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md) before going through this page. +Here's a list of methods that the Trial API provides to help you load data for analysis. Please click on the method to see a detailed description of the method and all the parameters it takes. If you are not familiar with `smdebug` constructs, review [this API documentation](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md) before going through this page. | Method | Description | | ------------- |-------------| | [trial.tensor_names()](#tensor_names) | See names of all tensors available | -| [trial.tensor(name)](#tensor) | Retrieve smdebug Tensor object | +| [trial.tensor(name)](#tensor) | Retrieve `smdebug` tensor object | | [trial.has_tensor(name)](#has_tensor) | Query for whether tensor was saved | | [trial.steps()](#steps) | Query steps for which data was saved | | [trial.modes()](#modes) | Query modes for which data was saved | @@ -110,12 +110,12 @@ Here's a list of methods that the Trial API provides which helps you load data f | [trial.workers()](#workers) | Query list of workers from the data saved | | [trial.collections()](#collections) | Query list of collections saved from the training job | | [trial.collection(name)](#collection) | Retrieve a single collection saved from the training job | -| [trial.wait\_for\_steps(steps)](#wait\_for\_steps) | Wait till the requested steps are available | +| [trial.wait\_for\_steps(steps)](#wait\_for\_steps) | Wait until the requested steps are available | | [trial.has\_passed\_step(step)](#has\_passed\_step) | Query whether the requested step is available | #### tensor_names -Retrieves names of tensors saved +Retrieves names of tensors saved. ```python trial.tensor_names(step= None, mode=modes.GLOBAL, @@ -123,240 +123,241 @@ trial.tensor_names(step= None, collection=None) ``` -###### Arguments +##### Arguments All arguments to this method are optional. You are not required to pass any of these arguments as keyword arguments. -- `step (int)` If you want to retrieve the list of tensors saved at a particular step, pass the step number as an integer. This step number will be treated as step number corresponding to the mode passed below. By default it is treated as global step. -- `mode (smdebug.modes enum value)` If you want to retrieve the list of tensors saved for a particular mode, pass the mode here as `smd.modes.TRAIN`, `smd.modes.EVAL`, `smd.modes.PREDICT`, or `smd.modes.GLOBAL`. -- `regex (str or list[str])` You can filter tensors matching regex expressions by passing a regex expressions as a string or list of strings. You can only pass one of `regex` or `collection` parameters. -- `collection (Collection or str)` You can filter tensors belonging to a collection by either passing a collection object or the name of collection as a string. You can only pass one of `regex` or `collection` parameters. +- `step (int)`: If you want to retrieve the list of tensors saved at a particular step, pass the step number as an integer. This step number is treated as the step number corresponding to the mode passed as follows. By default, it is treated as global step. +- `mode (smdebug.modes enum value)`: If you want to retrieve the list of tensors saved for a particular mode, pass the mode here as `smd.modes.TRAIN`, `smd.modes.EVAL`, `smd.modes.PREDICT`, or `smd.modes.GLOBAL`. +- `regex (str or list[str])`: You can filter tensors matching regex expressions by passing a regex expression as a string or list of strings. You can only pass one of `regex` or `collection` parameters. +- `collection (Collection or str)`: You can filter tensors belonging to a collection by either passing a collection object or the name of collection as a string. You can only pass one of `regex` or `collection` parameters. -###### Returns -`list[str]`: List of strings representing names of tensors matching the given arguments. Arguments are processed as follows: get the list of tensor names for given step and mode, saved for given step matching all the given arguments, i.e. intersection of tensors matching each of the parameters. +##### Returns +`list[str]`: List of strings representing names of tensors matching the given arguments. Arguments are processed as follows: get the list of tensor names for a given step and mode, and save for a given step matching all the given arguments, i.e. intersection of tensors matching each of the parameters. -###### Examples -- `trial.tensor_names()` Returns all tensors saved for any step or mode. -- `trial.tensor_names(step=10, mode=modes.TRAIN)` Returns tensors saved for training step 10 -- `trial.tensor_names(regex='relu')` Returns all tensors matching the regex pattern `relu` saved for any step or mode. -- `trial.tensor_names(collection='gradients')` Returns tensors from collection "gradients" -- `trial.tensor_names(step=10, mode=modes.TRAIN, regex='softmax')` Returns tensor saved for 10th training step which matches the regex `softmax` +##### Examples +- `trial.tensor_names()`: Returns all tensors saved for any step or mode. +- `trial.tensor_names(step=10, mode=modes.TRAIN)`: Returns tensors saved for training step 10. +- `trial.tensor_names(regex='relu')`: Returns all tensors matching the regex pattern `relu` saved for any step or mode. +- `trial.tensor_names(collection='gradients')`: Returns tensors from collection `gradients`. +- `trial.tensor_names(step=10, mode=modes.TRAIN, regex='softmax')`: Returns tensors saved for the 10th training step which match the regex `softmax`. #### tensor -Retrieve the `smdebug.core.tensor.Tensor` object by the given name `tname`. You can review all the methods that this Tensor object provides [here](#Tensor-1). +Retrieve the `smdebug.core.tensor.Tensor` object by the given name `tname`. You can review all the methods that this tensor object provides [here](#Tensor-1). ```python trial.tensor(tname) ``` -###### Arguments -- `tname (str)` Takes the name of tensor +##### Arguments +- `tname (str)`: Takes the name of the tensor. -###### Returns -`smdebug.core.tensor.Tensor` object which has [this API](#Tensor-1) +##### Returns +`smdebug.core.tensor.Tensor` object which has [this API](#Tensor-1). #### has_tensor -Query whether the trial has a tensor by the given name +Query whether the trial has a tensor by the given name. ```python trial.has_tensor(tname) ``` -###### Arguments -- `tname (str)` Takes the name of tensor +##### Arguments +- `tname (str)`: Takes the name of the tensor. -###### Returns +##### Returns `bool`: `True` if the tensor is seen by the trial so far, else `False`. #### steps -Retrieve a list of steps seen by the trial +Retrieve a list of steps seen by the trial. ```python trial.steps(mode=None) ``` -###### Arguments -- `mode (smdebug.modes enum value)` Passing a mode here allows you want to retrieve the list of steps seen by a trial for that mode -If this is not passed, returns steps for all modes. +##### Arguments +- `mode (smdebug.modes enum value)`: By passing a mode here, you can retrieve the list of steps seen by a trial for that mode. +If you do not pass this, it returns steps for all modes. -###### Returns -`list[int]` List of integers representing step numbers. If a mode was passed, this returns steps within that mode, i.e. mode steps. +##### Returns +`list[int]` List of integers representing step numbers. If a mode was passed, this returns steps within that mode, called mode steps. Each of these mode steps has a global step number associated with it. The global step represents the sequence of steps across all modes executed by the job. #### modes -Retrieve a list of modes seen by the trial +Retrieve a list of modes seen by the trial. ```python trial.modes() ``` -###### Returns -`list[smdebug.modes enum value]` List of modes for which data was saved from the training job across all steps seen. +##### Returns +`list[smdebug.modes enum value]`: List of modes for which data was saved from the training job across all steps seen. #### mode -Given a global step number you can identify the mode for that step using this method. +Given a global step number, you can identify the mode for that step using this method. ```python trial.mode(global_step=100) ``` -###### Arguments -- `global_step (int)` Takes the global step as an integer +##### Arguments +- `global_step (int)` Takes the global step as an integer. -###### Returns -`smdebug.modes enum value` of the given global step +##### Returns +`smdebug.modes enum value` of the given global step. #### mode_step -Given a global step number you can identify the `mode_step` for that step using this method. +Given a global step number, you can identify the `mode_step` for that step using this method. ```python trial.mode_step(global_step=100) ``` -###### Arguments -- `global_step (int)` Takes the global step as an integer +##### Arguments +- `global_step (int)`: Takes the global step as an integer. -###### Returns -`int`: An integer representing `mode_step` of the given global step. Typically used in conjunction with `mode` method. +##### Returns +`int`: An integer representing `mode_step` of the given global step. Typically used in conjunction with the `mode` method. #### global_step -Given a mode and a mode_step number you can retrieve its global step using this method. +Given a mode and a `mode_step` number, you can retrieve its global step using this method. ```python trial.global_step(mode=modes.GLOBAL, mode_step=100) ``` -###### Arguments -- `mode (smdebug.modes enum value)` Takes the mode as enum value -- `mode_step (int)` Takes the mode step as an integer +##### Arguments +- `mode (smdebug.modes enum value)`: Takes the mode as an `enum` value. +- `mode_step (int)`: Takes the mode step as an integer. -###### Returns -`int` An integer representing `global_step` of the given mode and mode_step. +##### Returns +`int`: An integer representing the `global_step` of the given mode and `mode_step`. #### workers -Query for all the worker processes from which data was saved by smdebug during multi worker training. +Query for all the worker processes from which data was saved by `smdebug` during multi-worker training. ```python trial.workers() ``` -###### Returns -`list[str]` A sorted list of names of worker processes from which data was saved. If using TensorFlow Mirrored Strategy for multi worker training, these represent names of different devices in the process. For Horovod, torch.distributed and similar distributed training approaches, these represent names of the form `worker_0` where 0 is the rank of the process. +##### Returns +`list[str]`: A sorted list of names of worker processes from which data was saved. If you are using TensorFlow Mirrored Strategy for multi-worker training, these represent names of different devices in the process. For Horovod, torch.distributed, and similar distributed training approaches, these represent names of the form `worker_0`, where 0 is the rank of the process. #### collections -List the collections from the trial. Note that tensors part of these collections may not necessarily have been saved from the training job. Whether a collection was saved or not depends on the configuration of the Hook during training. +List the collections from the trial. Note that the tensors part of these collections may not necessarily have been saved from the training job. Whether a collection was saved or not depends on the configuration of the hook during training. ```python trial.collections() ``` -###### Returns -`dict[str -> Collection]` A dictionary indexed by the name of the collection, with the Collection object as the value. Please refer [Collection API](api.md#Collection) for more details. +##### Returns +`dict[str -> Collection]`: A dictionary indexed by the name of the collection, with the collection object as the value. See [Collection API](api.md#Collection) for more details. #### collection -Get a specific collection from the trial. Note that tensors which are part of this collection may not necessarily have been saved from the training job. Whether this collection was saved or not depends on the configuration of the Hook during training. +Get a specific collection from the trial. Note that tensors which are part of this collection may not necessarily have been saved from the training job. Whether this collection was saved or not depends on the configuration of the hook during training. ```python trial.collection(coll_name) ``` -###### Arguments -- `coll_name (str)` Name of the collection +##### Arguments +- `coll_name (str)`: Name of the collection. -###### Returns -`Collection` The requested Collection object. Please refer [Collection API](api.md#Collection) for more details. +##### Returns +`Collection`: The requested collection object. See [Collection API](api.md#Collection) for more details. #### wait\_for\_steps -This method allows you to wait for steps before proceeding. You might want to use this method if you want to wait for smdebug to see the required steps so you can then query and analyze the tensors saved by that step. This method blocks till all data from the steps are seen by smdebug. +This method allows you to wait for steps before proceeding. Use this method if you want to wait for `smdebug` to see the required steps so you can then query and analyze the tensors saved by that step. This method blocks until `smdebug` sees all data from the steps. ```python trial.wait_for_steps(required_steps, mode=modes.GLOBAL) ``` -###### Arguments -- `required_steps (list[int])` Step numbers to wait for -- `mode (smdebug.modes enum value)` The mode to which given step numbers correspond to. This defaults to modes.GLOBAL. +##### Arguments +- `required_steps (list[int])`: Step numbers for which to wait. +- `mode (smdebug.modes enum value)`: The mode to which given step numbers correspond. This defaults to modes.GLOBAL. -###### Returns -None, but it only returns after we know definitely whether we have seen the steps. +##### Returns +None, but it only returns after we know whether we have seen the steps. -###### Exceptions raised -`StepUnavailable` and `NoMoreData`. See [Exceptions](#exceptions) section for more details. +##### Exceptions raised +`StepUnavailable` and `NoMoreData`. See the [Exceptions](#exceptions) section for more details. #### has\_passed\_step ```python trial.has_passed_step(step, mode=modes.GLOBAL) ``` -###### Arguments -- `step (int)` The step number to check if the trial has passed it -- `mode (smdebug.modes enum value)` The mode to which given step number corresponds to. This defaults to modes.GLOBAL. +##### Arguments +- `step (int)`: The step number to check if the trial has passed. +- `mode (smdebug.modes enum value)`: The mode to which a given step number corresponds. This defaults to modes.GLOBAL. -###### Returns -`smdebug.core.tensor.StepState enum value` which can take one of three values `UNAVAILABLE`, `AVAILABLE` and `NOT_YET_AVAILABLE`. +##### Returns +`smdebug.core.tensor.StepState enum value`, which can take one of three values: `UNAVAILABLE`, `AVAILABLE`, and `NOT_YET_AVAILABLE`. TODO@Nihal describe these in detail ## Tensor -An smdebug `Tensor` object can be retrieved through the `trial.tensor(name)` API. It is uniquely identified by the string representing name. +An `smdebug` tensor object can be retrieved through the `trial.tensor(name)` API. It is uniquely identified by the string representing its name. It provides the following methods. | Method | Description| | ---- | ----- | -| [steps()](#steps-1) | Query steps for which tensor was saved | -| [value(step)](#value) | Get the value of the tensor at a given step as a numpy array | -| [reduction_value(step)](#reduction_value) | Get the reduction value of the chosen tensor at a particular step | -| [reduction_values(step)](#reduction_values) | Get all reduction values saved for the chosen tensor at a particular step | -| [values(mode)](#values) | Get the values of the tensor for all steps of a given mode | -| [workers(step)](#workers-1) | Get all the workers for which this tensor was saved at a given step | -| [prev\_steps(step, n)](#prev_steps) | Get the last n step numbers of a given mode from a given step | +| [steps()](#steps-1) | Query steps for which tensor was saved +.| +| [value(step)](#value) | Get the value of the tensor at a given step as a `numpy` array. | +| [reduction_value(step)](#reduction_value) | Get the reduction value of the chosen tensor at a particular step. | +| [reduction_values(step)](#reduction_values) | Get all reduction values saved for the chosen tensor at a particular step. | +| [values(mode)](#values) | Get the values of the tensor for all steps of a given mode. | +| [workers(step)](#workers-1) | Get all the workers for which this tensor was saved at a given step. | +| [prev\_steps(step, n)](#prev_steps) | Get the last `n` step numbers of a given mode from a given step. | ### Tensor API #### steps -Query for the steps at which the given tensor was saved +Query for the steps at which the given tensor was saved. ```python trial.tensor(name).steps(mode=ModeKeys.GLOBAL, show_incomplete_steps=False) ``` -###### Arguments -- `mode (smdebug.modes enum value)` The mode whose steps to return for the given tensor. Defaults to `modes.GLOBAL` -- `show_incomplete_steps (bool)` This parameter is relevant only for distributed training. By default this method only returns the steps which have been received from all workers. But if this parameter is set to True, this method will return steps received from at least one worker. +##### Arguments +- `mode (smdebug.modes enum value)`: The mode whose steps to return for the given tensor. Defaults to `modes.GLOBAL`. +- `show_incomplete_steps (bool)`: This parameter is relevant only for distributed training. By default, this method only returns the steps which have been received from all workers. But if this parameter is set to `True`, this method returns steps received from at least one worker. -###### Returns -`list[int]` A list of steps at which the given tensor was saved +##### Returns +`list[int]`: A list of steps at which the given tensor was saved. #### value -Get the value of the tensor at a given step as a numpy array +Get the value of the tensor at a given step as a `numpy` array. ```python trial.tensor(name).value(step_num, mode=ModeKeys.GLOBAL, worker=None) ``` -###### Arguments -- `step_num (int)` The step number whose value is to be returned for the mode passed through the next parameter. -- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL` -- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)` +##### Arguments +- `step_num (int)`: The step number whose value is to be returned for the mode passed through the next parameter. +- `mode (smdebug.modes enum value)`: The mode applicable for the preceding step number passed. Defaults to `modes.GLOBAL`. +- `worker (str)`: This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You can also query the workers which saved a value for the tensor at a specific step with the method: `trial.tensor(name).workers(step, mode)`. -###### Returns -`numpy.ndarray` The value of tensor at the given step and worker (if the training job saved data from multiple workers) +##### Returns +`numpy.ndarray`: The value of the tensor at the given step and worker (if the training job saved data from multiple workers). #### reduction_value -Get the reduction value of the chosen tensor at a particular step. A reduction value is a tensor reduced to a single value through reduction or aggregation operations. The different reductions you can query for are the same as what are allowed in [ReductionConfig](api.md#reductionconfig) when saving tensors. -This API thus allows you to access the reduction you might have saved instead of the full tensor. If you had saved the full tensor, it will calculate the requested reduction at the time of this call. +Get the reduction value of the chosen tensor at a particular step. A reduction value is a tensor reduced to a single value through reduction or aggregation operations. The different reductions for which you can query are the same as those allowed in [ReductionConfig](api.md#reductionconfig) when saving tensors. +This API thus allows you to access the reduction you might have saved instead of the full tensor. If you saved the full tensor, it calculates the requested reduction at the time of this call. -Reduction names allowed are `min`, `max`, `mean`, `prod`, `std`, `sum`, `variance` and `l1`, `l2` representing the norms. +Reduction names allowed are `min`, `max`, `mean`, `prod`, `std`, `sum`, `variance`, and `l1`, `l2` representing the norms. -Each of these can be retrieved for the absolute value of the tensor or the original tensor. Above was an example to get the mean of the absolute value of the tensor. `abs` can be set to `False` if you want to see the `mean` of the actual tensor. +Each of these can be retrieved for the absolute value of the tensor or the original tensor. The preceding example gets the mean of the absolute value of the tensor. `abs` can be set to `False` if you want to see the `mean` of the actual tensor. -If you had saved the tensor without any reduction, then you can retrieve the actual tensor as a numpy array and compute any reduction you might be interested in. In such a case you do not need this method. +If you saved the tensor without any reduction, then you can retrieve the actual tensor as a `numpy` array and compute any reduction in which you might be interested. In such a case, you do not need this method. ```python trial.tensor(name).reduction_value(step_num, reduction_name, mode=modes.GLOBAL, worker=None, abs=False) ``` -###### Arguments -- `step_num (int)` The step number whose value is to be returned for the mode passed through the next parameter. -- `reduction_name (str)` The name of the reduction to query for. This can be one of `min`, `max`, `mean`, `std`, `variance`, `sum`, `prod` and the norms `l1`, `l2`. -- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL` -- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)` -- `abs (bool)` If abs is True, this method tries to return the reduction passed through `reduction_name` after taking the absolute value of the tensor. It defaults to `False`. +##### Arguments +- `step_num (int)`: The step number whose value is to be returned for the mode passed through the next parameter. +- `reduction_name (str)`: The name of the reduction for which to query. This can be one of `min`, `max`, `mean`, `std`, `variance`, `sum`, `prod`, and the norms `l1`, `l2`. +- `mode (smdebug.modes enum value)`: The mode applicable for the preceding step number passed. Defaults to `modes.GLOBAL`. +- `worker (str)`: This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step. This is possible with the method: `trial.tensor(name).workers(step, mode)`. +- `abs (bool)`: If `abs` is` True`, this method tries to return the reduction passed through `reduction_name` after taking the absolute value of the tensor. It defaults to `False`. -###### Returns -`numpy.ndarray` The reduction value of tensor at the given step and worker (if the training job saved data from multiple workers) as a 1x1 numpy array. If this reduction was saved for the tensor during training as part of specification through reduction config, it will be loaded and returned. If the given reduction was not saved then, but the full tensor was saved, the reduction will be computed on the fly and returned. If both the chosen reduction and full tensor are not available, this method raises `TensorUnavailableForStep` exception. +##### Returns +`numpy.ndarray`: The reduction value of the tensor at the given step and worker (if the training job saved data from multiple workers) as a 1x1 `numpy` array. If this reduction was saved for the tensor during training as part of specification through reduction config, it is loaded and returned. If the given reduction was not saved then, but the full tensor was saved, the reduction is computed on the fly and returned. If both the chosen reduction and full tensor are not available, this method raises a `TensorUnavailableForStep` exception. #### shape Get the shape of the chosen tensor at a particular step. @@ -365,13 +366,13 @@ Get the shape of the chosen tensor at a particular step. trial.tensor(name).shape(step_num, mode=modes.GLOBAL, worker=None) ``` -###### Arguments -- `step_num (int)` The step number whose value is to be returned for the mode passed through the next parameter. -- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL` -- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)` +##### Arguments +- `step_num (int)`: The step number whose value is to be returned for the mode passed through the next parameter. +- `mode (smdebug.modes enum value)`: The mode applicable for the step number passed above. Defaults to `modes.GLOBAL`. +- `worker (str)`: This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You can also query the workers which saved a value for the tensor at a specific step with the method: `trial.tensor(name).workers(step, mode)`. -###### Returns -`tuple(int)` If only the shape of this tensor was saved through `save_shape` configuration in ReductionConfig, it will be returned. If the full tensor was saved, then shape will be computed and returned today. If both the shape and full tensor are not available, this method raises `TensorUnavailableForStep` exception. +##### Returns +`tuple(int)`: If only the shape of this tensor was saved through `save_shape` configuration in `ReductionConfig`, it is returned. If the full tensor was saved, then the shape is computed and returned today. If both the shape and full tensor are not available, this method raises a `TensorUnavailableForStep` exception. #### values Get the values of the tensor for all steps of a given mode. @@ -380,27 +381,27 @@ Get the values of the tensor for all steps of a given mode. trial.tensor(name).values(mode=modes.GLOBAL, worker=None) ``` -###### Arguments -- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL` -- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)` +##### Arguments +- `mode (smdebug.modes enum value)`: The mode applicable for the preceding step number passed. Defaults to `modes.GLOBAL`. +- `worker (str)`: This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You can also query the workers which saved a value for the tensor at a specific step. This is possible with the method: `trial.tensor(name).workers(step, mode)`. -###### Returns -`dict[int -> numpy.ndarray]` A dictionary with step numbers as keys and numpy arrays representing the value of the tensor as values. +##### Returns +`dict[int -> numpy.ndarray]`: A dictionary with step numbers as keys and `numpy` arrays representing the value of the tensor as values. #### reduction_values -Get all reduction values saved for the chosen tensor at a particular step. A reduction value is a tensor reduced to a single value through reduction or aggregation operations. Please go through the description of the method `reduction_value` for more details. +Get all reduction values saved for the chosen tensor at a particular step. A reduction value is a tensor reduced to a single value through reduction or aggregation operations. See the description of the method `reduction_value` for more details. ```python trial.tensor(name).reduction_values(step_num, mode=modes.GLOBAL, worker=None) ``` -###### Arguments -- `step_num (int)` The step number whose value is to be returned for the mode passed through the next parameter. -- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL` -- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)` +##### Arguments +- `step_num (int)`: The step number whose value is to be returned for the mode passed through the next parameter. +- `mode (smdebug.modes enum value)`: The mode applicable for the preceding step number passed. Defaults to `modes.GLOBAL`. +- `worker (str)`: This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You can also query the workers which saved a value for the tensor at a specific step. This is possible with the method: `trial.tensor(name).workers(step, mode)`. -###### Returns -`dict[(str, bool) -> numpy.ndarray]` A dictionary with keys being tuples of the form `(reduction_name, abs)` to a 1x1 numpy ndarray value. `abs` here is a boolean that denotes whether the reduction was performed on the absolute value of the tensor or not. Note that this method only returns the reductions which were saved from the training job. It does not compute all known reductions and return them if only the raw tensor was saved. +##### Returns +`dict[(str, bool) -> numpy.ndarray]`: A dictionary with keys being tuples of the form `(reduction_name, abs)` to a 1x1 `numpy ndarray` value. `abs` here is a boolean that denotes whether the reduction was performed on the absolute value of the tensor or not. Note that this method only returns the reductions which were saved from the training job. It does not compute all known reductions and return them if only the raw tensor was saved. #### shapes Get the shapes of the tensor for all steps of a given mode. @@ -409,53 +410,53 @@ Get the shapes of the tensor for all steps of a given mode. trial.tensor(name).shapes(mode=modes.GLOBAL, worker=None) ``` -###### Arguments -- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL` -- `worker (str)` This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You might also be interested in querying the workers which saved a value for the tensor at a specific step, this is possible with the method: `trial.tensor(name).workers(step, mode)` +##### Arguments +- `mode (smdebug.modes enum value)`: The mode applicable for the preceding step number passed. Defaults to `modes.GLOBAL`. +- `worker (str)`: This parameter is only applicable for distributed training. You can retrieve the value of the tensor from a specific worker by passing the worker name. You can query all the workers seen by the trial with the `trial.workers()` method. You can also query the workers which saved a value for the tensor at a specific step. This is possible with the method: `trial.tensor(name).workers(step, mode)`. -###### Returns -`dict[int -> tuple(int)]` A dictionary with step numbers as keys and tuples of ints representing the shapes of the tensor as values. +##### Returns +`dict[int -> tuple(int)]`: A dictionary with step numbers as keys and tuples of ints representing the shapes of the tensor as values. #### workers -Get all the workers for which this tensor was saved at a given step +Get all the workers for which this tensor was saved at a given step. ```python trial.tensor(name).workers(step_num, mode=modes.GLOBAL) ``` -###### Arguments -- `step_num (int)` The step number whose value is to be returned for the mode passed through the next parameter. -- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL` +##### Arguments +- `step_num (int)`: The step number whose value is to be returned for the mode passed through the next parameter. +- `mode (smdebug.modes enum value)`: The mode applicable for the preceding step number passed. Defaults to `modes.GLOBAL`. -###### Returns -`list[str]` A list of worker names for which the tensor was saved at the given step. +##### Returns +`list[str]`: A list of worker names for which the tensor was saved at the given step. #### prev_steps -Get the last n step numbers of a given mode from a given step. +Get the last `n` step numbers of a given mode from a given step. ```python trial.tensor(name).prev_steps(step, n, mode=modes.GLOBAL) ``` -###### Arguments -- `step (int)` The step number whose value is to be returned for the mode passed. -- `n (int)` Number of previous steps to return -- `mode (smdebug.modes enum value)` The mode applicable for the step number passed above. Defaults to `modes.GLOBAL` +##### Arguments +- `step (int)`: The step number whose value is to be returned for the mode passed. +- `n (int)`: Number of previous steps to return +- `mode (smdebug.modes enum value)`: The mode applicable for the preceding step number passed. Defaults to `modes.GLOBAL`. -###### Returns -`list[int]` A list of size at most n representing the previous steps for the given step and mode. Note that this list can be of size less than n if there were only less than n steps saved before the given step in this trial. +##### Returns +`list[int]`: A list of a size at most `n` representing the previous steps for the given step and mode. Note that this list can be of a size less than `n` if there were only fewer than `n` steps saved before the given step in this trial. ## Rules -Rules are the medium by which SageMaker Debugger executes a certain piece of code regularly on different steps of a training job. A rule is assigned to a trial and can be invoked at each new step of the trial. It can also access other trials for its evaluation. You can evaluate a rule using tensors from the current step or any step before the current step. Please ensure your logic respects these semantics, else you will get a `TensorUnavailableForStep` exception as the data would not yet be available for future steps. +Rules are the medium by which SageMaker Debugger executes a certain piece of code regularly on different steps of a training job. A rule is assigned to a trial and can be invoked at each new step of the trial. It can also access other trials for its evaluation. You can evaluate a rule using tensors from the current step or any step before the current step. Please ensure your logic respects these semantics, else you get a `TensorUnavailableForStep` exception as the data is not yet available for future steps. -### Built In Rules +### Built-in rules Please refer to the built-in rules that SageMaker provides [here](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/sagemaker.md#built-in-rules). ### Writing a custom rule -Writing a rule involves implementing the [Rule interface](../smdebug/rules/rule.py). Below, let us look at a simplified version of a VanishingGradient rule. +Writing a rule involves implementing the [Rule interface](../smdebug/rules/rule.py). In a following section, we look at a simplified version of a `VanishingGradient` rule. -##### Constructor -Creating a rule involves first inheriting from the base `Rule` class provided by smdebug. -For this example rule here, we do not need to look at any other trials, so we set `other_trials` to None. +#### Constructor +Creating a rule involves first inheriting from the base `Rule` class provided by `smdebug`. +For this example rule, we do not need to look at any other trials, so we set `other_trials` to `None`. ```python from smdebug.rules import Rule @@ -466,15 +467,15 @@ class VanishingGradientRule(Rule): self.threshold = float(threshold) ``` -Please note that apart from `base_trial` and `other_trials` (if required), we require all -arguments of the rule constructor to take a string as value. You can parse them to the type -that you want from the string. This means if you want to pass a list of strings, you might want to pass them as a comma separated string. This restriction is being enforced so as to let you create and invoke rules from json using Sagemaker's APIs. +Apart from `base_trial` and `other_trials` (if required), we require all +arguments of the rule constructor to take a string as a value. You can parse them to the type +that you want from the string. This means that if you want to pass a list of strings, you might want to pass them as a comma-separated string. This restriction is enforced so you can create and invoke rules from JSON using Sagemaker's APIs. ##### Function to invoke at a given step -In this function you can implement the core logic of what you want to do with these tensors. -It should return a boolean value `True` or `False`, where `True` means the rule evaluation condition has been met. When you invoke these rules through SageMaker, the rule evaluation ends when the rule evaluation condition is met. SageMaker creates a Cloudwatch event for every rule evaluation job, which can be used to define actions that you might want to take based on the state of the rule. +In this function, you can implement the core logic of what you want to do with these tensors. +It should return a boolean value `True` or `False`, where `True` means the rule evaluation condition has been met. When you invoke these rules through SageMaker, the rule evaluation ends when the rule evaluation condition is met. SageMaker creates an Amazon Cloudwatch event for every rule evaluation job, which can be used to define actions that you might want to take based on the state of the rule. -A simplified version of the actual invoke function for `VanishingGradientRule` is below: +A simplified version of the actual invoke function for `VanishingGradientRule` is as follows: ```python def invoke_at_step(self, step): @@ -487,10 +488,10 @@ A simplified version of the actual invoke function for `VanishingGradientRule` i return False ``` -That's it, writing a rule is as simple as that. +That's it! Writing a rule is as simple as that. ### Invoking a rule through SageMaker -After you've written your rule, you can ask SageMaker to evaluate the rule against your training job by either using SageMaker Python SDK as +After you've written your rule, you can ask SageMaker to evaluate the rule against your training job by using the SageMaker Python SDK: ``` estimator = Estimator( ... @@ -507,7 +508,7 @@ estimator = Estimator( } ) ``` -If you're using the SageMaker API directly to evaluate the rule, then you can specify the rule configuration [`DebugRuleConfigurations`](https://docs.aws.amazon.com/sagemaker/latest/dg/API_DebugRuleConfiguration.html) in the CreateTrainingJob API request as: +If you're using the SageMaker API directly to evaluate the rule, then you can specify the rule configuration [`DebugRuleConfigurations`](https://docs.aws.amazon.com/sagemaker/latest/dg/API_DebugRuleConfiguration.html) in the `CreateTrainingJob` API request: ``` "DebugRuleConfigurations": [ { @@ -516,7 +517,7 @@ If you're using the SageMaker API directly to evaluate the rule, then you can sp "VolumeSizeInGB": 30, "RuleEvaluatorImage": "864354269164.dkr.ecr.us-east-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest", "RuleParameters": { - "source_s3_uri": "s3://path/to/vanishing_gradient_rule.py", + "source_s3_uri": "s3:/path/to/vanishing_gradient_rule.py", "rule_to_invoke": "VanishingGradient", "threshold": "20.0" } @@ -525,7 +526,7 @@ If you're using the SageMaker API directly to evaluate the rule, then you can sp ``` #### Invoking a rule outside of SageMaker through `invoke_rule` -You might want to invoke the rule locally during development. We provide a function to invoke rules easily. Refer [smdebug/rules/rule_invoker.py](../smdebug/rules/rule_invoker.py). The invoke function has the following syntax. It takes a instance of a Rule and invokes it for a series of steps one after the other. +You can invoke the rule locally during development. We provide a function to invoke rules easily. See [smdebug/rules/rule_invoker.py](../smdebug/rules/rule_invoker.py). The invoke function takes an instance of a rule and invokes it for a series of steps one after the other, and has the following syntax: ```python from smdebug.rules import invoke_rule @@ -536,15 +537,15 @@ rule_obj = VanishingGradientRule(trial, threshold=0.0001) invoke_rule(rule_obj, start_step=0, end_step=None) ``` -###### Arguments -- `rule_obj (Rule)` An instance of a subclass of `smdebug.rules.Rule` that you want to invoke. -- `start_step (int)` Global step number to start invoking the rule from. Note that this refers to a global step. This defaults to 0. -- `end_step (int or None)`: Global step number to end the invocation of rule before. To clarify, `end_step` is an exclusive bound. The rule is invoked at `end_step`. This defaults to `None` which means run till the end of the job. -- `raise_eval_cond (bool)` This parameter controls whether to raise the exception `RuleEvaluationConditionMet` when raised by the rule, or to catch it and log the message and move to the next step. Defaults to `False`, which implies that the it catches the exception, logs that the evaluation condition was met for a step and moves on to evaluate the next step. +##### Arguments +- `rule_obj (Rule)`: An instance of a subclass of `smdebug.rules.Rule` that you want to invoke. +- `start_step (int)`: A global step number from which to start invoking the rule. Note that this refers to a global step. This defaults to 0. +- `end_step (int or None)`: Global step number before which to end the invocation of rule. To clarify, `end_step` is an exclusive bound. The rule is invoked at `end_step`. This defaults to `None`, which means it runs until the end of the job. +- `raise_eval_cond (bool)`: This parameter controls whether to raise the exception `RuleEvaluationConditionMet` when raised by the rule, or to catch it and log the message and move to the next step. It defaults to `False`, which implies that it catches the exception, logs that the evaluation condition was met for a step, and moves on to evaluate the next step. ## Exceptions -smdebug is designed to be aware that tensors required to evaluate a rule may not be available at every step. Hence, it raises a few exceptions which allow us to control what happens when a tensor is missing. These are available in the `smdebug.exceptions` module. You can import them as follows: +`smdebug` is designed to be aware that the tensors required to evaluate a rule may not be available at every step. Hence, it raises a few exceptions which allow us to control what happens when a tensor is missing. These are available in the `smdebug.exceptions` module. You can import them as follows: ```python from smdebug.exceptions import * @@ -552,31 +553,31 @@ from smdebug.exceptions import * Here are the exceptions (along with others) and their meaning: -- `TensorUnavailableForStep` : This means that the tensor requested is not available for the step. It may have been or will be saved for a different step number. You can check which steps tensor is saved for by `trial.tensor('tname').steps()` [api](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/analysis.md#steps-1). Note that this exception implies that the requested tensor will never become available for this step in the future. +- `TensorUnavailableForStep`: This means that the tensor requested is not available for the step. It may have been or will be saved for a different step number. You can check which steps the tensor is saved for by `trial.tensor('tname').steps()` [api](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/analysis.md#steps-1). Note that this exception implies that the requested tensor will never become available for this step in the future. -- `TensorUnavailable` : This means that this tensor has not been saved from the training job. Note that if you have a `SaveConfig` which saves a certain tensor only after the time you queried for the tensor, you might get a `TensorUnavailable` exception even if the tensor may become available later for some step. +- `TensorUnavailable`: This means that this tensor has not been saved from the training job. Note that if you have a `SaveConfig` which saves a certain tensor only after the time you queried for the tensor, you might get a `TensorUnavailable` exception even if the tensor may become available later for some step. - `StepUnavailable`: This means that the step was not saved from the training job. No tensor will be available for this step. - `StepNotYetAvailable`: This means that the step has not yet been seen from the training job. It may be available in the future if the training is still going on. We automatically load new data as and when it becomes available. This step may either become available in the future, or the exception might change to `StepUnavailable`. -- `NoMoreData` : This will be raised when the training ends. Once you see this, you will know that there will be no more steps and no more tensors saved. +- `NoMoreData`: This is raised when the training ends. Once you see this, you know that there are no more steps and no more tensors saved. - `RuleEvaluationConditionMet`: This is raised when the rule invocation returns `True` for some step. -- `MissingCollectionFiles`: This is raised when no data was saved by the training job. Check that the `Hook` was configured correctly before starting the training job. +- `MissingCollectionFiles`: This is raised when no data was saved by the training job. Check that the `hook` was configured correctly before starting the training job. ## Utils ### Enable or disable refresh of tensors in a trial -By default smdebug refreshes tensors each time you try to query the tensor. -It looks for whether this tensor is saved for new steps and if so fetches them. +By default, `smdebug` refreshes tensors each time you try to query the tensor. +It looks for whether this tensor is saved for new steps and, if so, fetches them. If you know the saved data will not change (stopped the machine learning job), or are not interested in the latest data, you can stop the refreshing of tensors as follows: `no_refresh` takes a trial or a list of trials, which should not be refreshed. -Anything executed inside the with `no_refresh` block will not be refreshed. +Anything executed inside with a `no_refresh` block is not refreshed. ```python from smdebug.analysis.utils import no_refresh @@ -584,7 +585,7 @@ with no_refresh(trials): pass ``` -Similarly if you want to refresh tensors only within a block, you can do: +Similarly, if you want to refresh tensors only within a block, you can do so as follows: ```python from smdebug.analysis.utils import refresh @@ -592,4 +593,4 @@ with refresh(trials): pass ``` -During rule invocation smdebug waits till the current step is available and then turns off refresh to ensure that you do not get different results for methods like `trial.tensor(name).steps()` and run into subtle issues. +During rule invocation, `smdebug` waits until the current step is available and then turns off refresh to ensure that you do not get different results for methods like `trial.tensor(name).steps()` and run into subtle issues. diff --git a/docs/api.md b/docs/api.md index fe01532df..c04c74ace 100644 --- a/docs/api.md +++ b/docs/api.md @@ -3,14 +3,14 @@ - [Glossary](#glossary) - [Hook](#hook) - - [Creating a Hook](#creating-a-hook) - - [Hook when using SageMaker Python SDK](#hook-when-using-sagemaker-python-sdk) - - [Configuring Hook using SageMaker Python SDK](#configuring-hook-using-sagemaker-python-sdk) - - [Hook from Python constructor](#hook-from-python-constructor) - - [Common Hook API](#common-hook-api) - - [TensorFlow specific Hook API](#tensorflow-specific-hook-api) - - [MXNet specific Hook API](#mxnet-specific-hook-api) - - [PyTorch specific Hook API](#pytorch-specific-hook-api) + - [Creating a hook](#creating-a-hook) + - [Hook when using the SageMaker Python SDK](#hook-when-using-sagemaker-python-sdk) + - [Configuring a hook using the SageMaker Python SDK](#configuring-hook-using-sagemaker-python-sdk) + - [Hook from a Python constructor](#hook-from-python-constructor) + - [Common hook API](#common-hook-api) + - [TensorFlow-specific hook API](#tensorflow-specific-hook-api) + - [MXNet-specific hook API](#mxnet-specific-hook-api) + - [PyTorch-specific hook API](#pytorch-specific-hook-api) - [Modes](#modes) - [Collection](#collection) - [SaveConfig](#saveconfig) @@ -20,41 +20,41 @@ The imports assume `import smdebug.{tensorflow,pytorch,mxnet,xgboost} as smd`. -**Step**: Step means one the work done by the training job for one batch (i.e. forward and backward pass). (An exception is with TensorFlow's Session interface, where a step also includes the initialization session run calls). SageMaker Debugger is designed in terms of steps. When to save data is specified using steps as well as the invocation of Rules is on a step-by-step basis. +**Step**: A *step* is the work done by the training job for one batch (a forward and backward pass). (An exception is TensorFlow's Session interface, where a step also includes the initialization session run calls.) SageMaker Debugger is designed in terms of steps. When to save data is specified using steps. The invocation of rules is on a step-by-step basis. -**Hook**: The main class to pass as a callback object, or to create callback functions. It keeps track of collections and writes output files at each step. The current hook implementation does not support merging tensors from current job with tensors from previous job(s). Hence ensure that the 'out_dir' does not exist prior to instantiating the 'Hook' object. +**Hook**: A *hook* is the main class to pass as a callback object, or to create callback functions. It keeps track of collections and writes output files at each step. The current hook implementation does not support merging tensors from a current job with tensors from previous job(s). Hence, ensure that the 'out_dir' does not exist prior to instantiating the 'hook' object. - `hook = smd.Hook(out_dir="/tmp/mnist_job")` -**Mode**: One of "train", "eval", "predict", or "global". Helpful for segmenting data based on the phase -you're in. Defaults to "global". +**Mode**: One of `train`, `eval`, `predict`, or `global`. Helpful for segmenting data based on the phase +you're in. Defaults to `global`. - `train_mode = smd.modes.TRAIN` -**Collection**: A group of tensors. Each collection contains its configuration for what tensors are part of it, and when to save them. +**Collection**: A group of tensors. Each collection contains its configuration for which tensors are part of it and when to save them. - `collection = hook.get_collection("losses")` **SaveConfig**: A Python dict specifying how often to save losses and tensors. - `save_config = smd.SaveConfig(save_interval=10)` -**ReductionConfig**: Allows you to save a reduction, such as 'mean' or 'l1 norm', instead of the full tensor. Reductions are simple floats. +**ReductionConfig**: Allows you to save a reduction, such as `mean` or `l1 norm`, instead of the full tensor. Reductions are simple floats. - `reduction_config = smd.ReductionConfig(reductions=['min', 'max', 'mean'], norms=['l1'])` **Trial**: The main interface to use when analyzing a completed training job. Access collections and tensors. See [trials documentation](analysis.md). - `trial = smd.create_trial(out_dir="/tmp/mnist_job")` -**Rule**: A condition to monitor the saved data for. It can trigger an exception when the condition is met, for example a vanishing gradient. See [rules documentation](analysis.md). +**Rule**: A condition for which to monitor the saved data. It can trigger an exception when the condition is met: for example, a vanishing gradient. See [rules documentation](analysis.md). --- ## Hook -### Creating a Hook -By using AWS Deep Learning Containers, you can directly run your own training script without any additional effort to make it compatible with the SageMaker Python SDK. For a detailed developer guide for this, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). +### Creating a hook +By using AWS Deep Learning Containers, you can directly run your own training script without any additional effort to make it compatible with the SageMaker Python SDK. For a detailed developer guide, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). -However, for some advanced use cases where you need access to customized tensors from targeted parts of a training script, you can manually construct the hook object. The smdebug library provides hook classes to make this process simple and compatible with the SageMaker ecosystem and Debugger. +However, for some advanced use cases in which you need access to customized tensors from targeted parts of a training script, you can manually construct the hook object. The `smdebug` library provides hook classes to make this process simple and compatible with the SageMaker ecosystem and Debugger. #### Hook when using the SageMaker Python SDK If you create a SageMaker job and specify the hook configuration in the SageMaker Estimator API as described in [AWS Docs](https://docs.aws.amazon.com/sagemaker/latest/dg/train-model.html), -the CreateTrainingJob API operation containing the hook configuration will be automatically written to the training container. +the CreateTrainingJob API operation containing the hook configuration is automatically written to the training container. To capture tensors from your training model, paste the following code to the top or the main function of the training script. ```python @@ -62,9 +62,9 @@ import smdebug.Framework as smd hook = smd.HookClass.create_from_json_file() ``` -Depending on your choice of framework, `HookClass` need to be replaced by one of `KerasHook`, `SessionHook` or `EstimatorHook` for TensorFlow, and `Hook` for PyTorch, MXNet, and XGBoost. +Depending on your choice of framework, `HookClass` must be replaced by one of `KerasHook`, `SessionHook`, or `EstimatorHook` for TensorFlow, and `Hook` for PyTorch, MXNet, and XGBoost. -The framework in `smd.Framework` import refers to one of `tensorflow`, `mxnet`, `pytorch`, or `xgboost`. +The framework in the `smd.Framework` import refers to one of `tensorflow`, `mxnet`, `pytorch`, or `xgboost`. After choosing a framework and defining the hook object, you need to embed the hooks into target parts of your training script to retrieve tensors and to use with the SageMaker Debugger Python SDK. @@ -75,7 +75,7 @@ For more information about constructing the hook depending on a framework of you * [PyTorch hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/pytorch.md) * [XGBoost hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/xgboost.md) -#### Configuring Hook using SageMaker Python SDK +#### Configuring a hook using the SageMaker Python SDK After you make the minimal changes to your training script, you can configure the hook with parameters to the SageMaker Debugger API operation, `DebuggerHookConfig`. ```python @@ -87,7 +87,7 @@ hook_config = DebuggerHookConfig( }) ``` -The available hook parameters are listed in the following. The meaning of these parameters will be clear as you review the sections of documentation below. Note that all parameters below have to be strings. So for any parameter which accepts a list (such as save_steps, reductions, include_regex), the value needs to be given as strings separated by a comma between them. +The available hook parameters are provided in the following list. Further information about these parameters is provided in the following sections. All parameters below have to be strings. For any parameter which accepts a list (such as `save_steps`, `reductions`, `include_regex`), the value must be given as strings separated by a comma between them. ``` dry_run @@ -119,10 +119,10 @@ global.start_step global.end_step ``` -#### Hook from Python constructor +#### Hook from a Python constructor See the framework-specific pages for more details. -HookClass below can be one of `KerasHook`, `SessionHook`, `EstimatorHook` for TensorFlow, or is just `Hook` for MXNet, Pytorch and XGBoost. +The following HookClass can be one of `KerasHook`, `SessionHook`, or `EstimatorHook` for TensorFlow. It's just `Hook` for MXNet, Pytorch, and XGBoost. ```python hook = HookClass( @@ -139,67 +139,67 @@ hook = HookClass( ) ``` ##### Arguments -- `out_dir` (str): Path where to save tensors and metadata. This is a required argument. Please ensure that the 'out_dir' does not exist. -- `export_tensorboard` (bool): Whether to export TensorBoard summaries (distributions and histograms for tensors saved, and scalar summaries for scalars saved). Defaults to `False`. Note that when running on SageMaker this parameter will be ignored. You will need to use the TensorBoardOutputConfig section in API to enable TensorBoard summaries. Refer [SageMaker page](sagemaker.md) for an example. -- `tensorboard_dir` (str): Path where to save TensorBoard artifacts. If this is not passed and `export_tensorboard` is True, then TensorBoard artifacts are saved in `out_dir/tensorboard` . Note that when running on SageMaker this parameter will be ignored. You will need to use the TensorBoardOutputConfig section in API to enable TensorBoard summaries. Refer [SageMaker page](sagemaker.md) for an example. -- `dry_run` (bool): If true, don't write any files +- `out_dir` (str): Path to which to save tensors and metadata. This is a required argument. Please ensure that the `out_dir` does not exist. +- `export_tensorboard` (bool): Whether to export TensorBoard summaries (distributions and histograms for tensors saved, and scalar summaries for scalars saved. Defaults to `False`. When running on SageMaker, this parameter is ignored. You must use the TensorBoardOutputConfig section in the API to enable TensorBoard summaries. See the [SageMaker page](sagemaker.md) for an example. +- `tensorboard_dir` (str): Path to which to save TensorBoard artifacts. If this is not passed and `export_tensorboard` is `True`, then TensorBoard artifacts are saved in `out_dir/tensorboard` . When running on SageMaker, this parameter is ignored. You must use the TensorBoardOutputConfig section in the API to enable TensorBoard summaries. See the [SageMaker page](sagemaker.md) for an example. +- `dry_run` (bool): If true, don't write any files. - `reduction_config`: ([ReductionConfig](#reductionconfig) object) Specifies the reductions to be applied as default for tensors saved. A collection can have its own `ReductionConfig` object which overrides this for the tensors which belong to that collection. - `save_config`: ([SaveConfig](#saveconfig) object) Specifies when to save tensors. A collection can have its own `SaveConfig` object which overrides this for the tensors which belong to that collection. -- `include_regex` (list[str]): list of regex patterns which specify the tensors to save. Tensors whose names match these patterns will be saved -- `include_collections` (list[str]): List of which collections to save specified by name +- `include_regex` (list[str]): List of regex patterns which specify the tensors to save. Tensors whose names match these patterns are saved. +- `include_collections` (list[str]): List of which collections to save specified by name. - `save_all` (bool): Saves all tensors and collections. Increases the amount of disk space used, and can reduce the performance of the training job significantly, depending on the size of the model. -- `include_workers` (str): Used for distributed training. It can take the values `one` or `all`. `one` means only the tensors from one chosen worker will be saved. This is the default behavior. `all` means tensors from all workers will be saved. +- `include_workers` (str): Used for distributed training. It can take the values `one` or `all`. `one` means only the tensors from one chosen worker are saved. This is the default behavior. `all` means tensors from all workers are saved. ### Common Hook API These methods are common for all hooks in any framework. -Note that `smd` import below translates to `import smdebug.{framework} as smd`. +Note that the `smd` import below translates to `import smdebug.{framework} as smd`. | Method | Arguments | Behavior | | --- | --- | --- | -|`add_collection(collection)` | `collection (smd.Collection)` | Takes a Collection object and adds it to the CollectionManager that the Hook holds. Note that you should only pass in a Collection object for the same framework as the hook | -|`get_collection(name)`| `name (str)` | Returns collection identified by the given name | -|`get_collections()` | - | Returns all collection objects held by the hook | -|`set_mode(mode)`| value of the enum `smd.modes` | Sets mode of the job, can be one of `smd.modes.TRAIN`, `smd.modes.EVAL`, `smd.modes.PREDICT` or `smd.modes.GLOBAL`. Refer [Modes](#modes) for more on that. | -|`create_from_json_file(`
` json_file_path=None)` | `json_file_path (str)` | Takes the path of a file which holds the json configuration of the hook, and creates hook from that configuration. This is an optional parameter.
If this is not passed it tries to get the file path from the value of the environment variable `SMDEBUG_CONFIG_FILE_PATH` and defaults to `/opt/ml/input/config/debughookconfig.json`. When training on SageMaker you do not have to specify any path because this is the default path that SageMaker writes the hook configuration to. -|`close()` | - | Closes all files that are currently open by the hook | -| `save_scalar()` | `name (str)`
`value (float)`
`sm_metric (bool)`| Saves a scalar value by the given name. Passing `sm_metric=True` flag also makes this scalar available as a SageMaker Metric to show up in SageMaker Studio. Note that when `sm_metric` is False, this scalar always resides only in your AWS account, but setting it to True saves the scalar also on AWS servers. The default value of `sm_metric` for this method is False. | -| `save_tensor()`| `tensor_name (str)`, `tensor_value (numpy.array or numpy.ndarray)`, `collections_to_write (str or list[str])` | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor or `save_tensor()`.| +|`add_collection(collection)` | `collection (smd.Collection)` | Takes a collection object and adds it to the CollectionManager that the hook holds. Note that you should only pass in a collection object for the same framework as the hook. | +|`get_collection(name)`| `name (str)` | Returns the collection identified by the given name. | +|`get_collections()` | - | Returns all collection objects held by the hook. | +|`set_mode(mode)`| value of the enum `smd.modes` | Sets the mode of the job. Can be one of `smd.modes.TRAIN`, `smd.modes.EVAL`, `smd.modes.PREDICT`, or `smd.modes.GLOBAL`. See [Modes](#modes) for more information. | +|`create_from_json_file(`
` json_file_path=None)` | `json_file_path (str)` | Takes the path of a file which holds the JSON configuration of the hook and creates hook from that configuration. This is an optional parameter.
If this is not passed, it tries to get the file path from the value of the environment variable `SMDEBUG_CONFIG_FILE_PATH` and defaults to `/opt/ml/input/config/debughookconfig.json`. When training on SageMaker, you do not have to specify any path because this is the default path to which SageMaker writes the hook configuration. +|`close()` | - | Closes all files that are currently open by the hook. | +| `save_scalar()` | `name (str)`
`value (float)`
`sm_metric (bool)`| Saves a scalar value by the given name. Passing the `sm_metric=True` flag also makes this scalar available as a SageMaker Metric to show up in SageMaker Studio. When `sm_metric` is `False`, this scalar always resides only in your AWS account, but setting it to `True` also saves the scalar on AWS servers. The default value of `sm_metric` for this method is `False`. | +| `save_tensor()`| `tensor_name (str)`, `tensor_value (numpy.array or numpy.ndarray)`, `collections_to_write (str or list[str])` | Manually save metrics tensors. The `record_tensor_value()` API is deprecated in favor of `save_tensor()`.| ### TensorFlow specific Hook API -Note that there are three types of Hooks in TensorFlow: SessionHook, EstimatorHook and KerasHook based on the TensorFlow interface being used for training. [This page](tensorflow.md) shows examples of each of these. +Note that there are three types of hooks in TensorFlow: SessionHook, EstimatorHook and KerasHook, based on the TensorFlow interface being used for training. [This page](tensorflow.md) shows examples of each of these. | Method | Arguments | Returns | Behavior | | --- | --- | --- | --- | -| `wrap_optimizer(optimizer)` | `optimizer` (tf.train.Optimizer or tf.keras.Optimizer) | Returns the same optimizer object passed with a couple of identifying markers to help `smdebug`. This returned optimizer should be used for training. | When not using Zero Script Change environments, calling this method on your optimizer is necessary for SageMaker Debugger to identify and save gradient tensors. Note that this method returns the same optimizer object passed and does not change your optimization logic. If the hook is of type `KerasHook`, you can pass in either an object of type `tf.train.Optimizer` or `tf.keras.Optimizer`. If the hook is of type `SessionHook` or `EstimatorHook`, the optimizer can only be of type `tf.train.Optimizer`. This new -| `add_to_collection(`
`collection_name, variable)` | `collection_name (str)` : name of the collection to add to.
`variable` parameter to pass to the collection's `add` method. | `None` | Calls the `add` method of a collection object. See [this section](#collection) for more. | +| `wrap_optimizer(optimizer)` | `optimizer` (tf.train.Optimizer or tf.keras.Optimizer) | Returns the same optimizer object passed with a couple of identifying markers to help `smdebug`. This returned optimizer should be used for training. | When not using zero script change environments, calling this method on your optimizer is necessary for SageMaker Debugger to identify and save gradient tensors. Note that this method returns the same optimizer object passed and does not change your optimization logic. If the hook is of the `KerasHook`type, you can pass in either an object of the `tf.train.Optimizer` or `tf.keras.Optimizer` type. If the hook is of the `SessionHook` or `EstimatorHook` type, the optimizer can only be of the `tf.train.Optimizer` type. +| `add_to_collection(`
`collection_name, variable)` | `collection_name (str)` : The name of the collection to add to.
`variable` parameter to pass to the collection's `add` method. | `None` | Calls the `add` method of a collection object. See [this section](#collection) for more information. | The following hook APIs are specific to training scripts using the TF 2.x GradientTape ([Example](tensorflow.md#TF 2.x GradientTape example)): | Method | Arguments | Returns | Behavior | | --- | --- | --- | --- | -| `wrap_tape(tape)` | `tape` (tensorflow.python.eager.backprop.GradientTape) | Returns a tape object with three identifying markers to help `smdebug`. This returned tape should be used for training. | When not using Zero Script Change environments, calling this method on your tape is necessary for SageMaker Debugger to identify and save gradient tensors. Note that this method returns the same tape object passed. +| `wrap_tape(tape)` | `tape` (tensorflow.python.eager.backprop.GradientTape) | Returns a tape object with three identifying markers to help `smdebug`. This returned tape should be used for training. | When not using zero script change environments, calling this method on your tape is necessary for SageMaker Debugger to identify and save gradient tensors. This method returns the same tape object passed. -### MXNet specific Hook API +### MXNet-specific hook API | Method | Arguments | Behavior | | --- | --- | --- | | `register_block(block)` | `block (mx.gluon.Block)` | Calling this method applies the hook to the Gluon block representing the model, so SageMaker Debugger gets called by MXNet and can save the tensors required. | -### PyTorch specific Hook API +### PyTorch-specific hook API | Method | Arguments | Behavior | | --- | --- | --- | | `register_module(module)` | `module (torch.nn.Module)` | Calling this method applies the hook to the Torch Module representing the model, so SageMaker Debugger gets called by PyTorch and can save the tensors required. | -| `register_loss(loss_module)` | `loss_module (torch.nn.modules.loss._Loss)` | Calling this method applies the hook to the Torch Module representing the loss, so SageMaker Debugger can save losses | +| `register_loss(loss_module)` | `loss_module (torch.nn.modules.loss._Loss)` | Calling this method applies the hook to the Torch Module representing the loss, so SageMaker Debugger can save losses. | --- ## Modes Used to signify which part of training you're in, similar to Keras modes. `GLOBAL` mode is used as -a default when no mode was set. Choose from +a default when no mode was set. Choose from: ```python smdebug.modes.TRAIN smdebug.modes.EVAL @@ -207,43 +207,43 @@ smdebug.modes.PREDICT smdebug.modes.GLOBAL ``` -The modes enum is also available under the alias `smdebug.{framework}.modes`. +The mode's enum is also available under the alias `smdebug.{framework}.modes`. --- ## Collection -The construct of a Collection groups tensors together. A Collection is identified by a string representing the name of the collection. It can be used to group tensors of a particular kind such as "losses", "weights", "biases", or "gradients". A Collection has its own list of tensors specified by include regex patterns, and other parameters determining how these tensors should be saved and when. Using collections enables you to save different types of tensors at different frequencies and in different forms. These collections are then also available during analysis so you can query a group of tensors at once. +The construct of a collection groups tensors together. A collection is identified by a string representing the name of the collection. It can be used to group tensors of a particular kind such as losses, weights, biases, or gradients. A collection has its own list of tensors specified by included regex patterns and other parameters determining how these tensors should be saved and when. Using collections enables you to save different types of tensors at different frequencies and in different forms. These collections are then also available during analysis so you can query a group of tensors at once. There are a number of built-in collections that SageMaker Debugger manages by default. This means that the library takes care of identifying what tensors should be saved as part of that collection. You can also define custom collections, to do which there are couple of different ways. You can specify which of these collections to save in the hook's `include_collections` parameter, or through the `collection_configs` parameter to the `DebuggerHookConfig` in the SageMaker Python SDK. -### Built in Collections -Below is a comprehensive list of the built-in collections that are managed by SageMaker Debugger. The Hook identifes the tensors that should be saved as part of that collection for that framework and saves them if they were requested. +### Built-in collections +The following is a comprehensive list of the built-in collections managed by SageMaker Debugger. The hook identifes the tensors that should be saved as part of that collection for that framework and saves them if they were requested. -The names of these collections are all lower case strings. +The names of these collections are all lowercase strings. | Name | Supported by frameworks/hooks | Description | |---|---|---| -| `all` | all | Matches all tensors | -| `default` | all | It's a default collection created, which matches the regex patterns passed as `include_regex` to the Hook | -| `weights` | TensorFlow, PyTorch, MXNet | Matches all weights of the model | -| `biases` | TensorFlow, PyTorch, MXNet | Matches all biases of the model | -| `gradients` | TensorFlow, PyTorch, MXNet | Matches all gradients of the model. In TensorFlow when not using Zero Script Change environments, must use `hook.wrap_optimizer()`. | -| `losses` | TensorFlow, PyTorch, MXNet | Saves the loss for the model | -| `metrics` | TensorFlow's KerasHook, XGBoost | For KerasHook, saves the metrics computed by Keras for the model. For XGBoost, the evaluation metrics computed by the algorithm. | -| `outputs` | TensorFlow's KerasHook | Matches the outputs of the model | -| `layers` | TensorFlow's KerasHook | Input and output of intermediate convolutional layers | -| `sm_metrics` | TensorFlow | You can add scalars that you want to show up in SageMaker Metrics to this collection. SageMaker Debugger will save these scalars both to the out_dir of the hook, as well as to SageMaker Metric. Note that the scalars passed here will be saved on AWS servers outside of your AWS account. | -| `optimizer_variables` | TensorFlow's KerasHook | Matches all optimizer variables, currently only supported in Keras. | +| `all` | all | Matches all tensors. | +| `default` | all | Default collection created which matches the regex patterns passed as `include_regex` to the hook. | +| `weights` | TensorFlow, PyTorch, MXNet | Matches all weights of the model. | +| `biases` | TensorFlow, PyTorch, MXNet | Matches all biases of the model. | +| `gradients` | TensorFlow, PyTorch, MXNet | Matches all gradients of the model. In TensorFlow, when not using zero script change environments, must use `hook.wrap_optimizer()`. | +| `losses` | TensorFlow, PyTorch, MXNet | Saves the loss for the model. | +| `metrics` | TensorFlow's KerasHook, XGBoost | For KerasHook, saves the metrics computed by Keras for the model. For XGBoost, saves the evaluation metrics computed by the algorithm. | +| `outputs` | TensorFlow's KerasHook | Matches the outputs of the model. | +| `layers` | TensorFlow's KerasHook | Input and output of intermediate convolutional layers. | +| `sm_metrics` | TensorFlow | You can add scalars that you want to show up in SageMaker Metrics to this collection. SageMaker Debugger saves these scalars both to the `out_dir` of the hook and to SageMaker Metrics. The scalars passed here are saved on AWS servers outside of your AWS account. | +| `optimizer_variables` | TensorFlow's KerasHook | Matches all optimizer variables. Currently only supported in Keras. | | `hyperparameters` | XGBoost | [Booster paramameters](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html) | -| `predictions` | XGBoost | Predictions on validation set (if provided) | -| `labels` | XGBoost | Labels on validation set (if provided) | -| `feature_importance` | XGBoost | Feature importance given by [get_score()](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.get_score) | -| `full_shap` | XGBoost | A matrix of (nsmaple, nfeatures + 1) with each record indicating the feature contributions ([SHAP values](https://github.com/slundberg/shap)) for that prediction. Computed on training data with [predict()](https://github.com/slundberg/shap) | +| `predictions` | XGBoost | Predictions on validation set (if provided). | +| `labels` | XGBoost | Labels on validation set (if provided). | +| `feature_importance` | XGBoost | Feature importance given by [get_score()](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.get_score). | +| `full_shap` | XGBoost | A matrix of (nsmaple, nfeatures + 1) with each record indicating the feature contributions ([SHAP values](https://github.com/slundberg/shap)) for that prediction. Computed on training data with [predict()](https://github.com/slundberg/shap). | | `average_shap` | XGBoost | The sum of SHAP value magnitudes over all samples. Represents the impact each feature has on the model output. | -| `trees` | XGBoost | Boosted tree model given by [trees_to_dataframe()](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.trees_to_dataframe) | +| `trees` | XGBoost | Boosted tree model given by [trees_to_dataframe()](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.trees_to_dataframe). | ### Default collections saved The following collections are saved regardless of the hook configuration. @@ -256,8 +256,8 @@ The following collections are saved regardless of the hook configuration. | `XGBoost` | METRICS | - If for some reason, you want to disable the saving of these collections, you can do so by setting end_step to 0 in the collection's SaveConfig. - When using the SageMaker Python SDK this would look like + If you want to disable the saving of these collections, you can do so by setting `end_step` to `0` in the collection's SaveConfig. + When using the SageMaker Python SDK, this would look like the following example: ```python from sagemaker.debugger import DebuggerHookConfig, CollectionConfig hook_config = DebuggerHookConfig( @@ -267,46 +267,46 @@ hook_config = DebuggerHookConfig( ] ) ``` - When configuring the Collection in your Python script, it would be as follows: + When configuring the collection in your Python script, it should resemble the following example: ```python hook.get_collection("metrics").save_config.end_step = 0 ``` -### Creating or retrieving a Collection +### Creating or retrieving a collection | Function | Behavior | |---|---| -| ```hook.get_collection(collection_name)``` | Returns the collection with the given name. Creates the collection with default configuration if it doesn't already exist. A new collection created by default does not match any tensor and is configured to save histograms and distributions along with the tensor if tensorboard support is enabled, and uses the reduction configuration and save configuration passed to the hook. | +| ```hook.get_collection(collection_name)``` | Returns the collection with the given name. Creates the collection with the default configuration if it doesn't already exist. A new collection created by default does not match any tensor and is configured to save histograms and distributions along with the tensor if tensorboard support is enabled, and uses the reduction configuration and save configuration passed to the hook. | -### Properties of a Collection +### Properties of a collection | Property | Description | |---|---| -| `tensor_names` | Get or set list of tensor names as strings | -| `include_regex` | Get or set list of regexes to include. Tensors whose names match these regex patterns will be included in the collection | -| `reduction_config` | Get or set the ReductionConfig object to be used for tensors part of this collection | -| `save_config` | Get or set the SaveConfig object to be used for tensors part of this collection | -| `save_histogram` | Get or set the boolean flag which determines whether to write histograms to enable histograms and distributions in TensorBoard, for tensors part of this collection. Only applicable if TensorBoard support is enabled.| +| `tensor_names` | Get or set list of tensor names as string.s | +| `include_regex` | Get or set list of regexes to include. Tensors whose names match these regex patterns are included in the collection. | +| `reduction_config` | Get or set the ReductionConfig object to be used for tensors that are part of this collection. | +| `save_config` | Get or set the SaveConfig object to be used for tensors that are part of this collection. | +| `save_histogram` | Get or set the boolean flag which determines whether to write histograms to enable histograms and distributions in TensorBoard for tensors that are part of this collection. Only applicable if TensorBoard support is enabled.| -### Methods on a Collection +### Methods on a collection | Method | Behavior | |---|---| | ```coll.include(regex)``` | Takes a regex string or a list of regex strings to match tensors to include in the collection. | | ```coll.add(tensor)``` | **(TensorFlow only)** Takes an instance or list or set of tf.Tensor/tf.Variable/tf.MirroredVariable/tf.Operation to add to the collection. | -| ```coll.add_keras_layer(layer, inputs=False, outputs=True)``` | **(tf.keras only)** Takes an instance of a tf.keras layer and logs input/output tensors for that module. By default, only outputs are saved. | -| ```coll.add_module_tensors(module, inputs=False, outputs=True)``` | **(PyTorch only)** Takes an instance of a PyTorch module and logs input/output tensors for that module. By default, only outputs are saved. | -| ```coll.add_block_tensors(block, inputs=False, outputs=True)``` | **(MXNet only)** Takes an instance of a Gluon block,and logs input/output tensors for that module. By default, only outputs are saved. | +| ```coll.add_keras_layer(layer, inputs=False, outputs=True)``` | **(tf.keras only)** Takes an instance of a tf.keras layer and logs input and output tensors for that module. By default, only outputs are saved. | +| ```coll.add_module_tensors(module, inputs=False, outputs=True)``` | **(PyTorch only)** Takes an instance of a PyTorch module and logs input and output tensors for that module. By default, only outputs are saved. | +| ```coll.add_block_tensors(block, inputs=False, outputs=True)``` | **(MXNet only)** Takes an instance of a Gluon block,and logs input and output tensors for that module. By default, only outputs are saved. | -### Configuring Collection using SageMaker Python SDK -Parameters to configure Collection are passed as below when using the SageMaker Python SDK. +### Configuring a collection using the SageMaker Python SDK +Parameters to configure a collection are passed as in the following example when using the SageMaker Python SDK. ```python from sagemaker.debugger import CollectionConfig coll_config = CollectionConfig( name="weights", parameters={ "parameter": "value" }) ``` -The parameters can be one of the following. The meaning of these parameters will be clear as you review the sections of documentation below. Note that all parameters below have to be strings. So any parameter which accepts a list (such as save_steps, reductions, include_regex), needs to be given as strings separated by a comma between them. +The parameters can be one of the following. The following sections provide more detail about these parameters. Note that all of the following parameters must be strings. Any parameter which accepts a list (such as `save_steps`, `reductions`, or `include_regex`) msut be given as strings separated by a comma. ``` include_regex @@ -340,11 +340,11 @@ global.end_step ## SaveConfig The SaveConfig class customizes the frequency of saving tensors. -The hook takes a SaveConfig object which is applied as default to all tensors included. -A collection can also have a SaveConfig object which is applied to the collection's tensors. +The hook takes a SaveConfig object, which is applied as a default to all included tensors. +A collection can have a SaveConfig object which is applied to the collection's tensors. You can also choose to have different configuration for when to save tensors based on the mode of the job. -This class is available in the following namespaces `smdebug` and `smdebug.{framework}`. +This class is available in the following namespaces: `smdebug` and `smdebug.{framework}`. ```python import smdebug as smd @@ -356,24 +356,24 @@ save_config = smd.SaveConfig( save_steps = None, ) ``` -##### Arguments +### Arguments - `mode_save_configs` (dict): Used for advanced cases; see details below. -- `save_interval` (int): How often, in steps, to save tensors. Defaults to 500. A step is saved if `step % save_interval == 0` +- `save_interval` (int): How often, in steps, to save tensors. Defaults to 500. A step is saved if `step % save_interval == 0`. - `start_step` (int): When to start saving tensors. -- `end_step` (int): When to stop saving tensors, exclusive. -- `save_steps` (list[int]): Specific steps to save tensors at. Union with save_interval. +- `end_step` (int): When to stop saving tensors (exclusive). +- `save_steps` (list[int]): Specific steps at which to save tensors. Union with `save_interval`. -##### Examples +### Examples -- `SaveConfig()` will save at steps 0, 500, ... -- `SaveConfig(save_interval=1)` will save at steps 0, 1, ... -- `SaveConfig(save_interval=100, end_step=200)` will save at steps 0, 100 -- `SaveConfig(save_interval=100, end_step=201)` will save at steps 0, 100, 200 -- `SaveConfig(save_interval=100, start_step=150)` will save at steps 200, 300, ... -- `SaveConfig(save_steps=[3, 7])` will save at steps 0, 3, 7, 500, ... +- `SaveConfig()` saves at steps 0, 500, ... +- `SaveConfig(save_interval=1)` saves at steps 0, 1, ... +- `SaveConfig(save_interval=100, end_step=200)` saves at steps 0, 100 +- `SaveConfig(save_interval=100, end_step=201)` saves at steps 0, 100, 200 +- `SaveConfig(save_interval=100, start_step=150)` saves at steps 200, 300, ... +- `SaveConfig(save_steps=[3, 7])` saves at steps 0, 3, 7, 500, ... -##### Specifying different configuration based on mode -There is also a more advanced use case, where you specify a different SaveConfig for each mode. +### Specifying different configuration based on mode +There is also a more advanced use case in which you specify a different SaveConfig for each mode. It is best understood through an example: ```python import smdebug as smd @@ -385,27 +385,27 @@ smd.SaveConfig(mode_save_configs={ }) ``` Essentially, create a dictionary mapping modes to SaveConfigMode objects. The SaveConfigMode objects -take the same four parameters (save_interval, start_step, end_step, save_steps) as the main object. -Any mode not specified will default to the default configuration. If a mode is provided but not all -params are specified, we use the default values for non-specified parameters. +take the same four parameters (`save_interval`, `start_step`, `end_step`, and `save_steps`) as the main object. +Any mode not specified defaults to the default configuration. If a mode is provided but not all +parameters are specified, we use the default values for non-specified parameters. -#### Configuration using SageMaker Python SDK -Refer [Configuring Hook using SageMaker Python SDK](#configuring-hook-using-sagemaker-python-sdk) and [Configuring Collection using SageMaker Python SDK](#configuring-collection-using-sagemaker-python-sdk) +### Configuration using the SageMaker Python SDK +See [Configuring Hook using SageMaker Python SDK](#configuring-hook-using-sagemaker-python-sdk) and [Configuring Collection using SageMaker Python SDK](#configuring-collection-using-sagemaker-python-sdk). --- ## ReductionConfig ReductionConfig allows the saving of certain reductions of tensors instead -of saving the full tensor. The motivation here is to reduce the amount of data -saved, and increase the speed in cases where you don't need the full -tensor. The reduction operations which are computed in the training process +of saving the full tensor to reduce the amount of data +saved and increase the speed in cases where you don't need the full +tensor. The reduction operations are computed in the training process and then saved. During analysis, these are available as reductions of the original tensor. -Please note that using reduction config means that you will not have -the full tensor available during analysis, so this can restrict what you can do with the tensor saved. You can choose to also save the raw tensor along with the reductions if you so desire. +Using ReductionConfig means that you do not have +the full tensor available during analysis, so this can restrict what you can do with the saved tensor. You can also save the raw tensor along with the reductions. -The hook takes a ReductionConfig object which is applied as default to all tensors included. +The hook takes a ReductionConfig object, which is applied as default to all tensors included. A collection can also have its own ReductionConfig object which is applied to the tensors belonging to that collection. @@ -420,24 +420,24 @@ reduction_config = smd.ReductionConfig( ) ``` -##### Arguments -- `reductions` (list[str]): Takes names of reductions, choosing from "min", "max", "median", "mean", "std", "variance", "sum", "prod" -- `abs_reductions` (list[str]): Same as reductions, except the reduction will be computed on the absolute value of the tensor -- `norms` (list[str]): Takes names of norms to compute, choosing from "l1", "l2" -- `abs_norms` (list[str]): Same as norms, except the norm will be computed on the absolute value of the tensor -- `save_raw_tensor` (bool): Saves the tensor directly, in addition to other desired reductions +### Arguments +- `reductions` (list[str]): Takes names of reductions, choosing from `min`, `max`, `median`, `mean`, `std`, `variance`, `sum`, and `prod`. +- `abs_reductions` (list[str]): Similar to `reductions`, except the reduction are computed on the absolute value of the tensor. +- `norms` (list[str]): Takes the names of norms to compute, choosing from `l1` and `l2`. +- `abs_norms` (list[str]): Similar to `norms`, except the norm is computed on the absolute value of the tensor. +- `save_raw_tensor` (bool): Saves the tensor directly, in addition to other desired reductions. -For example, +The following example saves the standard deviation and variance, the mean of the absolute value, and the l1 norm. `ReductionConfig(reductions=['std', 'variance'], abs_reductions=['mean'], norms=['l1'])` -will save the standard deviation and variance, the mean of the absolute value, and the l1 norm. -#### Configuration using SageMaker Python SDK -The reductions are passed as part of the "reductions" parameter to HookParameters or Collection Parameters. -Refer [Configuring Hook using SageMaker Python SDK](#configuring-hook-using-sagemaker-python-sdk) and [Configuring Collection using SageMaker Python SDK](#configuring-collection-using-sagemaker-python-sdk) for more on that. -The parameter "reductions" can take a comma separated string consisting of the following values: +### Configuration using the SageMaker Python SDK +The reductions are passed as part of the `reductions` parameter to HookParameters or CollectionParameters. +See [Configuring Hook using SageMaker Python SDK](#configuring-hook-using-sagemaker-python-sdk) and [Configuring Collection using SageMaker Python SDK](#configuring-collection-using-sagemaker-python-sdk) for more information. + +The `reductions` paramenter can take a comma-separated string consisting of the following values: ``` min max @@ -465,7 +465,7 @@ abs_l2 ## Frameworks -For details on what's supported for different framework, go here: +For details on what's supported for different frameworks, see the following pages: * [TensorFlow](tensorflow.md) * [PyTorch](pytorch.md) * [MXNet](mxnet.md) diff --git a/docs/env_var.md b/docs/env_var.md index 81ac526fc..fd7d84620 100644 --- a/docs/env_var.md +++ b/docs/env_var.md @@ -1,21 +1,20 @@ -## Environment Variables +# Environment variables -#### `USE_SMDEBUG`: +## `USE_SMDEBUG`: -When using official [SageMaker Framework Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html) and [AWS Deep Learning Containers](https://aws.amazon.com/machine-learning/containers/) which support the [Zero Script Change experience](sagemaker.md#zero-script-change), SageMaker Debugger can be disabled by setting this variable to `0`. In such a case, the hook is disabled regardless of what configuration is given to the job through the SageMaker Python SDK. By default this is set to `1` signifying True. +When using official [SageMaker Framework Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html) and [AWS Deep Learning Containers](https://aws.amazon.com/machine-learning/containers/) which support the [Zero Script Change experience](sagemaker.md#zero-script-change), you can disable SageMaker Debugger by setting this variable to `0`. In such a case, the hook is disabled regardless of what configuration is given to the job through the SageMaker Python SDK. By default, this is set to `1` signifying True. -#### `SMDEBUG_CONFIG_FILE_PATH`: +## `SMDEBUG_CONFIG_FILE_PATH`: -Contains the path to the JSON file that describes the smdebug hook. +Contains the path to the JSON file that describes the `smdebug` hook. -At the minimum, the JSON config should contain the path where smdebug should output tensors. -Example: +At the minimum, the JSON config should contain the path where `smdebug` should output tensors, as shown in the following example: `{ "LocalPath": "/my/smdebug_hook/path" }` -In SageMaker environment, this path is set to point to a pre-defined location containing a valid JSON. -In non-SageMaker environment, SageMaker-Debugger is not used if this environment variable is not set and +In a SageMaker environment, this path is set to point to a pre-defined location containing a valid JSON. +In a non-SageMaker environment, Debugger is not used if this environment variable is not set and a hook is not created manually. Sample JSON from which a hook can be created: @@ -47,62 +46,60 @@ Sample JSON from which a hook can be created: ``` -#### `TENSORBOARD_CONFIG_FILE_PATH`: +## `TENSORBOARD_CONFIG_FILE_PATH`: -Contains the path to the JSON file that specifies where TensorBoard artifacts need to +This contains the path to the JSON file that specifies where TensorBoard artifacts need to be placed. Sample JSON file: `{ "LocalPath": "/my/tensorboard/path" }` -In SageMaker environment, the presence of this JSON is necessary to log any Tensorboard artifact. +In a SageMaker environment, the presence of this JSON is necessary to log any Tensorboard artifact. By default, this path is set to point to a pre-defined location in SageMaker. -tensorboard_dir can also be passed while creating the hook using the API or -in the JSON specified in SMDEBUG_CONFIG_FILE_PATH. For this, export_tensorboard should be set to True. -This option to set tensorboard_dir is available in both, SageMaker and non-SageMaker environments. +`tensorboard_dir` can also be passed while creating the hook using the API or +in the JSON specified in `SMDEBUG_CONFIG_FILE_PATH`. For this, `export_tensorboard` should be set to `True`. +This option to set `tensorboard_dir` is available in both SageMaker and non-SageMaker environments. -#### `CHECKPOINT_CONFIG_FILE_PATH`: +## `CHECKPOINT_CONFIG_FILE_PATH`: Contains the path to the JSON file that specifies where training checkpoints need to -be placed. This is used in the context of spot training. - -Sample JSON file: +be placed. This is used in the context of spot training, as shown in the following JSON sample file: `{ "LocalPath": "/my/checkpoint/path" }` -In SageMaker environment, the presence of this JSON is necessary to save checkpoints. +In a SageMaker environment, the presence of this JSON is necessary to save checkpoints. By default, this path is set to point to a pre-defined location in SageMaker. -#### `SAGEMAKER_METRICS_DIRECTORY`: +## `SAGEMAKER_METRICS_DIRECTORY`: -Contains the path to the directory where metrics will be recorded for consumption by SageMaker Metrics. -This is relevant only in SageMaker environment, where this variable points to a pre-defined location. +This contains the path to the directory where metrics are recorded for consumption by SageMaker Metrics. +This is relevant only in a SageMaker environment, where this variable points to a pre-defined location. -**Note**: The environment variables below are applicable for versions > 0.4.14 +**Note**: The following environment variables are applicable for versions > 0.4.14. -#### `SMDEBUG_TRAINING_END_DELAY_REFRESH`: +## `SMDEBUG_TRAINING_END_DELAY_REFRESH`: During analysis, a [trial](analysis.md) is created to query for tensors from a specified directory. This directory contains collections, events, and index files. This environment variable specifies how many seconds to wait before refreshing the index files to check if training has ended -and the tensor is available. By default value, this value is set to 1. +and the tensor is available. By default, this value is set to 1. -#### `SMDEBUG_INCOMPLETE_STEP_WAIT_WINDOW`: +## `SMDEBUG_INCOMPLETE_STEP_WAIT_WINDOW`: During analysis, a [trial](analysis.md) is created to query for tensors from a specified directory. This directory contains collections, events, and index files. A trial checks to see if a step -specified in the smdebug hook has been completed. This environment variable -specifies the maximum number of incomplete steps that the trial will wait for before marking -half of them as complete. Default: 1000 +specified in the `smdebug` hook has been completed. This environment variable +specifies the maximum number of incomplete steps for which the trial waits before marking +half of them as complete. Default: 1000. -#### `SMDEBUG_MISSING_EVENT_FILE_RETRY_LIMIT`: +## `SMDEBUG_MISSING_EVENT_FILE_RETRY_LIMIT`: During analysis, a [trial](analysis.md) is created to query for tensors from a specified directory. This directory contains collections, events, and index files. All the tensor data is stored in the event files. diff --git a/docs/mxnet.md b/docs/mxnet.md index d418647d2..6fdf2293a 100644 --- a/docs/mxnet.md +++ b/docs/mxnet.md @@ -2,7 +2,7 @@ ## Contents - [Support](#support) -- [How to Use](#how-to-use) +- [How to use](#how-to-use) - [Example](#example) - [Full API](#full-api) @@ -10,27 +10,27 @@ ## Support -- Zero Script Change experience where you need no modifications to your training script is supported in the official [AWS Deep Learning Container for MXNet](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#general-framework-containers). -- This library itself supports the following versions when you use our API which requires a few minimal changes to your training script: MXNet 1.4, 1.5, 1.6, and 1.7. -- Only Gluon models are supported -- When the Gluon model is hybridized, inputs and outputs of intermediate layers can not be saved -- Parameter server based distributed training is not yet supported +- The official [AWS Deep Learning Container for MXNet](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#general-framework-containers) supports a zero script change experience, in which you need no modifications to your training script. +- This library supports the following versions when you use our API, which requires a few minimal changes to your training script: MXNet 1.4, 1.5, 1.6, and 1.7. +- This library only supports Gluon models. +- When the Gluon model is hybridized, inputs and outputs of intermediate layers can't be saved. +- This library does not yet support parameter server-based distributed training. --- -## How to Use -### Using Zero Script Change containers -In this case, you don't need to do anything to get the hook running. You are encouraged to configure the hook from the SageMaker python SDK so you can run different jobs with different configurations without having to modify your script. If you want access to the hook to configure certain things which can not be configured through the SageMaker SDK, you can retrieve the hook as follows. +## How to use +### Using zero script change containers +In this case, you don't need to do anything to get the hook running. We encourage you to configure the hook from the SageMaker Python SDK so you can run different jobs with different configurations without having to modify your script. If you want access to the hook to configure certain things which can not be configured through the SageMaker SDK, you can retrieve the hook as follows. ``` import smdebug.mxnet as smd hook = smd.Hook.create_from_json_file() ``` -Note that you can create the hook from smdebug's python API as is being done in the next section even in such containers. +You can create the hook from `smdebug`'s python API as in the following section even in such containers. ### Bring your own container experience #### 1. Create a hook -If using SageMaker, you will configure the hook in SageMaker's python SDK using the Estimator class. Instantiate it with -`smd.Hook.create_from_json_file()`. Otherwise, call the hook class constructor, `smd.Hook()`. +If you're using SageMaker, you configure the hook in SageMaker's Python SDK using the Estimator class. Instantiate it with +`smd.Hook.create_from_json_file()`. Otherwise, call the hook class constructor: `smd.Hook()`. #### 2. Register the model to the hook Call `hook.register_block(net)`. @@ -44,7 +44,7 @@ For a full list of actions that the hook APIs offer to construct hooks and save ## Example ```python ####################################### -# Creating a hook. Refer `API for Saving Tensors` page for more on this +# Creating a hook. See `API for Saving Tensors` page for more information. import smdebug.mxnet as smd hook = smd.Hook(out_dir=args.out_dir) ####################################### @@ -65,7 +65,7 @@ trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': args.lr}) ####################################### -# Here we register the block to smdebug +# Register the block to smdebug hook.register_block(net) ####################################### @@ -97,6 +97,6 @@ for i in range(args.epochs): --- ## Full API -See the [API for Saving Tensors](api.md) page for details about Hook, Collection, SaveConfig, and ReductionConfig +See the [API for Saving Tensors](api.md) page for details about hooks, collections, SaveConfig, and ReductionConfig. See the [Analysis](analysis) page for details about analyzing a training job. diff --git a/docs/pytorch.md b/docs/pytorch.md index f312d19e1..23e06d9e6 100644 --- a/docs/pytorch.md +++ b/docs/pytorch.md @@ -2,39 +2,39 @@ ## Contents - [Support](#support) -- [How to Use](#how-to-use) -- [Module Loss Example](#module-loss-example) -- [Functional Loss Example](#functional-loss-example) +- [How to use](#how-to-use) +- [Module loss example](#module-loss-example) +- [Functional loss example](#functional-loss-example) - [Full API](#full-api) ## Support ### Versions -- Zero Script Change experience where you need no modifications to your training script is supported in the official [AWS Deep Learning Container for PyTorch](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#general-framework-containers). +- The official [AWS Deep Learning Container for PyTorch](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#general-framework-containers) supports a zero script change experience, in which you need no modifications to your trainings script. -- The library itself supports the following versions when using changes to the training script: PyTorch 1.2, 1.3, 1.4, 1.5, and 1.6. +- The library supports the following versions when using changes to the training script: PyTorch 1.2, 1.3, 1.4, 1.5, and 1.6. --- -## How to Use -### Using Zero Script Change containers -In this case, you don't need to do anything to get the hook running. You are encouraged to configure the hook from the SageMaker python SDK so you can run different jobs with different configurations without having to modify your script. If you want access to the hook to configure certain things which can not be configured through the SageMaker SDK, you can retrieve the hook as follows. +## How to use +### Using zero script change containers +In this case, you don't need to do anything to get the hook running. We encourage you to configure the hook from the SageMaker Python SDK so you can run different jobs with different configurations without having to modify your script. If you want access to the hook to configure certain things which can't be configured through the SageMaker SDK, you can retrieve the hook as follows. ``` import smdebug.pytorch as smd hook = smd.Hook.create_from_json_file() ``` -Note that you can create the hook from smdebug's python API as is being done in the next section even in such containers. +You can create the hook from `smdebug`'s python API as in the following section even in such containers. ### Bring your own container experience #### 1. Create a hook -If using SageMaker, you will configure the hook in SageMaker's python SDK using the Estimator class. Instantiate it with -`smd.Hook.create_from_json_file()`. Otherwise, call the hook class constructor, `smd.Hook()`. +If you're using SageMaker, configure the hook in SageMaker's Python SDK using the Estimator class. Instantiate it with +`smd.Hook.create_from_json_file()`. Otherwise, call the hook class constructor: `smd.Hook()`. #### 2. Register the model to the hook Call `hook.register_module(net)`. #### 3. Register your loss function to the hook -If using a loss which is a subclass of `nn.Module`, call `hook.register_loss(loss_criterion)` once before starting training.\ -If using a loss which is a subclass of `nn.functional`, call `hook.record_tensor_value(loss)` after each training step. +If you're using a loss which is a subclass of `nn.Module`, call `hook.register_loss(loss_criterion)` once before starting training.\ +If you're using a loss which is a subclass of `nn.functional`, call `hook.record_tensor_value(loss)` after each training step. #### 4. Take actions using the hook APIs @@ -42,10 +42,10 @@ For a full list of actions that the hook APIs offer to construct hooks and save --- -## Module Loss Example +## Module loss example ```python ####################################### -# Creating a hook. Refer `API for Saving Tensors` page for more on this +# Creating a hook. See the `API for Saving Tensors` page for more information. import smdebug.pytorch as smd hook = smd.Hook(out_dir=args.out_dir) ####################################### @@ -121,5 +121,5 @@ for (inputs, labels) in trainloader: --- ## Full API -See the [API for Saving Tensors](api.md) page for details about Hook, Collection, SaveConfig, and ReductionConfig. +See the [API for Saving Tensors](api.md) page for details about hooks, collections, SaveConfig, and ReductionConfig. See the [Analysis](analysis.md) page for details about analyzing a training job. diff --git a/docs/sagemaker.md b/docs/sagemaker.md index e8b4db36c..727938ba4 100644 --- a/docs/sagemaker.md +++ b/docs/sagemaker.md @@ -1,31 +1,31 @@ -## Running SageMaker jobs with Amazon SageMaker Debugger +# Running Amazon SageMaker jobs with Amazon SageMaker Debugger -### Outline +## Outline - [Configuring SageMaker Debugger](#configuring-sagemaker-debugger) - [Saving data](#saving-data) - [Saving built-in collections that we manage](#saving-built-in-collections-that-we-manage) - [Saving reductions for a custom collection](#saving-reductions-for-a-custom-collection) - [Enabling TensorBoard summaries](#enabling-tensorboard-summaries) - [Rules](#rules) - - [Built In Rules](#built-in-rules) - - [Custom Rules](#custom-rules) -- [Interactive Exploration](#interactive-exploration) + - [Built-in rules](#built-in-rules) + - [Custom rules](#custom-rules) +- [Interactive exploration](#interactive-exploration) - [SageMaker Studio](#sagemaker-studio) -- [TensorBoard Visualization](#tensorboard-visualization) -- [Example Notebooks](#example-notebooks) +- [TensorBoard visualization](#tensorboard-visualization) +- [Example notebooks](#example-notebooks) ## Configuring SageMaker Debugger -Regardless of which of the two above ways you have enabled SageMaker Debugger, you can configure it using the SageMaker python SDK. There are two aspects to this configuration. -- You can specify what tensors to be saved, when they should be saved and in what form they should be saved. -- You can specify which Rule you want to monitor your training job with. This can be either a built in rule that SageMaker provides, or a custom rule that you can write yourself. +Regardless of how you have enabled SageMaker Debugger, you can configure it using the SageMaker Python SDK. There are two aspects to this configuration. +- You can specify which tensors to save, when to save them, and in what form to save them. +- You can specify with which rule you want to monitor your training job. This can be either a built-in rule that SageMaker provides or a custom rule that you can write yourself. -### Saving Data +### Saving data -SageMaker Debugger gives you a powerful and flexible API to save the tensors you choose at the frequencies you want. These configurations are made available in the SageMaker Python SDK through the `DebuggerHookConfig` class. +SageMaker Debugger gives you a powerful and flexible API to save the tensors you choose at the frequencies you want. These configurations are available in the SageMaker Python SDK through the `DebuggerHookConfig` class. #### Saving built-in collections that we manage -Learn more about these built in collections [here](api.md). +To learn more about these built-in collections, see [api.md](api.md). ```python from sagemaker.debugger import DebuggerHookConfig, CollectionConfig @@ -63,7 +63,7 @@ sagemaker_estimator.fit() ``` #### Saving reductions for a custom collection -You can define your collection of tensors. You can also choose to save certain reductions of tensors only instead of saving the full tensor. You may choose to do this to reduce the amount of data saved. Please note that when you save reductions, unless you pass the flag `save_raw_tensor`, only these reductions will be available for analysis. The raw tensor will not be saved. +You can define your collection of tensors. You can also choose to save only certain reductions of tensors instead of saving the full tensor. You may choose to do this to reduce the amount of data saved. Please note that when you save reductions, unless you pass the flag `save_raw_tensor`, only these reductions are available for analysis. The raw tensors are not saved. ```python from sagemaker.debugger import DebuggerHookConfig, CollectionConfig @@ -94,16 +94,14 @@ sagemaker_estimator.fit() ``` #### Enabling TensorBoard summaries -SageMaker Debugger can automatically generate tensorboard scalar summaries, -distributions and histograms for tensors saved. This can be enabled by -passing a `TensorBoardOutputConfig` object when creating an Estimator as follows. +SageMaker Debugger can automatically generate tensorboard scalar summaries, distributions, and histograms for tensors saved. You can enable this by passing a `TensorBoardOutputConfig` object when creating an estimator as follows. You can also choose to disable or enable histograms specifically for different collections. -By default a collection has `save_histogram` flag set to True. -Note that scalar summaries are added to TensorBoard for all `ScalarCollections` and any scalar saved through `hook.save_scalar`. -Refer [API](api.md) for more details on scalar collections and `save_scalar` method. +By default, a collection has the `save_histogram` flag set to `True`. +Scalar summaries are added to TensorBoard for all `ScalarCollections` and any scalar saved through `hook.save_scalar`. +See the [API](api.md) for more details on scalar collections and `save_scalar` method. -The below example saves weights and gradients as full tensors, and also saves the gradients as histograms and distributions to visualize in TensorBoard. -These will be saved to the location passed in `TensorBoardOutputConfig` object. +The following example saves weights and gradients as full tensors and also saves the gradients as histograms and distributions to visualize in TensorBoard. +These are saved to the location passed in the `TensorBoardOutputConfig` object. ```python from sagemaker.debugger import DebuggerHookConfig, CollectionConfig, TensorBoardOutputConfig hook_config = DebuggerHookConfig( @@ -134,29 +132,29 @@ sagemaker_estimator = sm.tensorflow.TensorFlow( sagemaker_estimator.fit() ``` -For more details, refer our [API page](api.md). +For more details, see the [API page](api.md). ### Rules -Here are some examples on how to run Rules with your training jobs. +The following examples demonstrate how to run rules with your training jobs. -Note that passing a `CollectionConfig` object to the Rule as `collections_to_save` +Note that passing a `CollectionConfig` object to the rule as `collections_to_save` is equivalent to passing it to the `DebuggerHookConfig` object as `collection_configs`. This is just a shortcut for your convenience. -#### Built-in Rules -To find a full list of built-in rules that you can use with the SageMaker Python SDK, see the [List of Debugger Built-in Rules](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html) page. +#### Built-in rules +For a full list of built-in rules that you can use with the SageMaker Python SDK, see the [List of Debugger Built-in Rules](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html). -#### Running built-in SageMaker Rules -You can run a SageMaker built-in Rule as follows using the `Rule.sagemaker` method. -The first argument to this method is the base configuration that is associated with the Rule. -We configure them as much as possible. -You can take a look at the ruleconfigs that we populate for all built-in rules [here](https://github.com/awslabs/sagemaker-debugger-rulesconfig). +#### Running built-in SageMaker rules +You can run a SageMaker built-in rule as follows using the `Rule.sagemaker` method. +The first argument to this method is the base configuration that is associated with the rule. + +To examine the 'ruleconfigs' that we populate for all built-in rules, see the [sagemaker-debugger-rulesconfig](https://github.com/awslabs/sagemaker-debugger-rulesconfig) directory. You can choose to customize these parameters using the other parameters. -These rules are run on our pre-built Docker images which are listed [here](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-docker-images-rules.html). +These rules run on our pre-built Docker images, which are listed in [Use Debugger Docker Images for Built-in or Custom Rules](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-docker-images-rules.html). You are not charged for the instances when running SageMaker built-in rules. -A list of all our built-in rules are provided [below](#built-in-rules). +A list of all our built-in rules are provided [in the Built-in rules section](#built-in-rules). ```python from sagemaker.debugger import Rule, CollectionConfig, rule_configs @@ -188,23 +186,23 @@ sagemaker_estimator = sm.tensorflow.TensorFlow( sagemaker_estimator.fit() ``` -#### Custom Rules +#### Custom rules -You can write your own rule custom made for your application and provide it, so SageMaker can monitor your training job using your rule. To do so, you need to understand the programming model that `smdebug` provides. Our page on [Programming Model for Analysis](analysis.md) describes the APIs that we provide to help you write your own rule. +You can write your own custom rule for your application and provide it so SageMaker can monitor your training job using your rule. To do so, you need to understand the programming model that `smdebug` provides. Our [Programming Model for Analysis](analysis.md) page describes the APIs that we provide to help you write your own rule. Please refer to [this example notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/tensorflow_keras_custom_rule/tf-keras-custom-rule.ipynb) for a demonstration of creating your custom rule and running it on SageMaker. -#### Running custom Rules -To run a custom rule, you have to provide a few additional parameters. -Key parameters to note are a file which has the implementation of your Rule class `source`, - the name of the Rule class (`rule_to_invoke`), the type of instance to run the Rule job on (`instance_type`), - the size of the volume on that instance (`volume_size_in_gb`), and the docker image to use for running this job (`image_uri`). +#### Running custom rules +To run a custom rule, you must provide a few additional parameters. +Key parameters to note are a file which has the implementation of your rule class `source`, + the name of the rule class (`rule_to_invoke`), the type of instance on which to run the rule job (`instance_type`), + the size of the volume on that instance (`volume_size_in_gb`), and the Docker image to use for running this job (`image_uri`). -Please refer to the documentation [here](https://github.com/aws/sagemaker-python-sdk/blob/391733efd433c5e26afb56102c76ab7472f94b3d/src/sagemaker/debugger.py#L190) for more details. +Please refer to the [documentation](https://github.com/aws/sagemaker-python-sdk/blob/391733efd433c5e26afb56102c76ab7472f94b3d/src/sagemaker/debugger.py#L190) for more details. We have pre-built Docker images that you can use to run your custom rules. -These are listed [here](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-docker-images-rules.html). +These are listed in [Use Debugger Docker Images for Built-in or Custom Rules](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-docker-images-rules.html). You can also choose to build your own Docker image for custom rule evaluation. -Please refer to the repository [SageMaker Debugger Rules Container](https://github.com/awslabs/sagemaker-debugger-rules-container) for instructions on how to build such an image. +Please refer to the [SageMaker Debugger Rules Container](https://github.com/awslabs/sagemaker-debugger-rules-container) repository for instructions on how to build such an image. ```python from sagemaker.debugger import Rule, CollectionConfig @@ -242,12 +240,12 @@ sagemaker_estimator = sm.tensorflow.TensorFlow( sagemaker_estimator.fit() ``` -For more details, refer our [Analysis page](analysis.md). +For more details, see the [Analysis page](analysis.md). -## Interactive Exploration +## Interactive exploration -`smdebug` SDK also allows you perform interactive and real-time exploration of the data saved. You can choose to inspect the tensors saved, or visualize them through your custom plots. -You can retrieve these tensors as numpy arrays allowing you to use your favorite analysis libraries right in a SageMaker notebook instance. We have couple of example notebooks demonstrating this. +The `smdebug` SDK also allows you perform interactive and real-time exploration of the data saved. You can choose to inspect the tensors saved, or visualize them through your custom plots. +You can retrieve these tensors as `numpy` arrays, allowing you to use your favorite analysis libraries right in a SageMaker notebook instance. The following example notebooks demonstrate this: - [Real-time anaysis in a notebook during training](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/mxnet_realtime_analysis/mxnet-realtime-analysis.ipynb) - [Interactive tensor analysis in a notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/mnist_tensor_analysis/mnist_tensor_analysis.ipynb) @@ -255,13 +253,13 @@ You can retrieve these tensors as numpy arrays allowing you to use your favorite SageMaker Debugger is on by default for supported training jobs on the official SageMaker Framework containers (or AWS Deep Learning Containers) during SageMaker training jobs. In this default scenario, SageMaker Debugger takes the losses and metrics from your training job and publishes them to SageMaker Metrics, allowing you to track these metrics in SageMaker Studio. -You can also see the status of Rules you have enabled for your training job right in the Studio. [Here](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-visualization.html) are screenshots of that experience. +You can also see the status of rules you have enabled for your training job right in the Studio, as shown in [these screenshots](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-visualization.html). -## TensorBoard Visualization +## TensorBoard visualization -If you have enabled TensorBoard outputs for your training job through SageMaker Debugger, TensorBoard artifacts will automatically be generated for the tensors saved. +If you have enabled TensorBoard outputs for your training job through SageMaker Debugger, TensorBoard artifacts are automatically generated for the tensors saved. You can then point your TensorBoard instance to that S3 location and review the visualizations for the tensors saved. -## Example Notebooks +## Example notebooks -We have a bunch of [example notebooks](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger) here demonstrating different aspects of SageMaker Debugger. +These [example notebooks](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger) demonstrate different aspects of SageMaker Debugger. diff --git a/docs/tensorflow.md b/docs/tensorflow.md index 590630b77..4aa4b8622 100644 --- a/docs/tensorflow.md +++ b/docs/tensorflow.md @@ -1,35 +1,35 @@ # Tensorflow ## Contents -- [What SageMaker Debugger Supports](#support) +- [What SageMaker Debugger supports](#support) - [Debugger on AWS Deep Learning Containers with TensorFlow](#debugger-dlc) - - [Debugger Built-in Tensor Collections for TensorFlow](#tf-built-in-collection) -- [Debugger on SageMaker Training Containers and Custom Containers](#debugger-script-change) -- [Code Examples](#examples) + - [Debugger Built-in Tensor collections for TensorFlow](#tf-built-in-collection) +- [Debugger on SageMaker training containers and custom containers](#debugger-script-change) +- [Code examples](#examples) - [References](#references) --- -## Amazon SageMaker Debugger Support for TensorFlow +## Amazon SageMaker Debugger Support for TensorFlow -Amazon SageMaker Debugger python SDK and its client library `smdebug` now fully support TensorFlow 2.3 with the latest version release. +The Amazon SageMaker Debugger Python SDK and its client library `smdebug` now fully support TensorFlow 2.3 with the latest version release. -Using Debugger, you can access tensors of any kind for TensorFlow models, from the Keras model zoo to your own custom model, and save them using Debugger built-in or custom tensor collections. You can run your training script on [the official AWS Deep Learning Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#general-framework-containers) where Debugger can automatically capture tensors from your training job. It doesn't matter whether your TensorFlow models use Keras API or pure TensorFlow API (in eager mode or non-eager mode), you can directly run them on the AWS Deep Learning Containers. +Using Debugger, you can access tensors of any kind for TensorFlow models, from the Keras model zoo to your own custom model, and save them using Debugger built-in or custom tensor collections. You can run your training script on [the official AWS Deep Learning Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#general-framework-containers) where Debugger can automatically capture tensors from your training job. It doesn't matter whether your TensorFlow models use the Keras API or a pure TensorFlow API (in eager mode or non-eager mode); either way, you can directly run them on the AWS Deep Learning Containers. -Debugger and its client library `smdebug` support debugging your training job on other AWS training containers and custom containers. In this case, a hook registration process is required to manually add the hook features to your training script. For a full list of AWS TensorFlow containers to use Debugger, see [SageMaker containers to use Debugger with script mode](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). For a complete guide for using custom containers, see [Use Debugger in Custom Training Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-bring-your-own-container.html). +Debugger and its `smdebug` client library support debugging your training job on other AWS training containers and custom containers. In this case, a hook registration process is required to manually add the hook features to your training script. For a full list of AWS TensorFlow containers to use Debugger, see [SageMaker containers to use Debugger with script mode](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-aws-containers). For a complete guide for using custom containers, see [Use Debugger in Custom Training Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-bring-your-own-container.html). -### New Features supported by Debugger -- The latest TensorFlow version fully covered by Debugger is 2.3.0 -- Debug training jobs with the TensorFlow framework or Keras TensorFlow -- Debug training jobs with the TensorFlow eager or non-eager mode -- New built-in tensor collections: `inputs`, `outputs`, `layers`, `gradients` -- New hook APIs to save tensors, in addition to scalars: `save_tensors`, `save_scalar` +### New features supported by Debugger +- The latest TensorFlow version fully covered by Debugger is 2.3.0. +- Debug training jobs with the TensorFlow framework or Keras TensorFlow. +- Debug training jobs with the TensorFlow eager or non-eager mode. +- New built-in tensor collections include `inputs`, `outputs`, `layers`, and `gradients`. +- New hook APIs to save tensors, in addition to scalars, include `save_tensors` and `save_scalar`. --- -## Using Debugger on AWS Deep Learning Containers with TensorFlow +## Using Debugger on AWS Deep Learning Containers with TensorFlow -The Debugger built-in rules and hook features are fully integrated with the AWS Deep Learning Containers. You can run your training script without any script changes. When running training jobs on those Deep Learning Containers, Debugger registers its hooks automatically to your training script in order to retrieve tensors. To find a comprehensive guide of using the high-level SageMaker TensorFlow estimator with Debugger, see [Amazon SageMaker Debugger with TensorFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html#debugger-zero-script-change-TensorFlow) in the Amazon SageMaker Developer Guide. +The Debugger built-in rules and hook features are fully integrated with the AWS Deep Learning Containers. You can run your training script without any script changes. When running training jobs on those Deep Learning Containers, Debugger registers its hooks automatically to your training script in order to retrieve tensors. To find a comprehensive guide to using the high-level SageMaker TensorFlow estimator with Debugger, see [Amazon SageMaker Debugger with TensorFlow](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html#debugger-zero-script-change-TensorFlow). The following code example provides the base structure for a SageMaker TensorFlow estimator with Debugger. @@ -61,14 +61,14 @@ tf_estimator = TensorFlow( ) tf_estimator.fit("s3://bucket/path/to/training/data") ``` ->**Note**: The SageMaker TensorFlow estimator and the Debugger collections in the example are based on the SageMaker python SDK v2 and `smdebug` v0.9.2. It is highly recommended to upgrade the packages by executing the following command line. +>**Note**: The SageMaker TensorFlow estimator and the Debugger collections in the example are based on the SageMaker Python SDK v2 and `smdebug` v0.9.2. It is highly recommended to upgrade the packages by executing the following command line. ```bash pip install -U sagemaker pip install -U smdebug ``` -If you are using Jupyter Notebook, put exclamation mark at the front of the code lines and restart your kernel. +If you are using a Jupyter Notebook, put an exclamation mark at the front of the code lines and restart your kernel. -#### Available Tensor Collections for TensorFlow +### Available Tensor Collections for TensorFlow **Note**: The SageMaker TensorFlow estimator and the Debugger collections in this example are based on the latest `smdebug` library. We highly recommend that you upgrade the packages by running the following commands at the command line: ``` @@ -77,9 +77,9 @@ pip install -U smdebug ``` If you are using a Jupyter Notebook, put an exclamation mark (!) at the beginning of the code string and restart your kernel. For more information about the SageMaker Python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). -### Debugger Built-in Tensor Collections for TensorFlow +### Debugger Built-in Tensor Collections for TensorFlow -The following table lists the pre-configured tensor collections for TensorFlow models. You can pick any tensor collections by specifying the `name` parameter of `CollectionConfig()` as shown in the previous base code example. SageMaker Debugger will save these tensors to the default out_dir of the hook. +The following table lists the pre-configured tensor collections for TensorFlow models. You can pick any tensor collection by specifying the `name` parameter of `CollectionConfig()` as shown in the previous base code example. SageMaker Debugger will save these tensors to the default `out_dir` of the hook. | Name | Description| | --- | --- | @@ -104,12 +104,12 @@ For a full list of available tensor collection parameters, see [Configuring Coll --- -## Using Debugger on SageMaker Training Containers and Custom Containers +## Using Debugger on SageMaker training containers and custom containers If you want to run your own training script or custom containers other than the AWS Deep Learning Containers in the previous option, you can use any of the following options: -- **Option 1** - Use the SageMaker TensorFlow training containers with training script modification -- **Option 2** - Use your custom container with modified training script and push the container to Amazon ECR. +- **Option 1**: Use the SageMaker TensorFlow training containers with training script modification. +- **Option 2**: Use your custom container with modified training script and push the container to Amazon ECR. For both options, you need to manually register the Debugger hook to your training script. Depending on the TensorFlow and Keras API operations used to construct your model, you need to pick the right TensorFlow hook class, register the hook, and then save the tensors. @@ -121,15 +121,15 @@ For both options, you need to manually register the Debugger hook to your traini 3. [Register the hook to model.fit()](#register-a-hook) -### Step 1: Create a hook +### Step 1: Create a hook -To create the hook constructor, add the following code to your training script. This enables the `smdebug` tools for TensorFlow and creates a TensorFlow `hook` object. When you run the `fit()` API for training, specify the smdebug `hook` as `callbacks`, as shown in the following subsections. +To create the hook constructor, add the following code to your training script. This enables the `smdebug` tools for TensorFlow and creates a TensorFlow `hook` object. When you run the `fit()` API for training, specify the `smdebug` `hook` as `callbacks`, as shown in the following subsections. Depending on the TensorFlow versions and the Keras API that you use in your training script, you need to choose the right hook class. The hook constructors for TensorFlow that you can choose are `smd.KerasHook`, `smd.SessionHook`, and `smd.EstimatorHook`. #### KerasHook -If you use the Keras model zoo and a Keras `model.fit()` API, use `KerasHook`. `KerasHook` is available for the Keras model with the TensorFlow backend interface. `KerasHook` covers the eager execution modes and the gradient tape features that are introduced in the TensorFlow framework version 2.0. You can set the smdebug Keras hook constructor by adding the following code to your training script. Place this code line before `model.compile()`: +If you use the Keras model zoo and a Keras `model.fit()` API, use `KerasHook`. `KerasHook` is available for the Keras model with the TensorFlow backend interface. `KerasHook` covers the eager execution modes and the gradient tape features that are introduced in the TensorFlow framework version 2.0. You can set the `smdebug` Keras hook constructor by adding the following code to your training script. Place this code line before `model.compile()`: ```python import smdebug.tensorflow as smd @@ -142,7 +142,7 @@ To learn how to fully implement the hook in your training script, see the [Keras #### SessionHook -If your model is created in TensorFlow version 1.x with the low-level approach (not using the Keras API), use `SessionHook`. `SessionHook` is for the TensorFlow 1.x monitored training session API, `tf.train.MonitoredSessions()`, as shown following: +If your model is created in TensorFlow version 1.x with the low-level approach (not using the Keras API), use `SessionHook`. `SessionHook` is for the TensorFlow 1.x monitored training session API, `tf.train.MonitoredSessions()`, as shown in the following example: ```python import smdebug.tensorflow as smd @@ -155,7 +155,7 @@ To learn how to fully implement the hook into your training script, see the [Ten #### EstimatorHook -If you have a model using the `tf.estimator()` API, use `EstimatorHook`. `EstimatorHook` is available for any TensorFlow framework versions that support the `tf.estimator()` API, as shown following: +If you have a model using the `tf.estimator()` API, use `EstimatorHook`. `EstimatorHook` is available for any TensorFlow framework versions that support the `tf.estimator()` API, as shown in the following example: ```python import smdebug.tensorflow as smd @@ -164,18 +164,18 @@ hook = smd.EstimatorHook.create_from_json_file() To learn how to fully implement the hook into your training script, see the [simple MNIST training script with the Tensorflow estimator](https://github.com/awslabs/sagemaker-debugger/blob/master/examples/tensorflow/sagemaker_byoc/simple.py). -### Step 2: Wrap the optimizer and the gradient tape to retrieve gradient tensors +### Step 2: Wrap the optimizer and the gradient tape to retrieve gradient tensors -The smdebug TensorFlow hook provides tools to manually retrieve `gradients` tensors specific to the TensorFlow framework. +The `smdebug` TensorFlow hook provides tools to manually retrieve `gradients` tensors specific to the TensorFlow framework. -If you want to save `gradients` (for example, from the Keras Adam optimizer) wrap it with the hook as shown following: +If you want to save `gradients` (for example, from the Keras Adam optimizer), wrap it with the hook as shown in the following example: ```python optimizer = tf.keras.optimizers.Adam(learning_rate=args.lr) optimizer = hook.wrap_optimizer(optimizer) ``` -If you want to save gradients and outputs tensors from the TensorFlow `GradientTape` feature, wrap `tf.GradientTape` with the smdebug `hook.wrap_tape` method and save using the `hook.save_tensor` function. The input of `hook.save_tensor` is in (tensor_name, tensor_value, collections_to_write="default") format. For example: +If you want to save gradients and outputs tensors from the TensorFlow `GradientTape` feature, wrap `tf.GradientTape` with the `smdebug` `hook.wrap_tape` method and save using the `hook.save_tensor` function. The input of `hook.save_tensor` is in (tensor_name, tensor_value, collections_to_write="default") format. For example: ```python with hook.wrap_tape(tf.GradientTape(persistent=True)) as tape: logits = model(data, training=True) @@ -186,13 +186,13 @@ grads = tape.gradient(loss_value, model.variables) hook.save_tensor("grads", grads, "gradients") ``` -These smdebug hook wrapper functions capture the gradient tensors, not affecting your optimization logic at all. +These `smdebug` hook wrapper functions capture the gradient tensors without affecting your optimization logic. For examples of code structures that you can use to apply the hook wrappers, see the [Code Examples](#examples) section. -### Step 3: Register the hook to model.fit() +### Step 3: Register the hook to model.fit() -To collect the tensors from the hooks that you registered, add `callbacks=[hook]` to the Keras `model.fit()` API. This will pass the SageMaker Debugger hook as a Keras callback. Similarly, add `hooks=[hook]` to the `MonitoredSession()`, `tf.function()`, and `tf.estimator()` APIs. For example: +To collect the tensors from the hooks that you registered, add `callbacks=[hook]` to the Keras `model.fit()` API. This passes the SageMaker Debugger hook as a Keras callback. Similarly, add `hooks=[hook]` to the `MonitoredSession()`, `tf.function()`, and `tf.estimator()` APIs. For example: ```python model.fit(X_train, Y_train, @@ -210,13 +210,13 @@ For a full list of actions that the hook APIs offer to construct hooks and save --- -## Code Examples +## Code Examples The following code examples show the base structures that you can use for hook registration in various TensorFlow training scripts. If you want to use the high-level Debugger features with zero script change on AWS Deep Learning Containers, see [Use Debugger in AWS Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-container.html). ### Keras API (tf.keras) -The following code example shows how to register the smdebug `KerasHook` for the Keras `model.fit()`. You can also set the hook mode to track stored tensors in different phases of training job. For a list of available hook modes, see [smdebug modes](api.md#modes). +The following code example shows how to register the `smdebug` `KerasHook` for the Keras `model.fit()`. You can also set the hook mode to track stored tensors in different phases of a training job. For a list of available hook modes, see [smdebug modes](api.md#modes). ```python import smdebug.tensorflow as smd @@ -239,7 +239,7 @@ model.evaluate(x_test, y_test, callbacks=[hook]) ### Keras GradientTape example for TensorFlow 2.0 and later -The following code example shows how to register the smdebug `KerasHook` by wrapping the TensorFlow `GradientTape()` with the smdebug `hook.wrap_tape()` API. +The following code example shows how to register the `smdebug` `KerasHook` by wrapping the TensorFlow `GradientTape()` with the `smdebug` `hook.wrap_tape()` API. ```python import smdebug.tensorflow as smd @@ -261,9 +261,9 @@ model = tf.keras.models.Sequential([ ... ]) hook.save_tensor(tensor_name="accuracy", tensor_value=acc, collections_to_write="default") ``` -### Monitored Session (tf.train.MonitoredSession) +### Monitored session (tf.train.MonitoredSession) -The following code example shows how to register the smdebug `SessionHook`. +The following code example shows how to register the `smdebug` `SessionHook`. ```python import smdebug.tensorflow as smd @@ -284,7 +284,7 @@ sess.run([loss, ...]) ### Estimator (tf.estimator.Estimator) -The following code example shows how to register the smdebug `EstimatorHook`. You can also set the hook mode to track stored tensors in different phases of training job. For a list of available hook modes, see [smdebug modes](api.md#modes). +The following code example shows how to register the `smdebug` `EstimatorHook`. You can also set the hook mode to track stored tensors in different phases of training job. For a list of available hook modes, see [smdebug modes](api.md#modes). ```python import smdebug.tensorflow as smd @@ -306,8 +306,8 @@ estimator.evaluate(input_fn=eval_input_fn, steps=args.steps, hooks=[hook]) ## References -### The smdebug API for saving tensors -See the [API for saving tensors](api.md) page for details about the Hooks, Collection, SaveConfig, and ReductionConfig. +### The `smdebug` API for saving tensors +See the [API for saving tensors](api.md) page for details about hooks, collections, SaveConfig, and ReductionConfig. See the [Analysis](analysis.md) page for details about analyzing a training job. ### TensorFlow References diff --git a/docs/xgboost.md b/docs/xgboost.md index 2ec65157a..ed771e1e6 100644 --- a/docs/xgboost.md +++ b/docs/xgboost.md @@ -2,23 +2,29 @@ ## Contents -- [SageMaker Example](#sagemaker-example) +- [SageMaker example](#sagemaker-example) - [Full API](#full-api) -## SageMaker Example +## SageMaker example ### Use XGBoost as a built-in algorithm -The XGBoost algorithm can be used 1) as a built-in algorithm, or 2) as a framework such as MXNet, PyTorch, or Tensorflow. -If SageMaker XGBoost is used as a built-in algorithm in container version `0.90-2` or later, Amazon SageMaker Debugger will be available by default (i.e., zero code change experience). -See [XGBoost Algorithm AWS docmentation](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) for more information on how to use XGBoost as a built-in algorithm. -See [Amazon SageMaker Debugger examples](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger) for sample notebooks that demonstrate debugging and monitoring capabilities of Amazon SageMaker Debugger. -See [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/) for more information on how to configure the Amazon SageMaker Debugger from the Python SDK. +The XGBoost algorithm can be used as: +1) A built-in algorithm +2) A framework such as MXNet, PyTorch, or Tensorflow + +If SageMaker XGBoost is used as a built-in algorithm in container version `0.90-2` or later, Amazon SageMaker Debugger is available by default (i.e., zero code change experience). + +For more information on how to use XGBoost as a built-in algorithm, see [XGBoost Algorithm AWS docmentation](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html). + +For sample notebooks that demonstrate the debugging and monitoring capabilities of Amazon SageMaker Debugger, see [Amazon SageMaker Debugger examples](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger). + +For more information on how to configure the Amazon SageMaker Debugger from the Python SDK, see [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/). ### Use XGBoost as a framework -When SageMaker XGBoost is used as a framework, it is recommended that the hook is configured from the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/). -By using SageMaker Python SDK, you can run different jobs (e.g., Processing jobs) on the SageMaker platform. +When SageMaker XGBoost is used as a framework, we recommended that you configure the hook from the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/). +By using the SageMaker Python SDK, you can run different jobs (such as processing jobs) on the SageMaker platform. You can retrieve the hook as follows. ```python import xgboost as xgb @@ -44,7 +50,7 @@ Alternatively, you can also create the hook from `smdebug`'s Python API as shown ### Use the Debugger hook -If you are in a non-SageMaker environment, or even in SageMaker, if you want to configure the hook in a certain way in script mode, you can use the full Debugger hook API as follows. +In a non-SageMaker environment, or even in SageMaker, if you want to configure the hook in a certain way in script mode, you can use the full Debugger hook API as follows. ```python import xgboost as xgb from smdebug.xgboost import Hook @@ -82,11 +88,11 @@ def __init__( ``` Initializes the hook. Pass this object as a callback to `xgboost.train()`. -* `out_dir` (str): A path into which tensors and metadata will be written. +* `out_dir` (str): A path into which tensors and metadata are written. * `export_tensorboard` (bool): Whether to use TensorBoard logs. * `tensorboard_dir` (str): Where to save TensorBoard logs. * `dry_run` (bool): If true, evaluations are not actually saved to disk. -* `reduction_config` (ReductionConfig object): Not supported in XGBoost and will be ignored. +* `reduction_config` (ReductionConfig object): Not supported in XGBoost and is ignored. * `save_config` (SaveConfig object): See the [Common API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md). * `include_regex` (list[str]): List of additional regexes to save. * `include_collections` (list[str]): List of collections to save. @@ -96,5 +102,6 @@ Initializes the hook. Pass this object as a callback to `xgboost.train()`. * `train_data` (DMatrix object): Data to be trained. * `validation_data` (DMatrix object): Validation set for which metrics will evaluated during training. -See the [Common API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md) page for details about Collection, SaveConfig, and ReductionConfig.\ +See the [Common API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md) page for details about `Collection`, `SaveConfig`, and `ReductionConfig`.\ + See the [Analysis](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/analysis.md) page for details about analyzing a training job. diff --git a/examples/mxnet/README.md b/examples/mxnet/README.md index a282a073a..1792359a9 100644 --- a/examples/mxnet/README.md +++ b/examples/mxnet/README.md @@ -1,17 +1,17 @@ # Examples -## Example notebooks -Please refer to the example notebooks in [Amazon SageMaker Examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger) +## Notebooks +Please refer to the example notebooks in the [Amazon SageMaker examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger). -## Example scripts -The above notebooks come with example scripts which can be used through SageMaker. Some more example scripts are here in [scripts/](scripts/) +## Scripts +The notebooks in the repository linked in the preceding section come with example scripts which can be used through SageMaker. For more example scripts, see the [scripts/](scripts/) directory. -## Example configurations for saving tensors through [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) -Example configurations for saving tensors through the hook are available at [docs/sagemaker.md](../docs/sagemaker.md) +## Saving tensors through the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) +For example configurations for saving tensors through the hook, see [docs/sagemaker.md](../docs/sagemaker.md). -## Example configurations for running rules through [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) -Example configurations for saving tensors through the hook are available at [docs/sagemaker.md](../docs/sagemaker.md) +## Running rules through the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) +For example configurations for saving tensors through the hook, see [docs/sagemaker.md](../docs/sagemaker.md). -## Example for running rule locally +## Running rules locally ``` from smdebug.rules import invoke_rule diff --git a/examples/pytorch/README.md b/examples/pytorch/README.md index a282a073a..f240526e7 100644 --- a/examples/pytorch/README.md +++ b/examples/pytorch/README.md @@ -1,17 +1,17 @@ # Examples -## Example notebooks -Please refer to the example notebooks in [Amazon SageMaker Examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger) +## Notebooks +Please refer to the example notebooks in the [Amazon SageMaker examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger). -## Example scripts -The above notebooks come with example scripts which can be used through SageMaker. Some more example scripts are here in [scripts/](scripts/) +## Scripts +The notebooks in the repository linked in the preceding section come with example scripts which can be used through SageMaker. For more example scripts, see the [scripts/](scripts/) directory. -## Example configurations for saving tensors through [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) -Example configurations for saving tensors through the hook are available at [docs/sagemaker.md](../docs/sagemaker.md) +## Saving tensors through the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) +For example configurations for saving tensors through the hook, see [docs/sagemaker.md](../docs/sagemaker.md). -## Example configurations for running rules through [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) -Example configurations for saving tensors through the hook are available at [docs/sagemaker.md](../docs/sagemaker.md) +## Running rules through the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) +For example configurations for saving tensors through the hook, see [docs/sagemaker.md](../docs/sagemaker.md). -## Example for running rule locally +## Running rules locally ``` from smdebug.rules import invoke_rule diff --git a/examples/tensorflow/README.md b/examples/tensorflow/README.md index a282a073a..3c753e93f 100644 --- a/examples/tensorflow/README.md +++ b/examples/tensorflow/README.md @@ -1,17 +1,17 @@ # Examples -## Example notebooks -Please refer to the example notebooks in [Amazon SageMaker Examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger) +## Notebooks +Please refer to the example notebooks in the [Amazon SageMaker examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger). -## Example scripts -The above notebooks come with example scripts which can be used through SageMaker. Some more example scripts are here in [scripts/](scripts/) +## Scripts +The example notebooks in the repository linked in the preceding section come with example scripts which can be used through SageMaker. For more example scripts, see the [scripts/](scripts/) directory. -## Example configurations for saving tensors through [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) -Example configurations for saving tensors through the hook are available at [docs/sagemaker.md](../docs/sagemaker.md) +## Saving tensors through the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) +For example configurations for saving tensors through the hook, see [docs/sagemaker.md](../docs/sagemaker.md). -## Example configurations for running rules through [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) -Example configurations for saving tensors through the hook are available at [docs/sagemaker.md](../docs/sagemaker.md) +## Running rules through the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) +For example configurations for saving tensors through the hook, see [docs/sagemaker.md](../docs/sagemaker.md). -## Example for running rule locally +## Running rules locally ``` from smdebug.rules import invoke_rule diff --git a/examples/tensorflow2/README.md b/examples/tensorflow2/README.md index a282a073a..8eb8a1186 100644 --- a/examples/tensorflow2/README.md +++ b/examples/tensorflow2/README.md @@ -1,17 +1,17 @@ # Examples -## Example notebooks -Please refer to the example notebooks in [Amazon SageMaker Examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger) +## Notebooks +Please refer to the example notebooks in the [Amazon SageMaker examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger). -## Example scripts -The above notebooks come with example scripts which can be used through SageMaker. Some more example scripts are here in [scripts/](scripts/) +## Scripts +The notebooks in the repository linked in the preceding section come with example scripts which can be used in SageMaker. For more example scripts, see the [scripts/](scripts/) directory. -## Example configurations for saving tensors through [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) -Example configurations for saving tensors through the hook are available at [docs/sagemaker.md](../docs/sagemaker.md) +## Saving tensors through the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) +For example configurations for saving tensors through the hook, see [docs/sagemaker.md](../docs/sagemaker.md). -## Example configurations for running rules through [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) -Example configurations for saving tensors through the hook are available at [docs/sagemaker.md](../docs/sagemaker.md) +## Running rules through the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) +For example configurations for saving tensors through the hook, see [docs/sagemaker.md](../docs/sagemaker.md). -## Example for running rule locally +## Running rules locally ``` from smdebug.rules import invoke_rule diff --git a/examples/xgboost/README.md b/examples/xgboost/README.md index 9818d371a..023158d2f 100644 --- a/examples/xgboost/README.md +++ b/examples/xgboost/README.md @@ -1,2 +1,2 @@ -## Example Notebooks -Please refer to the example notebooks in [Amazon SageMaker Examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger) +# Example notebooks +Please refer to the example notebooks in the [Amazon SageMaker examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger).