diff --git a/README.md b/README.md
index 4ceafdea9..3efd2f844 100644
--- a/README.md
+++ b/README.md
@@ -37,56 +37,49 @@ to improve the training performance and resources utilization.
## Why DLRover?
-### Fault Tolerance to Improve the Stability of Job
+### Fault Tolerance to Reduce the Downtime of a Large Scale Training Job
DLRover can restore the training when the process fails without stopping the
training job. The actions to restore training in DLRover are:
-1. Diagnose the failure reason.
+1. Automatically diagnose the failure reason.
2. Restart the process not the node due to software errors.
3. Restart the failed nodes due to hardward errors.
For detail, we can see [experiments](docs/tech_report/fault_tolerance_exps.md)
of fault-tolerance and elasticity.
-#### Fault Tolerance of PyTorch Distributed Training
+#### Fault Tolerance and Flash Checkpoint to Reduce Downtime of PyTorch Training
-DLRover supports fault tolerance of the process failure and the node failure
-to restore trainig. Compared with restarting a new job, DLRover can
-reduce the overhead to schedule all Pods, pull image and
-install packages on all nodes.
+In addition to fault tolerance, DLRover provides the flash checkpoint to
+save/load checkpoint in seconds. With flash checkpoint, the training can
+frequently save checkpoints and reduce the roll-back step to resume training
+from the latest checkpoint when a failure happens. The actions of flash checkpoint are:
-| Step to restore training | Failure without DLRover | Node failure with DLRover | Process failure with DLRover |
-|:-------------------------:|:-------------------------:|:---------------------------------:|:---------------------------------:|
-| Restore action | Restart Job | Restart failed nodes | Restart training process |
-| Schedule node, pull image and install packages | All nodes | Only new nodes | No |
-| Node health check | No | All nodes execute a simple allgtather task | All nodes execute a allgtather simple task |
-| Start training process | Yes | Yes | Yes |
+1. Asynchronously persist the checkpoint to the storage.
+2. Persist the checkpoint to the storage once the training process fails.
+3. Load the checkpoint from the host memory after the training process restarts.
+
+After applying the fault tolerance and flash checkpoint of DLRover, **the overall goodput
+for the largest-scale training job using thousands of GPUs increased from 69% to 95%** .
+The goodput is the time spent computing useful new steps over the elapsed time of the training job.
+The downtime details are shown:
-
+
-#### Fault Tolerance of TensorFlow PS Distributed Training
+#### Fault Tolerance Improves the Stability of TensorFlow PS Training
-DLRover can recover failed parameter servers and workers to
-resume training. Compared with manual restarting jobs, DLRover
-can reduce the overhead to restore the training.
+DLRover can recover failed parameter servers and workers to resume training.
-| Step to restore training | Failure without DLRover | PS failure with DLRover | Worker failure with DLRover |
-|:-----------------------------------------------:|:-------------------------:|:--------------------------:|:---------------------------:|
-| Restore action | Restart Job | Restart failed PS | Restart failed workers |
-| Schedule node, pull image and install packages | All nodes | Only new PS | Only new workers |
-| Start session | all nodes | all nodes | Only new workers |
-| Initialize Graph | Yes | Yes | Only new workers |
-| Restore checkpoint | Yes | Yes | No |
+1. DLRover can automatically launch a Pod with more memory to recover the OOM node.
+2. DLRover can reassign the training data of a failed worker to other workers.
+3. DLRover can automatically scale up the parameter servers to fit the model size.
-What's more, DLRover also can automatic diagnose the reason of failure. For example,
-the OOM is the common error due to user's insufficient memory configuration.
-DLRover can automatically launch a Pod with more memory to recover the OOM node.
In AntGroup, DLRover manages hundreds of DL training jobs every day on the customized Kubernetes cluster in AntGroup.
-Except for the failed job resulting from code errors, the rate of completed jobs raise 89%
-with tf-operator in KubeFlow to 95%. Other unrecoverable failure reasons of a job are data error,
+Except for the failed job resulting from code errors, *the rate of completed jobs increase from 89%
+with tf-operator in KubeFlow to 95%*. Other unrecoverable failure reasons of a job are data error,
NaN loss of the model, network breakdown, and so on.
@@ -170,7 +163,7 @@ into environments. We can run `dlrover-run` like
```bash
NODE_RANK=$RANK DLROVER_MASTER_ADDR=$MASTER_ADDR:$MASTER_PORT \
-dlrover-run --standalone --network-check --nnodes=$NODE_NUM --nproc_per_node=$NUM_TRAINERS train_script.py
+dlrover-run --network-check --nnodes=$NODE_NUM --nproc_per_node=$NUM_TRAINERS train_script.py
```
**Note**:
@@ -212,7 +205,7 @@ Please refer to the [DEVELOPMENT](docs/developer_guide.md)
## Quick Start
-[Train a PyTorch Model on Kubernetes.](docs/tutorial/torch_on_cloud.md)
+[Train a PyTorch Model on Kubernetes.](docs/tutorial/torch_elasticjob_on_k8s.md)
[Train a GPT Model on Kubernetes.](docs/tutorial/torch_ddp_nanogpt.md)
diff --git a/docs/figures/dlrover-goodput-performance.jpg b/docs/figures/dlrover-goodput-performance.jpg
new file mode 100644
index 000000000..23ba1714e
Binary files /dev/null and b/docs/figures/dlrover-goodput-performance.jpg differ
diff --git a/docs/tech_report/fault_tolerance_exps.md b/docs/tech_report/fault_tolerance_exps.md
index f32d8d1ec..35c575320 100644
--- a/docs/tech_report/fault_tolerance_exps.md
+++ b/docs/tech_report/fault_tolerance_exps.md
@@ -7,7 +7,7 @@ of DLRover ElasticJob. In the experiments, we use the chaos enginerring toolkit
## Preliminary
- Create a k8s cluster and configure cluster credentials on your local computer.
-- Deploy DLRover ElasticJob on the k8s cluster with the [tutorial](torch_on_cloud.md).
+- Deploy DLRover ElasticJob on the k8s cluster with the [tutorial](../tutorial/torch_elasticjob_on_k8s.md).
- Build the image with chaosblade like the [example](../../examples/pytorch/mnist/mnist_chaos.dockerfile).
## Experiments of PyTorch Distributed Job