diff --git a/TOC.md b/TOC.md
index 78f9c391e0ee9..d0ebc0b7f05f7 100644
--- a/TOC.md
+++ b/TOC.md
@@ -229,6 +229,7 @@
- [Identify Expensive Queries Using Top SQL](/dashboard/top-sql.md)
- [Identify Expensive Queries Using Logs](/identify-expensive-queries.md)
- [Save and Restore the On-Site Information of a Cluster](/sql-plan-replayer.md)
+ - [Understanding Stale Read and safe-ts in TiKV](/troubleshoot-stale-read.md)
- [Support Resources](/support.md)
- Performance Tuning
- Tuning Guide
diff --git a/grafana-tidb-dashboard.md b/grafana-tidb-dashboard.md
index 0ac514942c3da..5a25959b7d7a1 100644
--- a/grafana-tidb-dashboard.md
+++ b/grafana-tidb-dashboard.md
@@ -110,9 +110,22 @@ To understand the key metrics displayed on the TiDB dashboard, check the followi
### KV Request
+The following metrics relate to requests sent to TiKV. Retry requests are counted multiple times.
+
- KV Request OPS: the execution times of a KV request, displayed according to TiKV
- KV Request Duration 99 by store: the execution time of a KV request, displayed according to TiKV
- KV Request Duration 99 by type: the execution time of a KV request, displayed according to the request type
+- Stale Read Hit/Miss Ops
+ - **hit**: the number of requests per second that successfully execute a stale read
+ - **miss**: the number of requests per second that attempt a stale read but fail
+- Stale Read Req Ops:
+ - **cross-zone**: the number of requests per second that attempt a stale read in a remote zone
+ - **local**: the number of requests per second that attempt a stale read in the local zone
+- Stale Read Req Traffic:
+ - **cross-zone-in**: the incoming traffic of responses to requests that attempt a stale read in a remote zone
+ - **cross-zone-out**: the outgoing traffic of requests that attempt a stale read in a remote zone
+ - **local-in**: the incoming traffic of responses to requests that attempt a stale read in the local zone
+ - **local-out**: the outgoing traffic of requests that attempt a stale read in the local zone
### PD Client
diff --git a/grafana-tikv-dashboard.md b/grafana-tikv-dashboard.md
index 9c96efb19c1b3..5d06066a6d474 100644
--- a/grafana-tikv-dashboard.md
+++ b/grafana-tikv-dashboard.md
@@ -406,6 +406,20 @@ This section provides a detailed description of these key metrics on the **TiKV-
- Total pessimistic locks memory size: The memory size occupied by the in-memory pessimistic locks
- In-memory pessimistic locking result: The result of only saving pessimistic locks to memory. `full` means the number of times that the pessimistic lock is not saved to memory because the memory limit is exceeded.
+### Resolved-TS
+
+- Resolved-TS worker CPU: The CPU utilization of the resolved-ts worker threads
+- Advance-TS worker CPU: The CPU utilization of the advance-ts worker threads
+- Scan lock worker CPU: The CPU utilization of the scan lock worker threads
+- Max gap of resolved-ts: The maximum time difference between the resolved-ts of all active Regions in this TiKV and the current time
+- Max gap of safe-ts: The maximum time difference between the safe-ts of all active Regions in this TiKV and the current time
+- Min Resolved TS Region: The ID of the Region whose resolved-ts is the minimal
+- Min Safe TS Region: The ID of the Region whose safe-ts is the minimal
+- Check Leader Duration: The distribution of time spent on processing leader requests. The duration is from sending requests to receiving responses in leader
+- Max gap of resolved-ts in Region leaders: The maximum time difference between the resolved-ts of all active Regions in this TiKV and the current time, only for Region leaders
+- Min Leader Resolved TS Region: The ID of the Region whose resolved-ts is the minimal, only for Region leaders
+- Lock heap size: The memory footprint of the heap that tracks locks in the resolved-ts module
+
### Memory
- Allocator Stats: The statistics of the memory allocator
diff --git a/media/stale-read/example-ops.png b/media/stale-read/example-ops.png
new file mode 100644
index 0000000000000..90b5cf3c6c272
Binary files /dev/null and b/media/stale-read/example-ops.png differ
diff --git a/media/stale-read/example-ts-gap.png b/media/stale-read/example-ts-gap.png
new file mode 100644
index 0000000000000..3f51d7d3754ee
Binary files /dev/null and b/media/stale-read/example-ts-gap.png differ
diff --git a/media/stale-read/metrics-hit-miss.png b/media/stale-read/metrics-hit-miss.png
new file mode 100644
index 0000000000000..304d7a9482255
Binary files /dev/null and b/media/stale-read/metrics-hit-miss.png differ
diff --git a/media/stale-read/metrics-ops.png b/media/stale-read/metrics-ops.png
new file mode 100644
index 0000000000000..4f0615fdb1973
Binary files /dev/null and b/media/stale-read/metrics-ops.png differ
diff --git a/media/stale-read/traffic.png b/media/stale-read/traffic.png
new file mode 100644
index 0000000000000..a2359974c298f
Binary files /dev/null and b/media/stale-read/traffic.png differ
diff --git a/releases/release-6.5.4.md b/releases/release-6.5.4.md
index af883d144ae4c..7ffb480aef4d9 100644
--- a/releases/release-6.5.4.md
+++ b/releases/release-6.5.4.md
@@ -26,7 +26,7 @@ Quick access: [Quick start](https://docs.pingcap.com/tidb/v6.5/quick-start-with-
+ TiKV
- Use gzip compression for `check_leader` requests to reduce traffic [#14553](https://github.com/tikv/tikv/issues/14553) @[you06](https://github.com/you06)
- - Add the `Max gap of safe-ts` and `Min safe ts region` metrics and introduce the `tikv-ctl get_region_read_progress` command to better observe and diagnose the status of resolved-ts and safe-ts [#15082](https://github.com/tikv/tikv/issues/15082) @[ekexium](https://github.com/ekexium)
+ - Add the `Max gap of safe-ts` and `Min safe ts region` metrics and introduce the `tikv-ctl get-region-read-progress` command to better observe and diagnose the status of resolved-ts and safe-ts [#15082](https://github.com/tikv/tikv/issues/15082) @[ekexium](https://github.com/ekexium)
- Expose some RocksDB configurations in TiKV that allow users to disable features such as TTL and periodic compaction [#14873](https://github.com/tikv/tikv/issues/14873) @[LykxSassinator](https://github.com/LykxSassinator)
- Avoid holding mutex when writing Titan manifest files to prevent affecting other threads [#15351](https://github.com/tikv/tikv/issues/15351) @[Connor1996](https://github.com/Connor1996)
diff --git a/stale-read.md b/stale-read.md
index 7e5ad7a920498..bb1d2fb33ed38 100644
--- a/stale-read.md
+++ b/stale-read.md
@@ -52,6 +52,12 @@ advance-ts-interval = "20s" # The default value is "20s". You can set it to a sm
>
> Decreasing the preceding TiKV configuration item will lead to an increase in TiKV CPU usage and traffic between nodes.
+
+
+For more information about the internals of Resolved TS and diagnostic techniques, see [Understanding Stale Read and safe-ts in TiKV](/troubleshoot-stale-read.md).
+
+
+
## Restrictions
When a Stale Read query for a table is pushed down to TiFlash, the query will return an error if this table has newer DDL operations executed after the read timestamp specified by the query. This is because TiFlash only supports reading data from the tables with the latest schemas.
diff --git a/tikv-control.md b/tikv-control.md
index 0c90642bb59c8..b45d8eb76e63a 100644
--- a/tikv-control.md
+++ b/tikv-control.md
@@ -652,3 +652,40 @@ From the output above, you can see that the information of the damaged SST file
+ In the `sst meta` part, `14` means the SST file number; `552997` means the file size, followed by the smallest and largest sequence numbers and other meta-information.
+ The `overlap region` part shows the information of the Region involved. This information is obtained through the PD server.
+ The `suggested operations` part provides you suggestion to clean up the damaged SST file. You can take the suggestion to clean up files and restart the TiKV instance.
+
+### Get the state of a Region's `RegionReadProgress`
+
+Starting from v6.5.4, v7.1.2, and v7.3.0, TiKV introduces the `get-region-read-progress` subcommand to get up-to-date details of the resolver and `RegionReadProgress`. You need to specify a Region ID and a TiKV, which can be obtained from Grafana (`Min Resolved TS Region` and `Min Safe TS Region`) or `DataIsNotReady` logs.
+
+- `--log` (optional): If specified, TiKV logs the smallest `start_ts` of locks in the Region's resolver in this TiKV at `INFO` level. This option helps you identify locks that might block resolved-ts in advance.
+
+- `--min-start-ts` (optional): If specified, TiKV filters out locks with smaller `start_ts` than this value in logs. You can use this to specify a transaction of interest for logging. It defaults to `0`, which means no filter.
+
+The following is an example:
+
+```
+./tikv-ctl --host 127.0.0.1:20160 get-region-read-progress -r 14 --log --min-start-ts 0
+```
+
+The output is as follows:
+
+```
+Region read progress:
+ exist: true,
+ safe_ts: 0,
+ applied_index: 92,
+ pending front item (oldest) ts: 0,
+ pending front item (oldest) applied index: 0,
+ pending back item (latest) ts: 0,
+ pending back item (latest) applied index: 0,
+ paused: false,
+Resolver:
+ exist: true,
+ resolved_ts: 0,
+ tracked index: 92,
+ number of locks: 0,
+ number of transactions: 0,
+ stopped: false,
+```
+
+The subcommand is useful in diagnosing issues related to Stale Read and safe-ts. For details, see [Understanding Stale Read and safe-ts in TiKV](/troubleshoot-stale-read.md).
\ No newline at end of file
diff --git a/troubleshoot-stale-read.md b/troubleshoot-stale-read.md
new file mode 100644
index 0000000000000..589ee1d839115
--- /dev/null
+++ b/troubleshoot-stale-read.md
@@ -0,0 +1,215 @@
+---
+title: Understanding Stale Read and safe-ts in TiKV
+summary: Introduce the principles of Stale Read and safe-ts in TiKV and provide troubleshooting tips and examples for diagnosing common issues related to Stale Read.
+---
+
+# Understanding Stale Read and safe-ts in TiKV
+
+In this guide, you can learn about Stale Read and safe-ts in TiKV and how to diagnose common issues related to Stale Read.
+
+## Overview of Stale Read and safe-ts
+
+[Stale Read](/stale-read.md) is a mechanism that TiDB applies to read historical versions of data stored in TiDB. In TiKV, Stale Read relies on [safe-ts](/#what-is-safe-ts). If a read request on a Region peer has a timestamp (ts) that is less than or equal to the Region's safe-ts, TiDB can safely read the data from the peer. TiKV implements this safety guarantee by ensuring that safe-ts is always less than or equal to [resolved-ts](#what-is-resolved-ts).
+
+## Understand safe-ts and resolved-ts
+
+This section explains the concepts and maintenance of safe-ts and resolved-ts.
+
+### What is safe-ts?
+
+The safe-ts is a timestamp that each peer in a Region maintains. It ensures that all transactions with a timestamp less than this value have been applied locally, which enables local Stale Read.
+
+### What is resolved-ts?
+
+The resolved-ts is a timestamp that guarantees all transactions with a timestamp less than this value have been applied by the leader. Unlike safe-ts, which is a peer concept, resolved-ts is only maintained by the Region leader. Followers might have a smaller apply index than the leader, so resolved-ts cannot be directly treated as safe-ts in followers.
+
+### The maintenance of safe-ts
+
+The `RegionReadProgress` module maintains safe-ts. The Region leader maintains resolved-ts and periodically sends its resolved-ts, the minimum required apply index (which validates this resolved-ts), and the Region itself to the `RegionReadProgerss` modules of all replicas via the CheckLeader RPC.
+
+When a peer applies data, it updates the apply index and checks if any pending resolved-ts can become the new safe-ts.
+
+### The maintenance of resolved-ts
+
+The Region leader uses a resolver to manage resolved-ts. This resolver tracks locks in the LOCK CF (Column Family) by receiving change logs when Raft applies. When initialized, the resolver scans the entire Region to track locks.
+
+## Diagnose Stale Read issues
+
+This section introduces how to diagnose Stale Read issues using Grafana and `tikv-ctl`.
+
+### Identify issues
+
+In the [Grafana > TiDB dashboard > **KV Request** dashboard](/grafana-tidb-dashboard.md#kv-request), the following panels show the hit rate, OPS, and traffic of Stale Read:
+
+![Stale Read Hit/Miss OPS](/media/stale-read/metrics-hit-miss.png)
+
+![Stale Read Req OPS](/media/stale-read/metrics-ops.png)
+
+![Stale Read Req Traffic](/media/stale-read/traffic.png)
+
+For more information about the preceding metrics, see [TiDB monitoring metrics](/grafana-tidb-dashboard.md#kv-request).
+
+When Stale Read issues occur, you might notice changes in the preceding metrics. The most direct indicator is a WARN log from TiDB, which reports `DataIsNotReady` with a Region ID and the `safe-ts` it encounters.
+
+### Common causes
+
+The most common causes that can impact the effectiveness of Stale Read are as follows:
+
+- Transactions that take long time to commit.
+- Transactions live too long before they commit.
+- Delays in pushing the information of CheckLeader from the leader to the follower.
+
+### Use Grafana to diagnose
+
+In the [**TiKV-Details** > **Resolved-TS** dashboard](/grafana-tikv-dashboard.md#resolved-ts), you can identify the Region with the smallest resolved-ts and safe-ts for each TiKV. If these timestamps are significantly behind real-time, you need to check the details of these Regions using `tikv-ctl`.
+
+### Use `tikv-ctl` to diagnose
+
+`tikv-ctl` provides up-to-date details of the resolver and `RegionReadProgress`. For more details, see [Get the state of a Region's `RegionReadProgress`](/tikv-control.md#get-the-state-of-a-regions-regionreadprogress).
+
+The following is an example:
+
+```bash
+./tikv-ctl --host 127.0.0.1:20160 get-region-read-progress -r 14 --log --min-start-ts 0
+```
+
+The output is as follows:
+
+```log
+Region read progress:
+ exist: true,
+ safe_ts: 0,
+ applied_index: 92,
+ pending front item (oldest) ts: 0,
+ pending front item (oldest) applied index: 0,
+ pending back item (latest) ts: 0,
+ pending back item (latest) applied index: 0,
+ paused: false,
+Resolver:
+ exist: true,
+ resolved_ts: 0,
+ tracked index: 92,
+ number of locks: 0,
+ number of transactions: 0,
+ stopped: false,
+```
+
+The preceding output helps you determine:
+
+- Whether locks are blocking resolved-ts.
+- Whether the apply index is too small to update safe-ts.
+- Whether the leader is sending a sufficiently updated resolved-ts when a follower peer exists.
+
+## Troubleshooting tips
+
+### Handle slow transaction commit
+
+A transaction that takes a long time to commit is often a large transaction. The prewrite phase of this slow transaction leaves some locks, but it takes too long before the commit phase clean the locks. To troubleshoot this issue, you can try to identify the transaction to which the locks belong and try to pinpoint the reason they exist, such as using logs.
+
+The following list some actions you can take:
+
+- Specify the `--log` option in the `tikv-ctl` command and check TiKV logs to find the specific locks with their start_ts.
+- Search the start_ts in both TiDB and TiKV logs to identify issues with the transaction.
+
+ If a query takes over 60 seconds, an `expensive_query` log is printed with the SQL statement. You can use the start_ts value to match the log. The following is an example:
+
+ ```log
+ [2023/07/17 19:32:09.403 +08:00] [WARN] [expensivequery.go:145] [expensive_query] [cost_time=60.025022732s] [cop_time=0.00346666s] [process_time=8.358409508s] [wait_time=0.013582596s] [request_count=278] [total_keys=9943616] [process_keys=9943360] [num_cop_tasks=278] [process_avg_time=0.030066221s] [process_p90_time=0.045296042s] [process_max_time=0.052828934s] [process_max_addr=192.168.31.244:20160] [wait_avg_time=0.000048858s] [wait_p90_time=0.00006057s] [wait_max_time=0.00040991s] [wait_max_addr=192.168.31.244:20160] [stats=t:442916666913587201] [conn=2826881778407440457] [user=root] [database=test] [table_ids="[100]"] [**txn_start_ts**=442916790435840001] [mem_max="2514229289 Bytes (2.34 GB)"] [sql="update t set b = b + 1"]
+ ```
+
+- Use the [`CLUSTER_TIDB_TRX`](/information-schema/information-schema-tidb-trx.md#cluster_tidb_trx) table to find active transactions if you cannot get enough information about the locks from logs.
+- Execute [`SHOW PROCESSLIST`](/sql-statements/sql-statement-show-processlist.md) to view the current sessions connected to the same TiDB server and their time spent on the current statement. But it does not show start_ts.
+
+If the locks exist due to ongoing large transactions, consider modifying your application logic as these locks can hinder the progress of resolve-ts.
+
+If the locks do not belong to any ongoing transactions, it might be due to a coordinator (TiDB) crashing after it prewrites the locks. In this case, TiDB will automatically resolve the locks. No action is required unless the problem persists.
+
+### Handle long-lived transactions
+
+Transactions that remain active for a long time could possibly block the advance of resolved-ts, even if they eventually commit quickly. This is because it is the start-ts of these long-lived transactions that are used to calculate the resolved-ts.
+
+To address this issue:
+
+- Identify the Transaction: Begin by pinpointing the transaction associated with the locks. It is crucial to understand the reason behind their existence. Leveraging logs can be particularly helpful.
+
+- Examine Application Logic: If the prolonged transaction duration is a result of your application's logic, consider revising it to prevent such occurrences.
+
+- Address Slow Queries: If the transaction's duration is extended due to slow queries, prioritize resolving these queries to alleviate the issue.
+
+### Address CheckLeader issues
+
+To address CheckLeader issues, you can check the network and the **Check Leader Duration** metric in [**TiKV-Details** > **Resolved-TS** dashboard](/grafana-tikv-dashboard.md#resolved-ts).
+
+## Example
+
+If you observe an increasing miss rate of **Stale Read OPS** as follows:
+
+![Example: Stale Read OPS](/media/stale-read/example-ops.png)
+
+You can first check the **Max Resolved TS gap** and **Min Resolved TS Region** metrics in the [**TiKV-Details** > **Resolved-TS** dashboard](/grafana-tikv-dashboard.md#resolved-ts):
+
+![Example: Max Resolved TS gap](/media/stale-read/example-ts-gap.png)
+
+From the preceding metrics, you can find that Region `3121` and some other Regions have not updated their resolved-ts in time.
+
+To get more details about the state of Region `3121`, you can run the following command:
+
+```bash
+./tikv-ctl --host 127.0.0.1:20160 get-region-read-progress -r 3121 --log
+```
+
+The output is as follows:
+
+```log
+Region read progress:
+ exist: true,
+ safe_ts: 442918444145049601,
+ applied_index: 2477,
+ read_state.ts: 442918444145049601,
+ read_state.apply_index: 1532,
+ pending front item (oldest) ts: 0,
+ pending front item (oldest) applied index: 0,
+ pending back item (latest) ts: 0,
+ pending back item (latest) applied index: 0,
+ paused: false,
+ discarding: false,
+Resolver:
+ exist: true,
+ resolved_ts: 442918444145049601,
+ tracked index: 2477,
+ number of locks: 480000,
+ number of transactions: 1,
+ stopped: false,
+```
+
+A notable observation here is that the `applied_index` equals to the `tracked index` in resolver. Therefore, the resolver appears to be the root of this issue. You can also see that there is 1 transaction that leaves 480000 locks in this Region, which might be the cause.
+
+To get the exact transaction and the keys of some of the locks, you can check TiKV logs and grep `locks with`. The output is as follows:
+
+```log
+[2023/07/17 21:16:44.257 +08:00] [INFO] [resolver.rs:213] ["locks with the minimum start_ts in resolver"] [keys="[74800000000000006A5F7280000000000405F6, ... , 74800000000000006A5F72800000000000EFF6, 74800000000000006A5F7280000000000721D9, 74800000000000006A5F72800000000002F691]"] [start_ts=442918429687808001] [region_id=3121]
+```
+
+From the TiKV log, you can get the start_ts of the transaction, that is `442918429687808001`. To get more information about the statement and transaction, you can grep `start_ts` in TiDB logs. The output is as follows:
+
+```log
+[2023/07/17 21:16:18.287 +08:00] [INFO] [2pc.go:685] ["[BIG_TXN]"] [session=2826881778407440457] ["key sample"=74800000000000006a5f728000000000000000] [size=319967171] [keys=10000000] [puts=10000000] [dels=0] [locks=0] [checks=0] [txnStartTS=442918429687808001]
+
+[2023/07/17 21:16:22.703 +08:00] [WARN] [expensivequery.go:145] [expensive_query] [cost_time=60.047172498s] [cop_time=0.004575113s] [process_time=15.356963423s] [wait_time=0.017093811s] [request_count=397] [total_keys=20000398] [process_keys=10000000] [num_cop_tasks=397] [process_avg_time=0.038682527s] [process_p90_time=0.082608262s] [process_max_time=0.116321331s] [process_max_addr=192.168.31.244:20160] [wait_avg_time=0.000043057s] [wait_p90_time=0.00004007s] [wait_max_time=0.00075014s] [wait_max_addr=192.168.31.244:20160] [stats=t:442918428521267201] [conn=2826881778407440457] [user=root] [database=test] [table_ids="[106]"] [txn_start_ts=442918429687808001] [mem_max="2513773983 Bytes (2.34 GB)"] [sql="update t set b = b + 1"]
+```
+
+Then, you can basically locate the statement that caused the problem. To further check it, you can execute the [`SHOW PROCESSLIST`](/sql-statements/sql-statement-show-processlist.md) statement. The output is as follows:
+
+```sql
++---------------------+------+---------------------+--------+---------+------+------------+---------------------------+
+| Id | User | Host | db | Command | Time | State | Info |
++---------------------+------+---------------------+--------+---------+------+------------+---------------------------+
+| 2826881778407440457 | root | 192.168.31.43:58641 | test | Query | 48 | autocommit | update t set b = b + 1 |
+| 2826881778407440613 | root | 127.0.0.1:45952 | test | Execute | 0 | autocommit | select * from t where a=? |
+| 2826881778407440619 | root | 192.168.31.43:60428 | | Query | 0 | autocommit | show processlist |
++---------------------+------+---------------------+--------+---------+------+------------+---------------------------+
+```
+
+The output shows that someone is executing an unexpected `UPDATE` statement (`update t set b = b + 1`), which results in a large transaction and hinders Stale Read.
+
+To resolve this issue, you can stop the application that is running this `UPDATE` statement.
\ No newline at end of file