Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8.5 performance highlights: update the table for IO jitter improvement #19899

Merged
merged 16 commits into from
Jan 8, 2025
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
106 changes: 81 additions & 25 deletions tidb-cloud/v8.5-performance-highlights.md
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original file line number Diff line number Diff line change
Expand Up @@ -83,31 +83,87 @@ TiDB v8.5.0 introduces multiple enhancements to mitigate the impact of cloud dis

### Test results

The failover time of the IO latency jitter is 30% shorter, and P99/999 latency is reduced by 70% or more.

- Test results without IO latency jitter improvement

| Workload | Failover time | QPS drop rate | Maximum latency (P999) during failover | Maximum latency (P99) during failover |
| --- | --- | --- | --- | --- |
| IO delay of 1 s lasts for 10 mins | 4 mins | 99% | 1 min | 56 s |
| IO delay of 500 ms lasts for 10 mins | 4 mins | 99% | 54 s | 7.8 s |
| IO delay of 100 ms lasts for 10 mins | Failover not achieved | 99% | 32 s | 26 s |
| IO delay of 50 ms lasts for 10 mins | Failover not achieved | 97% | 13.2 s | 6.7 s |
| IO delay of 10 ms lasts for 10 mins | Failover not achieved | 94% | 3 s | 1.45 s |
| IO delay of 5 ms lasts for 10 mins | Failover not achieved | 81% | 462 ms | 246 ms |
| IO delay of 2 ms lasts for 10 mins | Failover not achieved | 38% | 232 ms | 22.9 ms |

- Test results with IO latency jitter improvement

| Workload | Failover time | QPS drop rate | Maximum latency (P999) during failover | Maximum latency (P99) during failover |
| --- | --- | --- | --- | --- |
| IO delay of 1 s lasts for 10 mins | 3 mins | 93% | 4.66 s | 929 ms |
| IO delay of 500 ms lasts for 10 mins | 2 mins | 92% | 7.22 s | 894 ms |
| IO delay of 100 ms lasts for 10 mins | 3 mins | 80% | 7.53 s | 1.7 s |
| IO delay of 50 ms lasts for 10 mins | 3 mins | 53% | 1.36 s | 238 ms |
| IO delay of 10 ms lasts for 10 mins | 3 mins | 18% | 69 ms | 25 ms |
| IO delay of 5 ms lasts for 10 mins | 2 mins | 29% | 37.9 ms | 10 ms |
| IO delay of 2 ms lasts for 10 mins | Almost no effect | 1% | 14 ms | 7.9 ms |
Failovers are now available in multiple IO delay scenarios, and P99/999 latency during impacts is reduced by up to 98%.

In the following table of test results, the **Current** column shows the results with improvements to reduce IO latency jitter, while the **Original** column shows the results without these improvements:

<table>
<thead>
<tr>
<th rowspan="2">Workload description</th>
<th colspan="2">Failover time</th>
<th colspan="2">Maximum latency during impacts (P999)</th>
<th colspan="2">Maximum latency during impacts (P99)</th>
</tr>
<tr>
<th>Current</th>
<th>Original</th>
<th>Current</th>
<th>Original</th>
<th>Current</th>
<th>Original</th>
</tr>
</thead>
<tbody>
<tr>
<td>IO delay of 2 ms lasts for 10 mins</td>
<td>Almost no effect</td>
<td>Failover not available</td>
<td>14 ms</td>
<td>232 ms</td>
<td>7.9 ms</td>
<td>22.9 ms</td>
</tr>
<tr>
<td>IO delay of 5 ms lasts for 10 mins</td>
<td>2 mins</td>
<td>Failover not available</td>
<td>37.9 ms</td>
<td>462 ms</td>
<td>10 ms</td>
<td>246 ms</td>
</tr>
<tr>
<td>IO delay of 10 ms lasts for 10 mins</td>
<td>3 mins</td>
<td>Failover not available</td>
<td>69 ms</td>
<td>3 s</td>
<td>25 ms</td>
<td>1.45 s</td>
</tr>
<tr>
<td>IO delay of 50 ms lasts for 10 mins</td>
<td>3 mins</td>
<td>Failover not available</td>
<td>1.36 s</td>
<td>13.2 s</td>
<td>238 ms</td>
<td>6.7 s</td>
</tr>
<tr>
<td>IO delay of 100 ms lasts for 10 mins</td>
<td>3 mins</td>
<td>Failover not available</td>
<td>7.53 s</td>
<td>32 s</td>
<td>1.7 s</td>
<td>26 s</td>
</tr>
</tbody>
</table>

![Maximum latency during impacts (P999)](/media/tidb-cloud/v8.5-maximum-latency-during-impacts-p999.png)

![Maximum latency during impacts (P99)](/media/tidb-cloud/v8.5-maximum-latency-during-impacts-p99.png)

### Further improvements

Due to the inherent risk of physical disk damage, the cloud disk jitter issue is unavoidable. To mitigate its impact, TiKV introduces a [slow node detection mechanism](https://docs.pingcap.com/tidb/v8.5/pd-scheduling-best-practices#troubleshoot-tikv-node). This mechanism uses [evict-slow-store-scheduler](https://docs.pingcap.com/tidb/v8.5/pd-control#scheduler-show--add--remove--pause--resume--config--describe) to detect and manage slow nodes, reducing the effects of cloud disk jitter.

The severity of disk jitter might also be highly related to users' workload profiles. In latency-sensitive scenarios, designing applications in conjunction with TiDB features can further minimize the impact of IO jitter on applications. For example, in read-heavy and latency-sensitive environments, adjusting the [tikv_client_read_timeout](/system-variables.md#tikv_client_read_timeout-new-in-v740) system variable according to latency requirements and using stale reads or follower reads can enable faster failover retries to other replica peers for KV requests sent from TiDB. This reduces the impact of IO jitter on a single TiKV node and helps improve query latency. Note that the effectiveness of this feature depends on the workload profile, which should be evaluated before implementation.
qiancai marked this conversation as resolved.
Show resolved Hide resolved

Additionally, cloud users can reduce the probability of jitter by choosing cloud disks with higher performance.
qiancai marked this conversation as resolved.
Show resolved Hide resolved

## Batch processing

Expand Down
Loading