Skip to content

Commit

Permalink
8.5 performance highlights: update the table for IO jitter improvement (
Browse files Browse the repository at this point in the history
  • Loading branch information
qiancai authored Jan 8, 2025
1 parent b279bdc commit 711632a
Showing 1 changed file with 77 additions and 25 deletions.
102 changes: 77 additions & 25 deletions tidb-cloud/v8.5-performance-highlights.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,31 +83,83 @@ TiDB v8.5.0 introduces multiple enhancements to mitigate the impact of cloud dis
### Test results
The failover time of the IO latency jitter is 30% shorter, and P99/999 latency is reduced by 70% or more.
- Test results without IO latency jitter improvement
| Workload | Failover time | QPS drop rate | Maximum latency (P999) during failover | Maximum latency (P99) during failover |
| --- | --- | --- | --- | --- |
| IO delay of 1 s lasts for 10 mins | 4 mins | 99% | 1 min | 56 s |
| IO delay of 500 ms lasts for 10 mins | 4 mins | 99% | 54 s | 7.8 s |
| IO delay of 100 ms lasts for 10 mins | Failover not achieved | 99% | 32 s | 26 s |
| IO delay of 50 ms lasts for 10 mins | Failover not achieved | 97% | 13.2 s | 6.7 s |
| IO delay of 10 ms lasts for 10 mins | Failover not achieved | 94% | 3 s | 1.45 s |
| IO delay of 5 ms lasts for 10 mins | Failover not achieved | 81% | 462 ms | 246 ms |
| IO delay of 2 ms lasts for 10 mins | Failover not achieved | 38% | 232 ms | 22.9 ms |
- Test results with IO latency jitter improvement
| Workload | Failover time | QPS drop rate | Maximum latency (P999) during failover | Maximum latency (P99) during failover |
| --- | --- | --- | --- | --- |
| IO delay of 1 s lasts for 10 mins | 3 mins | 93% | 4.66 s | 929 ms |
| IO delay of 500 ms lasts for 10 mins | 2 mins | 92% | 7.22 s | 894 ms |
| IO delay of 100 ms lasts for 10 mins | 3 mins | 80% | 7.53 s | 1.7 s |
| IO delay of 50 ms lasts for 10 mins | 3 mins | 53% | 1.36 s | 238 ms |
| IO delay of 10 ms lasts for 10 mins | 3 mins | 18% | 69 ms | 25 ms |
| IO delay of 5 ms lasts for 10 mins | 2 mins | 29% | 37.9 ms | 10 ms |
| IO delay of 2 ms lasts for 10 mins | Almost no effect | 1% | 14 ms | 7.9 ms |
Failovers are now available in multiple IO delay scenarios, and P99/999 latency during impacts is reduced by up to 98%.
In the following table of test results, the **Current** column shows the results with improvements to reduce IO latency jitter, while the **Original** column shows the results without these improvements:
<table>
<thead>
<tr>
<th rowspan="2">Workload description</th>
<th colspan="2">Failover time</th>
<th colspan="2">Maximum latency during impacts (P999)</th>
<th colspan="2">Maximum latency during impacts (P99)</th>
</tr>
<tr>
<th>Current</th>
<th>Original</th>
<th>Current</th>
<th>Original</th>
<th>Current</th>
<th>Original</th>
</tr>
</thead>
<tbody>
<tr>
<td>IO delay of 2 ms lasts for 10 mins</td>
<td>Almost no effect</td>
<td>Failover not available</td>
<td>14 ms</td>
<td>232 ms</td>
<td>7.9 ms</td>
<td>22.9 ms</td>
</tr>
<tr>
<td>IO delay of 5 ms lasts for 10 mins</td>
<td>2 mins</td>
<td>Failover not available</td>
<td>37.9 ms</td>
<td>462 ms</td>
<td>10 ms</td>
<td>246 ms</td>
</tr>
<tr>
<td>IO delay of 10 ms lasts for 10 mins</td>
<td>3 mins</td>
<td>Failover not available</td>
<td>69 ms</td>
<td>3 s</td>
<td>25 ms</td>
<td>1.45 s</td>
</tr>
<tr>
<td>IO delay of 50 ms lasts for 10 mins</td>
<td>3 mins</td>
<td>Failover not available</td>
<td>1.36 s</td>
<td>13.2 s</td>
<td>238 ms</td>
<td>6.7 s</td>
</tr>
<tr>
<td>IO delay of 100 ms lasts for 10 mins</td>
<td>3 mins</td>
<td>Failover not available</td>
<td>7.53 s</td>
<td>32 s</td>
<td>1.7 s</td>
<td>26 s</td>
</tr>
</tbody>
</table>
### Further improvements
Due to the inherent risk of physical disk damage, the cloud disk jitter issue is unavoidable. To mitigate its impact, TiKV introduces a [slow node detection mechanism](https://docs.pingcap.com/tidb/v8.5/pd-scheduling-best-practices#troubleshoot-tikv-node). This mechanism uses [evict-slow-store-scheduler](https://docs.pingcap.com/tidb/v8.5/pd-control#scheduler-show--add--remove--pause--resume--config--describe) to detect and manage slow nodes, reducing the effects of cloud disk jitter.
The severity of disk jitter might also be highly related to users' workload profiles. In latency-sensitive scenarios, designing applications in conjunction with TiDB features can further minimize the impact of IO jitter on applications. For example, in read-heavy and latency-sensitive environments, adjusting the [`tikv_client_read_timeout`](/system-variables.md#tikv_client_read_timeout-new-in-v740) system variable according to latency requirements and using stale reads or follower reads can enable faster failover retries to other replica peers for KV requests sent from TiDB. This reduces the impact of IO jitter on a single TiKV node and helps improve query latency. Note that the effectiveness of this feature depends on the workload profile, which should be evaluated before implementation.
Additionally, users [deploying TiDB on public cloud](https://docs.pingcap.com/tidb/dev/best-practices-on-public-cloud) can reduce the probability of jitter by choosing cloud disks with higher performance.
## Batch processing
Expand Down

0 comments on commit 711632a

Please sign in to comment.