title | summary | aliases | |
---|---|---|---|
Troubleshoot a TiFlash Cluster |
Learn common operations when you troubleshoot a TiFlash cluster. |
|
This section describes some commonly encountered issues when using TiFlash, the reasons, and the solutions.
The issue might occur due to different reasons. It is recommended that you troubleshoot it following the steps below:
-
Check whether your system is CentOS8.
CentOS8 does not have the
libnsl.so
system library. You can manually install it via the following command:{{< copyable "shell-regular" >}}
dnf install libnsl
-
Check your system's
ulimit
parameter setting.{{< copyable "shell-regular" >}}
ulimit -n 1000000
-
Use the PD Control tool to check whether there is any TiFlash instance that failed to go offline on the node (same IP and Port) and force the instance(s) to go offline. For detailed steps, refer to Scale in a TiFlash cluster.
If the above methods cannot resolve your issue, save the TiFlash log files and email to [email protected] for more information.
This is because TiFlash is in an abnormal state caused by configuration errors or environment issues. Take the following steps to identify the faulty component:
-
Check whether PD enables the
Placement Rules
feature:{{< copyable "shell-regular" >}}
echo 'config show replication' | /path/to/pd-ctl -u http://<pd-ip>:<pd-port>
The expected result is
"enable-placement-rules": "true"
. If not enabled, enable the Placement Rules feature. -
Check whether the TiFlash process is working correctly by viewing
UpTime
on the TiFlash-Summary monitoring panel. -
Check whether the TiFlash proxy status is normal through
pd-ctl
.{{< copyable "shell-regular" >}}
echo "store" | /path/to/pd-ctl -u http://<pd-ip>:<pd-port>
The TiFlash proxy's
store.labels
includes information such as{"key": "engine", "value": "tiflash"}
. You can check this information to confirm a TiFlash proxy. -
Check whether
pd buddy
can correctly print the logs (the log path is the value oflog
in the [flash.flash_cluster] configuration item; the default log path is under thetmp
directory configured in the TiFlash configuration file). -
Check whether the number of configured replicas is less than or equal to the number of TiKV nodes in the cluster. If not, PD cannot replicate data to TiFlash:
{{< copyable "shell-regular" >}}
echo 'config placement-rules show' | /path/to/pd-ctl -u http://<pd-ip>:<pd-port>
Reconfirm the value of
default: count
.Note:
After the placement rules feature is enabled, the previously configured
max-replicas
andlocation-labels
no longer take effect. To adjust the replica policy, use the interface related to placement rules. -
Check whether the remaining disk space of the machine (where
store
of the TiFlash node is) is sufficient. By default, when the remaining disk space is less than 20% of thestore
capacity (which is controlled by thelow-space-ratio
parameter), PD cannot schedule data to this TiFlash node.
This is because large amounts of data are written to the cluster, which causes that the TiFlash query encounters a lock and requires query retry.
You can set the query timestamp to one second earlier in TiDB. For example, if the current time is '2020-04-08 20:15:01', you can execute set @@tidb_snapshot='2020-04-08 20:15:00';
before you execute the query. This makes less TiFlash queries encounter a lock and mitigates the risk of unstable query time.
If the load pressure on TiFlash is too heavy and it causes that TiFlash data replication falls behind, some queries might return the Region Unavailable
error.
In this case, you can balance the load pressure by adding more TiFlash nodes.
Take the following steps to handle the data file corruption:
- Refer to Take a TiFlash node down to take the corresponding TiFlash node down.
- Delete the related data of the TiFlash node.
- Redeploy the TiFlash node in the cluster.