From 93fa0c4da18ce2dd7ef5963983659b648742423c Mon Sep 17 00:00:00 2001 From: 3AceShowHand Date: Fri, 13 Sep 2024 12:36:21 +0800 Subject: [PATCH 1/8] add claim check raw value format doc --- ticdc/ticdc-sink-to-kafka.md | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/ticdc/ticdc-sink-to-kafka.md b/ticdc/ticdc-sink-to-kafka.md index 41dd1aa469613..d110cad9c2efc 100644 --- a/ticdc/ticdc-sink-to-kafka.md +++ b/ticdc/ticdc-sink-to-kafka.md @@ -531,4 +531,21 @@ If the message contains the `claimCheckLocation` field, the Kafka consumer reads } ``` -The `key` and `value` fields contain the encoded large message, which should have been sent to the corresponding field in the Kafka message. Consumers can parse the data in these two parts to restore the content of the large message. +The `key` and `value` fields corresponds to the same fields in the Kafka message. Consumers can parse the data in these two parts to restore the content of the large message, by encoding the `key` and `value` into one JSON object to deliver a complete message. Only Open-protocol encode the `key` field, it's empty for other protocols. + +#### Only send the value to external storage + +Starting from version v8.4.0, the `claim-check-raw-value` parameter is supported, and it defaults to false. It can be set to true if not using Open-protocol, otherwise error occurs. + +An example configuration is as follows: + +```toml +protocol = "simple" + +[sink.kafka-config.large-message-handle] +large-message-handle-option = "claim-check" +claim-check-storage-uri = "s3://claim-check-bucket" +claim-check-raw-value = true +``` + +When this parameter is set to true, the changefeed directly sends the Value portion of Kafka messages to external storage, on the consumer side, data can be read directly from external storage and consumed. This reduce CPU overhead introduced by the JSON serialization and deserialization. \ No newline at end of file From f1b00d39cbcc35ab561b7bfdf28905efb5ce422b Mon Sep 17 00:00:00 2001 From: 3AceShowHand Date: Sat, 14 Sep 2024 11:43:09 +0800 Subject: [PATCH 2/8] add description about checksum v2 --- ticdc/ticdc-integrity-check.md | 26 +++++++++++++++++--------- 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/ticdc/ticdc-integrity-check.md b/ticdc/ticdc-integrity-check.md index 36a271dbc942a..e4df665c6b0a2 100644 --- a/ticdc/ticdc-integrity-check.md +++ b/ticdc/ticdc-integrity-check.md @@ -5,15 +5,7 @@ summary: Introduce the implementation principle and usage of the TiCDC data inte # TiCDC Data Integrity Validation for Single-Row Data -Starting from v7.1.0, TiCDC introduces the data integrity validation feature, which uses a checksum algorithm to validate the integrity of single-row data. This feature helps verify whether any error occurs in the process of writing data from TiDB, replicating it through TiCDC, and then writing it to a Kafka cluster. The data integrity validation feature only supports changefeeds that use Kafka as the downstream and currently supports the Avro protocol. - -## Implementation principles - -After you enable the checksum integrity validation feature for single-row data, TiDB uses the CRC32 algorithm to calculate the checksum of a row and writes it to TiKV along with the data. TiCDC reads the data from TiKV and recalculates the checksum using the same algorithm. If the two checksums are equal, it indicates that the data is consistent during the transmission from TiDB to TiCDC. - -TiCDC then encodes the data into a specific format and sends it to Kafka. After the Kafka Consumer reads data, it calculates a new checksum using the same algorithm as TiDB. If the new checksum is equal to the checksum in the data, it indicates that the data is consistent during the transmission from TiCDC to the Kafka Consumer. - -For more information about the algorithm of the checksum, see [Algorithm for checksum calculation](#algorithm-for-checksum-calculation). +Starting from v7.1.0, TiCDC introduces the data integrity validation feature, which uses a checksum algorithm to validate the integrity of single-row data. This feature helps verify whether any error occurs in the process of writing data from TiDB, replicating it through TiCDC, and then writing it to a Kafka cluster. The data integrity validation feature only supports changefeeds that use Kafka as the downstream and currently supports the simple and Avro protocol. ## Enable the feature @@ -67,6 +59,22 @@ TiCDC disables data integrity validation by default. To disable this feature aft The preceding configuration only takes effect for newly created sessions. After all clients writing to TiDB have reconnected, the messages written by changefeed to Kafka will no longer include the checksum for the corresponding data. +## Checksum V2 + +Starting from v8.4.0, TiDB and TiCDC introduced a new Checksum verification algorithm. When the Checksum feature is enabled, this new algorithm is used by default for Checksum calculation and verification. + +The reason for introducing the new algorithm is that with the previous Checksum calculation method, TiCDC could not correctly verify the Old Value portion of Update/Delete events that occurred after Add Column/Drop Column DDL. Checksum V2 can handle this scenario correctly. + +After upgrading clusters from earlier versions to v8.4.0, TiDB will use Checksum V2 by default to calculate the Checksum for newly written data and write it to TiKV. TiCDC supports handling both V1 and V2 Checksums simultaneously, without affecting external operations. This feature only impacts the internal implementation details of TiDB and TiCDC, with no changes to the Checksum verification method for downstream Kafka consumers. + +## Checksum V1 Implementation principles + +Starting from v7.1.0, to v8.4.0, TiDB and TiCDC use Checksum V1 as the default method for the checksum verification and calculation. After you enable the checksum integrity validation feature for single-row data, TiDB uses the CRC32 algorithm to calculate the checksum of a row and writes it to TiKV along with the data. TiCDC reads the data from TiKV and recalculates the checksum using the same algorithm. If the two checksums are equal, it indicates that the data is consistent during the transmission from TiDB to TiCDC. + +TiCDC then encodes the data into a specific format and sends it to Kafka. After the Kafka Consumer reads data, it calculates a new checksum using the same algorithm as TiDB. If the new checksum is equal to the checksum in the data, it indicates that the data is consistent during the transmission from TiCDC to the Kafka Consumer. + +For more information about the algorithm of the checksum, see [Algorithm for checksum calculation](#algorithm-for-checksum-calculation). + ## Algorithm for checksum calculation The pseudocode for the checksum calculation algorithm is as follows: From b3b6f344f8457db1888bff43e9d8074f20d1b56e Mon Sep 17 00:00:00 2001 From: Aolin Date: Mon, 23 Sep 2024 14:17:31 +0800 Subject: [PATCH 3/8] align with Chinese Signed-off-by: Aolin --- ticdc/ticdc-integrity-check.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/ticdc/ticdc-integrity-check.md b/ticdc/ticdc-integrity-check.md index e4df665c6b0a2..8572f9c79efdd 100644 --- a/ticdc/ticdc-integrity-check.md +++ b/ticdc/ticdc-integrity-check.md @@ -5,7 +5,7 @@ summary: Introduce the implementation principle and usage of the TiCDC data inte # TiCDC Data Integrity Validation for Single-Row Data -Starting from v7.1.0, TiCDC introduces the data integrity validation feature, which uses a checksum algorithm to validate the integrity of single-row data. This feature helps verify whether any error occurs in the process of writing data from TiDB, replicating it through TiCDC, and then writing it to a Kafka cluster. The data integrity validation feature only supports changefeeds that use Kafka as the downstream and currently supports the simple and Avro protocol. +Starting from v7.1.0, TiCDC introduces the data integrity validation feature, which uses a [checksum algorithm](#checksum-algorithm) to validate the integrity of single-row data. This feature helps verify whether any error occurs in the process of writing data from TiDB, replicating it through TiCDC, and then writing it to a Kafka cluster. The data integrity validation feature only supports changefeeds that use Kafka as the downstream and currently supports the simple and Avro protocol. For more information about the algorithm of the checksum, see [Algorithm for checksum calculation](#algorithm-for-checksum-calculation). ## Enable the feature @@ -59,21 +59,21 @@ TiCDC disables data integrity validation by default. To disable this feature aft The preceding configuration only takes effect for newly created sessions. After all clients writing to TiDB have reconnected, the messages written by changefeed to Kafka will no longer include the checksum for the corresponding data. -## Checksum V2 +## Checksum algorithms -Starting from v8.4.0, TiDB and TiCDC introduced a new Checksum verification algorithm. When the Checksum feature is enabled, this new algorithm is used by default for Checksum calculation and verification. +### Checksum V1 -The reason for introducing the new algorithm is that with the previous Checksum calculation method, TiCDC could not correctly verify the Old Value portion of Update/Delete events that occurred after Add Column/Drop Column DDL. Checksum V2 can handle this scenario correctly. +Before v8.4.0, TiDB and TiCDC use Checksum V1 for checksum calculation and verification. -After upgrading clusters from earlier versions to v8.4.0, TiDB will use Checksum V2 by default to calculate the Checksum for newly written data and write it to TiKV. TiCDC supports handling both V1 and V2 Checksums simultaneously, without affecting external operations. This feature only impacts the internal implementation details of TiDB and TiCDC, with no changes to the Checksum verification method for downstream Kafka consumers. +After you enable the checksum integrity validation feature for single-row data, TiDB uses the CRC32 algorithm to calculate the checksum of a row and writes it to TiKV along with the data. TiCDC reads the data from TiKV and recalculates the checksum using the same algorithm. If the two checksums are equal, it indicates that the data is consistent during the transmission from TiDB to TiCDC. -## Checksum V1 Implementation principles +TiCDC then encodes the data into a specific format and sends it to Kafka. After the Kafka Consumer reads data, it calculates a new checksum using the same algorithm as TiDB. If the new checksum is equal to the checksum in the data, it indicates that the data is consistent during the transmission from TiCDC to the Kafka Consumer. -Starting from v7.1.0, to v8.4.0, TiDB and TiCDC use Checksum V1 as the default method for the checksum verification and calculation. After you enable the checksum integrity validation feature for single-row data, TiDB uses the CRC32 algorithm to calculate the checksum of a row and writes it to TiKV along with the data. TiCDC reads the data from TiKV and recalculates the checksum using the same algorithm. If the two checksums are equal, it indicates that the data is consistent during the transmission from TiDB to TiCDC. +### Checksum V2 -TiCDC then encodes the data into a specific format and sends it to Kafka. After the Kafka Consumer reads data, it calculates a new checksum using the same algorithm as TiDB. If the new checksum is equal to the checksum in the data, it indicates that the data is consistent during the transmission from TiCDC to the Kafka Consumer. +Starting from v8.4.0, TiDB and TiCDC introduce Checksum V2 to address issues with Checksum V1 in verifying old values in Update or Delete events after Add Column or Drop Column operations. -For more information about the algorithm of the checksum, see [Algorithm for checksum calculation](#algorithm-for-checksum-calculation). +For new clusters created in v8.4.0 or later, or clusters upgraded to v8.4.0, TiDB uses Checksum V2 by default when single-row data checksum verification is enabled. TiCDC supports handling both Checksum V1 and V2. This change only affects TiDB and TiCDC internal implementation and does not impact checksum calculation methods for downstream Kafka consumers. ## Algorithm for checksum calculation From c2c3d356b1c1465f8f4b2160513c5b9672a72976 Mon Sep 17 00:00:00 2001 From: Aolin Date: Mon, 23 Sep 2024 16:01:42 +0800 Subject: [PATCH 4/8] Apply suggestions from code review --- ticdc/ticdc-integrity-check.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/ticdc-integrity-check.md b/ticdc/ticdc-integrity-check.md index 8572f9c79efdd..e70ebb2a6024b 100644 --- a/ticdc/ticdc-integrity-check.md +++ b/ticdc/ticdc-integrity-check.md @@ -5,7 +5,7 @@ summary: Introduce the implementation principle and usage of the TiCDC data inte # TiCDC Data Integrity Validation for Single-Row Data -Starting from v7.1.0, TiCDC introduces the data integrity validation feature, which uses a [checksum algorithm](#checksum-algorithm) to validate the integrity of single-row data. This feature helps verify whether any error occurs in the process of writing data from TiDB, replicating it through TiCDC, and then writing it to a Kafka cluster. The data integrity validation feature only supports changefeeds that use Kafka as the downstream and currently supports the simple and Avro protocol. For more information about the algorithm of the checksum, see [Algorithm for checksum calculation](#algorithm-for-checksum-calculation). +Starting from v7.1.0, TiCDC introduces the data integrity validation feature, which uses a [checksum algorithm](#checksum-algorithms) to validate the integrity of single-row data. This feature helps verify whether any error occurs in the process of writing data from TiDB, replicating it through TiCDC, and then writing it to a Kafka cluster. The data integrity validation feature only supports changefeeds that use Kafka as the downstream and currently supports the simple and Avro protocol. For more information about the algorithm of the checksum, see [Algorithm for checksum calculation](#algorithm-for-checksum-calculation). ## Enable the feature From 6098d1d8e747504119b0b5ab464380fb57812b0f Mon Sep 17 00:00:00 2001 From: Aolin Date: Wed, 25 Sep 2024 17:01:20 +0800 Subject: [PATCH 5/8] Apply suggestions from code review --- ticdc/ticdc-integrity-check.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/ticdc/ticdc-integrity-check.md b/ticdc/ticdc-integrity-check.md index e70ebb2a6024b..22dfaad0e1970 100644 --- a/ticdc/ticdc-integrity-check.md +++ b/ticdc/ticdc-integrity-check.md @@ -67,11 +67,11 @@ Before v8.4.0, TiDB and TiCDC use Checksum V1 for checksum calculation and verif After you enable the checksum integrity validation feature for single-row data, TiDB uses the CRC32 algorithm to calculate the checksum of a row and writes it to TiKV along with the data. TiCDC reads the data from TiKV and recalculates the checksum using the same algorithm. If the two checksums are equal, it indicates that the data is consistent during the transmission from TiDB to TiCDC. -TiCDC then encodes the data into a specific format and sends it to Kafka. After the Kafka Consumer reads data, it calculates a new checksum using the same algorithm as TiDB. If the new checksum is equal to the checksum in the data, it indicates that the data is consistent during the transmission from TiCDC to the Kafka Consumer. +TiCDC then encodes the data into a specific format and sends it to Kafka. After the Kafka Consumer reads data, it calculates a new checksum using the same CRC32 algorithm as TiDB. If the new checksum is equal to the checksum in the data, it indicates that the data is consistent during the transmission from TiCDC to the Kafka Consumer. ### Checksum V2 -Starting from v8.4.0, TiDB and TiCDC introduce Checksum V2 to address issues with Checksum V1 in verifying old values in Update or Delete events after Add Column or Drop Column operations. +Starting from v8.4.0, TiDB and TiCDC introduce Checksum V2 to address issues with Checksum V1 in verifying old values in Update or Delete events after `ADD COLUMN` or `DROP COLUMN` operations. For new clusters created in v8.4.0 or later, or clusters upgraded to v8.4.0, TiDB uses Checksum V2 by default when single-row data checksum verification is enabled. TiCDC supports handling both Checksum V1 and V2. This change only affects TiDB and TiCDC internal implementation and does not impact checksum calculation methods for downstream Kafka consumers. From ac57ce55b9a505c6e108283df737039c0a31749a Mon Sep 17 00:00:00 2001 From: Aolin Date: Fri, 27 Sep 2024 16:13:13 +0800 Subject: [PATCH 6/8] Apply suggestions from code review Co-authored-by: Grace Cai --- ticdc/ticdc-integrity-check.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/ticdc/ticdc-integrity-check.md b/ticdc/ticdc-integrity-check.md index 22dfaad0e1970..77902f97c38bc 100644 --- a/ticdc/ticdc-integrity-check.md +++ b/ticdc/ticdc-integrity-check.md @@ -5,7 +5,7 @@ summary: Introduce the implementation principle and usage of the TiCDC data inte # TiCDC Data Integrity Validation for Single-Row Data -Starting from v7.1.0, TiCDC introduces the data integrity validation feature, which uses a [checksum algorithm](#checksum-algorithms) to validate the integrity of single-row data. This feature helps verify whether any error occurs in the process of writing data from TiDB, replicating it through TiCDC, and then writing it to a Kafka cluster. The data integrity validation feature only supports changefeeds that use Kafka as the downstream and currently supports the simple and Avro protocol. For more information about the algorithm of the checksum, see [Algorithm for checksum calculation](#algorithm-for-checksum-calculation). +Starting from v7.1.0, TiCDC introduces the data integrity validation feature, which uses a [checksum algorithm](#checksum-algorithms) to validate the integrity of single-row data. This feature helps verify whether any error occurs in the process of writing data from TiDB, replicating it through TiCDC, and then writing it to a Kafka cluster. Currently, only changefeeds using Kafka as the downstream and Simple or Avro as the protocol support this feature. For more information about the checksum algorithm, see [Algorithm for checksum calculation](#algorithm-for-checksum-calculation). ## Enable the feature @@ -65,15 +65,15 @@ TiCDC disables data integrity validation by default. To disable this feature aft Before v8.4.0, TiDB and TiCDC use Checksum V1 for checksum calculation and verification. -After you enable the checksum integrity validation feature for single-row data, TiDB uses the CRC32 algorithm to calculate the checksum of a row and writes it to TiKV along with the data. TiCDC reads the data from TiKV and recalculates the checksum using the same algorithm. If the two checksums are equal, it indicates that the data is consistent during the transmission from TiDB to TiCDC. +After you enable the checksum integrity validation feature for single-row data, TiDB uses the CRC32 algorithm to calculate the checksum of each row and writes it to TiKV along with the data. TiCDC reads the data from TiKV and recalculates the checksum using the same algorithm. If the two checksums are equal, it indicates that the data is consistent during the transmission from TiDB to TiCDC. -TiCDC then encodes the data into a specific format and sends it to Kafka. After the Kafka Consumer reads data, it calculates a new checksum using the same CRC32 algorithm as TiDB. If the new checksum is equal to the checksum in the data, it indicates that the data is consistent during the transmission from TiCDC to the Kafka Consumer. +TiCDC then encodes the data into a specific format and sends it to Kafka. After the Kafka consumer reads the data, it calculates a new checksum using the same CRC32 algorithm as TiDB. If the new checksum is equal to the checksum in the data, it indicates that the data is consistent during the transmission from TiCDC to the Kafka consumer. ### Checksum V2 Starting from v8.4.0, TiDB and TiCDC introduce Checksum V2 to address issues with Checksum V1 in verifying old values in Update or Delete events after `ADD COLUMN` or `DROP COLUMN` operations. -For new clusters created in v8.4.0 or later, or clusters upgraded to v8.4.0, TiDB uses Checksum V2 by default when single-row data checksum verification is enabled. TiCDC supports handling both Checksum V1 and V2. This change only affects TiDB and TiCDC internal implementation and does not impact checksum calculation methods for downstream Kafka consumers. +For clusters created in v8.4.0 or later, or clusters upgraded to v8.4.0, TiDB uses Checksum V2 by default when single-row data checksum verification is enabled. TiCDC supports handling both Checksum V1 and V2. This change only affects TiDB and TiCDC internal implementation and does not affect checksum calculation methods for downstream Kafka consumers. ## Algorithm for checksum calculation From 84cc6ba8ac65917072a707ff1d9b3d61635c9482 Mon Sep 17 00:00:00 2001 From: Aolin Date: Fri, 27 Sep 2024 16:13:26 +0800 Subject: [PATCH 7/8] Discard changes to ticdc/ticdc-sink-to-kafka.md --- ticdc/ticdc-sink-to-kafka.md | 19 +------------------ 1 file changed, 1 insertion(+), 18 deletions(-) diff --git a/ticdc/ticdc-sink-to-kafka.md b/ticdc/ticdc-sink-to-kafka.md index d110cad9c2efc..41dd1aa469613 100644 --- a/ticdc/ticdc-sink-to-kafka.md +++ b/ticdc/ticdc-sink-to-kafka.md @@ -531,21 +531,4 @@ If the message contains the `claimCheckLocation` field, the Kafka consumer reads } ``` -The `key` and `value` fields corresponds to the same fields in the Kafka message. Consumers can parse the data in these two parts to restore the content of the large message, by encoding the `key` and `value` into one JSON object to deliver a complete message. Only Open-protocol encode the `key` field, it's empty for other protocols. - -#### Only send the value to external storage - -Starting from version v8.4.0, the `claim-check-raw-value` parameter is supported, and it defaults to false. It can be set to true if not using Open-protocol, otherwise error occurs. - -An example configuration is as follows: - -```toml -protocol = "simple" - -[sink.kafka-config.large-message-handle] -large-message-handle-option = "claim-check" -claim-check-storage-uri = "s3://claim-check-bucket" -claim-check-raw-value = true -``` - -When this parameter is set to true, the changefeed directly sends the Value portion of Kafka messages to external storage, on the consumer side, data can be read directly from external storage and consumed. This reduce CPU overhead introduced by the JSON serialization and deserialization. \ No newline at end of file +The `key` and `value` fields contain the encoded large message, which should have been sent to the corresponding field in the Kafka message. Consumers can parse the data in these two parts to restore the content of the large message. From d3fd102b07f1f1a83cdfb9e714fe25f321ba9c42 Mon Sep 17 00:00:00 2001 From: Aolin Date: Fri, 27 Sep 2024 16:13:59 +0800 Subject: [PATCH 8/8] Apply suggestions from code review Co-authored-by: Grace Cai --- ticdc/ticdc-integrity-check.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ticdc/ticdc-integrity-check.md b/ticdc/ticdc-integrity-check.md index 77902f97c38bc..e93360fca381b 100644 --- a/ticdc/ticdc-integrity-check.md +++ b/ticdc/ticdc-integrity-check.md @@ -71,7 +71,7 @@ TiCDC then encodes the data into a specific format and sends it to Kafka. After ### Checksum V2 -Starting from v8.4.0, TiDB and TiCDC introduce Checksum V2 to address issues with Checksum V1 in verifying old values in Update or Delete events after `ADD COLUMN` or `DROP COLUMN` operations. +Starting from v8.4.0, TiDB and TiCDC introduce Checksum V2 to address issues of Checksum V1 in verifying old values in Update or Delete events after `ADD COLUMN` or `DROP COLUMN` operations. For clusters created in v8.4.0 or later, or clusters upgraded to v8.4.0, TiDB uses Checksum V2 by default when single-row data checksum verification is enabled. TiCDC supports handling both Checksum V1 and V2. This change only affects TiDB and TiCDC internal implementation and does not affect checksum calculation methods for downstream Kafka consumers.