Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cdc): support INCLUDE TIMESTAMP for MySQL, PG and MongoDB cdc table #16833

Merged
merged 19 commits into from
May 27, 2024

Conversation

StrikeW
Copy link
Contributor

@StrikeW StrikeW commented May 20, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

  • Support INCLUDE TIMESTAMP [AS] for MySQL, PG and MongoDB cdc table to ingest upstream commit timestamp.
  • The commit timestamp for historical data will be filled as 1970-01-01 00:00:00+00:00, aligned with Flink.

Example:

  • MySQL and PG
CREATE TABLE mytable (v1 int primary key, v2 varchar)
include timestamp as commit_ts
from pg_source table 'public.mytable';

dev=> select * from t2 order by v1;
 v1 | v2 |         commit_ts
----+----+---------------------------
  1 | aa | 1970-01-01 00:00:00+00:00
  2 | bb | 1970-01-01 00:00:00+00:00
  3 | cc | 2024-05-20 09:01:08+00:00
  4 | dd | 2024-05-20 09:01:08+00:00
  • MongoDB
CREATE TABLE test (_id JSONB PRIMARY KEY, payload JSONB)
include timestamp as commit_ts
WITH (
  connector = 'mongodb-cdc',
  mongodb.url = 'mongodb://localhost:27017/?replicaSet=rs0',
  collection.name = 'test.*'
);

dev=> select * from test;
                 _id                  |                                      payload                                      |         commit_ts
--------------------------------------+-----------------------------------------------------------------------------------+---------------------------
 {"$oid": "664c48e87d2c84adfabfc03f"} | {"_id": {"$oid": "664c48e87d2c84adfabfc03f"}, "data": "mydata", "name": "ssssss"} | 2024-05-21 08:18:25+00:00
 {"$oid": "660125a80f048c7c7eff4a6a"} | {"_id": {"$oid": "660125a80f048c7c7eff4a6a"}, "name": "aa"}                       | 1970-01-01 00:00:00+00:00

close: #16359
related: #16654
fix: #16850

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

  • Support INCLUDE TIMESTAMP [AS] for MySQL, PG and MongoDB cdc table to ingest upstream commit timestamp.
  • The commit timestamp for historical data will be filled as 1970-01-01 00:00:00+00:00

@StrikeW StrikeW changed the title feat(cdc): support INCLUDE TIMESTAMP for MySQL and PG cdc table feat(cdc): support INCLUDE TIMESTAMP for MySQL and PG cdc table (WIP) May 20, 2024
@neverchanje
Copy link
Contributor

neverchanje commented May 21, 2024

thanks @StrikeW

As per the original user request confirmed by @lmatz,
we should include 3 metadata fields,

  • database_name,
  • table_name,
  • op_ts(commit time)

See also https://nightlies.apache.org/flink/flink-cdc-docs-master/docs/connectors/cdc-connectors/mysql-cdc/#available-metadata

I am thinking if the two "static" fields, database_name and table_name, can be evaluated in the frontend as constants, rather than actually being materialized. cc @xiangjinwu

@xiangjinwu
Copy link
Contributor

xiangjinwu commented May 21, 2024

I am thinking if the two "static" fields, database_name and table_name, can be evaluated in the frontend as constants, rather than actually being materialized. cc @xiangjinwu

Good point. As long as they are persisted as part of the source/table catalog, it could be supported similar to current_schema (or table-qualified tableoid as in #11222).

However, why can't the user just write select commit_ts, 'mytable' from mytable; instead of select commit_ts, mytable.table_name from mytable;? My guess is that user intends to have multiple external tables into a single risingwave table, and leverage table_name column to separate (probably group by), which requires materialization and looks like unsupported yet.

@StrikeW
Copy link
Contributor Author

StrikeW commented May 21, 2024

thanks @StrikeW

As per the original user request confirmed by @lmatz, we should include 3 metadata fields,

  • database_name,
  • table_name,
  • op_ts(commit time)

See also https://nightlies.apache.org/flink/flink-cdc-docs-master/docs/connectors/cdc-connectors/mysql-cdc/#available-metadata

I am thinking if the two "static" fields, database_name and table_name, can be evaluated in the frontend as constants, rather than actually being materialized. cc @xiangjinwu

I have a comment for other metadata.
#16654 (comment)

@StrikeW StrikeW changed the title feat(cdc): support INCLUDE TIMESTAMP for MySQL and PG cdc table (WIP) feat(cdc): support INCLUDE TIMESTAMP for MySQL, PG and MongoDB cdc table May 21, 2024
@StrikeW StrikeW added the user-facing-changes Contains changes that are visible to users label May 21, 2024
@StrikeW StrikeW requested review from xiangjinwu and tabVersion May 23, 2024 01:46
@StrikeW StrikeW force-pushed the siyuan/cdc-metadata-columns branch from 6ad4ff5 to 41c3a49 Compare May 23, 2024 06:49
@StrikeW StrikeW force-pushed the siyuan/cdc-metadata-columns branch from 755b6a6 to d4c15d3 Compare May 23, 2024 09:43
@stdrc stdrc self-requested a review May 24, 2024 10:14
Copy link
Contributor

@tabVersion tabVersion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The impl general LGTM. Please also update the release doc part.

@StrikeW StrikeW added this pull request to the merge queue May 27, 2024
Merged via the queue into main with commit 1401d56 May 27, 2024
33 of 34 checks passed
@StrikeW StrikeW deleted the siyuan/cdc-metadata-columns branch May 27, 2024 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature user-facing-changes Contains changes that are visible to users
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug(cdc-connector): InstanceNotFoundException CDC connector with additional columns
5 participants