Source versioning: Postgres, MySQL and Load generator #647

bobbyiliev · 2024-09-03T09:40:17Z

Initial implementation for the source versioning refactor as per #646

The main changes to consider:

Marking the table attribute as optional and deprecated for both the MySQL and Postgres sources
Introduced a new all_tables bool attribute for the MySQL and the Loadgen sources, as in the past this was defaulting always using FOR ALL TABLES in the load gen sources (auction, marketing, tpch) and in the MySQL case whenever no table blocks were defined, we defaulted to FOR ALL TABLES. This all_tables bool attribute allows us to create sources without any tables defined as per the source versioning work
Introducing the new materialize_source_table_{mysql|postgres|load_generator} resource which allows us to do CREATE TABLE ... FROM SOURCE ...

Things that are still pending: #646

morsapaes · 2024-09-11T15:21:15Z

docs/resources/source_table.md

@@ -29,7 +29,7 @@ description: |-
 - `ownership_role` (String) The owernship role of the object.
 - `region` (String) The region to use for the resource connection. If not set, the default region is used.
 - `schema_name` (String) The identifier for the table schema in Materialize. Defaults to `public`.
- `text_columns` (List of String) Columns to be decoded as text.
+- `text_columns` (List of String) Columns to be decoded as text. Not supported for the load generator sources, if the source is a load generator, the attribute will be ignored.


Similar to sources, we might want a source table resource per source type? The existing source-level options will basically shift to source table-level.

Yes indeed, I was just thinking about this. With the MySQL and Postgres sources, it is probably fine, but as soon as we add Kafka and Webhook sources, the logic will get out of hand.

Will refactor this to have a separate source table resource per source!

rjobanp

nice work!

rjobanp · 2024-09-16T13:07:37Z

docs/resources/source_kafka.md

+- `start_offset` (List of Number, Deprecated) Read partitions from the specified offset. Deprecated: Use the new materialize_source_table_kafka resource instead.
+- `start_timestamp` (Number, Deprecated) Use the specified value to set `START OFFSET` based on the Kafka timestamp. Deprecated: Use the new materialize_source_table_kafka resource instead.


these two options are still currently only possible on the top-level CREATE SOURCE statement for kafka sources -- not yet on a per-table basis. It will require a non-trivial amount more refactoring to allow them on a per-table basis so I'm unsure if we will do that work until it's requested by a customer

Ah yes! Good catch! Thank you!

rjobanp · 2024-09-16T13:08:26Z

docs/resources/source_mysql.md

@@ -53,12 +53,12 @@ resource "materialize_source_mysql" "test" {
 - `comment` (String) **Public Preview** Comment on an object in the database.
 - `database_name` (String) The identifier for the source database in Materialize. Defaults to `MZ_DATABASE` environment variable if set or `materialize` if environment variable is not set.
 - `expose_progress` (Block List, Max: 1) The name of the progress collection for the source. If this is not specified, the collection will be named `<src_name>_progress`. (see [below for nested schema](#nestedblock--expose_progress))
- `ignore_columns` (List of String, Deprecated) Ignore specific columns when reading data from MySQL. Can only be updated in place when also updating a corresponding `table` attribute. Deprecated: Use the new materialize_source_table resource instead.
+- `ignore_columns` (List of String, Deprecated) Ignore specific columns when reading data from MySQL. Can only be updated in place when also updating a corresponding `table` attribute. Deprecated: Use the new materialize_source_table_mysql resource instead.


fyi this option is also being renamed MaterializeInc/materialize#29438 but the old name will be aliased to the new one, so this shouldn't break

Sounds good! I will go ahead and use the exclude columns for the new table source resource!

rjobanp · 2024-09-16T13:09:12Z

docs/resources/source_table_kafka.md

+- `start_offset` (List of Number) Read partitions from the specified offset.
+- `start_timestamp` (Number) Use the specified value to set `START OFFSET` based on the Kafka timestamp.


these aren't currently available on a per-table basis for kafka sources

rjobanp · 2024-09-16T13:09:34Z

docs/resources/source_table_kafka.md

+- `schema_name` (String) The identifier for the source schema in Materialize. Defaults to `public`.
+- `start_offset` (List of Number) Read partitions from the specified offset.
+- `start_timestamp` (Number) Use the specified value to set `START OFFSET` based on the Kafka timestamp.
+- `upstream_schema_name` (String) The schema of the table in the upstream database.


what does this refer to for kafka sources? we might just want to omit it since the upstream reference should just be the kafka topic name

Good catch, this was an overlook on my end in the schema for the Kafka source table resource.

rjobanp · 2024-09-16T13:10:49Z

pkg/materialize/source_table_kafka.go

+	startOffset      []int
+	startTimestamp   int


these two aren't used below and also aren't possible on the statement

rjobanp · 2024-09-16T13:13:55Z

docs/guides/materialize_source_table.md

+
+This guide will walk you through the process of migrating your existing source table definitions to the new `materialize_source_table_{source}` resource.
+
+For each source type (e.g., MySQL, Postgres, etc.), you will need to create a new `materialize_source_table_{source}` resource for each table that was previously defined within the source resource. This ensures that the tables are preserved during the migration process.


Suggested change

For each source type (e.g., MySQL, Postgres, etc.), you will need to create a new `materialize_source_table_{source}` resource for each table that was previously defined within the source resource. This ensures that the tables are preserved during the migration process.

For each source type (e.g., MySQL, Postgres, etc.), you will need to create a new `materialize_source_table_{source}` resource for each table that was previously defined within the source resource. This ensures that the tables are preserved during the migration process. For Kafka sources, you will need to create at least one `materialize_source_table_kafka` table to hold data for the kafka topic.

@morsapaes might have better wording for this but I think we should be clear that this migration needs to happen for sources that previously didn't have subsources too (e.g. kafka)

rjobanp · 2024-09-16T13:15:49Z

docs/guides/materialize_source_table.md

+
+The same approach can be used for other source types such as Postgres, eg. `materialize_source_table_postgres`.
+
+## Automated Migration Process (TBD)


nice - this is great! We will probably want to figure out how to tell them that they will be able to coordinate the 'automated' migration process with their field-engineer representative if they go down this path

rjobanp · 2024-09-17T14:16:09Z

@morsapaes @bobbyiliev let's discuss this PR at the sources & sinks meeting this week - we should decide when it makes sense to merge this - my thinking is we should do so whenever we move into private preview for the source versioning feature. But if we want to merge sooner and just have a disclaimer that the things mentioned as 'deprecated' here are not actually yet deprecated, that could work too

bobbyiliev · 2024-09-30T07:19:16Z

One thing that we can consider here as per this old tracking issue: #391 is take the chance and decide if we still want to rename some of the attributes in the new source table resources:

Lists For attributes that use list, we have more cases of singular than plural.

Attribute	Resource	Type	Plural	Comment
start_offset	materialize_source_kafka	List of Strings		In Materialize the attribute is singular `START OFFSET` even though it is a list of strings
header	materialize_source_kafka	List of Strings		In Materialize the attribute is singular `HEADER` even though it is a list of strings
text_columns	materialize_source_postgres	List of Strings	X

Blocks We had decided should be singular. There are some blocks that use plural so this could be a good chance to rename those attributes in the new source table load gen resource:

Attribute	Resource	Type	Plural
auction_options	materialize_source_load_generator	Block	X
counter_options	materialize_source_load_generator	Block	X
marketing_options	materialize_source_load_generator	Block	X
tpch_options	materialize_source_load_generator	Block	X
check_options	materialize_webhook	Block	X

bobbyiliev changed the title ~~Source versioning~~ [WIP] Source versioning Sep 3, 2024

bobbyiliev mentioned this pull request Sep 3, 2024

Add source versioning design doc #645

Merged

bobbyiliev force-pushed the source-versioning branch 3 times, most recently from c4a61c8 to e202851 Compare September 9, 2024 13:33

morsapaes reviewed Sep 11, 2024

View reviewed changes

bobbyiliev changed the title ~~[WIP] Source versioning~~ Source versioning: Postgres, MySQL and Load generator Sep 13, 2024

bobbyiliev marked this pull request as ready for review September 13, 2024 12:29

bobbyiliev requested a review from a team as a code owner September 13, 2024 12:29

bobbyiliev requested review from arusahni and rjobanp and removed request for a team September 13, 2024 12:29

rjobanp reviewed Sep 16, 2024

View reviewed changes

bobbyiliev added 17 commits September 17, 2024 16:12

Source versioning initial implementation

0686460

Use source table instead of table from source

f3412d6

MySQL source: separate for tables and all tables

c1a08c2

Loadgen source: add all tables bool attr

b438f6d

Add tests

008284e

Add more tests for mysql and loadgen

912ac32

Add ignore columns for MySQL

4df16c1

Add source_id logic

89c2c2f

Add source table migration guide

2810df8

Add deprecated message

dab29b0

Ignore text columns for load gen source tables

19fcdab

Refactor source table for individual sources

c83e517

Add datasource

2a492ef

Format examples

f7ae45a

Add Kafka source table resource

011a424

Review updates

14a4024

Update guide migration guide

fc49ffa

bobbyiliev added 2 commits September 17, 2024 16:12

Update guide migration guide

7ad7474

Add import examples for Kafka source tables

7870d48

bobbyiliev force-pushed the source-versioning branch from 0b63a29 to 7870d48 Compare September 17, 2024 13:14

bobbyiliev added 4 commits September 20, 2024 16:11

Add upstream mysql and postgres table names

0157f27

Fix unit tests

0343736

Add Kafka upstream references

38fdbdb

Add integration tests

5b43cee

bobbyiliev force-pushed the source-versioning branch 2 times, most recently from 6d60efb to 61cd55c Compare September 24, 2024 09:37

Fix failing test

9b96939

bobbyiliev force-pushed the source-versioning branch from 61cd55c to 9b96939 Compare September 24, 2024 09:43

bobbyiliev added 3 commits September 24, 2024 13:02

Extend data source to include upstream names

e82ea93

Small updates

b56e44c

Switch back to latest image

a467a3e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Source versioning: Postgres, MySQL and Load generator #647

Source versioning: Postgres, MySQL and Load generator #647

bobbyiliev commented Sep 3, 2024 •

edited

Loading

morsapaes Sep 11, 2024

bobbyiliev Sep 11, 2024

rjobanp left a comment

rjobanp Sep 16, 2024

bobbyiliev Sep 16, 2024

rjobanp Sep 16, 2024

bobbyiliev Sep 16, 2024

rjobanp Sep 16, 2024

rjobanp Sep 16, 2024

bobbyiliev Sep 16, 2024

rjobanp Sep 16, 2024

rjobanp Sep 16, 2024

rjobanp Sep 16, 2024

rjobanp commented Sep 17, 2024 •

edited

Loading

bobbyiliev commented Sep 30, 2024

		- `start_offset` (List of Number, Deprecated) Read partitions from the specified offset. Deprecated: Use the new materialize_source_table_kafka resource instead.
		- `start_timestamp` (Number, Deprecated) Use the specified value to set `START OFFSET` based on the Kafka timestamp. Deprecated: Use the new materialize_source_table_kafka resource instead.

		- `start_offset` (List of Number) Read partitions from the specified offset.
		- `start_timestamp` (Number) Use the specified value to set `START OFFSET` based on the Kafka timestamp.


		This guide will walk you through the process of migrating your existing source table definitions to the new `materialize_source_table_{source}` resource.

		For each source type (e.g., MySQL, Postgres, etc.), you will need to create a new `materialize_source_table_{source}` resource for each table that was previously defined within the source resource. This ensures that the tables are preserved during the migration process.


		The same approach can be used for other source types such as Postgres, eg. `materialize_source_table_postgres`.

		## Automated Migration Process (TBD)

Source versioning: Postgres, MySQL and Load generator #647

Are you sure you want to change the base?

Source versioning: Postgres, MySQL and Load generator #647

Conversation

bobbyiliev commented Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjobanp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjobanp commented Sep 17, 2024 • edited Loading

bobbyiliev commented Sep 30, 2024

bobbyiliev commented Sep 3, 2024 •

edited

Loading

rjobanp commented Sep 17, 2024 •

edited

Loading