Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

materialize-bigquery: JSON column type for array and objects #2324

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

williamhbaker
Copy link
Member

@williamhbaker williamhbaker commented Feb 3, 2025

Description:

When the BigQuery materialization connector was initially built, BigQuery did not have a native JSON column type, so arrays and objects were materialized as "stringified" JSON.

There is now a native JSON column type, so we can use that for these kinds of fields.

The main complication here is preserving the behavior of pre-existing materializations, while allowing new ones to use the JSON column type. The objects_and_arrays_as_json feature flag was added previously, and that flag will toggle using the old stringified columns instead of the JSON columns.

Most of the diff here is mechanical changes needed to thread through the value of that objects_and_arrays_as_json flag to where it needs to be.

The rollout of this change will be:

  1. Set all existing bigquery materializations to have the flag no_objects_and_arrays_as_json, which was added in materialize-bigquery: don't serialize JSON nulls #2323.
  2. Shortly after that, this pull request should be merged so that new tasks use the default behavior going forward of using JSON columns.

Closes #2319
Closes #2333

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

Manually tested that an existing materialization can transition successfully from having no feature flag set to having the no_objects_and_arrays_as_json set. Trying to set objects_and_arrays_as_json on an existing materialization fails because of the newly-added compatibility check for existing root document columns, and the change to not allow them to be migrated. A brand new task with objects_and_arrays_as_json set works fine with JSON columns from the start.


This change is Reviewable

When the BigQuery materialization connector was initially built, BigQuery did
not have a native JSON column type, so arrays and objects were materialized as
"stringified" JSON.

There is now a native JSON column type, so we can use that for these kinds of
fields.

The main complication here is preserving the behavior of pre-existing
materializations, while allowing new ones to use the JSON column type. The
`objects_and_arrays_as_json` feature flag was added previously, and that flag
will toggle using the old stringified columns instead of the JSON columns.

Most of the diff here is mechanical changes needed to thread through the value
of that `objects_and_arrays_as_json` flag to where it needs to be.
@@ -32,77 +35,86 @@ type transactor struct {
loggedStorageApiMessage bool
}

func newTransactor(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this diff is just from wrapping the newTransactor behavior in a closure so that it has access to the previously computed dialect, templates, and flag value.

There are no currently known cases where this should be allowed, and it is
likely to break situations where the document is currently materialized as a
string column and attempts to be migrated to a JSON column, where string -> JSON
migrations results in a JSON string rather than an actual JSON object.
@williamhbaker williamhbaker marked this pull request as ready for review February 10, 2025 17:29
@mdibaiee
Copy link
Member

@williamhbaker regarding release process, why do we not set the flag on tasks now, and then release this with objects_and_arrays_as_json as the default?

@williamhbaker
Copy link
Member Author

@williamhbaker regarding release process, why do we not set the flag on tasks now, and then release this with objects_and_arrays_as_json as the default?

That's a good point, since I did this a little strangely and already added the flag in a prior PR we can do it like you suggest, and not have to have another follow-up.

I'll update the PR and release process description for that!

…fault

New materializations created without the "no_objects_and_arrays_as_json" will
now use JSON columns for objects and arrays.
@williamhbaker
Copy link
Member Author

@mdibaiee I finished updating this PR to have objects_and_arrays_as_json be the default, let me know what you think. I'll update existing tasks to have no_objects_and_arrays_as_json before merging this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle JSON fields in Bigquery materialize-bigquery: object and array fields as JSON instead of string
2 participants