Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

materialize-bigquery: don't serialize JSON nulls #2323

Merged
merged 3 commits into from
Feb 10, 2025
Merged

Conversation

williamhbaker
Copy link
Member

@williamhbaker williamhbaker commented Feb 3, 2025

Description:

The current system of serializing JSON null values explicitly results in those values being loaded to BigQuery tables as JSON null rather than an actual SQL null. The intention is for these values to be SQL null, and that can be accomplished by omitting the fields entirely from the serialized documents.

Also included in this PR is the addition of a feature flag objects_and_arrays_as_strings which will be used in a later change (see #2324) that will begin to create object and array columns as JSON columns, rather than their string representations in string columns. Right now the feature flag doesn't do anything, but the configuration structure is being added so that pre-existing tasks can have it set to maintain their existing behavior, and have newly created tasks not have it set and use JSON columns instead.

Closes #2126

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)


This change is Reviewable

Serializing an explicit JSON `null` in staged files causes the loaded value from
those files to be a JSON `null` instead of a SQL `null`.

We want these to be SQL `null` instead, and omitting the value from the staged
files will do that. It also saves us a little bit of work when there are lots of
fields that are `null`.

Closes #2126
Adds a feature flag for materializing objects and arrays as strings.

This is the current behavior, and the feature flag is being added so that all
pre-existing tasks can have it be set to `true`. A subsequent change will enable
the flag's behavior, where if it is not set (which will be the default going
forward), objects and arrays will instead be materialized into JSON columns.
@williamhbaker williamhbaker changed the title materialize-bigquery: objects_and_arrays_as_strings feature flag materialize-bigquery: don't serialize JSON nulls Feb 3, 2025
@williamhbaker williamhbaker marked this pull request as ready for review February 3, 2025 19:07
}

type advancedConfig struct {
FeatureFlags string `json:"feature_flags,omitempty" jsonschema:"title=Feature Flags,description=This property is intended for Estuary internal use. You should only modify this field as directed by Estuary support."`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not an advanced property specifically for materializing objects and arrays as strings without an exposed JSONSchema? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this same thought initially, and there was some internal discussion about it here.

The TL;DR is that there's a desire to standardize around using this feature flag convention for this sort of thing, as we have been & will be doing more of for changing behaviors of SQL captures. More specifically what I mean re: "this sort of thing" is controlling behaviors that we don't expect users to ever interact with directly, either for preserving existing compatibility, or selectively enabling new code paths.

@williamhbaker
Copy link
Member Author

I changed the feature flag from objects_and_arrays_as_strings and objects_and_arrays_as_json based on my better understanding of how these feature flags should be used. I suppose I could/should also include the behavior change it will toggle in this same PR instead of a follow-up, but since I've already got this one ready and then next one (#2324) is fully drafted I'm going to keep them as they are, except for the change of the flag to objects_and_arrays_as_json.

Copy link
Member

@willdonnelly willdonnelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -16,7 +17,10 @@ import (
"google.golang.org/api/option"
)

// Config represents the endpoint configuration for BigQuery.
var featureFlagDefaults = map[string]bool{
"objects_and_arrays_as_json": false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is fairly self-explanatory in this case, but I've been trying to maintain a convention of describing the impact of a flag here in the defaults map.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, I added a comment here to explain what the flag does.

…re flag

Changes the feature flag `objects_and_arrays_as_string` to
`objects_and_arrays_as_json`, to allow for a flow of setting existing tasks to
`no_objects_and_arrays_as_json`, and then eventually making the JSON version the
default.
@williamhbaker williamhbaker merged commit c9586be into main Feb 10, 2025
1 check passed
@williamhbaker williamhbaker deleted the wb/mat-bq-flag branch February 10, 2025 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

materialize-bigquery: null values for variant column are being materialized as JSON nulls
3 participants