Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BigQuery Loader V2 #877

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Add BigQuery Loader V2 #877

wants to merge 1 commit into from

Conversation

pondzix
Copy link
Contributor

@pondzix pondzix commented May 7, 2024

No description provided.

Copy link

netlify bot commented May 7, 2024

Deploy Preview for snowplow-docs ready!

Name Link
🔨 Latest commit dc1045e
🔍 Latest deploy log https://app.netlify.com/sites/snowplow-docs/deploys/664c6ecadde50a0008164d4d
😎 Deploy Preview https://deploy-preview-877--snowplow-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@pondzix pondzix force-pushed the bq_loader_v2 branch 3 times, most recently from 1a692b2 to 43ead88 Compare May 7, 2024 11:44
@pondzix pondzix force-pushed the bq_loader_v2 branch 3 times, most recently from 8f27c18 to 71dc92d Compare May 10, 2024 12:29
@pondzix pondzix requested review from istreeter and stanch May 10, 2024 13:09
Copy link
Collaborator

@stanch stanch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also update this page? https://docs.snowplow.io/docs/storing-querying/storage-options/ BigQuery should be now an option for the 3 clouds

- `9999999` is a hash code unique to the schema (i.e. it will change if the schema is overwritten with a different one).

If you create a new schema `1-0-2` that reverts the offending changes and is again compatible with `1-0-0`, the data for events with that schema will be written to the original column as expected.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:::tip
You might find that some of your schemas were evolved incorrectly in the past, which results in the creation of these “recovery” columns after the upgrade. To address this for a given schema, create a new _minor_ schema version that reverts the breaking changes introduced in previous versions. (Or, if you want to keep the breaking change, create a new _major_ schema version.) You can set it to [supersede](/docs/understanding-tracking-design/versioning-your-data-structures/amending/index.md#marking-the-schema-as-superseded) the previous version(s), so that events are automatically validated against the new schema.
:::

Do we also need instructions similar to these? https://docs.snowplow.io/docs/pipeline-components-and-applications/loaders-storage-targets/snowplow-rdb-loader/upgrade-guides/6-0-0-upgrade-guide/#identifying-schemas-that-need-patching @istreeter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That section "Identifying schemas that need patching" was written long ago before we had the supercededBy feature. I would think now we have supercededBy, we shouldn't recommend to anyone to patch schemas.

Instead, the user can publish the fixed schema as a new major version, and configure the old schemas to automatically upgrade to the new major version. This is consistent with our belief that schemas are immutable. And it means we don't need to write a new igluctl feature.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My doubt was not specifically about patching but whether running igluctl would be useful to identify issues with schemas... Because I think it says what the incompatibilities are? With just a recovery column, how will the users know what to fix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

igluctl has an existing sub-command verify-parquet

igluctl static verify-parquet <schema_dir>

It prints out warnings like

'com.test/test/jsonschema/1-0-1'. Changes: Incompatible type change String to Double at /id"

Unfortunately, it only works if the schemas are written into files on the local file system. It does not connect to remote Iglus.

We've not really planned what happens when existing users migrate to the new loader. And that is semi-deliberate: it is a lot of work to gather all the requirements and plan for all the things that could go wrong. So I've been delaying it until there is time to work on it properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stanch @istreeter as BQ Loader v2 relies on the same schema compatibility model/rules as parquet/Databricks/lake loader, we could use verify-parquet igluctl command and recommend it to users to detect incompatibilities. But this parquet in the name is confusing! Should we also do some renaming in igluctl then to make it more generic? And then also adjust docs

@stanch
Copy link
Collaborator

stanch commented Jun 11, 2024

This will now need a rebase due to #915

@stanch stanch added the do not merge Flag to denote a Issue or PR which should not yet be merged (usually pending a release) label Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do not merge Flag to denote a Issue or PR which should not yet be merged (usually pending a release)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants