Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Storage Write API #288

Open
SamoylovMD opened this issue Mar 13, 2023 · 18 comments
Open

Support for Storage Write API #288

SamoylovMD opened this issue Mar 13, 2023 · 18 comments

Comments

@SamoylovMD
Copy link

The official BigQuery documentation states:

For new projects, we recommend using the BigQuery Storage Write API instead of the tabledata.insertAll method. The Storage Write API has lower pricing and more robust features, including exactly-once delivery semantics. The tabledata.insertAll method is still fully supported.

It looks like the currently used write method can be gradually decommissioned in the not-so-far future. Also, the new write method spends twice less money than the old one.

Here is the documentation for Storage Write API.

Do you have any work on this in the roadmap, and if no, how should the community request it?

@ekapratama93
Copy link

ekapratama93 commented Apr 14, 2023

Supporting this API is good idea since bigquery now support auto merging cdc data so it does not require to use tmp table.

https://cloud.google.com/blog/products/data-analytics/bigquery-gains-change-data-capture-functionality

@jrkinley
Copy link

Thoughts on adding a new BigQueryWriter implementation for the new Storage Write API that can be enabled in configuration? As opposed to modifying the existing writers AdaptiveBigQueryWriter and SimpleBigQueryWriter that use the legacy api tabledata.insertall?

@james-johnston-thumbtack

Supporting this API is good idea since bigquery now support auto merging cdc data so it does not require to use tmp table.

https://cloud.google.com/blog/products/data-analytics/bigquery-gains-change-data-capture-functionality

This seems really compelling because it would greatly simplify the operation of the connector when upsertEnabled or deleteEnabled is set. Reading the blog, it sounds like they have essentially abstracted the same MERGE operations behind the new API, so the connector doesn't have to do it any more: the max_staleness value sounds an awful lot like mergeIntervalMs!

@sendbird-sehwankim
Copy link

They used to have Storage Write API feature on this release (https://github.com/confluentinc/kafka-connect-bigquery/tree/v2.6.0-rc-2faec09), but later they removed this feature.
Now, on Confluent Cloud there's this new connector called BigQuery Sink Connector V2
which supports only Storage Write API. This plugin is not open source, and it is only available on Confluent Cloud.
I guess Confluent has no plan to add this feature to this open source plugin.

@C0urante
Copy link

@yashmayya @ashwinpankaj any idea if this feature will be limited to Confluent Cloud only, or if it will be merged+released to this open source repo as well?

@jvigneshr
Copy link

jvigneshr commented Feb 1, 2024

We do not plan to open-source the new connector in the near future.

@C0urante
Copy link

C0urante commented Feb 1, 2024

@jvigneshr the connector is already open source (check the license file); I'm guessing you mean you no longer plan to maintain this project or at least release new features for it?

@jvigneshr
Copy link

Apologies. I meant the new BigQuery V2 connector. (Edited the previous comment too)

@C0urante
Copy link

C0urante commented Feb 1, 2024

Nobody's asking about V2. Are you or are you not still maintaining this project?

@magnich
Copy link

magnich commented Feb 1, 2024

We are still supporting the project. The Storage Write API is a completely new API without a valid migration path from the existing connector and was therefore built as a new connector in cloud with other new features such as OAuth 2.0, support for schema context and reference subject naming strategy.

If we end up building a self-managed version of the connector it will be open source.

@C0urante
Copy link

C0urante commented Feb 1, 2024

Uh huh. So if someone else implemented support for the Storage Write API with this project it'd be reviewed and merged in a reasonable timeframe? (I personally doubt that the API is completely incompatible with this connector, and even if it is, a 3.0.0 release with some compatibility-breaking changes several years after 2.0.0 is completely reasonable.)

@criccomini
Copy link

@jvigneshr @b-goyal can y'all weigh in on the issues you had with your 2.6 code? Would love to know what needs to be done to make it work for this connector.

@ragepati
Copy link

ragepati commented Feb 8, 2024

Our initial approach was to replace the insertAll API with the Storage Write API in the existing BQ connector code base. During development, we observed certain incompatibilities between these two APIs. Some of these were not documented by Google.

An example is in the handling of data types - a timestamp represented as a String (e.g. 2023-12-15 13:14:15) can be ingested successfully to a DATETIME BigQuery type with the insertAll API, but not with Storage Write API (unable to parse text). It was not feasible to identify all such differences since the insertAll API has not documented the set of literal values that it can cast to a DATETIME.

We learned from Google that the newer API is not 100% backward compatible and there is no insertAll to Storage Write API migration guide.

So we built the v2 connector in Cloud as a new plugin type, along with other new features, and making it explicit to customers that there are incompatible changes between the two versions. We also documented the supported data types for the Storage Write API.

@C0urante
Copy link

C0urante commented Feb 8, 2024

I've just personally verified that, using a JsonStreamWriter without an explicitly-specified schema, the Java string 2023-12-15 13:14:15 can be successfully written to a DATETIME column.

That aside, if Storage Write API support were added as an opt-in feature (which, based on the reverted commits in #357, appears to have been the plan), surely these kinds of incompatibilities wouldn't be a problem?

At this point it seems like the motivation here is more to get people to pay for a proprietary fork of this project instead of continuing to maintain the open source variant. I don't think anyone who's using this project on their own would agree that they need to be "protected" from small incompatibilities in data type handling by being forced to switch over to a paid alternative instead of just tweaking a boolean property in their connector config.

@james-johnston-thumbtack

Definitely agree, and can confirm that as end-users, we would be fine with adjusting a few config options to opt in to the new storage write API support, and use config options to tweak data type conversions as needed.

@C0urante
Copy link

Heads up to everyone involved here--Aiven has decided to fork the project, pull in the code that was removed in #357, and begin maintaining their own version of the connector. You can find it here. We've published a 2.6.0 release that contains support for the Storage Write API as a beta feature and would be happy to get feedback from anyone interested in trying it out.

cc @SamoylovMD @ekapratama93 @jrkinley @james-johnston-thumbtack @LarsKlingen @andrelu @Ironaki @agoloborodko @sendbird-sehwankim @quassy @aakarshg @whittid4 @corleyma @criccomini @bakuljajan

@magnich @ragepati @jvigneshr Feel free to close this issue if you have no plans on addressing it. It'd be nice to give people a clear signal about which fork they should contribute to/utilize if the Storage Write API is a priority to them.

@jvigneshr
Copy link

When we release a self-managed version of the BQ v2 connector, we will make it open-source. It is on the product roadmap, but we don't have a timeline to share yet.

@C0urante
Copy link

Then this issue should be closed, since you have no plans of addressing it on this project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants