diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/_cross-cloud-diagram.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/_cross-cloud-diagram.md new file mode 100644 index 0000000000..3ca880cf6c --- /dev/null +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/_cross-cloud-diagram.md @@ -0,0 +1,18 @@ +```mdx-code-block +import Admonition from '@theme/Admonition'; +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; +import LoaderDiagram from '@site/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/_diagram.md'; +``` + + + + + + + + + + + + diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/_deploy-overview.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/_deploy-overview.md new file mode 100644 index 0000000000..5e0c62b196 --- /dev/null +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/_deploy-overview.md @@ -0,0 +1,22 @@ +```mdx-code-block +import {versions} from '@site/src/componentVersions'; +import CodeBlock from '@theme/CodeBlock'; +``` + +

The BigQuery Loader is published as a Docker image which you can run on any {props.cloud} VM.

+ +{ +`docker pull snowplow/bigquery-loader-${props.stream}:${versions.bqLoader} +`} + +To run the loader, mount your config file into the docker image, and then provide the file path on the command line. + +{ +`docker run \\ + --mount=type=bind,source=/path/to/myconfig,destination=/myconfig \\ + snowplow/bigquery-loader-${props.stream}:${versions.bqLoader} \\ + --config=/myconfig/loader.hocon \\ + --iglu-config /myconfig/iglu.hocon +`} + +Where `loader.hocon` is loader's [configuration file](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/#configuring-the-loader) and `iglu.hocon` is [iglu resolver](/docs/pipeline-components-and-applications/iglu/iglu-resolver/index.md) configuration. diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/_diagram.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/_diagram.md index 8d39607656..a21e963fa2 100644 --- a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/_diagram.md +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/_diagram.md @@ -1,24 +1,15 @@ -At the high level, BigQuery loader reads enriched Snowplow events in real time and loads them in BigQuery using the Storage Write API. +```mdx-code-block +import Mermaid from '@theme/Mermaid'; +import Link from '@docusaurus/Link'; +``` +

The BigQuery Streaming Loader on {props.cloud} is a fully streaming application that continually pulls events from {props.stream} and writes to BigQuery using the BigQuery Storage API.

-```mermaid +Enriched events\n(Pub/Sub stream)"]] - loader{{"BigQuery Loader\n(Loader, Mutator and Repeater apps)"}} - subgraph BigQuery + stream[["Enriched Events\n(${props.stream} stream)"]] + loader{{"BigQuery Loader"}} + subgraph bigquery [BigQuery] table[("Events table")] end - stream-->loader-->BigQuery -``` - -BigQuery loader consists of three applications: Loader, Mutator and Repeater. The following diagram illustrates the interaction between them and BigQuery: - -```mermaid -sequenceDiagram - loop - Note over Loader: Read a small batch of events - Loader-->>+Mutator: Communicate event types (via Pub/Sub) - Loader->>BigQuery: Send events using the Storage Write API - Mutator-->>-BigQuery: Adjust column types if necessary - Repeater->>BigQuery: Resend events that failed
because columns were not up to date - end -``` + stream-->loader-->|BigQuery Storage API|bigquery +`}/> diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/_bigquery_config.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/_bigquery_config.md new file mode 100644 index 0000000000..c515a8dc9a --- /dev/null +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/_bigquery_config.md @@ -0,0 +1,16 @@ + + output.good.project + Required. The GCP project to which the BigQuery dataset belongs + + + output.good.dataset + Required. The BigQuery dataset to which events will be loaded + + + output.good.table + Optional. Default value events. Name to use for the events table + + + output.good.credentials + Optional. Service account credentials (JSON). If not set, default credentials will be sourced from the usual locations, e.g. file pointed to by the GOOGLE_APPLICATION_CREDENTIALS environment variable + diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/_common_config.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/_common_config.md new file mode 100644 index 0000000000..3d1412863f --- /dev/null +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/_common_config.md @@ -0,0 +1,85 @@ +```mdx-code-block +import Link from '@docusaurus/Link'; +``` + + + batching.maxBytes + Optional. Default value 16000000. Events are emitted to BigQuery when the batch reaches this size in bytes + + + batching.maxDelay + Optional. Default value 1 second. Events are emitted to BigQuery after a maximum of this duration, even if the maxBytes size has not been reached + + + batching.uploadConcurrency + Optional. Default value 3. How many batches can we send simultaneously over the network to BigQuery + + + retries.setupErrors.delay + + Optional. Default value 30 seconds. + Configures exponential backoff on errors related to how BigQuery is set up for this loader. + Examples include authentication errors and permissions errors. + This class of errors are reported periodically to the monitoring webhook. + + + + retries.transientErrors.delay + + Optional. Default value 1 second. + Configures exponential backoff on errors that are likely to be transient. + Examples include server errors and network errors. + + + + retries.transientErrors.attempts + Optional. Default value 5. Maximum number of attempts to make before giving up on a transient error. + + + skipSchemas + Optional, e.g. ["iglu:com.example/skipped1/jsonschema/1-0-0"] or with wildcards ["iglu:com.example/skipped2/jsonschema/1-*-*"]. A list of schemas that won't be loaded to BigQuery. This feature could be helpful when recovering from edge-case schemas which for some reason cannot be loaded to the table. + + + monitoring.metrics.statsd.hostname + Optional. If set, the loader sends statsd metrics over UDP to a server on this host name. + + + monitoring.metrics.statsd.port + Optional. Default value 8125. If the statsd server is configured, this UDP port is used for sending metrics. + + + monitoring.metrics.statsd.tags.* + Optional. A map of key/value pairs to be sent along with the statsd metric. + + + monitoring.metrics.statsd.period + Optional. Default 1 minute. How often to report metrics to statsd. + + + monitoring.metrics.statsd.prefix + Optional. Default snowplow.bigquery-loader. Prefix used for the metric name when sending to statsd. + + + monitoring.webhook.endpoint + Optional, e.g. https://webhook.example.com. The loader will send to the webhook a payload containing details of any error related to how BigQuery is set up for this loader. + + + monitoring.webhook.tags.* + Optional. A map of key/value strings to be included in the payload content sent to the webhook. + + + monitoring.sentry.dsn + Optional. Set to a Sentry URI to report unexpected runtime exceptions. + + + monitoring.sentry.tags.* + Optional. A map of key/value strings which are passed as tags when reporting exceptions to Sentry. + + + telemetry.disable + Optional. Set to true to disable telemetry. + + + telemetry.userProvidedId + Optional. See here for more information. + diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/_kafka_config.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/_kafka_config.md new file mode 100644 index 0000000000..2ec32c51f9 --- /dev/null +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/_kafka_config.md @@ -0,0 +1,28 @@ + + input.topicName + Required. Name of the Kafka topic for the source of enriched events. + + + input.bootstrapServers + Required. Hostname and port of Kafka bootstrap servers hosting the source of enriched events. + + + input.consumerConf.* + Optional. A map of key/value pairs for any standard Kafka consumer configuration option. + + + output.bad.topicName + Required. Name of the Kafka topic that will receive failed events. + + + output.bad.bootstrapServers + Required. Hostname and port of Kafka bootstrap servers hosting the bad topic + + + output.bad.producerConf.* + Optional. A map of key/value pairs for any standard Kafka producer configuration option. + + + output.bad.maxRecordSize.* + Optional. Default value 1000000. Any single failed event sent to Kafka should not exceed this size in bytes + diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/_kinesis_config.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/_kinesis_config.md new file mode 100644 index 0000000000..f6cd9ab879 --- /dev/null +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/_kinesis_config.md @@ -0,0 +1,52 @@ + + input.streamName + Required. Name of the Kinesis stream with the enriched events + + + input.appName + Optional, default snowplow-bigquery-loader. Name to use for the dynamodb table, used by the underlying Kinesis Consumer Library for managing leases. + + + input.initialPosition + Optional, default LATEST. Allowed values are LATEST, TRIM_HORIZON, AT_TIMESTAMP. When the loader is deployed for the first time, this controls from where in the kinesis stream it should start consuming events. On all subsequent deployments of the loader, the loader will resume from the offsets stored in the DynamoDB table. + + + input.initialPosition.timestamp + Required if input.initialPosition is AT_TIMESTAMP. A timestamp in ISO8601 format from where the loader should start consuming events. + + + input.retrievalMode + Optional, default Polling. Change to FanOut to enable the enhance fan-out feature of Kinesis. + + + input.retrievalMode.maxRecords + Optional. Default value 1000. How many events the Kinesis client may fetch in a single poll. Only used when `input.retrievalMode` is Polling. + + + input.bufferSize + Optional. Default value 1. The number of batches of events which are pre-fetched from kinesis. The default value is known to work well. + + + output.bad.streamName + Required. Name of the Kinesis stream that will receive failed events. + + + output.bad.throttledBackoffPolicy.minBackoff + Optional. Default value 100 milliseconds. Initial backoff used to retry sending failed events if we exceed the Kinesis write throughput limits. + + + output.bad.throttledBackoffPolicy.maxBackoff + Optional. Default value 1 second. Maximum backoff used to retry sending failed events if we exceed the Kinesis write throughput limits. + + + output.bad.recordLimit + Optional. Default value 500. The maximum number of records we are allowed to send to Kinesis in 1 PutRecords request. + + + output.bad.byteLimit + Optional. Default value 5242880. The maximum number of bytes we are allowed to send to Kinesis in 1 PutRecords request. + + + output.bad.maxRecordSize.* + Optional. Default value 1000000. Any single event failed event sent to Kinesis should not exceed this size in bytes + diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/_pubsub_config.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/_pubsub_config.md new file mode 100644 index 0000000000..1a77895211 --- /dev/null +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/_pubsub_config.md @@ -0,0 +1,40 @@ + + input.subscription + Required, e.g. projects/myproject/subscriptions/snowplow-enriched. Name of the Pub/Sub subscription with the enriched events + + + input.parallelPullCount + Optional. Default value 1. Number of threads used internally by the pubsub client library for fetching events + + + input.bufferMaxBytes + Optional. Default value 10000000. How many bytes can be buffered by the loader app before blocking the pubsub client library from fetching more events. This is a balance between memory usage vs how efficiently the app can operate. The default value works well. + + + input.maxAckExtensionPeriod + Optional. Default value 1 hour. For how long the pubsub client library will continue to re-extend the ack deadline of an unprocessed event. + + + input.minDurationPerAckExtension + Optional. Default value 60 seconds. Sets min boundary on the value by which an ack deadline is extended. The actual value used is guided by runtime statistics collected by the pubsub client library. + + + input.maxDurationPerAckExtension + Optional. Default value 600 seconds. Sets max boundary on the value by which an ack deadline is extended. The actual value used is guided by runtime statistics collected by the pubsub client library. + + + output.bad.topic + Required, e.g. projects/myproject/topics/snowplow-bad. Name of the Pub/Sub topic that will receive failed events. + + + output.bad.batchSize + Optional. Default value 1000. Bad events are sent to Pub/Sub in batches not exceeding this count. + + + output.bad.requestByteThreshold + Optional. Default value 1000000. Bad events are sent to Pub/Sub in batches with a total size not exceeding this byte threshold + + + output.bad.maxRecordSize + Optional. Default value 10000000. Any single failed event sent to Pub/Sub should not exceed this size in bytes + diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/index.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/index.md new file mode 100644 index 0000000000..32eb1392ec --- /dev/null +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/index.md @@ -0,0 +1,106 @@ +--- +title: "BigQuery Loader configuration reference" +sidebar_label: "Configuration reference" +sidebar_position: 1 +--- + +```mdx-code-block +import {versions} from '@site/src/componentVersions'; +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; +import Admonition from '@theme/Admonition'; +import BigqueryConfig from '@site/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/_bigquery_config.md'; +import PubsubConfig from '@site/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/_pubsub_config.md'; +import KinesisConfig from '@site/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/_kinesis_config.md'; +import KafkaConfig from '@site/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/_kafka_config.md'; +import CommonConfig from '@site/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/_common_config.md'; +``` + +

The configuration reference in this page is written for BigQuery Loader {`${versions.bqLoader}`}

+ +### BigQuery configuration + + + + + + + + + + + +
ParameterDescription
+ +### Streams configuration + + + + + + + + + + + + + +
ParameterDescription
+
+ + + + + + + + + + + +
ParameterDescription
+
+ + + + + + + + + + + +
ParameterDescription
+ +:::info Event Hubs Authentication + +You can use the `input.consumerConf` and `output.bad.producerConf` options to configure authentication to Azure event hubs using SASL. For example: + +```json +"input.consumerConf": { + "security.protocol": "SASL_SSL" + "sasl.mechanism": "PLAIN" + "sasl.jaas.config": "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"\$ConnectionString\" password=;" +} +``` + +::: + +
+
+ +## Other configuration options + + + + + + + + + + + +
ParameterDescription
diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/index.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/index.md index 7ca5a84e00..5e68ef7e2b 100644 --- a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/index.md +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/index.md @@ -1,316 +1,66 @@ --- title: "BigQuery Loader" -sidebar_position: 2 +sidebar_label: "BigQuery Loader" +sidebar_position: 2 --- ```mdx-code-block -import {versions} from '@site/src/componentVersions'; +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; import CodeBlock from '@theme/CodeBlock'; -import Diagram from '@site/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/_diagram.md'; +import LoaderDiagram from '@site/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/_diagram.md'; +import DeployOverview from '@site/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/_deploy-overview.md'; ``` -Under the umbrella of Snowplow BigQuery Loader, we have a family of applications that can be used to load enriched Snowplow data into BigQuery. - - +## Overview + + + + + + + + + + + + + + + :::tip Schemas in BigQuery - For more information on how events are stored in BigQuery, check the [mapping between Snowplow schemas and the corresponding BigQuery column types](/docs/storing-querying/schemas-in-warehouse/index.md?warehouse=bigquery). - ::: -## Technical Architecture - -The available tools are: - -1. **Snowplow BigQuery StreamLoader**, a standalone Scala app that can be deployed on [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine). -2. **Snowplow BigQuery Loader**, an alternative to StreamLoader, in the form of a [Google Cloud Dataflow](https://cloud.google.com/dataflow) job. -3. **Snowplow BigQuery Mutator**, a Scala app that performs table updates to add new columns as required. -4. **Snowplow BigQuery Repeater**, a Scala app that reads failed inserts (caused by _table update lag_) and re-tries inserting them into BigQuery after some delay, sinking failures into a dead-letter bucket. - -### Snowplow BigQuery StreamLoader - -- Reads Snowplow enriched events from a dedicated Pub/Sub subscription. -- Uses the JSON transformer from the [Snowplow Scala Analytics SDK](https://github.com/snowplow/snowplow-scala-analytics-sdk) to convert those enriched events into JSON. -- Uses [Iglu Client](https://github.com/snowplow/iglu-scala-client/) to fetch JSON schemas for self-describing events and entities. -- Uses [Iglu Schema DDL](https://github.com/snowplow/schema-ddl) to transform self-describing events and entities into BigQuery format. -- Writes transformed data into BigQuery. -- Writes all encountered Iglu types into a dedicated Pub/Sub topic (the `types` topic). -- Writes all data that failed to be validated against its schema into a dedicated `badRows` Pub/Sub topic. -- Writes all data that was successfully transformed, but could not be loaded into a dedicated `failedInserts` topic. - -### Snowplow BigQuery Loader - -An [Apache Beam](https://beam.apache.org/) job intended to run on Google Cloud Dataflow. An alternative to the StreamLoader application, it has the same algorithm. - -### Snowplow BigQuery Mutator - -The Mutator app is in charge of performing automatic table updates, which means you do not have to pause loading and manually update the table every time you're adding a new custom self-describing event or entity. - -- Reads messages from a dedicated subscription to the `types` topic. -- Finds out if a message contains a type that has not been encountered yet (by checking internal cache). -- If a message contains a new type, double-checks it with the connected BigQuery table. -- If the type is not in the table, fetches its JSON schema from an Iglu registry. -- Transforms the JSON schema into BigQuery column definition. -- Adds the column to the connected BigQuery table. - -### Snowplow BigQuery Repeater - -The Repeater app is in charge of handling failed inserts. It reads ready-to-load events from a dedicated subscription on the `failedInserts` topic and re-tries inserting them into BigQuery to overcome 'table update lag'. - -#### Table update lag - -The loader app inserts data into BigQuery in near real-time. At the same time, it sinks messages containing information about the fields of an event into the `types` topic. It can take up to 10-15 seconds for Mutator to fetch, parse the message and execute an `ALTER TABLE` statement against the table. Additionally, the new column takes some time to propagate and become visible to all workers trying to write to it. - -If a new type arrives from the input subscription in this period of time, BigQuery might reject the row containing it and it will be sent to the `failedInserts` topic. This topic contains JSON objects _ready to be loaded into BigQuery_ (ie not canonical Snowplow Enriched event format). - -In order to load this data again from `failedInserts` to BigQuery you can use Repeater, which reads a subscription on `failedInserts` and performs `INSERT` statements. - -Repeater has several important behavior aspects: +## Configuring the loader -- If a pulled record is not a valid Snowplow event, it will result into a `loader_recovery_error` bad row. -- If a pulled record is a valid event, Repeater will wait some time (15 minutes by default) after the `etl_tstamp` before attempting to re-insert it, in order to let Mutator do its job. -- If the database responds with an error, the row will get transformed into a `loader_recovery_error` bad row. -- All entities in the dead-letter bucket are valid Snowplow [bad rows](https://github.com/snowplow/snowplow-badrows). +The loader config file is in HOCON format, and it allows configuring many different properties of how the loader runs. -### Topics, subscriptions and message formats +The simplest possible config file just needs a description of your pipeline inputs and outputs: -The Snowplow BigQuery Loader apps use Pub/Sub topics and subscriptions to store intermediate data and communicate with each other. + + -
KindPopulated byConsumed byData format
Input subscriptionEnriched events topicLoader / StreamLoadercanonical TSV + JSON enriched format
Types topicLoader / StreamLoaderTypes subscriptioniglu:com.snowplowanalytics.snowplow/shredded_type/jsonschema/1-0-0
Types subscriptionTypes topicMutatoriglu:com.snowplowanalytics.snowplow/shredded_type/jsonschema/1-0-0
Bad row topicLoader / StreamLoaderGCS Loaderiglu:com.snowplowanalytics.snowplow.badrows/loader_iglu_error/jsonschema/2-0-0
iglu:com.snowplowanalytics.snowplow.badrows/loader_parsing_error/jsonschema/2-0-0
iglu:com.snowplowanalytics.snowplow.badrows/loader_runtime_error/jsonschema/1-0-1
Failed insert topicLoader / StreamLoaderFailed insert subscriptionBigQuery JSON
Failed insert subscriptionFailed insert topicRepeaterBigQuery JSON
- -## Setup guide - -### Configuration file - -Loader / StreamLoader, Mutator and Repeater accept the same configuration file in HOCON format. An example of a minimal configuration file can look like this: - -```json -{ - "projectId": "com-acme" - - "loader": { - "input": { - "subscription": "enriched-sub" - } - - "output": { - "good": { - "datasetId": "snowplow" - "tableId": "events" - } - - "bad": { - "topic": "bad-topic" - } - - "types": { - "topic": "types-topic" - } - - "failedInserts": { - "topic": "failed-inserts-topic" - } - } - } - - "mutator": { - "input": { - "subscription": "types-sub" - } - - "output": { - "good": ${loader.output.good} # will be automatically inferred - } - } - - "repeater": { - "input": { - "subscription": "failed-inserts-sub" - } - - "output": { - "good": ${loader.output.good} # will be automatically inferred - - "deadLetters": { - "bucket": "gs://dead-letter-bucket" - } - } - } - - "monitoring": {} # disabled -} +```json reference +https://github.com/snowplow-incubator/snowplow-bigquery-loader/blob/v2/config/config.kinesis.minimal.hocon ``` -The loader takes command line arguments `--config` with a path to the configuration hocon file and `--resolver` with a path to the Iglu resolver file. If you are running the docker image then you should mount the configuration files into the container: - -{ -`docker run \\ - -v /path/to/configs:/configs \\ - snowplow/snowplow-bigquery-streamloader:${versions.bqLoader} \\ - --config=/configs/bigquery.hocon \\ - --resolver=/configs/resolver.json -`} - -Or you can pass the whole config as a base64-encoded string using the `--config` option, like so: - -{ -`docker run \\ - -v /path/to/resolver.json:/resolver.json \\ - snowplow/snowplow-bigquery-streamloader:${versions.bqLoader} \\ - --config=ewogICJwcm9qZWN0SWQiOiAiY29tLWFjbWUiCgogICJsb2FkZXIiOiB7CiAgICAiaW5wdXQiOiB7CiAgICAgICJzdWJzY3JpcHRpb24iOiAiZW5yaWNoZWQtc3ViIgogICAgfQoKICAgICJvdXRwdXQiOiB7CiAgICAgICJnb29kIjogewogICAgICAgICJkYXRhc2V0SWQiOiAic25vd3Bsb3ciCiAgICAgICAgInRhYmxlSWQiOiAiZXZlbnRzIgogICAgICB9CgogICAgICAiYmFkIjogewogICAgICAgICJ0b3BpYyI6ICJiYWQtdG9waWMiCiAgICAgIH0KCiAgICAgICJ0eXBlcyI6IHsKICAgICAgICAidG9waWMiOiAidHlwZXMtdG9waWMiCiAgICAgIH0KCiAgICAgICJmYWlsZWRJbnNlcnRzIjogewogICAgICAgICJ0b3BpYyI6ICJmYWlsZWQtaW5zZXJ0cy10b3BpYyIKICAgICAgfQogICAgfQogIH0KCiAgIm11dGF0b3IiOiB7CiAgICAiaW5wdXQiOiB7CiAgICAgICJzdWJzY3JpcHRpb24iOiAidHlwZXMtc3ViIgogICAgfQoKICAgICJvdXRwdXQiOiB7CiAgICAgICJnb29kIjogJHtsb2FkZXIub3V0cHV0Lmdvb2R9ICMgd2lsbCBiZSBhdXRvbWF0aWNhbGx5IGluZmVycmVkCiAgICB9CiAgfQoKICAicmVwZWF0ZXIiOiB7CiAgICAiaW5wdXQiOiB7CiAgICAgICJzdWJzY3JpcHRpb24iOiAiZmFpbGVkLWluc2VydHMtc3ViIgogICAgfQoKICAgICJvdXRwdXQiOiB7CiAgICAgICJnb29kIjogJHtsb2FkZXIub3V0cHV0Lmdvb2R9ICMgd2lsbCBiZSBhdXRvbWF0aWNhbGx5IGluZmVycmVkCgogICAgICAiZGVhZExldHRlcnMiOiB7CiAgICAgICAgImJ1Y2tldCI6ICJnczovL2RlYWQtbGV0dGVyLWJ1Y2tldCIKICAgICAgfQogICAgfQogIH0KCiAgIm1vbml0b3JpbmciOiB7fSAjIGRpc2FibGVkCn0= \\ - --resolver=/resolver.json -`} +
+ -The `--config` command option is actually optional. For some setups it is more convenient to provide configuration parameters using JVM system properties or environment variables, as documented in [the Lightbend config readme](https://github.com/lightbend/config/blob/v1.4.1/README.md). - -For example, to override the `repeater.input.subscription` setting using system properties: - -{ -`docker run \\ - -v /path/to/configs:/configs \\ - snowplow/snowplow-bigquery-streamloader:${versions.bqLoader} \\ - --config=/configs/bigquery.hocon \\ - --resolver=/configs/resolver.json \\ - -Drepeater.input.subscription="failed-inserts-sub" -`} - -Or to use environment variables for every setting: - -{ -`docker run \\ - -v /path/to/resolver.json:/resolver.json \\ - snowplow/snowplow-bigquery-repeater:${versions.bqLoader} \\ - --resolver=/resolver.json \\ - -Dconfig.override_with_env_vars=true -`} - -See the [configuration reference](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/snowplow-bigquery-loader-configuration-reference/index.md) for more details and advanced settings. - -### Command line options - -All apps accept a config HOCON as specified above, and an Iglu resolver config passed via the `--resolver` option. The latter must conform to the `iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-3` schema. - -#### StreamLoader - -StreamLoader accepts `--config` and `--resolver` arguments, as well as any JVM system properties that can be used to override the configuration. - -{ -`docker run \\ - -v /path/to/configs:/configs \\ - snowplow/snowplow-bigquery-streamloader:${versions.bqLoader} \\ - --config=/configs/bigquery.hocon \\ - --resolver=/configs/resolver.json \\ - -Dconfig.override_with_env_vars=true -`} - -The `--config` flag is optional, but if missing, all configuration options must be specified in some other way (system properties or environment variables). - -#### The Dataflow Loader - -The Dataflow Loader accepts the same two arguments as StreamLoader and [any other](https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options) supported by Google Cloud Dataflow. - -{ -`docker run \\ - -v /path/to/configs:/configs \\ - snowplow/snowplow-bigquery-loader:${versions.bqLoader} \\ - --config=/configs/bigquery.hocon \\ - --resolver=/configs/resolver.json \\ - --labels={"key1":"val1","key2":"val2"} # optional Dataflow args -`} - -The optional `labels` argument is an example of a Dataflow natively supported argument. It accepts a JSON with key-value pairs that will be used as [labels](https://cloud.google.com/compute/docs/labeling-resources) to the Cloud Dataflow job. - -This can be launched from any machine authenticated to submit Dataflow jobs. - -#### Mutator - -Mutator has three subcommands: `listen`, `create` and `add-column`. - -##### `listen` - -`listen` is the primary command and is used to automate table migrations. - -{ -`docker run \\ - -v /path/to/configs:/configs \\ - snowplow/snowplow-bigquery-mutator:${versions.bqLoader} \\ - listen \\ - --config=/configs/bigquery.hocon \\ - --resolver=/configs/resolver.json \\ - --verbose # optional, for debugging only -`} - -##### `add-column` - -`add-column` can be used once to add a column to the table specified via the `loader.output.good` setting. This should eliminate the risk of table update lag and the necessity to run a Repeater, but requires 'manual' intervention. - -{ -`docker run \\ - -v /path/to/configs:/configs \\ - snowplow/snowplow-bigquery-mutator:${versions.bqLoader} \\ - add-column \\ - --config=/configs/bigquery.hocon \\ - --resolver=/configs/resolver.json \\ - --shred-property=CONTEXTS \\ - --schema="iglu:com.acme/app_context/jsonschema/1-0-0" -`} - -The specified schema must be present in one of the Iglu registries in the resolver configuration. - -##### `create` - -`create` creates an empty table with `atomic` structure. It can optionally be partitioned by a `TIMESTAMP` field. - -{ -`docker run \\ - -v /path/to/configs:/configs \\ - snowplow/snowplow-bigquery-mutator:${versions.bqLoader} \\ - create \\ - --config=/configs/bigquery.hocon \\ - --resolver=/configs/resolver.json \\ - --partitionColumn=load_tstamp \\ # optional TIMESTAMP column by which to partition the table - --requirePartitionFilter # optionally require a filter on the partition column in all queries -`} - -See the Google documentation for more information about [partitioned tables](https://cloud.google.com/bigquery/docs/creating-partitioned-tables). - -#### Repeater - -We recommend constantly running Repeater on a small / cheap node or Docker container. - -{ -`docker run \\ - -v /path/to/configs:/configs \\ - snowplow/snowplow-bigquery-repeater:${versions.bqLoader} \\ - --config=/configs/bigquery.hocon \\ - --resolver=/configs/resolver.json \\ - --bufferSize=20 \\ # size of the batch to send to the dead-letter bucket - --timeout=20 \\ # duration after which bad rows will be sunk into the dead-letter bucket - --backoffPeriod=900 \\ # seconds to wait before attempting an insert (calculated against etl_tstamp) - --verbose # optional, for debugging only -`} - -`bufferSize`, `timeout` and `backoffPeriod` are optional parameters. - -### Docker support - -All applications are available as Docker images on Docker Hub, based on Ubuntu Focal and OpenJDK 11: +```json reference +https://github.com/snowplow-incubator/snowplow-bigquery-loader/blob/v2/config/config.pubsub.minimal.hocon +``` -{ -`$ docker pull snowplow/snowplow-bigquery-streamloader:${versions.bqLoader} -$ docker pull snowplow/snowplow-bigquery-loader:${versions.bqLoader} -$ docker pull snowplow/snowplow-bigquery-mutator:${versions.bqLoader} -$ docker pull snowplow/snowplow-bigquery-repeater:${versions.bqLoader} -`} + + -

We also provide an alternative lightweight set of images based on Google's "distroless" base image, which may provide some security advantages for carrying fewer dependencies. These images are distinguished with the {`${versions.bqLoader}-distroless`} tag:

+```json reference +https://github.com/snowplow-incubator/snowplow-bigquery-loader/blob/v2/config/config.azure.minimal.hocon +``` -{ -`$ docker pull snowplow/snowplow-bigquery-streamloader:${versions.bqLoader}-distroless -$ docker pull snowplow/snowplow-bigquery-loader:${versions.bqLoader}-distroless -$ docker pull snowplow/snowplow-bigquery-mutator:${versions.bqLoader}-distroless -$ docker pull snowplow/snowplow-bigquery-repeater:${versions.bqLoader}-distroless -`} +
+
-Mutator, Repeater and Streamloader are also available as fatjar files attached to [releases](https://github.com/snowplow-incubator/snowplow-bigquery-loader/releases) in the project's Github repository. +See the [configuration reference](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/index.md) for all possible configuration parameters. diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-0-3-0/index.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-0-3-0/index.md index fbe3d6ae41..3349338f51 100644 --- a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-0-3-0/index.md +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-0-3-0/index.md @@ -1,7 +1,7 @@ --- title: "BigQuery Loader (0.3.x)" date: "2020-03-11" -sidebar_position: 30 +sidebar_position: 40 --- Please be aware that we have identified a security vulnerability in BigQuery Repeater in this version, which we've fixed in version [0.4.2](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-0-4-0/index.md). You can find more details on our [Discourse forum](https://discourse.snowplow.io/t/important-notice-snowplow-bigquery-loader-vulnerability-and-fix/3783). diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-0-4-0/index.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-0-4-0/index.md index f9fe4af5b9..83d4a0b4d8 100644 --- a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-0-4-0/index.md +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-0-4-0/index.md @@ -1,7 +1,7 @@ --- title: "BigQuery Loader (0.4.x)" date: "2020-03-11" -sidebar_position: 20 +sidebar_position: 30 --- ## Technical Architecture diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-0-5-0/index.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-0-5-0/index.md index e0501b2af0..187fbd7f68 100644 --- a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-0-5-0/index.md +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-0-5-0/index.md @@ -1,7 +1,7 @@ --- title: "BigQuery Loader (0.5.x)" date: "2020-05-18" -sidebar_position: 10 +sidebar_position: 20 --- ## Technical Architecture diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-0-6-0/index.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-0-6-0/index.md index 1d3eb4869b..32663c5bed 100644 --- a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-0-6-0/index.md +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-0-6-0/index.md @@ -1,7 +1,7 @@ --- title: "BigQuery Loader (0.6.x)" date: "2021-10-06" -sidebar_position: 0 +sidebar_position: 10 --- ## Technical Architecture diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/_diagram.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/_diagram.md new file mode 100644 index 0000000000..d5c9778a0c --- /dev/null +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/_diagram.md @@ -0,0 +1,24 @@ +At the high level, BigQuery loader reads enriched Snowplow events in real time and loads them in BigQuery using the [legacy streaming API](https://cloud.google.com/bigquery/docs/streaming-data-into-bigquery). + +```mermaid +flowchart LR + stream[["Enriched events\n(Pub/Sub stream)"]] + loader{{"BigQuery Loader\n(Loader, Mutator and Repeater apps)"}} + subgraph BigQuery + table[("Events table")] + end + stream-->loader-->BigQuery +``` + +BigQuery loader consists of three applications: Loader, Mutator and Repeater. The following diagram illustrates the interaction between them and BigQuery: + +```mermaid +sequenceDiagram + loop + Note over Loader: Read a small batch of events + Loader-->>+Mutator: Communicate event types (via Pub/Sub) + Loader->>BigQuery: Send events using the Storage Write API + Mutator-->>-BigQuery: Adjust column types if necessary + Repeater->>BigQuery: Resend events that failed
because columns were not up to date + end +``` diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/snowplow-bigquery-loader-configuration-reference/index.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/configuration-reference/index.md similarity index 100% rename from docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/snowplow-bigquery-loader-configuration-reference/index.md rename to docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/configuration-reference/index.md diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/index.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/index.md new file mode 100644 index 0000000000..88eed982ba --- /dev/null +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/index.md @@ -0,0 +1,315 @@ +--- +title: "BigQuery Loader (1.x)" +sidebar_position: 0 +--- + +```mdx-code-block +import {versions} from '@site/src/componentVersions'; +import CodeBlock from '@theme/CodeBlock'; +import Diagram from '@site/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/_diagram.md'; +``` + +Under the umbrella of Snowplow BigQuery Loader, we have a family of applications that can be used to load enriched Snowplow data into BigQuery. + + + +:::tip Schemas in BigQuery + +For more information on how events are stored in BigQuery, check the [mapping between Snowplow schemas and the corresponding BigQuery column types](/docs/storing-querying/schemas-in-warehouse/index.md?warehouse=bigquery). + +::: + +## Technical Architecture + +The available tools are: + +1. **Snowplow BigQuery StreamLoader**, a standalone Scala app that can be deployed on [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine). +2. **Snowplow BigQuery Loader**, an alternative to StreamLoader, in the form of a [Google Cloud Dataflow](https://cloud.google.com/dataflow) job. +3. **Snowplow BigQuery Mutator**, a Scala app that performs table updates to add new columns as required. +4. **Snowplow BigQuery Repeater**, a Scala app that reads failed inserts (caused by _table update lag_) and re-tries inserting them into BigQuery after some delay, sinking failures into a dead-letter bucket. + +### Snowplow BigQuery StreamLoader + +- Reads Snowplow enriched events from a dedicated Pub/Sub subscription. +- Uses the JSON transformer from the [Snowplow Scala Analytics SDK](https://github.com/snowplow/snowplow-scala-analytics-sdk) to convert those enriched events into JSON. +- Uses [Iglu Client](https://github.com/snowplow/iglu-scala-client/) to fetch JSON schemas for self-describing events and entities. +- Uses [Iglu Schema DDL](https://github.com/snowplow/schema-ddl) to transform self-describing events and entities into BigQuery format. +- Writes transformed data into BigQuery. +- Writes all encountered Iglu types into a dedicated Pub/Sub topic (the `types` topic). +- Writes all data that failed to be validated against its schema into a dedicated `badRows` Pub/Sub topic. +- Writes all data that was successfully transformed, but could not be loaded into a dedicated `failedInserts` topic. + +### Snowplow BigQuery Loader + +An [Apache Beam](https://beam.apache.org/) job intended to run on Google Cloud Dataflow. An alternative to the StreamLoader application, it has the same algorithm. + +### Snowplow BigQuery Mutator + +The Mutator app is in charge of performing automatic table updates, which means you do not have to pause loading and manually update the table every time you're adding a new custom self-describing event or entity. + +- Reads messages from a dedicated subscription to the `types` topic. +- Finds out if a message contains a type that has not been encountered yet (by checking internal cache). +- If a message contains a new type, double-checks it with the connected BigQuery table. +- If the type is not in the table, fetches its JSON schema from an Iglu registry. +- Transforms the JSON schema into BigQuery column definition. +- Adds the column to the connected BigQuery table. + +### Snowplow BigQuery Repeater + +The Repeater app is in charge of handling failed inserts. It reads ready-to-load events from a dedicated subscription on the `failedInserts` topic and re-tries inserting them into BigQuery to overcome 'table update lag'. + +#### Table update lag + +The loader app inserts data into BigQuery in near real-time. At the same time, it sinks messages containing information about the fields of an event into the `types` topic. It can take up to 10-15 seconds for Mutator to fetch, parse the message and execute an `ALTER TABLE` statement against the table. Additionally, the new column takes some time to propagate and become visible to all workers trying to write to it. + +If a new type arrives from the input subscription in this period of time, BigQuery might reject the row containing it and it will be sent to the `failedInserts` topic. This topic contains JSON objects _ready to be loaded into BigQuery_ (ie not canonical Snowplow Enriched event format). + +In order to load this data again from `failedInserts` to BigQuery you can use Repeater, which reads a subscription on `failedInserts` and performs `INSERT` statements. + +Repeater has several important behavior aspects: + +- If a pulled record is not a valid Snowplow event, it will result into a `loader_recovery_error` bad row. +- If a pulled record is a valid event, Repeater will wait some time (15 minutes by default) after the `etl_tstamp` before attempting to re-insert it, in order to let Mutator do its job. +- If the database responds with an error, the row will get transformed into a `loader_recovery_error` bad row. +- All entities in the dead-letter bucket are valid Snowplow [bad rows](https://github.com/snowplow/snowplow-badrows). + +### Topics, subscriptions and message formats + +The Snowplow BigQuery Loader apps use Pub/Sub topics and subscriptions to store intermediate data and communicate with each other. + +
KindPopulated byConsumed byData format
Input subscriptionEnriched events topicLoader / StreamLoadercanonical TSV + JSON enriched format
Types topicLoader / StreamLoaderTypes subscriptioniglu:com.snowplowanalytics.snowplow/shredded_type/jsonschema/1-0-0
Types subscriptionTypes topicMutatoriglu:com.snowplowanalytics.snowplow/shredded_type/jsonschema/1-0-0
Bad row topicLoader / StreamLoaderGCS Loaderiglu:com.snowplowanalytics.snowplow.badrows/loader_iglu_error/jsonschema/2-0-0
iglu:com.snowplowanalytics.snowplow.badrows/loader_parsing_error/jsonschema/2-0-0
iglu:com.snowplowanalytics.snowplow.badrows/loader_runtime_error/jsonschema/1-0-1
Failed insert topicLoader / StreamLoaderFailed insert subscriptionBigQuery JSON
Failed insert subscriptionFailed insert topicRepeaterBigQuery JSON
+ +## Setup guide + +### Configuration file + +Loader / StreamLoader, Mutator and Repeater accept the same configuration file in HOCON format. An example of a minimal configuration file can look like this: + +```json +{ + "projectId": "com-acme" + + "loader": { + "input": { + "subscription": "enriched-sub" + } + + "output": { + "good": { + "datasetId": "snowplow" + "tableId": "events" + } + + "bad": { + "topic": "bad-topic" + } + + "types": { + "topic": "types-topic" + } + + "failedInserts": { + "topic": "failed-inserts-topic" + } + } + } + + "mutator": { + "input": { + "subscription": "types-sub" + } + + "output": { + "good": ${loader.output.good} # will be automatically inferred + } + } + + "repeater": { + "input": { + "subscription": "failed-inserts-sub" + } + + "output": { + "good": ${loader.output.good} # will be automatically inferred + + "deadLetters": { + "bucket": "gs://dead-letter-bucket" + } + } + } + + "monitoring": {} # disabled +} +``` + +The loader takes command line arguments `--config` with a path to the configuration hocon file and `--resolver` with a path to the Iglu resolver file. If you are running the docker image then you should mount the configuration files into the container: + +{ +`docker run \\ + -v /path/to/configs:/configs \\ + snowplow/snowplow-bigquery-streamloader:${versions.bqLoader1x} \\ + --config=/configs/bigquery.hocon \\ + --resolver=/configs/resolver.json +`} + +Or you can pass the whole config as a base64-encoded string using the `--config` option, like so: + +{ +`docker run \\ + -v /path/to/resolver.json:/resolver.json \\ + snowplow/snowplow-bigquery-streamloader:${versions.bqLoader1x} \\ + --config=ewogICJwcm9qZWN0SWQiOiAiY29tLWFjbWUiCgogICJsb2FkZXIiOiB7CiAgICAiaW5wdXQiOiB7CiAgICAgICJzdWJzY3JpcHRpb24iOiAiZW5yaWNoZWQtc3ViIgogICAgfQoKICAgICJvdXRwdXQiOiB7CiAgICAgICJnb29kIjogewogICAgICAgICJkYXRhc2V0SWQiOiAic25vd3Bsb3ciCiAgICAgICAgInRhYmxlSWQiOiAiZXZlbnRzIgogICAgICB9CgogICAgICAiYmFkIjogewogICAgICAgICJ0b3BpYyI6ICJiYWQtdG9waWMiCiAgICAgIH0KCiAgICAgICJ0eXBlcyI6IHsKICAgICAgICAidG9waWMiOiAidHlwZXMtdG9waWMiCiAgICAgIH0KCiAgICAgICJmYWlsZWRJbnNlcnRzIjogewogICAgICAgICJ0b3BpYyI6ICJmYWlsZWQtaW5zZXJ0cy10b3BpYyIKICAgICAgfQogICAgfQogIH0KCiAgIm11dGF0b3IiOiB7CiAgICAiaW5wdXQiOiB7CiAgICAgICJzdWJzY3JpcHRpb24iOiAidHlwZXMtc3ViIgogICAgfQoKICAgICJvdXRwdXQiOiB7CiAgICAgICJnb29kIjogJHtsb2FkZXIub3V0cHV0Lmdvb2R9ICMgd2lsbCBiZSBhdXRvbWF0aWNhbGx5IGluZmVycmVkCiAgICB9CiAgfQoKICAicmVwZWF0ZXIiOiB7CiAgICAiaW5wdXQiOiB7CiAgICAgICJzdWJzY3JpcHRpb24iOiAiZmFpbGVkLWluc2VydHMtc3ViIgogICAgfQoKICAgICJvdXRwdXQiOiB7CiAgICAgICJnb29kIjogJHtsb2FkZXIub3V0cHV0Lmdvb2R9ICMgd2lsbCBiZSBhdXRvbWF0aWNhbGx5IGluZmVycmVkCgogICAgICAiZGVhZExldHRlcnMiOiB7CiAgICAgICAgImJ1Y2tldCI6ICJnczovL2RlYWQtbGV0dGVyLWJ1Y2tldCIKICAgICAgfQogICAgfQogIH0KCiAgIm1vbml0b3JpbmciOiB7fSAjIGRpc2FibGVkCn0= \\ + --resolver=/resolver.json +`} + +The `--config` command option is actually optional. For some setups it is more convenient to provide configuration parameters using JVM system properties or environment variables, as documented in [the Lightbend config readme](https://github.com/lightbend/config/blob/v1.4.1/README.md). + +For example, to override the `repeater.input.subscription` setting using system properties: + +{ +`docker run \\ + -v /path/to/configs:/configs \\ + snowplow/snowplow-bigquery-streamloader:${versions.bqLoader1x} \\ + --config=/configs/bigquery.hocon \\ + --resolver=/configs/resolver.json \\ + -Drepeater.input.subscription="failed-inserts-sub" +`} + +Or to use environment variables for every setting: + +{ +`docker run \\ + -v /path/to/resolver.json:/resolver.json \\ + snowplow/snowplow-bigquery-repeater:${versions.bqLoader1x} \\ + --resolver=/resolver.json \\ + -Dconfig.override_with_env_vars=true +`} + + +### Command line options + +All apps accept a config HOCON as specified above, and an Iglu resolver config passed via the `--resolver` option. The latter must conform to the `iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-3` schema. + +#### StreamLoader + +StreamLoader accepts `--config` and `--resolver` arguments, as well as any JVM system properties that can be used to override the configuration. + +{ +`docker run \\ + -v /path/to/configs:/configs \\ + snowplow/snowplow-bigquery-streamloader:${versions.bqLoader1x} \\ + --config=/configs/bigquery.hocon \\ + --resolver=/configs/resolver.json \\ + -Dconfig.override_with_env_vars=true +`} + +The `--config` flag is optional, but if missing, all configuration options must be specified in some other way (system properties or environment variables). + +#### The Dataflow Loader + +The Dataflow Loader accepts the same two arguments as StreamLoader and [any other](https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options) supported by Google Cloud Dataflow. + +{ +`docker run \\ + -v /path/to/configs:/configs \\ + snowplow/snowplow-bigquery-loader:${versions.bqLoader1x} \\ + --config=/configs/bigquery.hocon \\ + --resolver=/configs/resolver.json \\ + --labels={"key1":"val1","key2":"val2"} # optional Dataflow args +`} + +The optional `labels` argument is an example of a Dataflow natively supported argument. It accepts a JSON with key-value pairs that will be used as [labels](https://cloud.google.com/compute/docs/labeling-resources) to the Cloud Dataflow job. + +This can be launched from any machine authenticated to submit Dataflow jobs. + +#### Mutator + +Mutator has three subcommands: `listen`, `create` and `add-column`. + +##### `listen` + +`listen` is the primary command and is used to automate table migrations. + +{ +`docker run \\ + -v /path/to/configs:/configs \\ + snowplow/snowplow-bigquery-mutator:${versions.bqLoader1x} \\ + listen \\ + --config=/configs/bigquery.hocon \\ + --resolver=/configs/resolver.json \\ + --verbose # optional, for debugging only +`} + +##### `add-column` + +`add-column` can be used once to add a column to the table specified via the `loader.output.good` setting. This should eliminate the risk of table update lag and the necessity to run a Repeater, but requires 'manual' intervention. + +{ +`docker run \\ + -v /path/to/configs:/configs \\ + snowplow/snowplow-bigquery-mutator:${versions.bqLoader1x} \\ + add-column \\ + --config=/configs/bigquery.hocon \\ + --resolver=/configs/resolver.json \\ + --shred-property=CONTEXTS \\ + --schema="iglu:com.acme/app_context/jsonschema/1-0-0" +`} + +The specified schema must be present in one of the Iglu registries in the resolver configuration. + +##### `create` + +`create` creates an empty table with `atomic` structure. It can optionally be partitioned by a `TIMESTAMP` field. + +{ +`docker run \\ + -v /path/to/configs:/configs \\ + snowplow/snowplow-bigquery-mutator:${versions.bqLoader1x} \\ + create \\ + --config=/configs/bigquery.hocon \\ + --resolver=/configs/resolver.json \\ + --partitionColumn=load_tstamp \\ # optional TIMESTAMP column by which to partition the table + --requirePartitionFilter # optionally require a filter on the partition column in all queries +`} + +See the Google documentation for more information about [partitioned tables](https://cloud.google.com/bigquery/docs/creating-partitioned-tables). + +#### Repeater + +We recommend constantly running Repeater on a small / cheap node or Docker container. + +{ +`docker run \\ + -v /path/to/configs:/configs \\ + snowplow/snowplow-bigquery-repeater:${versions.bqLoader1x} \\ + --config=/configs/bigquery.hocon \\ + --resolver=/configs/resolver.json \\ + --bufferSize=20 \\ # size of the batch to send to the dead-letter bucket + --timeout=20 \\ # duration after which bad rows will be sunk into the dead-letter bucket + --backoffPeriod=900 \\ # seconds to wait before attempting an insert (calculated against etl_tstamp) + --verbose # optional, for debugging only +`} + +`bufferSize`, `timeout` and `backoffPeriod` are optional parameters. + +### Docker support + +All applications are available as Docker images on Docker Hub, based on Ubuntu Focal and OpenJDK 11: + +{ +`$ docker pull snowplow/snowplow-bigquery-streamloader:${versions.bqLoader1x} +$ docker pull snowplow/snowplow-bigquery-loader:${versions.bqLoader1x} +$ docker pull snowplow/snowplow-bigquery-mutator:${versions.bqLoader1x} +$ docker pull snowplow/snowplow-bigquery-repeater:${versions.bqLoader1x} +`} + +

We also provide an alternative lightweight set of images based on Google's "distroless" base image, which may provide some security advantages for carrying fewer dependencies. These images are distinguished with the {`${versions.bqLoader1x}-distroless`} tag:

+ +{ +`$ docker pull snowplow/snowplow-bigquery-streamloader:${versions.bqLoader1x}-distroless +$ docker pull snowplow/snowplow-bigquery-loader:${versions.bqLoader1x}-distroless +$ docker pull snowplow/snowplow-bigquery-mutator:${versions.bqLoader1x}-distroless +$ docker pull snowplow/snowplow-bigquery-repeater:${versions.bqLoader1x}-distroless +`} + +Mutator, Repeater and Streamloader are also available as fatjar files attached to [releases](https://github.com/snowplow-incubator/snowplow-bigquery-loader/releases) in the project's Github repository. diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/index.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/index.md index 290646281a..f6f578d7f4 100644 --- a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/index.md +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/index.md @@ -6,4 +6,9 @@ sidebar_custom_props: outdated: true --- +```mdx-code-block +import DocCardList from '@theme/DocCardList'; + + +``` diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/1-0-x-upgrade-guide/index.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/upgrade-guides/1-0-x-upgrade-guide/index.md similarity index 97% rename from docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/1-0-x-upgrade-guide/index.md rename to docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/upgrade-guides/1-0-x-upgrade-guide/index.md index 6fde64d0cb..88e1871849 100644 --- a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/1-0-x-upgrade-guide/index.md +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/upgrade-guides/1-0-x-upgrade-guide/index.md @@ -6,7 +6,7 @@ sidebar_position: 0 ## Configuration -The only breaking change from the 0.6.x series is the new format of the configuration file. That used to be a self-describing JSON but is now HOCON. Additionally, some app-specific command-line arguments have been incorporated into the config, such as Repeater's `--failedInsertsSub` option. For more details, see the [setup guide](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/index.md#setup-guide) and [configuration reference](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/snowplow-bigquery-loader-configuration-reference/index.md). +The only breaking change from the 0.6.x series is the new format of the configuration file. That used to be a self-describing JSON but is now HOCON. Additionally, some app-specific command-line arguments have been incorporated into the config, such as Repeater's `--failedInsertsSub` option. For more details, see the [setup guide](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/index.md#setup-guide) and [configuration reference](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/configuration-reference/index.md). Using Repeater as an example, if your configuration for 0.6.x looked like this: diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/upgrade-guides/2-0-0-upgrade-guide/index.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/upgrade-guides/2-0-0-upgrade-guide/index.md new file mode 100644 index 0000000000..78988451ba --- /dev/null +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/upgrade-guides/2-0-0-upgrade-guide/index.md @@ -0,0 +1,96 @@ +--- +title: "2.0.0 upgrade guide" +sidebar_position: -20 +--- + +## Configuration + +BigQuery Loader 2.0.0 brings changes to the loading setup. It is no longer neccessary to configure and deploy three independent applications (Loader, Repeater and Mutator in [1.X](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/index.md)) in order to load your data to BigQuery. +Starting from 2.0.0, only one appliction is needed, which naturally introduces some breaking changes to the configuration file structure. + +See the [configuration reference](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/configuration-reference/index.md) for all possible configuration parameters +and the minimal [configuration samples](https://github.com/snowplow-incubator/snowplow-bigquery-loader/blob/v2/config) for each of supported cloud environments. + +## Infrastructure + +Apart from Repeater and Mutator, other infrastructure components have become obsolete: +* The `types` PubSub topic connecting Loader and Mutator. +* The `failedInserts` PubSub topic connecting Loader and Repeater. +* The `deadLetter` GCS bucket used by Repeater to store data that repeatedly failed to be inserted into BigQuery. + +## Events table format + +Starting from 2.0.0, BigQuery Loader changes its output column naming strategy. For example, for [ad_click event](https://github.com/snowplow/iglu-central/blob/master/schemas/com.snowplowanalytics.snowplow.media/ad_click_event/jsonschema/1-0-0): + +* Before an upgrade, the corresponding column would be named `unstruct_event_com_snowplowanalytics_snowplow_media_ad_click_event_1_0_0`. +* After an upgrade, new column will be named `unstruct_event_com_snowplowanalytics_snowplow_media_ad_click_event_1`. + +All self-describing events and entities will be loaded to new 'major version' - oriented columns. Old 'full version' - oriented columns would remain unchanged, but no new data would be loaded into them. (The 2.0.0 loader just ignores these columns.) + +The new column naming scheme has several advantages: +* Fewer columns created (BigQuery has a limit on the total number of columns) +* No need to update data models (or use complex macros) every time a new minor version of a schema is created + +The catch is that you have to follow the rules of schema evolution more strictly to ensure data from different schema versions can fit in the same column — see below. + +:::tip Consolidating old and new columns + +If you are using [Snowplow dbt models](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-models/index.md), they will automatically consolidate the data between `_1_0_0` and `_1` style columns, because they look at the major version prefix (e.g. `_1`), which is common to both. + +If you are not using Snowplow dbt models but still use dbt, you can employ [this macro](https://github.com/snowplow/dbt-snowplow-utils#combine_column_versions-source) to manually aggregate the data across old and new columns. + +::: + +## Recovery columns + +### What is schema evolution? + +One of Snowplow’s key features is the ability to [define custom schemas and validate events against them](/docs/understanding-your-pipeline/schemas/index.md). Over time, users often evolve the schemas, e.g. by adding new fields or changing existing fields. To accommodate these changes, BigQuery Loader 2.0.0 automatically adjusts the database tables in the warehouse accordingly. + +There are two main types of schema changes: + +**Breaking**: The schema version has to be changed in a major way (`1-2-3` → `2-0-0`). As of BigQuery Loader 2.0.0, each major schema version has its own column (`..._1`, `..._2`, etc, for example: `contexts_com.snowplowanalytics_ad_click_1`). + +**Non-breaking**: The schema version can be changed in a minor way (`1-2-3` → `1-3-0` or `1-2-3` → `1-2-4`). Data is stored in the same database column. + +Loader tries to format the incoming data according to the latest version of the schema it saw (for a given major version, e.g. `1-*-*`). For example, if a batch contains events with schema versions `1-0-0`, `1-0-1` and `1-0-2`, the loader derives the output schema based on version `1-0-2`. Then the loader instructs BigQuery to adjust the database column and load the data. + +### Recovering from invalid schema evolution + +Let's consider these two schemas as an example of breaking schema evolution (changing the type of a field from `integer` to `string`) using the same major version (`1-0-0` and `1-0-1`): + +```json +{ + // 1-0-0 + "properties": { + "a": {"type": "integer"} + } +} +``` + +```json +{ + // 1-0-1 + "properties": { + "a": {"type": "string"} + } +} +``` + +With BigQuery Loader 1.x, data for each version would go to its own column — no issue. With BigQuery Loader 2.x, there is only one column. But strings and integers can’t coexist! + +To avoid crashing or losing data, BigQuery Loader 2.0.0 proceeds by creating a new column for the data with schema `1-0-1`, e.g. `contexts_com_snowplowanalytics_ad_click_1_0_1_recovered_9999999`, where: + - `1_0_1` is the version of the offending schema; + - `9999999` is a hash code unique to the schema (i.e. it will change if the schema is overwritten with a different one). + +If you create a new schema `1-0-2` that reverts the offending changes and is again compatible with `1-0-0`, the data for events with that schema will be written to the original column as expected. +:::tip +You might find that some of your schemas were evolved incorrectly in the past, which results in the creation of these “recovery” columns after the upgrade. To address this for a given schema, create a new _minor_ schema version that reverts the breaking changes introduced in previous versions. (Or, if you want to keep the breaking change, create a new _major_ schema version.) You can set it to [supersede](/docs/understanding-tracking-design/versioning-your-data-structures/amending/index.md#marking-the-schema-as-superseded) the previous version(s), so that events are automatically validated against the new schema. +::: +:::note + +If events with incorrectly evolved schemas never arrive, then the recovery column would not be created. + +::: + +You can read more about schema evolution and how recovery columns work [here](/docs/storing-querying/schemas-in-warehouse/index.md?warehouse=bigquery#versioning). diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/upgrade-guides/index.md b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/upgrade-guides/index.md new file mode 100644 index 0000000000..7bea6e4616 --- /dev/null +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/upgrade-guides/index.md @@ -0,0 +1,10 @@ +--- +title: "Upgrade guides" +sidebar_position: 40 +--- + +```mdx-code-block +import DocCardList from '@theme/DocCardList'; + + +``` diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/lake-loader/configuration-reference/_common_config.md b/docs/pipeline-components-and-applications/loaders-storage-targets/lake-loader/configuration-reference/_common_config.md index c0aa98aeab..862f40df09 100644 --- a/docs/pipeline-components-and-applications/loaders-storage-targets/lake-loader/configuration-reference/_common_config.md +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/lake-loader/configuration-reference/_common_config.md @@ -35,11 +35,11 @@ import Link from '@docusaurus/Link'; Optional. Default snowplow.lakeloader. Prefix used for the metric name when sending to statsd. - sentry.dsn + monitoring.sentry.dsn Optional. Set to a Sentry URI to report unexpected runtime exceptions. - sentry.tags.* + monitoring.sentry.tags.* Optional. A map of key/value strings which are passed as tags when reporting exceptions to Sentry. diff --git a/docs/pipeline-components-and-applications/loaders-storage-targets/snowflake-streaming-loader/configuration-reference/_common_config.md b/docs/pipeline-components-and-applications/loaders-storage-targets/snowflake-streaming-loader/configuration-reference/_common_config.md index 6ff31e9e04..ff3d79338c 100644 --- a/docs/pipeline-components-and-applications/loaders-storage-targets/snowflake-streaming-loader/configuration-reference/_common_config.md +++ b/docs/pipeline-components-and-applications/loaders-storage-targets/snowflake-streaming-loader/configuration-reference/_common_config.md @@ -68,11 +68,11 @@ import Link from '@docusaurus/Link'; Optional. A map of key/value strings to be included in the payload content sent to the webhook. - sentry.dsn + monitoring.sentry.dsn Optional. Set to a Sentry URI to report unexpected runtime exceptions. - sentry.tags.* + monitoring.sentry.tags.* Optional. A map of key/value strings which are passed as tags when reporting exceptions to Sentry. diff --git a/docs/storing-querying/loading-process/index.md b/docs/storing-querying/loading-process/index.md index 7a3e4290da..a559e2e9f9 100644 --- a/docs/storing-querying/loading-process/index.md +++ b/docs/storing-querying/loading-process/index.md @@ -9,7 +9,8 @@ description: "A high level view of how Snowplow data is loaded into Redshift, Bi import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import RDBLoaderDiagram from '@site/docs/pipeline-components-and-applications/loaders-storage-targets/snowplow-rdb-loader/_cross-cloud-diagram.md'; -import BigQueryLoaderDiagram from '@site/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/_diagram.md'; +import BigQueryLoaderDiagramV1 from '@site/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/_diagram.md'; +import BigQueryLoaderDiagramV2 from '@site/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/_cross-cloud-diagram.md'; import LakeLoaderDiagram from '@site/docs/pipeline-components-and-applications/loaders-storage-targets/lake-loader/_cross-cloud-diagram.md'; import SnowflakeStreamingLoaderDiagram from '@site/docs/pipeline-components-and-applications/loaders-storage-targets/snowflake-streaming-loader/_cross-cloud-diagram.md'; ``` @@ -28,8 +29,14 @@ We load data into Redshift using the [RDB Loader](/docs/pipeline-components-and- We load data into BigQuery using the [BigQuery Loader](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/index.md). - - + + + + + + + + diff --git a/docs/storing-querying/querying-data/index.md b/docs/storing-querying/querying-data/index.md index 1c589503c9..fd4ec18354 100644 --- a/docs/storing-querying/querying-data/index.md +++ b/docs/storing-querying/querying-data/index.md @@ -80,12 +80,16 @@ You can query fields in the self-describing event like so: ```sql SELECT ... - unstruct_event_my_example_event_1_0_0.my_field, + unstruct_event_my_example_event_1.my_field, ... FROM ``` +:::note +Column name produced by previous versions of the BigQuery Loader (<2.0.0) would contain full schema version, e.g. `unstruct_event_my_example_event_1_0_0` +::: + @@ -174,7 +178,7 @@ You can query a single entity’s fields by extracting them like so: ```sql SELECT ... - contexts_my_entity_1_0_0[SAFE_OFFSET(0)].my_field AS my_field, + contexts_my_entity_1[SAFE_OFFSET(0)].my_field AS my_field, ... FROM @@ -190,8 +194,11 @@ SELECT FROM LEFT JOIN - unnest(contexts_my_entity_1_0_0) AS my_ent -- left join to avoid discarding events without values in this entity + unnest(contexts_my_entity_1) AS my_ent -- left join to avoid discarding events without values in this entity ``` +:::note +Column name produced by previous versions of the BigQuery Loader (<2.0.0) would contain full schema version, e.g. `contexts_my_entity_1_0_0`. +::: diff --git a/docs/storing-querying/schemas-in-warehouse/index.md b/docs/storing-querying/schemas-in-warehouse/index.md index 25b783edb6..7e14b3a6a5 100644 --- a/docs/storing-querying/schemas-in-warehouse/index.md +++ b/docs/storing-querying/schemas-in-warehouse/index.md @@ -108,9 +108,34 @@ For example, suppose you have the following field in the schema: It will be translated into a field called `last_name` (notice the underscore), of type `STRING`. + + + + +Each type of self-describing event and each type of entity get their own dedicated columns in the `events` table. The name of such a column is composed of the schema vendor, schema name and major schema version (more on versioning [later](#versioning)). + +Examples: + +| Kind | Schema | Resulting column | +|---|---|---| +| Self-describing event | `com.example/button_press/jsonschema/1-0-0` | `events.unstruct_event_com_example_button_press_1` | +| Entity | `com.example/user/jsonschema/1-0-0` | `events.contexts_com_example_user_1` | + + + Each type of self-describing event and each type of entity get their own dedicated columns in the `events` table. The name of such a column is composed of the schema vendor, schema name and full schema version (more on versioning [later](#versioning)). + +Examples: + + | Kind | Schema | Resulting column | + |---|---|---| + | Self-describing event | `com.example/button_press/jsonschema/1-0-0` | `events.unstruct_event_com_example_button_press_1_0_0` | + | Entity | `com.example/user/jsonschema/1-0-0` | `events.contexts_com_example_user_1_0_0` | + + + The column name is prefixed by `unstruct_event_` for self-describing events, and by `contexts_` for entities. _(In case you were wondering, those are the legacy terms for self-describing events and entities, respectively.)_ @@ -120,13 +145,6 @@ All characters are converted to lowercase and all symbols (like `.`) are replace ::: -Examples: - -| Kind | Schema | Resulting column | -|---|---|---| -| Self-describing event | `com.example/button_press/jsonschema/1-0-0` | `events.unstruct_event_com_example_button_press_1_0_0` | -| Entity | `com.example/user/jsonschema/1-0-0` | `events.contexts_com_example_user_1_0_0` | - For self-describing events, the column will be of a `RECORD` type, while for entities the type will be `REPEATED RECORD` (because an event can have more than one entity attached). Inside the record, there will be fields corresponding to the fields in the schema. Their types are determined according to the logic described [below](#types). @@ -155,6 +173,7 @@ For example, suppose you have the following field in the schema: It will be translated into a field called `last_name` (notice the underscore), of type `STRING`. + Each type of self-describing event and each type of entity get their own dedicated columns in the `events` table. The name of such a column is composed of the schema vendor, schema name and major schema version (more on versioning [later](#versioning)). @@ -332,7 +351,24 @@ Note that this behavior was introduced in RDB Loader 5.3.0. + + + +Because the column name for the self-describing event or entity includes the major schema version, each major version of a schema gets a new column: + +| Schema | Resulting column | +|---|---| +| `com.example/button_press/jsonschema/1-0-0` | `unstruct_event_com_example_button_press_1` | +| `com.example/button_press/jsonschema/1-2-0` | `unstruct_event_com_example_button_press_1` | +| `com.example/button_press/jsonschema/2-0-0` | `unstruct_event_com_example_button_press_2` | + +When you evolve your schema within the same major version, (non-destructive) changes are applied to the existing column automatically. For example, if you add a new optional field in the schema, a new optional field will be added to the `RECORD`. +:::info Breaking changes + +::: + + Because the column name for the self-describing event or entity includes the full schema version, each version of a schema gets a new column: | Schema | Resulting column | @@ -345,12 +381,14 @@ If you are [modeling your data with dbt](/docs/modeling-your-data/modeling-your- :::info Breaking changes -While our recommendation is to use major schema versions to indicate breaking changes (e.g. changing a type of a field from a `string` to a `number`), this is not particularly relevant for BigQuery. Indeed, each schema version gets its own column, so there is no difference between major and minor versions. That said, we believe sticking to our recommendation is a good idea: +While our recommendation is to use major schema versions to indicate breaking changes (e.g. changing a type of a field from a `string` to a `number`), this is not particularly relevant for BigQuery Loader version 1.x. Indeed, each schema version gets its own column, so there is no difference between major and minor versions. That said, we believe sticking to our recommendation is a good idea: * Breaking changes might affect downstream consumers of the data, even if they don’t affect BigQuery -* In the future, you might decide to migrate to a different data warehouse where our rules are stricter (e.g. Databricks) +* Version 2 of the loader has stricter behavior that matches our loaders for other warehouses and lakes ::: + + diff --git a/docs/storing-querying/storage-options/index.md b/docs/storing-querying/storage-options/index.md index 50f0911f62..64c66d1ec0 100644 --- a/docs/storing-querying/storage-options/index.md +++ b/docs/storing-querying/storage-options/index.md @@ -25,6 +25,7 @@ The cloud selection is for where your Snowplow pipeline runs. The warehouse itse | Destination | Type | Loader application | Status | | --- | --- | --- | --- | | Redshift
_(including Redshift serverless)_ | Batching (recommended)
or micro-batching | [RDB Loader](/docs/pipeline-components-and-applications/loaders-storage-targets/snowplow-rdb-loader/index.md) | Production-ready | +| BigQuery | Streaming | [BigQuery Loader](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/index.md) | Production-ready | | Snowflake | Batching (recommended)
or micro-batching | [Snowplow RDB Loader](/docs/pipeline-components-and-applications/loaders-storage-targets/snowplow-rdb-loader/index.md) | Production-ready | | Snowflake | Streaming 🆕 | [Snowflake Streaming Loader](/docs/pipeline-components-and-applications/loaders-storage-targets/snowflake-streaming-loader/index.md) | Early release | | Databricks | Batching (recommended)
or micro-batching | [Snowplow RDB Loader](/docs/pipeline-components-and-applications/loaders-storage-targets/snowplow-rdb-loader/index.md) | Production-ready | @@ -46,6 +47,7 @@ The cloud selection is for where your Snowplow pipeline runs. The warehouse itse | Destination | Type | Loader application | Status | | --- | --- | --- | --- | +| BigQuery | Streaming | [BigQuery Loader](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/index.md) | Production-ready | | Snowflake | Micro-batching | [RDB Loader](/docs/pipeline-components-and-applications/loaders-storage-targets/snowplow-rdb-loader/index.md) | Early release | | Snowflake | Streaming 🆕 | [Snowflake Streaming Loader](/docs/pipeline-components-and-applications/loaders-storage-targets/snowflake-streaming-loader/index.md) | Early release | | Databricks | Micro-batching
_(via a [data lake](#data-lake-loaders))_ | [Lake Loader](/docs/pipeline-components-and-applications/loaders-storage-targets/lake-loader/index.md) | Early release | diff --git a/src/componentVersions.js b/src/componentVersions.js index 4c0faf4353..766d96e941 100644 --- a/src/componentVersions.js +++ b/src/componentVersions.js @@ -29,7 +29,8 @@ export const versions = { snowbridge: '2.3.0', // Loaders - bqLoader: '1.7.1', + bqLoader: '2.0.0', + bqLoader1x: '1.7.1', esLoader: '2.1.2', gcsLoader: '0.5.5', postgresLoader: '0.3.3',