From 28cdc801c317e943f14e7eb6f59d21268b56302e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Haris=20Osmanagi=C4=87?= Date: Thu, 12 Sep 2024 14:57:36 +0200 Subject: [PATCH] Guidelines for writing a connector (#1802) --- docs/writing-a-connector-guidelines.md | 263 +++++++++++++++++++++++++ 1 file changed, 263 insertions(+) create mode 100644 docs/writing-a-connector-guidelines.md diff --git a/docs/writing-a-connector-guidelines.md b/docs/writing-a-connector-guidelines.md new file mode 100644 index 000000000..3c14473fa --- /dev/null +++ b/docs/writing-a-connector-guidelines.md @@ -0,0 +1,263 @@ +# Guidelines for writing a connector + +This document describes patterns and processes that have been successfully used +in many Conduit connectors over the past few years. The document assumes basic +knowledge about how Conduit works. + +## Start with the connector template + +The GitHub +repository [template](https://github.com/ConduitIO/conduit-connector-template/) +for Conduit connectors provides a basic project setup that includes placeholders +for the source connector code, destination connector code, connector +specifications, an example of a configuration, etc. + +Also included is a `Makefile` with commonly used targets and GitHub workflows +for linting the code, running unit tests, automatically merging minor dependabot +upgrades, etc. + +## Research the 3rd party system + +Researching the 3rd party system for which a Conduit connector is built helps +understand the capabilities and limitations of the system, which in turn results +in a better connector design and better work organization. + +Some questions that typically need to be answered: + +1. **How is the data organized in the system?** + + The way data is organized in the system will affect how the connector is + configured (e.g. what data is a user able to read or write to) and how the + hierarchy is mapped to records and collections in Conduit. +2. **What parameters are needed to connect to the system?** + + We generally recommend using connection strings/URIs, if available. The + reason for this is that modifying a connector's parameters is a matter of + changing an existing string, and not separate configuration parameters in a + connector. +3. **What APIs or drivers are available?** + + The choice is influenced by a number of factors, such as: + - Is there a driver available? If a public API and a driver are available, we + recommend using a driver, since it's a higher level of abstraction. + - In which language should the connector be written? + - If the language of choice is Go, is a pure Go driver available? A CGo + driver may make the usage and deployment of a connector more complex. + - Does the driver expose all the functionalities needed? + - How mature is the driver? +4. **What authentication methods should be supported?** + + If multiple authentication methods are available, then a decision needs to be + made about the methods that should be implemented first. Additionally, it + should be understood how expired credentials should be handled. For example, + a connector won't be able to handle an expired password, but a token can + sometimes be refreshed. +5. **Can the connector be isolated from other clients of the system?** + + In some cases a connector, as a client using a system, might affect other + clients, for example when getting messages from a message broker's queue, the + message will be delivered to the connector, while other clients might expect + the message. +6. **Are the source/destination specific features supported?** + + Source and destination connectors may have specific requirements (some of + them are outlined in later sections). When researching, attention should be + paid if those requirements can be fulfilled. + +## Development + +### Familiarize with the Connector SDK/connector protocol + +Conduit's Connector SDK is the Go software development kit for implementing a +connector for Conduit. If you want to implement a connector in another language +please check +the [connector protocol](https://github.com/conduitio/conduit-connector-protocol). + +### Write a test pipeline + +Having a test pipeline helps see the results of development sooner rather than +later. When working on a source connector, a file or log destination can be used +to build a test pipeline. When working on a destination, a file or generator +source can be used to manually or automatically generate some test data for the +destination connector. + +### Use built-in connectors as examples + +Conduit's [built-in connectors](https://conduit.io/docs/connectors/connector-list/) +are great references that can be used when writing connectors. They are also the +first ones to be updated with latest SDK and protocol changes. + +### Configuration + +A connector's configuration is a map in which keys are parameter names and +values are parameter values. Every source and destination needs to return the +parameters that it accepts, and, optionally, the validations that need to be run +on those. + +Conduit +offers [ParamGen](https://github.com/ConduitIO/conduit-commons/tree/main/paramgen), +a tool that generates the parameters map from a configuration struct. The SDK +contains a function, `sdk.Util.ParseConfig`, that can parse and validate a +user's configuration map into the configuration struct. + +### Middleware + +Conduit's Connector SDK adds default middleware that, by default, enables some +functionality. A connector developer should be familiar with the middleware, +especially with +the [schema related middleware](https://conduit.io/docs/connectors/configuration-parameters/schema-extraction). + +### Source connectors + +The following section summarizes best practices that are specific to source +connectors. + +#### Deciding how should a record position look like + +Every record read by a source connector has +a [position](https://conduit.io/docs/features/opencdc-record#fields) attached. +If a connector is stopped after that record is read, its position will be given +to the connector's `Open()` method the next time it starts. Hence, a position +needs to contain enough information for a source connector to resume reading +records from where it exactly stopped. + +A position is a slice of bytes that can represent any data structure. In Conduit +connectors, it's common to see that a position is actually a `struct`, that's +marshalled into a JSON string. + +#### Implementing snapshots + +Firstly, it should be clarified if supporting snapshots is a requirement or if +it's possible to do at all. If a connector is required to support snapshots, +then it's recommended to make it possible to turn off snapshots through the +connector configuration. + +The following things need to be taken into account when implementing a snapshot +procedure: + +- The snapshot needs to be consistent. +- The set of the existing data can be quite large. +- Restarting a connector during a snapshot should **not** require re-reading all + the data again. This is because in some destination connectors it may cause + data duplication, and it could be a significant performance overhead. + +#### Implementing change data capture (CDC) + +Change Data Capture (CDC) should be implemented so that the following criteria +is met: + +1. If a snapshot is needed, changes that happened while the + snapshot was running should be captured too. +2. Different types of changes might be possible (new data inserted, existing + data updated or deleted). +3. Changes that happened while a connector was stopped need to be read by the + connector when it starts up (assuming that the changes are still present in + the source system). + +Some source systems may provide a change log (e.g. WAL in PostgreSQL, binlog in +MySQL, etc.). Others may not and a different way to capture changes is needed. +Specifically, for connectors written for SQL databases, two patterns can be +used: + +1. **Triggers** + + With triggers, it's possible to capture all types of changes (creates, + updates and deletes). A trigger will write the event with necessary + metadata (operation performed, timestamp, etc.) into a "trigger table", that + will be read by the connector. + + An advantage of this approach is that all types of operations can be + captures. However, it may incur a performance penalty. +2. **A timestamp-based query** + + If the table a source connector is reading from has a timestamp column that + is updated whenever a new row is inserted or an existing row is updated, then + a query can be used to fetch changes. A disadvantage of this approach is that + delete operations cannot be captured. + +#### Iterator pattern + +A pattern that we found to be useful when writing source connectors is the +iterator pattern. The basic idea is that a source connector's `Read()` method +reads records through an iterator: + +```go +type Iterator interface { + HasNext() bool + Next() opencdc.Record +} + +type Source struct {} +func (s *Source) Read(ctx context.Context) (opencdc.Record, error) { + if s.iterator.HasNext(ctx) { + return opencdc.Record{}, sdk.ErrBackoffRetry + } + + return s.iterator.Next(ctx) +} +``` + +There are three implementations of the iterator interface: + +- `SnapshotIterator`: used when performing a snapshot +- `CDCIterator`: used in CDC +- `CombinedIterator`: an iterator that combines the above two and is able to + perform a snapshot and then switch to CDC. + +The iterator to be used is determined in the source's `Open()` method based on +the connector's configuration: if a snapshot is required, then a +`CombinedIterator` will be used, if not, then the `CDCIterator` will be used. + +There are multiple advantages of this approach. The source connector remains " +focused" on the higher level operations, such as configuration, opening the +connectors, tearing down, etc. The iterators contain the code that deals with +the source system itself and convert the data into `opencdc.Record`s. They also +take care of switching from snapshot mode to CDC mode. + +### Destination connectors + +#### Batching + +Batching can considerably improve the write performance and should be considered +when developing a destination connector. The Connector SDK +provides [batching middleware](https://conduit.io/docs/connectors/configuration-parameters/batching) +that automatically builds batches. + +When writing batches into a destination system, the order of records should +**not** be changed, even if grouping records would be useful in a way. For +example, in some connectors, it's useful to group records by operation, because +the destination system has different procedures for those operations. + +If a destination connector receives the following batch: + +```text +[create_1, update_1, update_2, create_2] +``` + +then this batch can be split into following batches: + +```text +Batch 1: [create_1] +Batch 2: [update_1, update_2] +Batch 3: [create_2] +``` + +This way, ordering is preserved and the connector can still take advantage of +batching. + +### Acceptance tests + +Conduit's Connector SDK contains a set of acceptance tests that verify many of a +connector's methods, such as: + +- Are missing configuration parameters detected? +- Can a source that is stopped during CDC be successfully restarted? +- Can a destination write a batch of records? + +An example of implemented acceptance tests can be found in +the [Kafka connector](https://github.com/ConduitIO/conduit-connector-kafka/blob/3233c37392e7249e9d64c9cae0374a0b47ca840f/acceptance_test.go). + +### Debugging the connector + +The steps for debugging a standalone connector have been +described [here](https://github.com/ConduitIO/conduit/discussions/1724#discussioncomment-10178484).