feat(destination): add destination connector #1

donatorsky · 2022-04-28T13:45:24Z

Description

Adds destination handling to Elasticsearch connector.

hariso

Nice work!

test/v5/connector_test.go

hariso · 2022-05-19T15:36:51Z

destination/config.go

+
+func ParseConfig(cfgRaw map[string]string) (_ Config, err error) {
+	cfg := Config{
+		Version:                cfgRaw[ConfigKeyVersion],


Given that ES itself reports its version (when you go to localhost:9200 for example), do we need the user to manually enter this?

We do, since it is not always available immediately. The server can require the client to be authorized first. In such case, I would need to use some version of the client, see if it connects, get the version and then use the proper version of the client.

Gotcha. However, the request to localhost:9200 is like any other request. It can fail only if the config is missing credentials, in which case the connector is not able to work anyway. In other words, we have two possibilties:

ES server requires auth, but the connector config doesn't have it: We will ping localhost:9200, get a 401 code back, and will let the user know that the config is invalid (missing creds).

ES doesn't require auth, the connector config doesn't have it: all good

ES requires auth, connector config has it: pinging the host to get the version will work.

Assuming e.g. ES SDK v8 (and future versions) can authorize ES server v5 (or any previous version in general). A similar thing happened to MySQL connector for PHP after MySQL 8.0 was released, and they changed the default auth plugin (https://stackoverflow.com/questions/49083573/php-7-2-2-mysql-8-0-pdo-gives-authentication-method-unknown-to-the-client-ca) which lead to kind of false-positive where user+password are correct, but additional configuration parameter is required.

However, I think we are good to go with this. I see it currently as a rare case, since ES is pretty much robust with their changes (they either do not change already well working features or announce changes early enough). I've created an issue for that #4.

destination/destination.go

hariso · 2022-05-19T15:57:42Z

destination/destination.go

+	// Sort operations to ensure the order
+	d.operationsQueue.Sort()


It's great that you thought about this case. I have a question: the purpose of this sort is to make sure that operations come in order (e.g. that we don't have a delete before a create). However, even this sort cannot guarantee that since it's local. We may see out-of-order records coming in the batch before this one or after this one. For example:
batch 1: record deleted at 10 AM
batch 2: record created at 9 AM
Simply changing the batch size changes the effect of this sort. And since we can't guarantee the effect of it, maybe we shouldn't do it. WDYT?

That's correct, is to ensure the order of events as much as it is possible, but still there may be an edge case when two consecutive events were sent, but the second got earlier, filled the buffer and the change was applied.

I'm afraid there is not much to do about this. If we remove sorting logic, then we can get a much bigger mess where the situation You described happens more often. Increasing the buffer size and sorting minimizes this, on average. We can also drop last n records when applying changes and save them for the next batch, but we don't know if the missing record is last or first.
A possible "solution" (workaround, but still not perfect) for that is to implement in the Conduit something similar to TCP/IP packets, where they are numbered and when client gets response no.1 and no.3, still waits for no.2 (e.g. some BatchID attached to Record). But even then:

There is no guarantee that no.2 will ever come, and a pipeline will be stuck. Timeouts, if added, will become a real pain when this happens frequently.

There is no guarantee that source connector received data in order. It would have to be implemented by actual source.

The order does not always matter. When Record.Key is not set, then records are always created. This also solves it for certain use cases. For that, I think it would be useful to have a dedicated config "Always create" to disable upsert logic event when Record.Key is available.

I agree with that you're saying. Let's add a comment to that line above, briefly explaining why we do that. If we ever start seeing the mentioned edge cases or performance degradation, we can work on it then, it's probably too early to work on it now anyway.

I'm curious why are we even concerned about records arriving out of order? Conduit guarantees message order, so unless the source produces the records out of order I don't see why we would need to sort them. Even if that is the case, it sounds like a bug that needs to be fixed by the source and not something the destination should address.

@donatorsky IMHO the point @lovromazgon made is a good one. Given that the sort is running always (so there's always some performance penalty), but that it's not a 100% effective solution (due to batches) and that another condition is that sources messed something up (because as Lovro said, Conduit itself guarantees message order), it looks like we shouldn't be doing it. What are your thoughts on this?

In your example you say that those two events were sent at those times. With that, you mean issued by Salesforce?

That's correct, issued by the Salesforce, and they have these dates in the payload:
1st: Sent at 15:30:01 with payload {"created_at":"yyyy-mm-dd 15:30:01"}, received 15:30:12.
2nd: Sent at 15:30:02 with payload {"created_at":"yyyy-mm-dd 15:30:02"}, received 15:30:11.

IIUC you are saying that the Salesforce source connector can create records in an order that is different from the order of events as they hapoened in Salesforce. This sounds like a bug in the source connector to me, definitely not something the destination should care about, otherwise we would need to include this logic in each and every destination connector, just in case.

You are saying that Conduit manages the order of messages received (i.e. HTTP requests), but not the actual order of messages specified in the payload.

What I am trying to say is that the source connector is completely in charge of creating the records in whichever order it chooses. Conduit will make sure that the order remains the same all the way to the destination, so if the source connector produces record A and then record B, the destination will receive record A and then record B.

To sum up, the source connector has more information about the records it creates than any other component in the pipeline, so it is the best place to decide the correct order of records.

💯 If, for any reason, sorting is needed, then the source connector sounds like a better place for that. Plus, sorting will be done even when it's not needed, which affects performance (even if small) for no reason.

Updated: d7b7145

internal/elasticsearch/factory_test.go

donatorsky · 2022-05-20T08:43:18Z

@hariso Some comments appeared twice, I removed duplicates 🙂

hariso · 2022-05-20T09:03:07Z

@hariso Some comments appeared twice, I removed duplicates slightly_smiling_face

😕

README.md

…edge case

hariso · 2022-06-02T12:43:06Z

destination/destination.go

+				)
+			} else {
+				d.operationsQueue[n].err = fmt.Errorf(
+					"item with key=%s create/upsert/delete failure: [%s] %s: %s",


Is it possible to know which operation exactly was it?

Yes, I can get it from the switch statement a few lines earlier. Will add this 🙂

hariso · 2022-06-02T12:46:29Z

destination/destination.go

+	// Execute operations
+	retriesLeft := d.config.Retries
+
+	for {


This for loop is for retries, right?

Not a blocker for this PR, but it would nice to split this whole method into smaller ones, and make the retry logic stand out.

Created #7 for that

hariso · 2022-06-02T12:51:17Z

destination/destination.go

+	actionCreate  = "create"
+	actionCreated = "created"
+	actionUpdate  = "update"
+	actionUpdated = "updated"
+	actionDelete  = "delete"
+	actionDeleted = "deleted"


Why do we need both flavors (created and created, etc.)?

It seems not to be standardized, so I wanted to support as many reasonable possibilities as possible.

You mean, not standardized in the ES API?

No, these are values from the Record's metadata["action"] field. So it actually depends on creators of source connectors 😉 I've seen you use action=delete for S3 and insert, update, delete for Postgres while Salesforce uses created, updated etc.

Gotcha. We're working on getting standardized through OpenCDC. So it would be good to mention the reason for the duplication in a comment.

@donatorsky Is there a special reason why the Salesforce connector uses the "d" version? The S3 and Pg connectors are in line with the OpenCDC guidelines here: https://github.com/ConduitIO/conduit/blob/main/docs/design-documents/20220309-opencdc.md?plain=1#L267-L272

It is because I get "d" version directly from the Salesforce in query response.

From this PR point of view, it seems I should drop "d" versions, go with OpenCDC guidelines and update Salesforce connector to translate SF ⇒ OpenCDC actions. Should I proceed with these steps?

@donatorsky That's right, I'd go with that. Internally, connectors are free to do what they want, but the input and output should match the Conduit APIs and the OpenCDC guidelines (which are work in progress obviously and will be refined over time).

Updated: 9fe58b3

And Salesforce counterpart:
conduitio-labs/conduit-connector-salesforce@f1d0dec

hariso · 2022-06-02T13:01:45Z

internal/elasticsearch/client.go

+	// PrepareCreateOperation prepares insert operation definition
+	PrepareCreateOperation(item sdk.Record) (metadata interface{}, payload interface{}, err error)
+
+	// PrepareUpsertOperation prepares upsert operation definition
+	PrepareUpsertOperation(key string, item sdk.Record) (metadata interface{}, payload interface{}, err error)
+
+	// PrepareDeleteOperation prepares delete operation definition
+	PrepareDeleteOperation(key string) (metadata interface{}, err error)


These are bulk operations right?

These create a single change definition for bulk query. The bulk query itself is composed of these later.

Gotcha, I guessed so. It would be nice to have it in the name, or the comment at the very least.

Underlying implementation differs between ES versions, same reason as for here:
#1 (comment)

It is definitely missing comment that it is for bulk, despite the fact that only bulk operation is supported by the connector. I'm adding this 🙂

test/v5/connector_test.go

hariso · 2022-06-02T13:17:19Z

internal/elasticsearch/v7/bulk_request_metadata.go

+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package v7


Is there a difference between the requests (and responses) across different versions, especially v5-v7?

If there are no changes, it looks like we can take advantage of that, and simplify the client code. v8 is the one being actively maintained, so if v5-v7 are the same, they will stay the same.

They are different, unfortunately :/ v6 introduces RetryOnConflict int for update action and v7 removes support for indices' types.

hariso · 2022-06-02T13:19:37Z

spec.go

+			destination.ConfigKeyVersion: {
+				Default:     "",
+				Required:    true,
+				Description: "The version of the Elasticsearch service.",


Would be nice to mention what are the possible values (and in which format, e.g. is it 8 or v8). It is mentioned in the ReadMe (which is great!) but these specs will be used to automatically populate options in a UI widget.

Gotcha, added 🙂

hariso · 2022-06-02T13:20:58Z

test/docker-compose.v5.overrides.yml

+version: '3.9'
+
+services:
+  kibana:


What's the purpose of this Kibana instance?

It is strictly for local development. I was going to exclude docker-compose*.overrides.yml but decided to leave it, maybe it will help someone in the future.

When running tests, Kibana is not started. Only locally, when explicitly defined.

That's a nice thing, so maybe you can mention it in the ReadMe (i.e. how to start Kibana for local dev).

hariso

I believe we're good to go with this PR, thanks all of the good work! I can't find all the comments, but three things are outstanding:

Comment about actions (created, create etc.)
Sorting records
Splitting that method (with the retry logic)

donatorsky added the enhancement New feature or request label Apr 28, 2022

donatorsky self-assigned this Apr 28, 2022

donatorsky force-pushed the feat/destination-connector branch 4 times, most recently from bdacc7c to e33d61a Compare May 4, 2022 14:22

donatorsky marked this pull request as draft May 5, 2022 12:13

donatorsky added the good first issue Good for newcomers label May 6, 2022

donatorsky marked this pull request as ready for review May 6, 2022 14:20

donatorsky force-pushed the feat/destination-connector branch 4 times, most recently from 2c75a34 to 2e3be3f Compare May 16, 2022 11:00

donatorsky force-pushed the feat/destination-connector branch from c3d333c to 26c3287 Compare May 17, 2022 10:33

hariso reviewed May 19, 2022

View reviewed changes

conduitio-labs deleted a comment from hariso May 20, 2022

donatorsky commented May 20, 2022

View reviewed changes

README.md Outdated Show resolved Hide resolved

donatorsky added 8 commits May 31, 2022 09:24

feat(destination): add destination connector

c82e3c9

lint fixes

a9a7fd3

README.md update

2730637

lint fixes

6f16097

Test using Go 1.18

3e8c9a0

Update namespace

cc061ce

Expose spec

b57ebc6

Handle different data types

140befb

donatorsky added 9 commits May 31, 2022 09:24

Require config: ES version

d26d85f

Add tests

e7055a6

Retries tests

2d02281

Use Index instead of Create operations for ES v5 and v6

1c2b3bf

Update documentation

1c1b02d

Extract possible actions to constants

8f38690

Add missing comments

b6ab4c7

PF fixes: update Read Me

37310c3

PR fix: leave more details why events are sorted and why there is an …

737256c

…edge case

donatorsky force-pushed the feat/destination-connector branch from f50e26f to 737256c Compare May 31, 2022 07:25

donatorsky added 4 commits June 2, 2022 11:43

PR fixes: remove unnecessary information about unsupported functionality

8a9c155

PR fixes: update ReadMe formatting

15cec9c

gomod update

dfb9a4f

Tests fixes

e0d99ff

hariso reviewed Jun 2, 2022

View reviewed changes

donatorsky added 5 commits June 2, 2022 17:03

PR fix: list supported ES versions for cofniguration

c171fda

PR fix: include operation name in error

108987a

PR fix: update comment for bulk operations interface

b2a152a

Update ReadMe typos

8498243

PR fix: mention Kibana configuration for local development in ReadMe

6f83784

hariso approved these changes Jun 3, 2022

View reviewed changes

donatorsky added 5 commits June 3, 2022 14:03

PR fix: describe why some actions come with different namings

92f0ba9

Handle body close error

9425950

Proofreading: fix typos

a0add45

Standarize operations on Record

9fe58b3

PR fix: do not sort records

d7b7145

hariso approved these changes Jun 6, 2022

View reviewed changes

donatorsky merged commit a7bec5e into main Jun 8, 2022

donatorsky deleted the feat/destination-connector branch June 8, 2022 08:29

donatorsky added a commit that referenced this pull request Jun 8, 2022

feat(destination): add destination connector (#1)

7b3fb98

		// Sort operations to ensure the order
		d.operationsQueue.Sort()

feat(destination): add destination connector #1

feat(destination): add destination connector #1

Conversation

donatorsky commented Apr 28, 2022

Description

hariso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

donatorsky May 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

donatorsky commented May 20, 2022

hariso commented May 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hariso left a comment

Choose a reason for hiding this comment

donatorsky May 20, 2022 •

edited

Loading