Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PEERDB_CLICKHOUSE_BINARY_FORMAT #2407

Merged
merged 4 commits into from
Jan 4, 2025
Merged

PEERDB_CLICKHOUSE_BINARY_FORMAT #2407

merged 4 commits into from
Jan 4, 2025

Conversation

serprex
Copy link
Contributor

@serprex serprex commented Jan 2, 2025

In #2181 we consolidated on transmitting bytea columns as base64 strings

ClickHouse supports binary data in strings. They need not be valid utf8

This adds PEERDB_CLICKHOUSE_BINARY_FORMAT with 3 formats: raw, hex, & base64
Default to base64 to avoid breaking existing setups

Also add caching logic for non-immediate dynamic settings. Helps here where we'd hit catalog for every bytea field processed

Included hex to demonstrate how other formats would fit in, but expect this feature to be removed once we move everyone to raw

))
}
case peerdbenv.BinaryFormatHex:
projection.WriteString(fmt.Sprintf("hex(base64Decode(JSONExtract(_peerdb_data, '%s', '%s'))) AS `%s`,", colName, clickHouseType, dstColName))
Copy link
Contributor Author

@serprex serprex Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be ideal to send hex to raw table in this case, but:

  1. that would mean passing down ctx/env somehow to json serialization
  2. things would get weird with raw tables being invalidated when changing PEERDB_BINARY_FORMAT
  3. base64 is actually a pretty good binary encoding when raw binary isn't available (relatively dense while being computationally straightforward compared to base85, generally available decoding functions on destination)

@serprex serprex force-pushed the ch-raw-binary branch 9 times, most recently from dd1a889 to d175ab7 Compare January 3, 2025 04:37
flow/peerdbenv/dynamicconf.go Outdated Show resolved Hide resolved
flow/peerdbenv/dynamicconf.go Show resolved Hide resolved
{
Name: "PEERDB_BINARY_FORMAT",
Description: "Binary field encoding; either raw, hex, or base64",
DefaultValue: "base64",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to raw as a TODO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by time we change default to raw we can probably remove this option

flow/peerdbenv/dynamicconf.go Outdated Show resolved Hide resolved
case peerdbenv.BinaryFormatHex:
encoded = strings.ToUpper(hex.EncodeToString(byteData))
default:
panic(fmt.Sprintf("unhandled binary format: %d", format))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a panic can cause issues with maintenance mode (relies on flow-worker being active to pause mirrors)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

panic can never happen unless a logic error is introduced in code, & bad code can always panic anyways

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
In #2181 we consolidated on transmitting bytea columns as base64 strings

ClickHouse supports binary data in strings. They need not be valid utf8

This adds PEERDB_BINARY_FORMAT with 3 formats: raw, hex, & base64
Default to base64 to avoid breaking existing setups

Also add caching logic for non-immediate dynamic settings. Helps here where we'd hit catalog for every bytea field processed
e2e

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@serprex serprex changed the title PEERDB_BINARY_FORMAT PEERDB_CLICKHOUSE_BINARY_FORMAT Jan 4, 2025
@serprex serprex merged commit a4d6525 into main Jan 4, 2025
9 checks passed
@serprex serprex deleted the ch-raw-binary branch January 4, 2025 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants