Spec discussion: BATCH message type, first iteration #859
Replies: 5 comments 11 replies
-
Proposed "batch_config": {
"encoding": {
"format": "jsonl",
"compression": "gz"
},
"storage": {
...
}
} For CSV: "batch_config": {
"encoding": {
"format": "csv",
"compression": "gz",
"include_header": false,
"delimiter": "|"
},
"storage": {
...
}
} |
Beta Was this translation helpful? Give feedback.
-
FYI, since record-level optimizations keep coming up in discussion, I've logged this new discussion to allow further discussion on that topic: |
Beta Was this translation helpful? Give feedback.
-
@edgarrmondragon and team, for the storage layer at least, what about potentially using the Filesystems supported: My only hesitations are that:
Another possibility is that we completely insulate the users and the spec itself from the In a first iteration we might support The contract between tap and target would be that a target (or the SDK itself) supports one or more protocol prefixes for files in the |
Beta Was this translation helpful? Give feedback.
-
New thread for the basic BATCH message spec Multiple-files per messages vs one message for every file?The current spec for BATCH messages requires the following sample structure: {
"type": "BATCH",
"stream": "users",
"encoding": {
...
},
"manifest": [
"path/to/batch/file/1",
"path/to/batch/file/2"
]
} Would this be functionally equivalent to: (pretty-printed for readability) {
"type": "BATCH",
"stream": "users",
"encoding": {
...
},
"filepath": "path/to/batch/file/1"
}
{
"type": "BATCH",
"stream": "users",
"encoding": {
...
},
"filepath": "path/to/batch/file/2"
} I prefer the second and I ask because I'm struggling to define where the number of files in the @aaronsteers wdyt? cc @meltano/engineering in case you have thoughts |
Beta Was this translation helpful? Give feedback.
-
@edgarrmondragon @aaronsteers this is excellent 👏 Two other things that I found useful when implementing
Something like: {
"type": "BATCH",
"stream": "users",
"encoding": {
...
},
"filepath": "path/to/batch/file/1",
"metadata": {
"last_batch": false,
"batch_size": 10000
}
}
{
"type": "BATCH",
"stream": "users",
"encoding": {
...
},
"filepath": "path/to/batch/file/2",
"metadata": {
"last_batch": true,
"batch_size": 367
}
} |
Beta Was this translation helpful? Give feedback.
-
Opening this discussion to review and finalize the BATCH message type spec.
Related to:
BATCH
message type (akaFAST_SYNC
spec) #9Tap config
Starting with a first supported file type of
.jsonl.gz
:Target config
No changes needed. Details are provided within the
BATCH
message type.Message type sample
Benefit of compression
Most modern data systems can read and write gzipped sources faster than raw/uncompressed files, due to the relatively low cost of CPU cycles versus the relative high savings of reduced IO when reading/writing compressed files. These savings are more pronounced when network upload/download speeds are considered and dealing with uploading files to remote systems.
Stream isolation in batch files
To streamline load operations, each batch file will only contain data from a single stream. This allows faster loads on the target side, since data validation and other logic can be performed on a per-file basis instead of per-record.
Negotiated capabilities
First iteration
In first iteration, the user would need to confirm capabilities of tap and target manually, and would need to set the tap config manually. The target would throw an error if any of these are true:
BATCH
message type is not supported by the targetformat
provided from the tapmanifest
Long-term path to automate tap/target batch negotiation
In the long-term, the tap and target would both print information about supported batch capabilities when CLI
<plugin-name> --about --format=json
is executed. An orchestrator would be able to run--about
on both the tap and target, and would then be able to negotiate the most optimal batching strategy.Note, per the below, that batching may still be performed by the orchestrator even if one or zero of the plugins support batching natively.
Batch compatibility with legacy taps and targets
Batch capabilities will be available for legacy non-SDK taps and targets via the
meltano-map-transform
plugin, which itself inherits from SDK tap and target classes, and which would be able to serialize and deserialized batch files, as needed.Sample invocation with legacy tap and legacy target:
Sample invocation with SDK tap and legacy target:
Sample invocation with legacy tap and SDK target:
All of the above hybrid examples optimize the workload by keeping record messages out of the memory buffer between tap and target. The tap is able to write to disk as quickly as records are available, and the target only reads data from disk as quickly as it is able to load that data into the destination.
Beta Was this translation helpful? Give feedback.
All reactions