Spec discussion: BATCH message type, first iteration #859

aaronsteers · 2022-07-26T03:44:34Z

aaronsteers
Jul 26, 2022

Opening this discussion to review and finalize the BATCH message type spec.

Related to:

Gitlab issue (now locked): https://gitlab.com/meltano/meltano/-/issues/2364
Migrated GitHub issue: feature: Add built-in support for BATCH message type (aka FAST_SYNC spec) #9

Tap config

Starting with a first supported file type of .jsonl.gz:

"batch_config": {
    "encoding": {
      "format": "jsonl",
      "compression": "gz"
    },
    "storage": {
        "protocol": "local",
        "output_dir": "./.output/",
        "file_prefix": "mytap-tempfile-"
    }
}

TODO: How should we specify max/min batch size, e.g. records and/or bytes, max record lifetime before send

Target config

No changes needed. Details are provided within the BATCH message type.

Target config should also indicated if the target has permission to cleanup files after load and after the corresponding STATE message has been emitted.

Message type sample

{
  "type": "BATCH",
  "stream": "stream_name",
  "batch": {
    "encoding": {
      "format": "jsonl",
      "compression": "gz"
    },
    "manifest": [ // one or more files
      "./.output/mytap-tempfile-stream_name-001.jsonl.gz",
      "./.output/mytap-tempfile-stream_name-002.jsonl.gz",
      "./.output/mytap-tempfile-stream_name-003.jsonl.gz"
    ]
  }
}

Benefit of compression

Most modern data systems can read and write gzipped sources faster than raw/uncompressed files, due to the relatively low cost of CPU cycles versus the relative high savings of reduced IO when reading/writing compressed files. These savings are more pronounced when network upload/download speeds are considered and dealing with uploading files to remote systems.

Stream isolation in batch files

To streamline load operations, each batch file will only contain data from a single stream. This allows faster loads on the target side, since data validation and other logic can be performed on a per-file basis instead of per-record.

Negotiated capabilities

First iteration

In first iteration, the user would need to confirm capabilities of tap and target manually, and would need to set the tap config manually. The target would throw an error if any of these are true:

the BATCH message type is not supported by the target
the target is unable to parse the format provided from the tap
the target is unable to access the storage location of files mentioned in manifest

Long-term path to automate tap/target batch negotiation

In the long-term, the tap and target would both print information about supported batch capabilities when CLI <plugin-name> --about --format=json is executed. An orchestrator would be able to run --about on both the tap and target, and would then be able to negotiate the most optimal batching strategy.

Note, per the below, that batching may still be performed by the orchestrator even if one or zero of the plugins support batching natively.

Batch compatibility with legacy taps and targets

Batch capabilities will be available for legacy non-SDK taps and targets via the meltano-map-transform plugin, which itself inherits from SDK tap and target classes, and which would be able to serialize and deserialized batch files, as needed.

Sample invocation with legacy tap and legacy target:

# Run a legacy tap, send the output to batch files with the mapper plugin, store Singer messages to intermediate output file
tap-legacy --config=tap-config.json | meltano-map-transform --config=batch-config.json > outfile.jsonl
# Echo stored Singer messages to the mapper plugin, send deserialized `RECORD` messages to the legacy target
cat outfile.jsonl | meltano-map-transform --config=batch-config.json | target-legacy --config=target-config.json

Sample invocation with SDK tap and legacy target:

tap-sdk-foo --config=tap-config.json > outfile.jsonl
cat outfile.jsonl | meltano-map-transform --config=batch-config.json | target-legacy --config=target-config.json

Sample invocation with legacy tap and SDK target:

tap-legacy-foo --config=tap-config.json | meltano-map-transform --config=batch-config.json | target-sdk-foo --config=target-config.json

All of the above hybrid examples optimize the workload by keeping record messages out of the memory buffer between tap and target. The tap is able to write to disk as quickly as records are available, and the target only reads data from disk as quickly as it is able to load that data into the destination.

edgarrmondragon · 2022-07-27T16:48:31Z

edgarrmondragon
Jul 27, 2022
Maintainer

Proposed encoding struct:

"batch_config": {
    "encoding": {
      "format": "jsonl",
      "compression": "gz"
    },
    "storage": {
        ...
    }
}

For CSV:

"batch_config": {
    "encoding": {
      "format": "csv",
      "compression": "gz",
      "include_header": false,
      "delimiter": "|"
    },
    "storage": {
        ...
    }
}

2 replies

aaronsteers Jul 27, 2022
Author

I like this a lot! 🙌

Thanks, @edgarrmondragon for proposing.

aaronsteers Jul 27, 2022
Author

I've updated the spec proposal to use encoding, per this suggestion.

e.g.

"batch_config": {
    "encoding": {
      "format": "jsonl",
      "compression": "gz"
    },
    "storage": {
        "protocol": "local",
        "output_dir": "./.output/",
        "file_prefix": "mytap-tempfile-"
    }
}

aaronsteers · 2022-07-27T18:35:15Z

aaronsteers
Jul 27, 2022
Author

FYI, since record-level optimizations keep coming up in discussion, I've logged this new discussion to allow further discussion on that topic:

Spec discussion: BATCH message type, first iteration #859

0 replies

aaronsteers · 2022-07-27T18:44:32Z

aaronsteers
Jul 27, 2022
Author

@edgarrmondragon and team, for the storage layer at least, what about potentially using the protocol://my/file/uri.ext convention, with some support from pyfilesystem as a robust and generic abstraction layer in the SDK backend code.

Filesystems supported:

https://www.pyfilesystem.org/page/index-of-filesystems/

My only hesitations are that:

Sometimes the protocol prefix is not obvious or aligned from what we'd receive from external systems - for instance using s3fs:// vs just s3://.
- We could possibly mitigate this by having some pre-transformation of the strings, and/or assigning the s3fs protocol handler (for instance) to also handle s3 protocol URIs.
This protocol seems to already have some handlers for filetypes ("encodings") as well as storage medium. We'd need to make a call on whether protocols like zip:// in the above example should be supported, vs disallowing those and requiring the encoding to be specified instead via the spec's encoding declaration.

Another possibility is that we completely insulate the users and the spec itself from the pyfilesystem expectations and protocol spec. We can use pyfilesystem and open_fs() internally but the external contract with the user and tap developer would be based on a small number of supported protocols, with conventions that match industry standard instead of preferring open_fs() URI syntax.

In a first iteration we might support local://path/to/file, file://path/to/file, and/or just ./path/to/file syntax. In a second iteration, we could support s3:// or another protocol, and behind the scenes we can be leveraging that library (or not) as per the protocol prefix.

The contract between tap and target would be that a target (or the SDK itself) supports one or more protocol prefixes for files in the manifest collection - and files sent with an unknown/unsupported protocol prefix would cause the target to fail.

4 replies

edgarrmondragon Jul 27, 2022
Maintainer

@aaronsteers

AFAICT, the s3fs pyfilesystem extension already uses the s3:// prefix, so that particular quirk may not be a concern at all, at least not for S3.
I don't think zip:// in the pyfilesystem examples is used to extract a single compressed file, rather as a compressed directory of files, similar to a compressed TAR archive. So, if we want to support zip, it's safe to treat it as a filesystem of its own, which files may (but usually aren't!) themselves be compressed as they would be in S3 or FTP.

Note that depending on the flexibility of the underlying library (json, pyarrow, etc.), we may need to patch the builtin open function at runtime. Similar to what's done in https://github.com/edgarrmondragon/tap-dbf/pull/1/files.

I'd suggest using, for example, "protocol": "file://" (from your updated spec above) or would it be preferable to internally map protocol strings to the expectations for each filesystem? e.g. local -> file://, s3 -> s3://, etc.

aaronsteers Jul 27, 2022
Author

AFAICT, the s3fs pyfilesystem extension already uses the s3:// prefix, so that particular quirk may not be a concern at all, at least not for S3.

Fantastic! I meant this as an example, in case the protocol prefix that our library expected was for any reason different from the prefix that would be expected or produced by other standards/conventions.

Important to call out that sometimes we'll parse this file ourselves in the SDK, and other times we'll pass the pointer (perhaps with some adaptation) on to the target library to ingest natively. However, even if we are passing on the file to be ingested natively, there is still value in us being able to parse the file ourselves in Python - especially for use cases around pre-validation.

I don't think zip:// in the pyfilesystem examples is used to extract a single compressed file, rather as a compressed directory of files

Good point! Maybe that was a red herring then... 🙄 🐟

I'd suggest using, for example, "protocol": "file://" (from your updated spec above) or would it be preferable to internally map protocol strings to the expectations for each filesystem? e.g. local -> file://, s3 -> s3://, etc.

I am not particularly attached to 'file' vs 'local' (I just couldn't remember which was correct) but probably we can exclude the :// suffix/delimiter when passing to the tap config. Do you agree? My thought was that each tap (also each particular version of the SDK) would have a list of strings which it was prepared to accept as batch storage protocols. If the protocol matches, we'll be able to 'reach' the file - for reading and/or for writing. If it does not, we would abort/fail.

And presumably, a similar evaluation would occur with encoding. If the encoding can be handled, we'll proceed; if not, we throw an error.

Wdyt? Does this sound right to you?

edgarrmondragon Jul 28, 2022
Maintainer

but probably we can exclude the :// suffix/delimiter when passing to the tap config. Do you agree?

@aaronsteers Yeah, that'd be fine. Though, should we plan how we're going to eventually support parameters? Some of the implementations support a root argument that carries both the root directory of the filesystem, as well as connection params like CacheControl or ACL for S3.

For example, a local root:

"batch_config": {
    "encoding": {
      ...
    },
    "storage": {
      "protocol": "local",
      "root": "/path/to/dir",
      "output_dir": "./.output/",
      "file_prefix": "mytap-tempfile-"
    }
}

would then be processed as

from fs import open_fs

mapped_protocol = "file://" if protocol == "local" else ...

with open_fs(f"{mapped_protocol}://{root}") as fs:
    with fs.open(output_dir, "w") as file:
       file.write(...)

So the root argument for S3, would need to look something like my-bucket?acl=public-read&cache_control=max-age.

What do you think of merging the protocol and root in a single property then?

aaronsteers Jul 28, 2022
Author

@edgarrmondragon - I like something like protocol_root a lot!

s3://bucket-name/path/to/my/root/dir

We do then need to parse that or use pattern matching to confirm it's viable but this is a pretty trivial lift with some basic assumptions on the format.

If we move in the direction for the base storage location as a string including protocol, we probably don't also need a root directory part also, since we can have the use include the base dir in that root. Do you agree? (I don't feel strongly either way, tbh.)

We could eventually expose something like file_naming_convention to give more control on how files should be named post the provided root. Our default would likely be something like {stream_name}-{invocation_guid}-part{file_id}.{default_ext} - but a "nice to have" would be to override names and nest some of those keys into subfolders, as in: mycustomsubdir/{stream_name}/{invocation_guid}/part{file_id}.{default_ext}

edgarrmondragon · 2022-08-30T21:39:32Z

edgarrmondragon
Aug 30, 2022
Maintainer

New thread for the basic BATCH message spec

Multiple-files per messages vs one message for every file?

The current spec for BATCH messages requires the following sample structure:

{
  "type": "BATCH",
  "stream": "users",
  "encoding": {
    ...
  },
  "manifest": [
    "path/to/batch/file/1",
    "path/to/batch/file/2"
  ]
}

Would this be functionally equivalent to:

(pretty-printed for readability)

{
  "type": "BATCH",
  "stream": "users",
  "encoding": {
    ...
  },
  "filepath": "path/to/batch/file/1"
}
{
  "type": "BATCH",
  "stream": "users",
  "encoding": {
    ...
  },
  "filepath": "path/to/batch/file/2"
}

I prefer the second and I ask because I'm struggling to define where the number of files in the manifest should be controlled, because it's a further partition of the dataset. It seems to me that it's easier to emit one message for every file.

@aaronsteers wdyt?

cc @meltano/engineering in case you have thoughts

5 replies

edgarrmondragon Aug 30, 2022
Maintainer

I guess targets could benefit from bulk-loading multiple files at once with some glob expression, and thus from a single message with multiple batches. Yet, where do we declare this in the tap?

aaronsteers Aug 31, 2022
Author

@edgarrmondragon - I see your points and it's something definitely worth thinking about.

What about our default handler just having 'list of one element' behavior? This keeps the default implementation simple, while also leaving the spec open for multi-file support per message when it is appropriate. For the default implementation, I agree, there probably isn't much benefit to batching multiple files together and how/why to configure it that way seems unclear to me as well.

Logical multi-file batches versus a logical sequence of single-file batches

Now that you raise this, and now that I've had a bit more time to think of it, I think a helpful context would be to think of how it will be overridden by native implementations.

The primary use case for multi-file-manifest-per-message is when handling databases that are fastest when creating multi-file outputs.

https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html#unload-parameters

And then, passing all of the files along in one manifest is a signal that they are part of the same logical batch, versus being sequential logical batches.

For streams which are ordered and also using multi-file manifests in the BATCH record type, sort order would be expected across messages but not across files in the manifest. There's no target-specific applications I know of that care about sort order, but some targets do perform better when loaded in order of the sort key. So, a multi-file manifest may behave differently in the target versus a sequence of single-file manifests, according to the target implementation.

Another example, a target that supports loading from multi files at once will almost certainly send all files in the manifest together in the same command, while the same target receiving a sequence of files would more likely try to sequence multiple SQL commands, one per BATCH message.

Wdyt?

kgpayne Aug 31, 2022

For simplicity, I like the one-file-per-message implementation. It helps enforce ordering (even if order isn't generally important for targets) and it also gives more flexibility to the target to decide how it wants to handle batches. In the "target that supports loading from multi files at once" case (e.g. Redshift), it would be perfectly reasonable for that target to do its own batching. I.e. it can wait for n batch messages to arrive before constructing a multi-file manifest and starting to load. One-file-per message makes the 'batch of batches' case an implementation detail of the target rather than the tap (separation of concerns) 🙂

aaronsteers Aug 31, 2022
Author

@kgpayne and @edgarrmondragon re:

Would this be functionally equivalent...?

My point above is that these are not functionally equivalent, and there's a benefit to expressiveness even if most implementations only have a single file per BATCH message.

The target can decide to ignore the logical nature of the batch, but at least we have the ability to communicate the distinction between these very different operations:

UNLOAD fact_mytable TO s3://path-to-output-prefix/

> 512,234 records output over 26 files

COPY INTO mydest FROM  s3://path-to-output-prefix/*

> 512,234 records affected.

Which is very different from:

UNLOAD fact_mytable TO s3://....

> 512,234 records output over 26 files

COPY INTO mydest FROM  s3://.....part1
> 20,000 records affected.

COPY INTO mydest FROM  s3://.....part2
> 20,000 records affected.

COPY INTO mydest FROM  s3://.....part3
> 20,000 records affected.

COPY INTO mydest FROM  s3://.....part4
> 20,000 records affected.
...

COPY INTO mydest FROM  s3://.....part26
> 234 records affected.

And, to the idea of including record counts, it's very likely that we don't know how many records are written per file, but we do know how many records total are in the batch.

I'm not arguing that we need a record count in the first iteration, just that in order to be able to express that there are 512,234 records in the batch but an unknown number of records in each file, we basically have to have support for a multi-file batch.

edgarrmondragon Sep 1, 2022
Maintainer

For streams which are ordered and also using multi-file manifests in the BATCH record type, sort order would be expected across messages but not across files in the manifest. There's no target-specific applications I know of that care about sort order, but some targets do perform better when loaded in order of the sort key. So, a multi-file manifest may behave differently in the target versus a sequence of single-file manifests, according to the target implementation.

Another example, a target that supports loading from multi files at once will almost certainly send all files in the manifest together in the same command, while the same target receiving a sequence of files would more likely try to sequence multiple SQL commands, one per BATCH message.

This is a great point. Thanks @aaronsteers!

kgpayne · 2022-08-31T12:18:02Z

kgpayne
Aug 31, 2022

@edgarrmondragon @aaronsteers this is excellent 👏 Two other things that I found useful when implementing BATCH for Tails 🐕 were last_batch and batch_size metadata.

last_batch is particularly useful in the 'batch of batches' case outlined in the discussion above; allowing the target to close out a final (potentially partial) batch-of-batches before the last message is received from the tap (which would be another circumstance whereby a partial batch could safely be flushed). We also found communicating the batch_size was needed for telemetry; emitting accurate METRIC log messages, as BATCH messages don't count as 1 record. It would also potentially be useful for post-processing operations in the target e.g. sampling 10% of records for validation.

Something like:

{
  "type": "BATCH",
  "stream": "users",
  "encoding": {
    ...
  },
  "filepath": "path/to/batch/file/1",
  "metadata": {
    "last_batch": false,
    "batch_size": 10000
  }
}
{
  "type": "BATCH",
  "stream": "users",
  "encoding": {
    ...
  },
  "filepath": "path/to/batch/file/2",
  "metadata": {
    "last_batch": true,
    "batch_size": 367
  }
}

0 replies

Spec discussion: BATCH message type, first iteration #859

aaronsteers Jul 26, 2022

Tap config

Target config

Message type sample

Benefit of compression

Stream isolation in batch files

Negotiated capabilities

First iteration

Long-term path to automate tap/target batch negotiation

Batch compatibility with legacy taps and targets

Replies: 5 comments · 11 replies

edgarrmondragon Jul 27, 2022 Maintainer

aaronsteers Jul 27, 2022 Author

aaronsteers Jul 27, 2022 Author

aaronsteers Jul 27, 2022 Author

aaronsteers Jul 27, 2022 Author

edgarrmondragon Jul 27, 2022 Maintainer

aaronsteers Jul 27, 2022 Author

edgarrmondragon Jul 28, 2022 Maintainer

aaronsteers Jul 28, 2022 Author

edgarrmondragon Aug 30, 2022 Maintainer

Multiple-files per messages vs one message for every file?

edgarrmondragon Aug 30, 2022 Maintainer

aaronsteers Aug 31, 2022 Author

Logical multi-file batches versus a logical sequence of single-file batches

kgpayne Aug 31, 2022

aaronsteers Aug 31, 2022 Author

edgarrmondragon Sep 1, 2022 Maintainer

kgpayne Aug 31, 2022

aaronsteers
Jul 26, 2022

Replies: 5 comments 11 replies

edgarrmondragon
Jul 27, 2022
Maintainer

aaronsteers Jul 27, 2022
Author

aaronsteers Jul 27, 2022
Author

aaronsteers
Jul 27, 2022
Author

aaronsteers
Jul 27, 2022
Author

edgarrmondragon Jul 27, 2022
Maintainer

aaronsteers Jul 27, 2022
Author

edgarrmondragon Jul 28, 2022
Maintainer

aaronsteers Jul 28, 2022
Author

edgarrmondragon
Aug 30, 2022
Maintainer

edgarrmondragon Aug 30, 2022
Maintainer

aaronsteers Aug 31, 2022
Author

aaronsteers Aug 31, 2022
Author

edgarrmondragon Sep 1, 2022
Maintainer

kgpayne
Aug 31, 2022