-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding alternatives to the exchange of record-oriented json messages #29
Comments
@Marnixvdb there's great interest around that! We've got #2 logged as an issue just for that. I think Arrow makes a lot of sense as an option for users. |
Thanks @tayloramurphy. I've read the discussion and the PR, which proposes adding a BATCH message to enable a tap to send a manifest of files to process to the target. It doesn't talk about an alternative data format for the exchange of data, does it? While the BATCH format will be useful in some scenario's, it seems more complicated than necessary and it runs counter to separation of concerns between taps and targets. I think it's simpler and more effective to instead (or additionally) add Arrow IPC as format to pipe data from tap -> target. Arrow has been designed to be highly performant and also, like singer, to be interoperable and language agnostic. It has matured quickly and is getting adopted widely. Arrow and Singer would be a great fit and, contrary to the BATCH proposal, the independence and composability of taps and targets would be preserved. (edited to add) Also: Arrow is typed, handles NULL values unambiguously and brings its schema along. Having to deal with more than one data format in the exchange between taps and targets would introduce some management overhead, but I that should be trivial to solve for the engineers writing the pipelines. What do you think? |
Hi @Marnixvdb! Overall, I think the prospect of leveraging the Arrow spec is very interesting. A PoC or reference implementation showcasing feature parity with the Singer spec, would be a good start for an enhancement proposal. Message typesThere is definite overlap between Singer and Arrow message types, so that is a good sign. The ones supported by Arrow are:
The documentation mentions Custom Application Metadata, which might be useful for our case:
Metadata at the
EcosystemJSON support is ubiquitous and very stable in almost all programming languages. Arrow, however, is maybe stable and support may not be as good across the board. Separate processes for tap and targetTo get the most out of Arrow IPC (e.g. zero-copy reads), the source and target processes have to share memory, but it's not clear how to achieve that with taps and targets running as individual applications as they do today. |
Arrow IPC is much faster than JSON, even as an interprocess format. It just does less serialization. So, everything is fine here. |
@pkit Thanks for the data point. We have considered adding this (or accepting PRs to add it) to the Singer SDK: meltano/sdk#1684. It's definitely an interesting idea to explore. |
As great and universal as json is for exchanging messages, the fact that Singer tap -> target communication requires a record-oriented json format is a big drawback (at least for me), as the unnecessary serialisation/deserialisation overhead becomes a real pain when processing (analytical) bulk data.
I was wondering how much room/importance the community sees in extending the spec in this area.
My first thought would be to add the option to use Apache Arrow Inter Process Communication (IPC). For those unfamiliar with Arrow: Arrow is a standardised columnar memory specification, and IPC is a way of transferring arrow record batches without the need for serialisation/deserialisation. As many data storage systems are adopting Arrow Flight, the will be a lot of value in data pipelines that use Arrow as the shared data layout in every step from extraction to loading.
Let me know if this is of interest, or if more information is needed, and I will add more detail.
The text was updated successfully, but these errors were encountered: