You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jul 3, 2023. It is now read-only.
Started building an infrastructure to ingest data into a DBSP circuit
from external data sources and to stream the outputs of the circuit to
external consumers.
The framework defines the APIs to integrate different transport
technologies (files, Kafka streams, database connections, etc.) and data
formats (CSV, bincode, JSON, etc.) into the DBSP input and output
pipelines. This PR only the input half of the API.
Overview
========
The data ingestion pipeline consists of two kinds of adapters: *data
transport* adapters and **data format** adapters.
```text
┌──────────────┐
│ controller │
│ ┌────────┐ │
│ │ catalog│ │
│ ├────────┤ │
│ │ config │ │
│ ├────────┤ │
control commands │ │ stats │ │
┌──┬────────┬─┬──────────────────────┤ └────────┘ │
│ │ │ │ │ │
│ │ │ │ └──────────────┘
│ │ │ │
│ │ │ │
│ │ │ │ ┌───────────┐
▼ │ │ │ │ │
┌────┴───┐ ▼ │ ┌──────┐ ┌────┴─┐ │
─────►│endpoint├──────┼──►│parser├───────►│handle│ ├───►
└────────┘ │ └──────┘ └────┬─┘ │
▼ │ │ circuit │
transport- ┌────────┐bytes ▼ ┌──────┐records ┌────┴─┐ │
specific ─────►│endpoint├─────────►│parser├───────►│handle│ │
protocol └────────┘ └──────┘ └────┬─┘ │
▲ ▲ │ │
│ │ └───────────┘
┌────┴────┐ ┌───┴────┐
│ input │ │ input │
│transport│ │ format │
└─────────┘ └────────┘
```
A data transport implements support for a specific streaming technology like
Kafka. It provides an API to create transport **endpoints**, that connect to
specified data sources, e.g., Kafka topics. An endpoint reads raw binary
data from the source and provides basic flow control and error reporting
facilities, but is agnostic of the contents or format of the data.
A data format adapter implements support for a data encapsulation format
like CSV, JSON, or bincode. It provides an API to create **parsers**, which
transform raw binary data into a stream of **records** and push this data to
the DBSP circuit.
The Controller component serves as a centralized control plane that
coordinates the creation, reconfiguration, teardown of the pipeline, and
implements runtime flow control. It instantiates the pipeline according to
a user-provided configuration (see below) and exposes an API to reconfigure
and monitor the pipeline at runtime.
Adapter API
===========
The transport adapter API consists of two traits:
* `InputTransport` is a factory trait that creates `InputEndpoint`
instances.
* `InputEndpoint` represents an individual data connection, e.g., a file,
an S3 bucket or a Kafka topic.
Similarly, the format adapter API consists of:
* `InputFormat` - a factory trait that creates `Parser` instances
* `Parser` - a parser that consumes a raw binary stream and outputs a
stream of records.
------
This PR introduces a workspace, with the I/O framework implemented as a
separate crate within this workspace.
0 commit comments