-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Zyp: A compact transformation engine
A data model and implementation for a compact transformation engine written in Python. - Based on JSON Pointer (RFC 6901), JMESPath, and transon - Implemented using `attrs` and `cattrs` - Includes built-in transformation functions `to_datetime` and `to_unixtime` - Ability to marshal and unmarshal its representation to/from JSON and YAML
- Loading branch information
Showing
36 changed files
with
1,693 additions
and
12 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -104,7 +104,7 @@ jobs: | |
pip install "setuptools>=64" --upgrade | ||
# Install package in editable mode. | ||
pip install --use-pep517 --prefer-binary --editable=.[develop,test,mongodb] | ||
pip install --use-pep517 --prefer-binary --editable=.[mongodb,develop,test] | ||
- name: Run linters and software tests | ||
run: poe check | ||
|
@@ -120,3 +120,65 @@ jobs: | |
env_vars: OS,PYTHON | ||
name: codecov-umbrella | ||
fail_ci_if_error: true | ||
|
||
|
||
test-zyp: | ||
name: " | ||
Zyp: Python ${{ matrix.python-version }} | ||
" | ||
runs-on: ${{ matrix.os }} | ||
strategy: | ||
fail-fast: false | ||
matrix: | ||
os: ['ubuntu-latest'] | ||
python-version: ['3.8', '3.9', '3.12'] | ||
|
||
env: | ||
OS: ${{ matrix.os }} | ||
PYTHON: ${{ matrix.python-version }} | ||
|
||
steps: | ||
|
||
- name: Acquire sources | ||
uses: actions/checkout@v4 | ||
|
||
- name: Set up Python ${{ matrix.python-version }} | ||
uses: actions/setup-python@v5 | ||
with: | ||
python-version: ${{ matrix.python-version }} | ||
architecture: x64 | ||
cache: 'pip' | ||
cache-dependency-path: | ||
pyproject.toml | ||
|
||
- name: Set up project | ||
run: | | ||
# `setuptools 0.64.0` adds support for editable install hooks (PEP 660). | ||
# https://github.com/pypa/setuptools/blob/main/CHANGES.rst#v6400 | ||
pip install "setuptools>=64" --upgrade | ||
# Install package in editable mode. | ||
pip install --use-pep517 --prefer-binary --editable=.[zyp,develop,test] | ||
- name: Set timezone | ||
uses: szenius/[email protected] | ||
with: | ||
timezoneLinux: "Europe/Berlin" | ||
timezoneMacos: "Europe/Berlin" | ||
timezoneWindows: "European Standard Time" | ||
|
||
- name: Run linters and software tests | ||
run: poe check | ||
|
||
# https://github.com/codecov/codecov-action | ||
- name: Upload coverage results to Codecov | ||
uses: codecov/codecov-action@v4 | ||
env: | ||
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }} | ||
with: | ||
files: ./coverage.xml | ||
flags: zyp | ||
env_vars: OS,PYTHON | ||
name: codecov-umbrella | ||
fail_ci_if_error: true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# Zyp Backlog | ||
|
||
## Iteration +1 | ||
- Refactor module namespace to `zyp` | ||
- Documentation | ||
- CLI interface | ||
- Apply to MongoDB Table Loader in CrateDB Toolkit | ||
|
||
## Iteration +2 | ||
Demonstrate! | ||
- math expressions | ||
- omit key (recursively) | ||
- combine keys | ||
- filter on keys and/or values | ||
- Pathological cases like "Not defined" in typed fields like `TIMESTAMP` | ||
- Use simpleeval, like Meltano, and provide the same built-in functions | ||
- https://sdk.meltano.com/en/v0.39.1/stream_maps.html#other-built-in-functions-and-names | ||
- https://github.com/MeltanoLabs/meltano-map-transform/pull/255 | ||
- https://github.com/MeltanoLabs/meltano-map-transform/issues/252 | ||
- Use JSONPath, see https://sdk.meltano.com/en/v0.39.1/code_samples.html#use-a-jsonpath-expression-to-extract-the-next-page-url-from-a-hateoas-response | ||
|
||
## Iteration +3 | ||
- Moksha transformations on Buckets | ||
- Investigate using JSON Schema | ||
- Fluent API interface | ||
- https://github.com/Halvani/alphabetic | ||
- Mappers do not support external API lookups. | ||
To add external API lookups, you can either (a) land all your data and | ||
then joins using a transformation tool like dbt, or (b) create a custom | ||
mapper plugin with inline lookup logic. | ||
=> Example from Luftdatenpumpe, using a reverse geocoder | ||
- [ ] Define schema | ||
https://sdk.meltano.com/en/latest/typing.html | ||
- https://docs.meltano.com/guide/v2-migration/#migrate-to-an-adapter-specific-dbt-transformer | ||
- https://github.com/meltano/sdk/blob/v0.39.1/singer_sdk/mapper.py | ||
|
||
## Fluent API Interface | ||
|
||
```python | ||
|
||
from zyp.model.fluent import FluentTransformation | ||
|
||
transformation = FluentTransformation() | ||
.jmes("records[?starts_with(location, 'B')]") | ||
.rename_fields({"_id": "id"}) | ||
.convert_values({"/id": "int", "/value": "float"}, type="pointer-python") | ||
.jq(".[] |= (.value /= 100)") | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,194 @@ | ||
# Zyp Transformations | ||
|
||
## About | ||
A data model and implementation for a compact transformation engine written | ||
in [Python], based on [JSON Pointer] (RFC 6901), [JMESPath], and [transon], | ||
implemented using [attrs] and [cattrs]. | ||
|
||
## Ideas | ||
:Conciseness: | ||
Define a multistep data refinement process with as little code as possible. | ||
:Low Footprint: | ||
Doesn't need any infrastructure or pipeline framework. It's just a little library. | ||
:Interoperability: | ||
Marshal transformation recipe definition to/from text-only representations (JSON, | ||
YAML), in order to encourage implementations in other languages. | ||
:Performance: | ||
Well, it is written in Python. Fragments can be re-written in Rust, when applicable. | ||
:Immediate: | ||
Other ETL frameworks and concepts often need to first land your data in the target | ||
system before applying subsequent transformations. Zyp is working directly within | ||
the data pipeline, before data is inserted into the target system. | ||
|
||
## Synopsis I | ||
A basic transformation example for individual data records. | ||
|
||
```python | ||
from zyp.model.bucket import BucketTransformation, FieldRenamer, ValueConverter | ||
|
||
# Consider a slightly messy collection of records. | ||
data_in = [ | ||
{"_id": "123", "name": "device-foo", "reading": "42.42"}, | ||
{"_id": "456", "name": "device-bar", "reading": -84.01}, | ||
] | ||
|
||
# Define a transformation that renames the `_id` field to `id`, | ||
# casts its value to `int`, and casts the `reading` field to `float`. | ||
transformation = BucketTransformation( | ||
names=FieldRenamer().add(old="_id", new="id"), | ||
values=ValueConverter() | ||
.add(pointer="/id", transformer="builtins.int") | ||
.add(pointer="/reading", transformer="builtins.float"), | ||
) | ||
|
||
for record in data_in: | ||
print(transformation.apply(record)) | ||
``` | ||
The result is a transformed data collection. | ||
```json | ||
[ | ||
{"id": 123, "name": "device-foo", "reading": 42.42}, | ||
{"id": 456, "name": "device-bar", "reading": -84.01} | ||
] | ||
``` | ||
|
||
## Synopsis II | ||
A more advanced transformation example for a collection of data records. | ||
|
||
Consider a messy collection of input data. | ||
- The actual collection is nested within the top-level `records` item. | ||
- `_id` fields are conveyed in string format. | ||
- `value` fields include both integer and string values. | ||
- `value` fields are fixed-point values, using a scaling factor of `100`. | ||
- The collection includes invalid `null` records. | ||
Those records usually trip processing when, for example, filtering on object items. | ||
```python | ||
data_in = { | ||
"message-source": "system-3000", | ||
"message-type": "eai-warehouse", | ||
"records": [ | ||
{"_id": "12", "meta": {"name": "foo", "location": "B"}, "data": {"value": "4242"}}, | ||
None, | ||
{"_id": "34", "meta": {"name": "bar", "location": "BY"}, "data": {"value": -8401}}, | ||
{"_id": "56", "meta": {"name": "baz", "location": "NI"}, "data": {"value": 2323}}, | ||
{"_id": "78", "meta": {"name": "qux", "location": "NRW"}, "data": {"value": -580}}, | ||
None, | ||
None, | ||
], | ||
} | ||
``` | ||
|
||
Consider after applying a corresponding transformation, the expected outcome is a | ||
collection of valid records, optionally filtered, and values adjusted according | ||
to relevant type hints and other conversions. | ||
```python | ||
data_out = [ | ||
{"id": 12, "meta": {"name": "foo", "location": "B"}, "data": {"value": 42.42}}, | ||
{"id": 34, "meta": {"name": "bar", "location": "BY"}, "data": {"value": -84.01}}, | ||
] | ||
``` | ||
|
||
Let's come up with relevant pre-processing rules to cleanse and mangle the shape of the | ||
input collection. In order to make this example more exciting, let's include two special | ||
needs: | ||
- Filter input collection by value of nested element. | ||
- Rename top-level fields starting with underscore `_`. | ||
|
||
Other than those special rules, the fundamental ones to re-shape the data are: | ||
- Unwrap `records` attribute from container dictionary into actual collection. | ||
- Filter collection, both by omitting invalid/empty records, and by applying query | ||
constrains. | ||
- On each record, rename the top-level `_id` field to `id`. | ||
- On each record, adjust the data types of the `id` and `value` fields. | ||
- Postprocess collection, applying a custom scaling factor to the `value` field. | ||
|
||
Zyp let's you concisely write those rules down, using the Python language. | ||
|
||
```python | ||
from zyp.model.bucket import BucketTransformation, FieldRenamer, ValueConverter | ||
from zyp.model.collection import CollectionTransformation | ||
from zyp.model.moksha import MokshaTransformation | ||
|
||
transformation = CollectionTransformation( | ||
pre=MokshaTransformation().jmes("records[?not_null(meta.location) && !starts_with(meta.location, 'N')]"), | ||
bucket=BucketTransformation( | ||
names=FieldRenamer().add(old="_id", new="id"), | ||
values=ValueConverter() | ||
.add(pointer="/id", transformer="builtins.int") | ||
.add(pointer="/data/value", transformer="builtins.float"), | ||
), | ||
post=MokshaTransformation().jq(".[] |= (.data.value /= 100)"), | ||
) | ||
|
||
data_out = transformation.apply(data_in) | ||
``` | ||
Alternatively, serialize the `zyp-collection` transformation description, | ||
for example into YAML format. | ||
```python | ||
print(transformation.to_yaml()) | ||
``` | ||
```yaml | ||
meta: | ||
version: 1 | ||
type: zyp-collection | ||
pre: | ||
rules: | ||
- expression: records[?not_null(meta.location) && !starts_with(meta.location, 'N')] | ||
type: jmes | ||
bucket: | ||
names: | ||
rules: | ||
- new: id | ||
old: _id | ||
values: | ||
rules: | ||
- args: [] | ||
pointer: /id | ||
transformer: builtins.int | ||
- args: [] | ||
pointer: /data/value | ||
transformer: builtins.float | ||
post: | ||
rules: | ||
- expression: .[] |= (.data.value /= 100) | ||
type: jq | ||
``` | ||
## Prior Art | ||
- [Singer Transformer] | ||
- [PipelineWise Transformations] | ||
- [singer-transform] | ||
- [Meltano Inline Data Mapping] | ||
- [Meltano Inline Stream Maps] | ||
- [AWS DMS source filter rules] | ||
- [AWS DMS table selection and transformation rules] | ||
- ... and many more. Thanks for the inspirations. | ||
## Etymology | ||
With kudos to [Kris Zyp] for conceiving [JSON Pointer]. | ||
## More | ||
```{toctree} | ||
:maxdepth: 1 | ||
|
||
research | ||
backlog | ||
``` | ||
|
||
|
||
|
||
[attrs]: https://www.attrs.org/ | ||
[AWS DMS source filter rules]: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.Filters.html | ||
[AWS DMS table selection and transformation rules]: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TableMapping.SelectionTransformation.html | ||
[cattrs]: https://catt.rs/ | ||
[Kris Zyp]: https://github.com/kriszyp | ||
[JMESPath]: https://jmespath.org/ | ||
[JSON Pointer]: https://datatracker.ietf.org/doc/html/rfc6901 | ||
[Meltano Inline Data Mapping]: https://docs.meltano.com/guide/mappers/ | ||
[Meltano Inline Stream Maps]: https://sdk.meltano.com/en/latest/stream_maps.html | ||
[PipelineWise Transformations]: https://transferwise.github.io/pipelinewise/user_guide/transformations.html | ||
[Python]: https://en.wikipedia.org/wiki/Python_(programming_language) | ||
[Singer Transformer]: https://github.com/singer-io/singer-python/blob/master/singer/transform.py | ||
[singer-transform]: https://github.com/dkarzon/singer-transform | ||
[transon]: https://transon-org.github.io/ |
Oops, something went wrong.