ref: Type out chains and steps #99

ayirr7 · 2025-04-14T21:38:12Z

Does the following:

Links types of different steps of an ExtensibleChain
Provides a simple Message class to carry message metadata through the pipeline. This allows for decoupling certain steps, like parsing/deserialization from the stream source step.
Basic message decoders/encoders provided out of the box in the API (JSON, protobuf)
Basic header filtering in the source
Schema validation and transforming raw bytes into a schema-enforced message type
General type checks
Unit tests cleanup

Still to do:

Improve a bunch of naming (of interfaces, of files, etc.)
Provide full support for Protobuf

sentry_streams/sentry_streams/pipeline/msg_parser.py

sentry_streams/sentry_streams/examples/transformer.py

ayirr7 · 2025-04-18T08:51:04Z

Got some offline reviews from @untitaker

Currently working on:

Adding schema validation to the Parser step and updating an example to use that
Seeing if types for certain steps should be more narrow (i.e. what the Serializer outputs)

ayirr7 · 2025-04-21T23:57:04Z

sentry_streams/sentry_streams/pipeline/message.py

+
+
+# a message with a generic payload
+class Message(Generic[TIn]):


I am trying to intentionally keep Message as bare-bones as possible. For example, I don't think the user should have to think about:

headers after the source step
partition key
offset management
However, they can provide a schema for messages and any additional info. e.g. for the generic metrics pipeline, maybe they annotate each message with the metric type.

Do let me know if something important is missing though.

I agree on hiding entirely offset management and partitions. They have no place in this abstraction.
I would revisit a few decisions though:

Headers. They are something the application logic may have to deal with for routing messages before payload parsing. The platform itself will need a way to make sense of message without the payload parsing. For exampe we will want to mark invalid messages or stale message

Timestamp. I think a concept of message timestamp is needed. Then the question is whether to introduce the broker timestamp and a second, optional event timestamp.

A few ideas:

Separate the concept of pre-parsing message and post-parsing. Before parsing you can only access headers and timestamp. After parsing you can access the parsed payload as well. This can be done with different Message classes or with the payload type. The goal is to discourage people from parsing bytes on their own. If the user wants to access the payload, it has to be a parsed payload.

Expose headers and separate them between internal ones managed by platform, mutable by the platform and the application ones: readonly and have to be present at the source. We will figure out later how to provide a more flexible set of application headers.

Add a timestamp field which is the source timestamp. We will figure out event timestamp another time.

Expose headers and separate them between internal ones managed by platform, mutable by the platform and the application ones: readonly and have to be present at the source. We will figure out later how to provide a more flexible set of application headers

With tasks, we have headers as as simple mutable map. Separating the headers out seems like additional complexity we could avoid. Our SDKs use message headers to pass trace_id and trace baggage which I assume you'll want for streaming as well.

for the time being: add headers as a map, propagate in all steps, leave empty in reduce.
Add timestamp.

ayirr7 · 2025-04-22T00:01:25Z

sentry_streams/sentry_streams/adapters/arroyo/consumer.py

+
+                return RoutedValue(
+                    route=Route(source=self.source, waypoints=[]),
+                    payload=StreamsMessage(schema=schema, payload=value),


This identifies the schema of the messages based on the topic it comes from. I chose to do it this way, where we wrap it in this Message and pass Message throughout the pipeline, so that the user can do flexible parsing/deserialization of messages. This way, instead of baking it into the Source, we can basically do parsing whenever in the pipeline using the Parser() step.

If we can just bake parsing/deserialization into the Source then all this is not needed.

ayirr7 · 2025-04-22T00:07:00Z

sentry_streams/sentry_streams/examples/batching.py

+    "myunbatch",
+    FlatMap(
+        function=cast(
+            Union[Callable[[Message[MutableSequence[IngestMetric]]], Message[IngestMetric]], str],


Because unbatch is a function with generic arguments, the user unfortunately has to explicitly cast it to the concrete type in their pipeline. This shouldn't be difficult because the incoming ExtensibleChain already has concrete type hints.

Obviously this specific cast() can also be avoided if users write and use a custom unbatcher.

I wonder whether it would be easier if we had Batch and Unbatch appliers without having to go through the FlatMap.

out of the scope for now.

ayirr7 · 2025-04-22T00:17:16Z

sentry_streams/sentry_streams/examples/transformer.py

+    Serializer(serializer=json_serializer),
+)  # ExtensibleChain[bytes]
+
+chain4 = chain3.sink(


I split up the pipeline like this in the example because you can see the inferred type hints of each chain in a code editor

Please write it in a comment so who reads the code knows.

fpacifici

I think it goes in a good direction.

See the comments in line for details but my main feedback is to be more strict on the message interface. Specifically:

I would separate the concept of parsed message (only headers) and serialized (pre-parsing and post serialization) message where the payload is available already parsed. We will figure out later whether to relax this constraint. This would also make it clear that serialization has to happen before a sink.
Maybe later, introduce a difference in terms of types between serializable messages (where we can break segments) and any other type where we cannot break segments.

Also, what happens today for invalid messages? We will have to add support for the DLQ. It is ok to do it separately, but please call it out in the parser.

fpacifici · 2025-04-14T23:54:07Z

sentry_streams/sentry_streams/pipeline/chain.py

 TRoute = TypeVar("TRoute")

 TIn = TypeVar("TIn")
 TOut = TypeVar("TOut")


+# a message with a generic payload
+class Message(Generic[TIn]):


Agreed.
I think you want to preserve the timestamp though. That can be useful.
re: schema, I think it may be a good idea.

fpacifici · 2025-04-23T21:16:27Z

sentry_streams/sentry_streams/pipeline/message.py

+
+from sentry_kafka_schemas.codecs import Codec
+
+TIn = TypeVar("TIn")


nit: this would be better named as TPayload. TIn was used in the chain to represent the the input of a function as opposed to TOut

fpacifici · 2025-04-23T21:37:48Z

sentry_streams/sentry_streams/pipeline/message.py

+
+
+# a message with a generic payload
+class Message(Generic[TIn]):


I agree on hiding entirely offset management and partitions. They have no place in this abstraction.
I would revisit a few decisions though:

Headers. They are something the application logic may have to deal with for routing messages before payload parsing. The platform itself will need a way to make sense of message without the payload parsing. For exampe we will want to mark invalid messages or stale message

Timestamp. I think a concept of message timestamp is needed. Then the question is whether to introduce the broker timestamp and a second, optional event timestamp.

A few ideas:

Separate the concept of pre-parsing message and post-parsing. Before parsing you can only access headers and timestamp. After parsing you can access the parsed payload as well. This can be done with different Message classes or with the payload type. The goal is to discourage people from parsing bytes on their own. If the user wants to access the payload, it has to be a parsed payload.

Expose headers and separate them between internal ones managed by platform, mutable by the platform and the application ones: readonly and have to be present at the source. We will figure out later how to provide a more flexible set of application headers.

Add a timestamp field which is the source timestamp. We will figure out event timestamp another time.

sentry_streams/sentry_streams/pipeline/message.py

sentry_streams/sentry_streams/pipeline/chain.py

sentry_streams/sentry_streams/adapters/arroyo/consumer.py

fpacifici · 2025-04-23T22:09:01Z

sentry_streams/sentry_streams/examples/batching.py

+    "myunbatch",
+    FlatMap(
+        function=cast(
+            Union[Callable[[Message[MutableSequence[IngestMetric]]], Message[IngestMetric]], str],


I wonder whether it would be easier if we had Batch and Unbatch appliers without having to go through the FlatMap.

fpacifici · 2025-04-23T22:12:16Z

sentry_streams/sentry_streams/pipeline/msg_parser.py

@@ -0,0 +1,31 @@
+from typing import Any
+
+from sentry_streams.pipeline.message import Message


Unit tests please ensuring it works for protobuf as well

You'll need a protobuf message type published to do that. If you need a hand with sentry-protos let me know.

I'll come back to this after PTO in another PR. Thanks!

sentry_streams/sentry_streams/adapters/arroyo/consumer.py

sentry_streams/sentry_streams/adapters/arroyo/msg_wrapper.py

fpacifici · 2025-05-02T01:13:07Z

sentry_streams/sentry_streams/adapters/arroyo/msg_wrapper.py

+            msg = StreamsMessage(message.payload, [], now, None)
+
+            routed_msg: Message[RoutedValue] = Message(
+                Value(committable=message.value.committable, payload=RoutedValue(self.__route, msg))
+            )


This is a good sign that we will not stick with arroyo in the long run:
StreamsMessage is the payload of RoutedMessage
which is the payload of the value of Message.
That's a lot

Yeah unfortunately had to alias our top-level API's Message as a StreamsMessage in order to not clash with Arroyo's Message

I had kind of hoped for a solution where all of it can be managed at the top-level API code, but given that we want to expose some metadata to the user, like processing time, the StreamsMessage type has to be managed in the Arroyo steps as well

fpacifici · 2025-05-02T01:13:28Z

sentry_streams/sentry_streams/adapters/arroyo/reduce.py

@@ -90,7 +90,7 @@ def add(self, value: Any) -> Self:
                self.offsets[partition] = max(offsets[partition], self.offsets[partition])

            else:
-                self.offsets.update(offsets)
+                self.offsets[partition] = offsets[partition]


If you expand the code snippet, there was a bug in the else block, that got fixed here

ayirr7 · 2025-05-02T04:36:23Z

sentry_streams/tests/adapters/arroyo/test_adapter.py

+def test_adapter(
+    broker: LocalBroker[KafkaPayload],
+    pipeline: Pipeline,
+    metric: IngestMetric,


I'll probably stop spamming the IngestMetric type everywhere in tests and examples, and use a different one instead for some of them...especially where the topic name doesn't even match

fpacifici · 2025-05-02T17:59:45Z

sentry_streams/sentry_streams/examples/transformer.py

+    Serializer(serializer=json_serializer),
+)  # ExtensibleChain[bytes]
+
+chain4 = chain3.sink(


Please write it in a comment so who reads the code knows.

ayirr7 added 4 commits April 14, 2025 11:21

basic typing between steps, serializer and deserializer

45a3873

some basic header filtering in source

6d9cecd

some minor clean up

33eda66

remove random comment

66f7f37

ayirr7 marked this pull request as draft April 14, 2025 21:41

ayirr7 commented Apr 14, 2025

View reviewed changes

sentry_streams/sentry_streams/pipeline/msg_parser.py Outdated Show resolved Hide resolved

ayirr7 added 2 commits April 14, 2025 14:43

Merge remote-tracking branch 'origin' into riya/chain-typing

efd8e1a

some Message typing

501dab8

ayirr7 commented Apr 14, 2025

View reviewed changes

sentry_streams/sentry_streams/examples/transformer.py Outdated Show resolved Hide resolved

ayirr7 added 2 commits April 15, 2025 11:58

wip use Message[]

794bf32

remove concept of Message

a81fb8b

ayirr7 added 6 commits April 18, 2025 03:05

schema validation basics

dcd7a27

working transformer example

63d5a73

fix sm batchin

a5c0204

fix blq.py typing

dd68893

type checking is fixed

b9840b2

assert StreamSources in arroyo adapter

4d9de0e

ayirr7 changed the title ~~Riya/chain typing~~ ref: Type out chains and steps Apr 21, 2025

ayirr7 commented Apr 21, 2025

View reviewed changes

ayirr7 commented Apr 22, 2025

View reviewed changes

somem cleanups

5f41865

ayirr7 commented Apr 22, 2025

View reviewed changes

ayirr7 marked this pull request as ready for review April 22, 2025 00:17

ayirr7 added 5 commits April 21, 2025 17:24

more cleanup

26abe36

support protobuf

9a537e3

change name to msg_parser

ad4efd4

some comment

1c226a2

remove deserializer and serializer

c7dcabf

fpacifici reviewed Apr 23, 2025

View reviewed changes

an immutable message interface

1b5425e

fpacifici reviewed May 2, 2025

View reviewed changes

ayirr7 added 3 commits May 1, 2025 22:43

somehow fixed all of the tests

619643a

add some comments, don't get_codec for every msg

39ae480

fix sm test

4e481f4

ayirr7 commented May 2, 2025

View reviewed changes

fpacifici approved these changes May 2, 2025

View reviewed changes

ayirr7 added 3 commits May 2, 2025 18:10

make flatmap use mutablesequence

8ed489f

make reduce tests better

dc0b600

more comments

418dc93

ayirr7 merged commit 800c11e into main May 2, 2025
10 checks passed



		# a message with a generic payload
		class Message(Generic[TIn]):


		from sentry_kafka_schemas.codecs import Codec

		TIn = TypeVar("TIn")

		@@ -0,0 +1,31 @@
		from typing import Any

		from sentry_streams.pipeline.message import Message

Uh oh!

ref: Type out chains and steps #99

ref: Type out chains and steps #99

Uh oh!

Conversation

ayirr7 commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ayirr7 commented Apr 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayirr7 Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayirr7 Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fpacifici left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayirr7 May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayirr7 May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ayirr7 commented Apr 14, 2025 •

edited

Loading

ayirr7 Apr 22, 2025 •

edited

Loading

ayirr7 Apr 22, 2025 •

edited

Loading

fpacifici left a comment •

edited

Loading

ayirr7 May 2, 2025 •

edited

Loading

ayirr7 May 2, 2025 •

edited

Loading