Add some docs (#93)

fpacifici · web-flow · commit e3a21d1bf75e · 2025-04-03T14:48:58.000-07:00
* Add some content

* Link
diff --git a/README.md b/README.md
@@ -2,37 +2,36 @@
 
 The Sentry Streaming Platform
 
-This repo contains two libraries: `sentry_streams` and `sentry_flink`.
+Sentry Streams is a distributed platform that, like most streaming platforms,
+is designed to handle real-time unbounded data streams.
 
-The first contain all the streaming api and an Arroyo based adapter to run
-the streaming applications on top of Arroyo.
+This is built primarily to allow the creation of Sentry ingestion pipelines
+though the api provided is fully independent from the Sentry product and can
+be used to build any streaming application.
 
-The second contains the Flink adapter to run streaming applications on
-Apache Flink. This part is in a separate library because, until we will not
-be able to make it run on python 3.13 and produce wheels for python 3.13,
-it will require Java to run even in the dev environment.
+The main features are:
 
-## Quickstart
+- Kafka sources and multiple sinks. Ingestion pipeline take data from Kafka
+  and write enriched data into multiple data stores.
 
-We are going to run a streaming application on top of Arroyo.
+- Dataflow API support. This allows the creation of streaming application
+  focusing on the application logic and pipeline topology rather than
+  the underlying dataflow engine.
 
-1. Run `make install-dev`
+- Support for stateful and stateless transformations. The state storage is
+  provided by the platform rather than being part of the application.
 
-2. Go to the sentry_streams directory
+- Distributed execution. The primitives used to build the application can
+  be distributed on multiple nodes by configuration.
 
-3. Activate the virtual environment: `source .venv/bin/activate`
+- Hide the Kafka details from the application. Like commit policy and topic
+  partitioning.
 
-4. Run one of the examples
+- Out of the box support for some streaming applications best practices:
+  DLQ, monitoring, health checks, etc.
 
-```
-python sentry_streams/runner.py \
-    -n test \
-    -b localhost:9092 \
-    -a arroyo \
-    sentry_streams/examples/transformer.py
-```
+- Support for Rust and Python applications.
 
-This will start an Arroyo consumer that runs the streaming application defined
-in `sentry_streams/examples/transformer.py`.
+- Support for multiple runtimes.
 
-there is a number of examples in the `sentry_streams/examples` directory.
+[Streams Documentation](https://getsentry.github.io/streams/)
diff --git a/sentry_streams/docs/source/build_pipeline.rst b/sentry_streams/docs/source/build_pipeline.rst
@@ -0,0 +1,26 @@
+Building a Pipeline
+===================
+
+Pipelines are defined through a python DSL (more options will be provided) by
+chaining dataflow primitives.
+
+Chaining primitives means sending a message from one operator to the following
+one.
+
+Pipelines start with `StreamingSource` which represent a Kafka consumer. They
+can fork and broadcast messages to multiple branches. Each branch terminates
+with a Sink.
+
+As of now only Python operations can be used. Soon we will have Rust as well.
+
+Distribution is not visible at this level as it only defines the topology of
+the application, which is basically its business logic. The distribution is
+defined via the deployment descriptor so the operators can be distributed
+differently in different environments.
+
+The DSL operators are in the `chain.py` module.
+
+.. automodule:: sentry_streams.pipeline.chain
+   :members:
+   :undoc-members:
+   :show-inheritance:
diff --git a/sentry_streams/docs/source/conf.py b/sentry_streams/docs/source/conf.py
@@ -1,3 +1,6 @@
+import os
+import sys
+
 # Configuration file for the Sphinx documentation builder.
 #
 # For the full list of built-in configuration values, see the documentation:
@@ -11,10 +14,17 @@
 author = "blank"
 release = "0.1"
 
+sys.path.insert(0, os.path.abspath("../.."))
+
 # -- General configuration ---------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
 
-extensions = ["sphinxcontrib.mermaid"]
+extensions = [
+    "sphinxcontrib.mermaid",
+    "sphinx.ext.autodoc",
+]
+
+always_document_param_types = True
 
 templates_path = ["_templates"]
 exclude_patterns = ["build"]
diff --git a/sentry_streams/docs/source/configure_pipeline.rst b/sentry_streams/docs/source/configure_pipeline.rst
@@ -0,0 +1,2 @@
+Runner Configuration
+========================
diff --git a/sentry_streams/docs/source/deployment.rst b/sentry_streams/docs/source/deployment.rst
@@ -0,0 +1,2 @@
+Deploying on Kuberentes
+=================================
diff --git a/sentry_streams/docs/source/index.rst b/sentry_streams/docs/source/index.rst
@@ -6,4 +6,9 @@
 .. toctree::
    :maxdepth: 2
 
+   what_for
    architecture
+   build_pipeline
+   configure_pipeline
+   runtime/arroyo
+   deployment
diff --git a/sentry_streams/docs/source/intro.rst b/sentry_streams/docs/source/intro.rst
@@ -1 +1,123 @@
-This is Sentry sterams
+Sentry Streams is a distributed platform that, like most streaming platforms,
+is designed to handle real-time unbounded data streams.
+
+This is built primarily to allow the creation of Sentry ingestion pipelines
+though the api provided is fully independent from the Sentry product and can
+be used to build any streaming application.
+
+The main features are:
+
+* Kafka sources and multiple sinks. Ingestion pipeline take data from Kafka
+  and write enriched data into multiple data stores.
+
+* Dataflow API support. This allows the creation of streaming application
+  focusing on the application logic and pipeline topology rather than
+  the underlying dataflow engine.
+
+* Support for stateful and stateless transformations. The state storage is
+  provided by the platform rather than being part of the application.
+
+* Distributed execution. The primitives used to build the application can
+  be distributed on multiple nodes by configuration.
+
+* Hide the Kafka details from the application. Like commit policy and topic
+  partitioning.
+
+* Out of the box support for some streaming applications best practices:
+  DLQ, monitoring, health checks, etc.
+
+* Support for Rust and Python applications.
+
+* Support for multiple runtimes.
+
+Design principles
+=================
+
+This streaming platform, in the context of Sentry ingestion, is designed
+with a few principles in mind:
+
+* Fully self service to speed up the time to reach production when building pipelines.
+* Abstract infrastructure aspect away (Kafka, delivery guarantees, schemas, scale, etc.) to improve stability and scale.
+* Opinionated in the abstractions provided to build ingestion to push for best practices and to hide the inner working of streaming applications.
+* Pipeline as a system for tuning, capacity management and architecture understanding
+
+Getting Started
+=================
+
+In order to build a streaming application and run it on top of the Sentry Arroyo
+runtime, follow these steps:
+
+1. Run locally a Kafka broker.
+
+2. Create a new Python project and a dev environment.
+
+3. Import sentry streams
+
+.. code-block::
+
+    pip install sentry_streams
+
+
+4. Create a new Pyhon module for your streaming application:
+
+.. code-block:: python
+    :linenos:
+
+    from json import JSONDecodeError, dumps, loads
+    from typing import Any, Mapping, cast
+
+    from sentry_streams.pipeline import Filter, Map, streaming_source
+
+    def parse(msg: str) -> Mapping[str, Any]:
+        try:
+            parsed = loads(msg)
+        except JSONDecodeError:
+            return {"type": "invalid"}
+
+        return cast(Mapping[str, Any], parsed)
+
+
+    def filter_not_event(msg: Mapping[str, Any]) -> bool:
+        return bool(msg["type"] == "event")
+
+    pipeline = (
+        streaming_source(
+            name="myinput",
+            stream_name="events",
+        )
+        .apply("mymap", Map(function=parse))
+        .apply("myfilter", Filter(function=filter_not_event))
+        .apply("serializer", Map(function=lambda msg: dumps(msg)))
+        .sink(
+            "myoutput",
+            stream_name="transformed-events",
+        )
+    )
+
+This is a simple pipeline that takes a stream of JSON messages, parses them,
+filters out the ones that are not events, and serializes them back to JSON
+and produces the result to another topic.
+
+5. Run the pipeline
+
+.. code-block::
+
+    python -m sentry_streams.runner \
+    -n Batch \
+    --broker localhost:9092 \
+    --adapter arroyo \
+    <YOUR PIELINE FILE>
+
+
+6. Produce events on the `events` topic and consume them from the `transformed-events` topic.
+
+.. code-block::
+
+    echo '{"type": "event", "data": {"foo": "bar"}}' | kcat -b localhost:9092 -P -t events
+
+.. code-block::
+
+    kcat -b localhost:9092 -G test transformed-events
+
+
+7. Look for more examples in the `sentry_streams/examples` folder of the repository.
diff --git a/sentry_streams/docs/source/runtime/arroyo.rst b/sentry_streams/docs/source/runtime/arroyo.rst
@@ -0,0 +1,2 @@
+Arroyo Runtime
+=================
diff --git a/sentry_streams/docs/source/what_for.rst b/sentry_streams/docs/source/what_for.rst
@@ -0,0 +1,2 @@
+The rationale
+===================
diff --git a/sentry_streams/sentry_streams/pipeline/chain.py b/sentry_streams/sentry_streams/pipeline/chain.py
@@ -151,24 +151,25 @@ class ExtensibleChain(Chain):
     Other steps manage the pipeline topology: sink, broadcast, route.
 
     Example:
-    ```
-    pipeline = (
-        streaming_source("myinput", "events") # Starts the pipeline
-        .apply("transform1", Map(lambda msg: msg)) # Performs an operation
-        .route( # Branches the pipeline
-            "route_to_one",
-            routing_function=routing_func,
-            routes={
-                Routes.ROUTE1: segment(name="route1") # Creates a branch
-                .apply("transform2", Map(lambda msg: msg))
-                .sink("myoutput1", "transformed-events-2"),
-                Routes.ROUTE2: segment(name="route2")
-                .apply("transform3", Map(lambda msg: msg))
-                .sink("myoutput2", "transformed-events3"),
-            },
+
+    .. code-block:: python
+
+        pipeline = streaming_source("myinput", "events") # Starts the pipeline
+            .apply("transform1", Map(lambda msg: msg)) # Performs an operation
+            .route( # Branches the pipeline
+                "route_to_one",
+                routing_function=routing_func,
+                routes={
+                    Routes.ROUTE1: segment(name="route1") # Creates a branch
+                    .apply("transform2", Map(lambda msg: msg))
+                    .sink("myoutput1", "transformed-events-2"),
+                    Routes.ROUTE2: segment(name="route2")
+                    .apply("transform3", Map(lambda msg: msg))
+                    .sink("myoutput2", "transformed-events3"),
+                }, \
+            ) \
         )
-    )
-    ```
+
     """
 
     def __init__(self, name: str) -> None:

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+Runner Configuration`
	`2`	`+========================`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+Deploying on Kuberentes`
	`2`	`+=================================`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+Arroyo Runtime`
	`2`	`+=================`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+The rationale`
	`2`	`+===================`