From e9161a5735cafddcb970d11dcf46218dafb77db2 Mon Sep 17 00:00:00 2001 From: Harold Dost Date: Tue, 31 Dec 2024 00:34:51 +0100 Subject: [PATCH] Update RFC to include more of the model. --- rfcs/2024-03-22-20170-trace-data-model.md | 273 +++++++++++++++++----- 1 file changed, 211 insertions(+), 62 deletions(-) diff --git a/rfcs/2024-03-22-20170-trace-data-model.md b/rfcs/2024-03-22-20170-trace-data-model.md index 7c7703df67ac8f..f3f315f2057b70 100644 --- a/rfcs/2024-03-22-20170-trace-data-model.md +++ b/rfcs/2024-03-22-20170-trace-data-model.md @@ -1,7 +1,7 @@ # RFC 20170 - 2024-03-22 - Establish an Internal Trace Model -As of today there is an already existing Log and Metric event model, the purpose is to establish an -internal model for describing a trace event. +As of today there is an already existing Log and Metric event model, I propose to establish a +canonical internal model for describing a trace event in Vector. ## Context @@ -13,9 +13,7 @@ to establish that data model. ## Cross cutting concerns -- This is part of the effort to support [sending OpenTelemetry traces], it would also support - ingesting OpenTelemetry. As well as to provide a path for a canonical model which traces can be - translated to/from ideally without loss of fidelity. +This is part of the effort to support [sending and receiving Opentelemetry signals][otel-signals-pr] ## Scope @@ -29,7 +27,6 @@ to establish that data model. ### Out of scope -- Change the `TraceEvent` type internally from a re-typing of `LogEvent`. - Define the settings and configuration for sinks and sources using the data model. ## Pain @@ -47,94 +44,229 @@ to establish that data model. ### Implementation -For the initial Implementation, the data model will be [v1.1.0][otel-proto-110] of the OpenTelemetry Proto. - -Therefore the only necessary mapping will be how the Proto data model maps to the `TraceEvent`. +For the initial Implementation, the data model will be [v1.5.0][otel-proto-150] of the OpenTelemetry Proto. +Specifically, the `opentelemetry/proto/trace/v1/trace.proto` + +The below data models are meant to represent close to the final internal structures, but there may +be some variations to improve performance made in the final implementation; `Inner`, `Arc`, etc. + +#### TraceEvent + +```rust +pub struct TraceEvent { + pub resource: Resource, + pub scope_spans: Vec, + pub schema_url: Option, +} +``` + +#### ScopeSpan + +Leaving the scope optional since not everything is guaranteed to implement this. + +```rust +pub struct ScopeSpans { + pub scope: Option, + pub spans: Vec, + pub schema_url: Option, +} +``` + +#### Resource + +```rust +pub struct Resource { + pub attributes: ObjectMap, + pub dropped_attributes_count: u32, +} +``` + +#### InstrumentationScope + +```rust +pub struct InstrumentationScope { + pub name: String, + pub version: String, + pub attributes: ObjectMap, + pub dropped_attributes_count: u32, +} +``` + +#### Span + +```rust +pub struct Span { + pub name: String, + pub is_remote: bool, + pub trace_id: TraceId(u128), + pub span_id: SpanId(u64), + pub trace_state: TraceState, + pub span_kind: SpanKind, + pub flags: TraceFlags(u8), + pub start_time: u64, + pub end_time: u64, + pub dropped_attributes_count: u32, + pub dropped_events_count: u32, + pub dropped_links_count: u32, + pub attributes: ObjectMap, + pub events: Vec, + pub links: Vec, + pub parent_span_id: Option, + pub status: Option, +} +``` + +#### SpanStatus + +The message MUST be ignored for `SpanStatusCode::Unset` and `SpanStatusCode::Ok`, since +it is intended to be used as an error message. + +```rust +pub struct SpanStatus { + pub code: SpanStatusCode, + pub message: Option, +} +``` + +#### SpanStatusCode + +Based on the [OTel semantics][otel-span-status-semantics] + +```rust +pub enum SpanStatusCode { + Unset, + Ok, + Error, +} +``` + +#### SpanKind + +```rust +pub enum SpanKind{ + Unspecified, + Internal, + Server, + Client, + Producer, + Consumer, +} +``` + +#### SpanEvent + +```rust +pub struct SpanEvent { + pub name: String, + pub time: u64, + pub attributes: ObjectMap, + pub dropped_attributes_count: u32, +} +``` + +#### TraceLink + +```rust +pub struct TraceLink { + pub trace_id: TraceId(u128), + pub span_id: SpanId(u64), + pub trace_state: TraceState, + pub dropped_attributes_count: u32, + pub attributes: ObjectMap, + pub flags: TraceFlags(u8), +} +``` + +#### TraceFlags + +Follows the [Trace Flags W3C Spec][w3c-trace-context-trace-flags]. + +#### TraceState + +Follows the [TraceState Header W3C Spec][w3c-trace-context-tracestate-header] + +```rust +pub struct TraceState(Option>) +``` #### Mapping OpenTelemetry Tracing to Vector Data Model This section reflects similarly to how the current Log `source` for `opentelemetry` works. +As `TraceEvent` is a representation of an OTLP `ResourceSpans` message. In Datadog the ingestion +format expects an Map of Array of Spans. The map key is the `trace_id`. + As OpenTelemetry is the primary model the this will only describe the data types. The top level `TraceEvent` will represent a top level proto `message ResourceSpans`. `Value::Object`. The remaining proto to mappings can be found: -| Proto Type | Vector Type | -|------------|-------------| -| `message AnyValue` | `Value` of the inner | -| `message ArrayValue` | `Value::Array` with elements in `values` | -| `message KeyValueList` | `Value::Array` with elements in `values` | -| Any unspecified `message` types | `Value::Object` | -| All `enum` types | `Value::Integer` | -| `string` | `Value::Bytes` | -| `fixed64` for datetime values | `Value::Timestamp` | -| `fixed64`,`fixed32`,`uint32` for generic values | `Value::Integer` | -| `bytes` | `Value::Bytes` | -| `repeated` messages | `Value::Array` of `Value::Object` | - -The keys for `message` come the field names for `Value::Object`, and MUST follow a "Snake Case" format. +| Proto Type | Vector Type | +|---------------------------------|------------------------------------------| +| `message AnyValue` | `Value` of the inner | +| `message ArrayValue` | `Value::Array` with elements in `values` | +| `message KeyValueList` | `Value::Array` with elements in `values` | +| Any unspecified `message` types | `Value::Object` | +| All `enum` types | `Value::Integer` | +| `string` | `Value::Bytes` | +| `fixed64` for datetime values | `Value::Timestamp` | +| `fixed64`,`fixed32`,`uint32` \ | `Value::Integer` | +| for generic values | | +| `bytes` | `Value::Bytes` | +| `repeated` messages | `Value::Array` of `Value::Object` | + +The keys for `message` come the field names for `Value::Object`, and MUST follow a "snake case" format. As an example, instead of `traceId` and `droppedEventCount`, use `trace_id` and `dropped_event_count`. #### Mapping Vector Tracing to the Datadog Data Model The basis of Mapping between the Internal Trace format and the Datadog format can be based on the -[Datadog agent][datadog-agent-otlp-ingest-09]. +[Datadog agent][datadog-agent-otlp-ingest-7601]. ##### Span Mapping Vector fields are shown using VRL path notation. | Datadog Field | Vector Field | Comments | -|---------------|--------------|----------| -| Name | X | See "Span Name Mapping" | -| TraceID | `.["scope_spans"][i]["spans"][j]["trace_id"]` | | -| SpanID | `.["scope_spans"][i]["spans"][j]["span_id"]` | | -| ParentID | `.["scope_spans"][i]["spans"][j]["parent_span_id"]` | | -| Start | `.["scope_spans"][i]["spans"][j]["trace_id"]` | | -| Duration | `.["scope_spans"][i]["spans"][j]["trace_id"]` | | -| Service | `.["resource"]["attributes"]["service.name"]` | | -| Resource | `.["scope_spans"][i]["spans"][j]["name"]` | | -| Meta | X | See "Attribute Handling" | -| Metrics | `.["scope_spans"][i]["spans"][j]["attributes"][k]` | Metrics[k] is assigned, for Integer or Double type values. | +|---------------|---------------------------------------------|--------------------------------| +| Name | See [Span Name Mapping][#span-name-mapping] | | +| TraceID | `.scope_spans[i].spans[j].trace_id` | | +| SpanID | `.scope_spans[i].spans[j].span_id` | | +| ParentID | `.scope_spans[i].spans[j].parent_span_id` | | +| Start | `.scope_spans[i].spans[j].start_time` | | +| Duration | `.scope_spans[i].spans[j].duration` | | +| Service | `.resource.attributes["service.name"]` | | +| Resource | `.scope_spans[i].spans[j].name` | | +| Meta | X | See "Attribute Handling" | +| Metrics | `.scope_spans[i].spans[j].attributes[k]` | Metrics[k] is assigned, for \ | +| | | Integer or Double type values. | ###### Attribute Handling For Mapping from Vector to Datadog the scope of ResourceSpan attributes -> ScopeSpan attributes and Span attributes are merged. The reverse it will not be expanded. -To begin `.["resource"]["attributes"][*]` are placed in the `Meta` map. +To begin `.resource["attributes"][*]` are placed in the `Meta` map. -After for any `.["scope_spans"][i]["spans"][j]["attributes"][k]` Meta[k] for any non-Integer and +After for any `.scope_spans[i].spans"][j]["attributes"][k]` Meta[k] for any non-Integer and non-Double type values. Transcoding the values to Strings. There are a number of Datadog specific transcodings, but it generally has to do with translating -semantic conventions to [Datadog conventions][datadog-agent-otlp-ingest-09]. +semantic conventions to [Datadog conventions][datadog-agent-otlp-ingest-7601]. ###### Span Name Mapping -Follow [Datadog agent][datadog-agent-otlp-ingest-09] for OTLP -> Datadog. For Datadog -> OTLP Span +Follow [Datadog agent][datadog-agent-otlp-ingest-7601] for OTLP -> Datadog. For Datadog -> OTLP Span Name. ###### Span Kinds and Types | Vector Span Kind | Datadog Span Type | Comments | |------------------|-------------------|----------| -| Server | "web" | | -| Client | "cache" | If the "db.system" exists in any of the Attribute scopes AND value matches -"redis" or "memcached" | -| Client | "db" | If the "db.system" exists in any of the Attribute scopes | -| Client | "http" | If the above conditions are not true. | -| * | "custom" | | - - -##### From Vector -> Datadog - -As `TraceEvent` is a representation of a OTLP `ResourceSpans` message. In Datadog the ingestion -format expects an Map of Array of Spans. The map key is the `trace_id`. - -##### From Datadog -> Vector - -There's not a known advantage to providing this. +| Server | "web" | | +| Client | "cache" | If the "db.system" exists in any of the Attribute scopes AND value matches "redis" or "memcached" | +| Client | "db" | If the "db.system" exists in any of the Attribute scopes | +| Client | "http" | If the above conditions are not true. | +| * | "custom" | | ## Rationale @@ -142,10 +274,9 @@ As of right now there is no alternatives to this, and providing this will allow between trace formats. Not doing this will prevent the ["must have" OpenTelemetry issue][github-top-level-otel-issue] from being able to be completed. - ## Drawbacks -- TBD +None found so far. ## Prior Art @@ -157,10 +288,23 @@ able to be completed. ## Alternatives -### Using OpenTelemetry-Inspired Model +### Continue with existing log-based model + +This requires users to navigate an imprecise model for traces which essentially works as a btree +which can be defined be every input and output. Transforms could be provided, but this would still +require a standard model so that 1-1 transforms would not need to be provided for each input/output. -Rather than needing to respecify the definitions and re-create work done by the wider OpenTelemetry -community we can re-use the language and guidelines used in the OpenTelemetry product. +Furthermore, it was already determined at an earlier date that an [internal trace model] was +required. + +### Using OpenTelemetry-inspired model on the existing log model + +A logical model built on top of the existing log model would allow for a quick implemention, but +lacks some of the benefits of having a more regular typing. It still has advantage of rather than +needing to respecify the definitions and re-create work done by the wider OpenTelemetry +community we can re-use the language and guidelines used in the OpenTelemetry product. However, it +lacks opportunities for optimization of data processing. Additionally, refactoring to mappings +in the model in the future is more difficult as with will require reliance on convention. ## Outstanding Questions @@ -190,6 +334,11 @@ Incremental steps to execute this change. These will be converted to issues afte [`datadog_traces` sink]:https://vector.dev/docs/reference/configuration/sinks/datadog_traces/ [validating schemas]:https://github.com/vectordotdev/vector/pull/9388 [sending OpenTelemetry traces]:https://github.com/vectordotdev/vector/issues/17308 -[otel-proto-110]:https://github.com/open-telemetry/opentelemetry-proto/releases/tag/v1.1.0 -[datadog-agent-otlp-ingest-09]:https://github.com/DataDog/datadog-agent/blob/v0.9.0/pkg/trace/api/otlp.go#L307 +[otel-proto-150]:https://github.com/open-telemetry/opentelemetry-proto/releases/tag/v1.5.0 +[datadog-agent-otlp-ingest-7601]:https://github.com/DataDog/datadog-agent/blob/7.60.1/pkg/trace/api/otlp.go#L489 [github-top-level-otel-issue]:https://github.com/vectordotdev/vector/issues/1444 +[w3c-trace-context-tracestate-header]:https://www.w3.org/TR/trace-context/#tracestate-header +[w3c-trace-context-trace-flags]: https://www.w3.org/TR/trace-context/#trace-flags +[otel-span-status-semantics]:https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#set-status +[otel-signals-pr]:https://github.com/vectordotdev/vector/issues/1444 +