Skip to content

Commit

Permalink
Update RFC to include more of the model.
Browse files Browse the repository at this point in the history
  • Loading branch information
hdost committed Dec 30, 2024
1 parent 6784504 commit e9161a5
Showing 1 changed file with 211 additions and 62 deletions.
273 changes: 211 additions & 62 deletions rfcs/2024-03-22-20170-trace-data-model.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# RFC 20170 - 2024-03-22 - Establish an Internal Trace Model

As of today there is an already existing Log and Metric event model, the purpose is to establish an
internal model for describing a trace event.
As of today there is an already existing Log and Metric event model, I propose to establish a
canonical internal model for describing a trace event in Vector.

## Context

Expand All @@ -13,9 +13,7 @@ to establish that data model.

## Cross cutting concerns

- This is part of the effort to support [sending OpenTelemetry traces], it would also support
ingesting OpenTelemetry. As well as to provide a path for a canonical model which traces can be
translated to/from ideally without loss of fidelity.
This is part of the effort to support [sending and receiving Opentelemetry signals][otel-signals-pr]

## Scope

Expand All @@ -29,7 +27,6 @@ to establish that data model.

### Out of scope

- Change the `TraceEvent` type internally from a re-typing of `LogEvent`.
- Define the settings and configuration for sinks and sources using the data model.

## Pain
Expand All @@ -47,105 +44,239 @@ to establish that data model.

### Implementation

For the initial Implementation, the data model will be [v1.1.0][otel-proto-110] of the OpenTelemetry Proto.

Therefore the only necessary mapping will be how the Proto data model maps to the `TraceEvent`.
For the initial Implementation, the data model will be [v1.5.0][otel-proto-150] of the OpenTelemetry Proto.
Specifically, the `opentelemetry/proto/trace/v1/trace.proto`

The below data models are meant to represent close to the final internal structures, but there may
be some variations to improve performance made in the final implementation; `Inner`, `Arc`, etc.

#### TraceEvent

```rust
pub struct TraceEvent {
pub resource: Resource,
pub scope_spans: Vec<ScopeSpan>,
pub schema_url: Option<String>,
}
```

#### ScopeSpan

Leaving the scope optional since not everything is guaranteed to implement this.

```rust
pub struct ScopeSpans {
pub scope: Option<InstrumentationScope>,
pub spans: Vec<Span>,
pub schema_url: Option<String>,
}
```

#### Resource

```rust
pub struct Resource {
pub attributes: ObjectMap,
pub dropped_attributes_count: u32,
}
```

#### InstrumentationScope

```rust
pub struct InstrumentationScope {
pub name: String,
pub version: String,
pub attributes: ObjectMap,
pub dropped_attributes_count: u32,
}
```

#### Span

```rust
pub struct Span {
pub name: String,
pub is_remote: bool,
pub trace_id: TraceId(u128),
pub span_id: SpanId(u64),
pub trace_state: TraceState,
pub span_kind: SpanKind,
pub flags: TraceFlags(u8),
pub start_time: u64,
pub end_time: u64,
pub dropped_attributes_count: u32,
pub dropped_events_count: u32,
pub dropped_links_count: u32,
pub attributes: ObjectMap,
pub events: Vec<SpanEvent>,
pub links: Vec<TraceLink>,
pub parent_span_id: Option<SpanId>,
pub status: Option<SpanStatus>,
}
```

#### SpanStatus

The message MUST be ignored for `SpanStatusCode::Unset` and `SpanStatusCode::Ok`, since
it is intended to be used as an error message.

```rust
pub struct SpanStatus {
pub code: SpanStatusCode,
pub message: Option<String>,
}
```

#### SpanStatusCode

Based on the [OTel semantics][otel-span-status-semantics]

```rust
pub enum SpanStatusCode {
Unset,
Ok,
Error,
}
```

#### SpanKind

```rust
pub enum SpanKind{
Unspecified,
Internal,
Server,
Client,
Producer,
Consumer,
}
```

#### SpanEvent

```rust
pub struct SpanEvent {
pub name: String,
pub time: u64,
pub attributes: ObjectMap,
pub dropped_attributes_count: u32,
}
```

#### TraceLink

```rust
pub struct TraceLink {
pub trace_id: TraceId(u128),
pub span_id: SpanId(u64),
pub trace_state: TraceState,
pub dropped_attributes_count: u32,
pub attributes: ObjectMap,
pub flags: TraceFlags(u8),
}
```

#### TraceFlags

Follows the [Trace Flags W3C Spec][w3c-trace-context-trace-flags].

#### TraceState

Follows the [TraceState Header W3C Spec][w3c-trace-context-tracestate-header]

Check warning on line 186 in rfcs/2024-03-22-20170-trace-data-model.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`tracestate` is not a recognized word. (unrecognized-spelling)

```rust
pub struct TraceState(Option<VecDeque<(String,String)>>)
```

#### Mapping OpenTelemetry Tracing to Vector Data Model

This section reflects similarly to how the current Log `source` for `opentelemetry` works.

As `TraceEvent` is a representation of an OTLP `ResourceSpans` message. In Datadog the ingestion
format expects an Map of Array of Spans. The map key is the `trace_id`.

As OpenTelemetry is the primary model the this will only describe the data types. The top level
`TraceEvent` will represent a top level proto `message ResourceSpans`.
`Value::Object`. The remaining proto to mappings can be found:

| Proto Type | Vector Type |
|------------|-------------|
| `message AnyValue` | `Value` of the inner |
| `message ArrayValue` | `Value::Array` with elements in `values` |
| `message KeyValueList` | `Value::Array` with elements in `values` |
| Any unspecified `message` types | `Value::Object` |
| All `enum` types | `Value::Integer` |
| `string` | `Value::Bytes` |
| `fixed64` for datetime values | `Value::Timestamp` |
| `fixed64`,`fixed32`,`uint32` for generic values | `Value::Integer` |
| `bytes` | `Value::Bytes` |
| `repeated` messages | `Value::Array` of `Value::Object` |

The keys for `message` come the field names for `Value::Object`, and MUST follow a "Snake Case" format.
| Proto Type | Vector Type |
|---------------------------------|------------------------------------------|
| `message AnyValue` | `Value` of the inner |
| `message ArrayValue` | `Value::Array` with elements in `values` |
| `message KeyValueList` | `Value::Array` with elements in `values` |
| Any unspecified `message` types | `Value::Object` |
| All `enum` types | `Value::Integer` |
| `string` | `Value::Bytes` |
| `fixed64` for datetime values | `Value::Timestamp` |
| `fixed64`,`fixed32`,`uint32` \ | `Value::Integer` |
| for generic values | |
| `bytes` | `Value::Bytes` |
| `repeated` messages | `Value::Array` of `Value::Object` |

The keys for `message` come the field names for `Value::Object`, and MUST follow a "snake case" format.
As an example, instead of `traceId` and `droppedEventCount`, use `trace_id` and `dropped_event_count`.

#### Mapping Vector Tracing to the Datadog Data Model

The basis of Mapping between the Internal Trace format and the Datadog format can be based on the
[Datadog agent][datadog-agent-otlp-ingest-09].
[Datadog agent][datadog-agent-otlp-ingest-7601].

##### Span Mapping

Vector fields are shown using VRL path notation.

| Datadog Field | Vector Field | Comments |
|---------------|--------------|----------|
| Name | X | See "Span Name Mapping" |
| TraceID | `.["scope_spans"][i]["spans"][j]["trace_id"]` | |
| SpanID | `.["scope_spans"][i]["spans"][j]["span_id"]` | |
| ParentID | `.["scope_spans"][i]["spans"][j]["parent_span_id"]` | |
| Start | `.["scope_spans"][i]["spans"][j]["trace_id"]` | |
| Duration | `.["scope_spans"][i]["spans"][j]["trace_id"]` | |
| Service | `.["resource"]["attributes"]["service.name"]` | |
| Resource | `.["scope_spans"][i]["spans"][j]["name"]` | |
| Meta | X | See "Attribute Handling" |
| Metrics | `.["scope_spans"][i]["spans"][j]["attributes"][k]` | Metrics[k] is assigned, for Integer or Double type values. |
|---------------|---------------------------------------------|--------------------------------|
| Name | See [Span Name Mapping][#span-name-mapping] | |
| TraceID | `.scope_spans[i].spans[j].trace_id` | |
| SpanID | `.scope_spans[i].spans[j].span_id` | |
| ParentID | `.scope_spans[i].spans[j].parent_span_id` | |
| Start | `.scope_spans[i].spans[j].start_time` | |
| Duration | `.scope_spans[i].spans[j].duration` | |
| Service | `.resource.attributes["service.name"]` | |
| Resource | `.scope_spans[i].spans[j].name` | |
| Meta | X | See "Attribute Handling" |
| Metrics | `.scope_spans[i].spans[j].attributes[k]` | Metrics[k] is assigned, for \ |
| | | Integer or Double type values. |

###### Attribute Handling

For Mapping from Vector to Datadog the scope of ResourceSpan attributes -> ScopeSpan attributes and
Span attributes are merged. The reverse it will not be expanded.

To begin `.["resource"]["attributes"][*]` are placed in the `Meta` map.
To begin `.resource["attributes"][*]` are placed in the `Meta` map.

After for any `.["scope_spans"][i]["spans"][j]["attributes"][k]` Meta[k] for any non-Integer and
After for any `.scope_spans[i].spans"][j]["attributes"][k]` Meta[k] for any non-Integer and
non-Double type values. Transcoding the values to Strings.

There are a number of Datadog specific transcodings, but it generally has to do with translating
semantic conventions to [Datadog conventions][datadog-agent-otlp-ingest-09].
semantic conventions to [Datadog conventions][datadog-agent-otlp-ingest-7601].

###### Span Name Mapping

Follow [Datadog agent][datadog-agent-otlp-ingest-09] for OTLP -> Datadog. For Datadog -> OTLP Span
Follow [Datadog agent][datadog-agent-otlp-ingest-7601] for OTLP -> Datadog. For Datadog -> OTLP Span
Name.

###### Span Kinds and Types

| Vector Span Kind | Datadog Span Type | Comments |
|------------------|-------------------|----------|
| Server | "web" | |
| Client | "cache" | If the "db.system" exists in any of the Attribute scopes AND value matches
"redis" or "memcached" |
| Client | "db" | If the "db.system" exists in any of the Attribute scopes |
| Client | "http" | If the above conditions are not true. |
| * | "custom" | |


##### From Vector -> Datadog

As `TraceEvent` is a representation of a OTLP `ResourceSpans` message. In Datadog the ingestion
format expects an Map of Array of Spans. The map key is the `trace_id`.

##### From Datadog -> Vector

There's not a known advantage to providing this.
| Server | "web" | |
| Client | "cache" | If the "db.system" exists in any of the Attribute scopes AND value matches "redis" or "memcached" |
| Client | "db" | If the "db.system" exists in any of the Attribute scopes |
| Client | "http" | If the above conditions are not true. |
| * | "custom" | |

## Rationale

As of right now there is no alternatives to this, and providing this will allow for interoperability
between trace formats. Not doing this will prevent the ["must have" OpenTelemetry issue][github-top-level-otel-issue] from being
able to be completed.


## Drawbacks

- TBD
None found so far.

## Prior Art

Expand All @@ -157,10 +288,23 @@ able to be completed.

## Alternatives

### Using OpenTelemetry-Inspired Model
### Continue with existing log-based model

This requires users to navigate an imprecise model for traces which essentially works as a btree
which can be defined be every input and output. Transforms could be provided, but this would still
require a standard model so that 1-1 transforms would not need to be provided for each input/output.

Rather than needing to respecify the definitions and re-create work done by the wider OpenTelemetry
community we can re-use the language and guidelines used in the OpenTelemetry product.
Furthermore, it was already determined at an earlier date that an [internal trace model] was
required.

### Using OpenTelemetry-inspired model on the existing log model

A logical model built on top of the existing log model would allow for a quick implemention, but

Check warning on line 302 in rfcs/2024-03-22-20170-trace-data-model.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`implemention` is not a recognized word. (unrecognized-spelling)
lacks some of the benefits of having a more regular typing. It still has advantage of rather than
needing to respecify the definitions and re-create work done by the wider OpenTelemetry
community we can re-use the language and guidelines used in the OpenTelemetry product. However, it
lacks opportunities for optimization of data processing. Additionally, refactoring to mappings
in the model in the future is more difficult as with will require reliance on convention.

## Outstanding Questions

Expand Down Expand Up @@ -190,6 +334,11 @@ Incremental steps to execute this change. These will be converted to issues afte
[`datadog_traces` sink]:https://vector.dev/docs/reference/configuration/sinks/datadog_traces/
[validating schemas]:https://github.com/vectordotdev/vector/pull/9388
[sending OpenTelemetry traces]:https://github.com/vectordotdev/vector/issues/17308
[otel-proto-110]:https://github.com/open-telemetry/opentelemetry-proto/releases/tag/v1.1.0
[datadog-agent-otlp-ingest-09]:https://github.com/DataDog/datadog-agent/blob/v0.9.0/pkg/trace/api/otlp.go#L307
[otel-proto-150]:https://github.com/open-telemetry/opentelemetry-proto/releases/tag/v1.5.0
[datadog-agent-otlp-ingest-7601]:https://github.com/DataDog/datadog-agent/blob/7.60.1/pkg/trace/api/otlp.go#L489
[github-top-level-otel-issue]:https://github.com/vectordotdev/vector/issues/1444
[w3c-trace-context-tracestate-header]:https://www.w3.org/TR/trace-context/#tracestate-header

Check warning on line 340 in rfcs/2024-03-22-20170-trace-data-model.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`tracestate` is not a recognized word. (unrecognized-spelling)

Check warning on line 340 in rfcs/2024-03-22-20170-trace-data-model.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`tracestate` is not a recognized word. (unrecognized-spelling)
[w3c-trace-context-trace-flags]: https://www.w3.org/TR/trace-context/#trace-flags
[otel-span-status-semantics]:https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#set-status
[otel-signals-pr]:https://github.com/vectordotdev/vector/issues/1444

0 comments on commit e9161a5

Please sign in to comment.