Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support validating a message with JSON schema #19386

Open
blake-mealey opened this issue Dec 14, 2023 · 3 comments
Open

Support validating a message with JSON schema #19386

blake-mealey opened this issue Dec 14, 2023 · 3 comments
Labels
domain: processing Anything related to processing Vector's events (parsing, merging, reducing, etc.) domain: vrl Anything related to the Vector Remap Language type: feature A value-adding code addition that introduce new functionality.

Comments

@blake-mealey
Copy link
Contributor

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

We're using vector as a usage event pipeline. We currently define JSON schemas for each of our event types which we can use to validate events at write time. However, we've been considering using Vector as the entrypoint for some new event types (creating events from an S3 bucket which uses a different format). For this case, it would be nice to transform the data into the correct shape, and then have the pipeline validate it against a schema to ensure it's correct.

Attempted Solutions

So far we haven't attempted anything, but if first-class support is not added, I think I will attempt to write a tool that generates a VRL script to validate an event against a JSON schema.

Proposal

  1. A new global configuration which defines where to load JSON schemas from (similar to enrichment_tables)
  2. A new VRL function to validate a value against a JSON schema by name

For example, the global configuration may look like:

schemas:
  my_schema:
    type: json_schema
    file_path: /vector-config/schemas/my_schema.json

And the VRL function usage might look like:

is_valid, err = validate_schema(., 'my_schema')

References

No response

Version

No response

@blake-mealey blake-mealey added the type: feature A value-adding code addition that introduce new functionality. label Dec 14, 2023
@dsmith3197 dsmith3197 added domain: processing Anything related to processing Vector's events (parsing, merging, reducing, etc.) domain: vrl Anything related to the Vector Remap Language labels Dec 15, 2023
@Freakin
Copy link

Freakin commented May 17, 2024

This would be extremely helpful. I have the same use case.

@movinfinex
Copy link
Contributor

Instead of a VRL function, we could easily add a new json_schema condition type. That would work for the common use-cases:

  • Conditions can be used in tests. (Most of my test conditions could be replaced with JSON schemas.)
  • Conditions can be used in transforms. In particular, the route transform allows you to handle unmatched events and has out-of-the-box metrics for matched/unmatched events, too.

@blake-mealey
Copy link
Contributor Author

blake-mealey commented Oct 10, 2024

I agree that would be a good solution. I'm picturing the syntax as something like:

my_condition:
  type: json_schema
  # Property path to validate. Optional, defaults to `.`
  path: .nested.property
  # The JSON schema
  schema:
    type: object
    properties:
      alpha:
        type: string

That said, one advantage of supporting this in VRL is that it would significantly reduce the number of transforms needed for a pipeline like mine. With the VRL approach, I could have a single remap transform which checks the event type from the object, then validates it against the appropriate schema. However, with the condition approach, I would need an initial route transform which checks the event type and fans out to individual validation route transforms for each event type.

Using a VRL JSON schema check:

flowchart TD
    source --> verify_all_event_types -->|fail| verify_all_event_types._unmatched
    verify_all_event_types._unmatched --> dlq_sink
    verify_all_event_types --->|pass| valid_sink_1 & valid_sink_2
Loading

Using a condition JSON schema check:

flowchart TD
    source --> route_by_event_type
    route_by_event_type -->|event_type is 1| verify_event_type_1 -->|fail| verify_event_type_1._unmatched
    route_by_event_type -->|event_type is 2| verify_event_type_2 -->|fail| verify_event_type_2._unmatched
    route_by_event_type -->|event_type is 3| verify_event_type_3 -->|fail| verify_event_type_3._unmatched
    verify_event_type_1._unmatched --> dlq_sink
    verify_event_type_2._unmatched --> dlq_sink
    verify_event_type_3._unmatched --> dlq_sink
    verify_event_type_1 --->|pass| valid_sink_1 & valid_sink_2
    verify_event_type_2 --->|pass| valid_sink_1 & valid_sink_2
    verify_event_type_3 --->|pass| valid_sink_1 & valid_sink_2
Loading

It does sound like it would be easier to implement the condition though. Maybe we could start with that and consider implementing the VRL check later?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: processing Anything related to processing Vector's events (parsing, merging, reducing, etc.) domain: vrl Anything related to the Vector Remap Language type: feature A value-adding code addition that introduce new functionality.
Projects
None yet
Development

No branches or pull requests

4 participants