Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BaseModel documents and change to schema generation #337

Open
evalott100 opened this issue Nov 21, 2024 · 5 comments · May be fixed by #343
Open

BaseModel documents and change to schema generation #337

evalott100 opened this issue Nov 21, 2024 · 5 comments · May be fixed by #343
Assignees

Comments

@evalott100
Copy link
Contributor

evalott100 commented Nov 21, 2024

There's been some demand for pydantic BaseModel versions of the documents.

I propose we change event-model document generation to allow for these, in a backwards compatible way.

1: Converting the current jsonschema to pydantic models

Most TypedDict document definitions only add {"additionalProperties": False} to the outputted schema, which is implicit in pydantic models, so most pydantic documents will be identical to the current ones swapping out TypedDict for BaseModel. There are other places where we add more complex logic to the schema.

Run Stop

In run-stop we have the following extra schema:

RUN_STOP_EXTRA_SCHEMA = {
    "patternProperties": {"^([^./]+)$": {"$ref": "#/$defs/DataType"}},
    "additionalProperties": False,
}

Which we can represent in pydantic as:

class RunStop(BaseModel):
    data_type: DataType
    # ... other non `DataType` fields

    class Config:
        extra = 'allow'

    @root_validator(pre=True)
    def validate_additional_fields(cls, values):
        for key, value in values.items():
            if '.' not in key and key not in cls.__fields__:
                try:
                    datatype = DataType.parse_raw(value)
                    setattr(self, key, datatype)
                except ValidationError as err:
                    raise ValueError(f"Extra non-datatype {key} received.") from err
        return values

Event Descriptor

In event-descriptor we have the following extra schema:

EVENT_DESCRIPTOR_EXTRA_SCHEMA = {
    "patternProperties": {"^([^./]+)$": {"$ref": "#/$defs/DataType"}},
    "$defs": {
        "DataType": {
            "title": "DataType",
            "patternProperties": {"^([^./]+)$": {"$ref": "#/$defs/DataType"}},
            "additionalProperties": False,
        },
    },
    "additionalProperties": False,
}

Which we can represent in pydantic the same way as above.

Run Start

The run-start additional schema is substantially more complicated:

RUN_START_EXTRA_SCHEMA = {
    "$defs": {
        "DataType": {
            "patternProperties": {"^([^./]+)$": {"$ref": "#/$defs/DataType"}},
            "additionalProperties": False,
        },
        "Projection": {
            "allOf": [
                {
                    "if": {
                        "allOf": [
                            {"properties": {"location": {"enum": ["configuration"]}}},
                            {"properties": {"type": {"enum": ["linked"]}}},
                        ]
                    },
                    "then": {
                        "required": [
                            "type",
                            "location",
                            "config_index",
                            "config_device",
                            "field",
                            "stream",
                        ]
                    },
                },
                {
                    "if": {
                        "allOf": [
                            {"properties": {"location": {"enum": ["event"]}}},
                            {"properties": {"type": {"enum": ["linked"]}}},
                        ]
                    },
                    "then": {"required": ["type", "location", "field", "stream"]},
                },
                {
                    "if": {
                        "allOf": [
                            {"properties": {"location": {"enum": ["event"]}}},
                            {"properties": {"type": {"enum": ["calculated"]}}},
                        ]
                    },
                    "then": {"required": ["type", "field", "stream", "calculation"]},
                },
                {
                    "if": {"properties": {"type": {"enum": ["static"]}}},
                    "then": {"required": ["type", "value"]},
                },
            ],
        },
    },
    "properties": {
        "hints": {
            "additionalProperties": False,
            "patternProperties": {"^([^.]+)$": {"$ref": "#/$defs/DataType"}},
        },
    },
    "patternProperties": {"^([^./]+)$": {"$ref": "#/$defs/DataType"}},
    "additionalProperties": False,
}
  1. The DataType root_validator can be added to the Hints and RunStart as above.

  2. For Projections the sanest way to adjust what we have currently would be to create a new model for each projection type and then add them as a union in RunStart, this would have the effect of defining a couple of different Projection types in the outputted schema, though it wouldn't be breaking. Alternatively there's the following method:

class Projection(BaseModel):
    type: Literal['linked', 'calculated', 'static']
    location: Optional[Literal['configuration', 'event']] = None
    config_index: Optional[int] = None
    config_device: Optional[str] = None
    field: Optional[str] = None
    stream: Optional[str] = None
    calculation: Optional[str] = None
    value: Optional[str] = None

    @root_validator(pre=True)
    def check_required_fields(cls, values):
        type_ = values.get('type')
        location = values.get('location')

        if type_ == 'linked' and location == 'configuration':
            required_fields = ['type', 'location', 'config_index', 'config_device', 'field', 'stream']
        elif type_ == 'linked' and location == 'event':
            required_fields = ['type', 'location', 'field', 'stream']
        elif type_ == 'calculated' and location == 'event':
            required_fields = ['type', 'field', 'stream', 'calculation']
        elif type_ == 'static':
            required_fields = ['type', 'value']
        else:
            required_fields = []

        for field in required_fields:
            if values.get(field) is None:
                raise ValueError(f'{field} is required for type {type_} and location {location}')

        return values

2: Updating the schema generation

Currently, we generate the jsonschema from the TypedDict definitions with pydantic, and add the EXTRA_SCHEMA dictionaries.

Instead, we'll define the pydantic models, package the schema representation of the root_validators within them and then generate both the jsonschema and the TypedDicts from the pydantic models (statically).

3: Optional fields

Pydantic doesn't allow for fields to be NotRequired, a field which is NotRequired in the TypedDict would have to be None in the pydantic model. For this reason we will forbid fields being Optional having a different meaning to NotRequired.

Fields which are Optional with default None in the BaseModel will be NotRequired[Optional[...]] in the TypedDict.

@evalott100 evalott100 self-assigned this Nov 21, 2024
@evalott100
Copy link
Contributor Author

@danielballan @coretl

@jacopoabramo
Copy link

I'm just stumbling by pure chance on this issue and I just wanted to mention if you would like to also consider using msgspec instead of/together with pydantic.

I'm mentioning it mostly for performance reason: msgspec has quite a strong benchmark in comparison to pydantic (both in terms of speed and library size). I imagine that documents are something that should be produced and consumed as quickly as possible, I'm just throwing this extra possibility hoping to see if it's something worthwhile considering.

@evalott100
Copy link
Contributor Author

if you would like to also consider using msgspec instead of/together with pydantic.

Thanks very much for the suggestion! The converter I'm using in the draft also supports jsonschema -> msgspec.Struct so if we wanted to implement this then it would be a fairly trivial change.

@jacopoabramo
Copy link

@evalott100 once the mentioned PR is complete I can probably give a crack at it - I don't want to mix them up. Are you using this tool by any chance?

@evalott100
Copy link
Contributor Author

@jacopoabramo

Yup, It would just mean making a new

def generate_typeddict(jsonschema_path: Path, documents_path=DOCUMENTS):
output_path = documents_path / f"{jsonschema_path.stem}.py"
datamodel_code_generator.generate(
input_=jsonschema_path,
input_file_type=datamodel_code_generator.InputFileType.JsonSchema,
output=output_path,
output_model_type=datamodel_code_generator.DataModelType.TypingTypedDict,
use_schema_description=True,
use_field_description=True,
use_annotated=True,
field_constraints=True,
wrap_string_literal=True,
)
with output_path.open("r+") as f:
content = f.read()
f.seek(0, 0)
f.write("# ruff: noqa\n" + content)

swapping the output file type and directory.

@evalott100 evalott100 linked a pull request Jan 10, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants