Skip to content

Commit

Permalink
build out doc, add fields_of/props_of
Browse files Browse the repository at this point in the history
  • Loading branch information
majensen committed Jan 8, 2025
1 parent acbfd3e commit b67d976
Show file tree
Hide file tree
Showing 2 changed files with 175 additions and 18 deletions.
182 changes: 165 additions & 17 deletions python/docs/data_val.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

The MDF [PropDefinitions](#property-definitions) section describes properties (slots or variables), along with the data types that consitute valid values for those properties. Using this information, one can validate data that are meant to comply with these conditions.

The `MDFDataValidator` class uses the [Pydantic](https://docs.pydantic.dev/latest/) data validation library to interpret MDF nodes and properties as Python classes having attributes whose values are automatically validated. This provides several options for performing data validation against an MDF model. Data simply needs to be expressed as a Python dict or as JSON. Suppose you have defined a node `sample` in MDF, with properties `sample_type` and `amount`:
The `MDFDataValidator` class uses the [Pydantic](https://docs.pydantic.dev/latest/) data validation library to interpret MDF nodes and properties as Python classes which have attributes whose values are automatically validated. This provides several options for performing data validation against an MDF model. Data to be validated simply needs to be expressed as a Python dict or as JSON.

Example: Suppose you have defined a node called `sample` in MDF, with properties `sample_type` and `amount`:

```yaml
# sample-model.yml
Expand All @@ -24,37 +26,183 @@ PropDefinitions:
value_type: number
```
Then you can validate a list of dicts of `sample` data:
You can then validate a list of dicts of `sample` data using `validate()` as follows:

```yaml
from bento_mdf import MDFReader, MDFDataValidator
mdf = MDFReader("sample-model.yml")
val = MDFDataValidator(mdf)
result = val.validate('Sample',
[{"sample_type": "normal", "amount": 0.50},
{"sample_type": "tumor", "amount": 1.0},
{"sample_type": "wrong", "amount": "fred"}])
assert result is False # at least one record was invalid
assert val.last_validation_errors[2] # the last record has error info
mdf = MDFReader("sample-model.yml")
val = MDFDataValidator(mdf)
result = val.validate('Sample',
[{"sample_type": "normal", "amount": 0.50},
{"sample_type": "tumor", "amount": 1.0},
{"sample_type": "wrong", "amount": "fred"}])
assert result is False # at least one record was invalid
assert val.last_validation_errors[2] # the last record has error info
```

## Available validation classes and data fields

The first argument of `validate` is a string, the class name, that represents a particular model node. Class names are created by CamelCasing the Node handles that appear in the MDF. Properties for Nodes become data fields within the node validation class. These are snake_case strings given by the MDF property handles.

Available node class names are found in the MDFDataValidator `node_classes` attribute. Available field (property) names for a node class can be retrieved with the `fields_of()` or `props_of()` method.

For example, using [test-model.yml](/python/tests/samples/test-model.yml):

```python
mdf = MDFReader("tests/samples/test-model.yml")
val = MDFDataValidator(mdf)
print( val.node_classes )
# ['Case', 'Diagnosis', 'File', 'Sample']
print( val.fields_of('Sample')
# ['sample_type', 'amount']
```

The second argument to `validate()` is the data to be validated against the given class. It is a dict or a list of dicts, whose keys are names of properties defined in the MDF for the given node, and whose values are actual data values to be validated. If all data records in the list are valid, `validate()` returns True; otherwise, it retuns False.

```python
if val.validate('Sample', {'sample_type': 'normal', 'amount':1.0}):
print("Valid!")
else:
print("Invalid.")
```

## The "Model Class"

An additional validation class is created that aggregates all Node classes. This can be used to validate a dict containing a data record for all model Nodes. The model class is named by appending 'Data' to the model handle. For example, [test-model.yml](/python/tests/samples/test-model.yml) has handle `test` and its model class is named `testData`. An example validation:

```python
data = {
"case": {"case_id": "CASE-22"},
"diagnosis": {
"disease": "http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C102872",
"date_of_dx": "1965-05-04T00:00:00",
},
"file": {"file_size": 150342, "md5sum": "9d4cf66a8472f2f97c4594758a06fbd0"},
"sample": {"amount": 4.0, "sample_type": "normal"},
}
assert v.validate('testData', data)
```

Note that the Node keys for data are in lower case.

## Inspecting validation errors

If `validate()` returns False, the attribute `last_validation_errors` will contain a dict of error lists emitted by Pydantic. The keys of the dict are the indexes in the data list of the records that errored; the values are a list of Pydantic [ValidationError](https://docs.pydantic.dev/latest/api/pydantic_core/#pydantic_core.ValidationError) objects detailing the nature of the errors.

```python
data = [
{"md5sum": "9d4cf66a8472f2f97c4594758a06fbd0",
"file_name": "grelf.txt",
"file_size": 50},
{"md5sum": "9d4cf66a8472f2f97c4594758a06Fbd0",
"file_name": "grolf.txt",
"file_size": 50.0},
{"md5sum": "9d4cf66a8472f2f97c4594758a06Fbd0",
"file_name": "grilf.txt",
"file_size": "big"}
]
v.validate('File', data)
print(json.dumps(val.last_validation_errors, indent=4))
```

```json
{
"1": [
{
"type": "predicate_failed",
"loc": [
"md5sum"
],
"msg": "Predicate Pattern.fullmatch failed",
"input": "9d4cf66a8472f2f97c4594758a06Fbd0"
}
],
"2": [
{
"type": "predicate_failed",
"loc": [
"md5sum"
],
"msg": "Predicate Pattern.fullmatch failed",
"input": "9d4cf66a8472f2f97c4594758a06Fbd0"
},
{
"type": "int_parsing",
"loc": [
"file_size",
"int"
],
"msg": "Input should be a valid integer, unable to parse string as an integer",
"input": "big",
"url": "https://errors.pydantic.dev/2.10/v/int_parsing"
}
]
```

## Generated Validation Classes

`MDFDataValidator` generates a Python module containing Pydantic classes (known as "models"). The module code is contained in `v.data_model`; it can be printed to a file and used independently. The validator object creates it and imports it dynamically; there is no need to deal directly with it in the simplest case of data validation (above).
`MDFDataValidator` generates a Python module containing Pydantic classes (generally known as "[models](https://docs.pydantic.dev/latest/concepts/models/)"). The module code is contained in `val.data_model`; it can be printed to a file and used as an independent package in other applications.

The Pydantic classes themselves, however, can be accessed using `model_of()`:
The validator object creates this code using a [Jinja](https://jinja.palletsprojects.com/en/stable/templates/) template and imports it back dynamically with [`importlib`](https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly).

There is no need to deal directly with this machinery in the simplest case of data validation (above). However, you can take advantage of Pydantic features available to these classes by accessing them using `val.model_of()`.

```python
# instantiate a validated object:
sample1 = v.model_of('Sample')(sample_type="normal", amount="1.0")
# get more detail on field types and validations (Pydantic BaseModel methods)
pydantic_fields = v.model_of('Sample').model_fields()
```

The class names are generally camelCase versions of MDF Nodes, and their attributes are Property handles. The class names are available on the MDFDataValidator object:
## JSON Schema Representations

Pydantic has extensive JSON Schema generation facilities. For any validation class, a JSON Schema representation can be created that may be used for for data validation across many programming environments and languages, including Python and Javascript. For example, data validation schemas can be stored along side MDF models in their repos, and general tools using JSON Schema can be developed to enable external submitters to validate their data prior to submission.

JSON Schema for any available validation class can be generated with the `json_schema()` method:

```python
pymodel = v.model_of( v.model_class )
pynodes = {cls : v.model_of(cls) for cls in v.node_classes}
pyenums = {cls : v.model_of(cls) for cls in v.enum_classes}
import json
print(json.dumps(val.json_schema('Diagnosis'), indent=4))
```
```json
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"properties": {
"disease": {
"anyOf": [
{
"format": "uri",
"minLength": 1,
"type": "string"
},
{
"type": "null"
}
],
"title": "Disease"
},
"date_of_dx": {
"anyOf": [
{
"format": "date-time",
"type": "string"
},
{
"type": "null"
}
],
"title": "Date Of Dx"
}
},
"required": [
"disease",
"date_of_dx"
],
"title": "Diagnosis",
"type": "object"
}
```

## JSON Schema Representations


11 changes: 10 additions & 1 deletion python/src/bento_mdf/mdf/validator.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,16 @@ def model_of(self, clsname : str):
if clsname != self.model_class and clsname not in self.node_classes and clsname not in self.enum_classes:
raise RuntimeError(f"Validation model does not contain class '{clsname}'")
return eval("self.module.{}".format(clsname))


@cache
def fields_of(self, clsname : str) -> List[str]:
if clsname != self.model_class and clsname not in self.node_classes:
raise RuntimeError(f"Validation model does not contain node class '{clsname}'")
return [x for x in self.model_of(clsname).model_fields]

def props_of(self, clsname : str) -> List[str]:
return self.fields_of(clsname)

@cache
def validator(self, clsname : str):
"""
Expand Down

0 comments on commit b67d976

Please sign in to comment.