Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[discussion]: Brainstorming for the different check functions to create #834

Open
lwjohnst86 opened this issue Nov 2, 2024 · 7 comments · May be fixed by #874
Open

[discussion]: Brainstorming for the different check functions to create #834

lwjohnst86 opened this issue Nov 2, 2024 · 7 comments · May be fixed by #874
Assignees

Comments

@lwjohnst86
Copy link
Member

lwjohnst86 commented Nov 2, 2024

What would you like to discuss?

As per #826, since we won't rely on frictionless-py's validate() function, we'll need to set up our own. As with good design, keeping the functions small and targeted is a good aim, so here's some ideas for checks to have. Add more if you think of any.

This includes both verifying and validating checks. But doesn't mean we will implement them right away.

Against data package itself

  • check that (our and Frictionless) required fields are filled in
  • check that fields are the correct type (e.g. str vs int)

Against data resources

  • check that (our and Frictionless) required fields are filled in
  • check that number of columns are equal between data file and properties
  • check that number of rows are the same (between only Parquet file and properties, since raw files will not always have the same)
  • check that data types of columns are identical between data and properties
  • check that data in column match the constraints, if the field is filled in
@signekb
Copy link
Member

signekb commented Nov 4, 2024

  • check that column names are the same in data and properties

@signekb
Copy link
Member

signekb commented Nov 13, 2024

  • if foreign or primary keys are given, check that the referenced column(s) exist
  • Check name uniqueness (i.e., another package with the same name doesn't exist or that another resource in the same package doesn't have the same name)

@martonvago
Copy link
Contributor

Against *Properties objects:
These are/can be all enforced by jsonschema.

  • Check that any format or pattern requirements encoded in the schema are met (e.g. for email or path fields).
  • Check that any requirements specific to array types are met (e.g. minItems, uniqueItems).
  • Check that any enum constraints are met (e.g. for fieldsMatch).

Against data:

  • Support checking data items against any format specified in format.
  • Support checking data items against any given categories.
  • Add support for recognising when a value is marked as missing.
  • Add support for checking unique keys.
  • Add support for handling the fields_match field.

@martonvago
Copy link
Contributor

Miscellaneous questions:

  • Do we want to support recommendations? Which ones? Maybe initially no, to make things simpler.
  • When in the metadata creation and update flows do we want to check the metadata? So far, we have been checking metadata at various points in these flows: when package properties are loaded from datapackage.json, when new values are supplied by the user, when new values are merged with old values, etc. This resulted in some pretty complex logic because, on the one hand, the datapackage.json is not yet technically correct/complete while it’s being built and, on the other hand, user-supplied values can represent partial metadata objects (if e.g. only one field is being updated), which by themselves are not correct/complete. Plus, the standard allows for custom properties, so even restricting the range of allowed fields on metadata wouldn’t be a check that is always appropriate.
    Maybe there is an argument for only running checks on metadata once during these flows, right before saving metadata to the datapackage.json? At this point we would usually be guaranteed to have a complete and correct set of properties. One exception is saving the package properties for the first time after creating the package (before creating a resource), but that is a true exception and could be handled as such.
  • How do we want to handle blank values? For fields that have extra constraints on them in the Data Package standard, such as string fields with format requirements or lists that must have at least 1 element, a blank value will fail the check against the schema. So far, with frictionless-py, these were accepted, so we included them in our default properties. As jsonschema will check exactly against the schema, these will not be accepted by default. Simply not including these in the default properties would be very clean and easy, but then they won’t be there in a newly created datapackage.json to “guide” the user. Is that okay?

@lwjohnst86
Copy link
Member Author

  • Not sure about this first point. What do you mean "support" them?
  • I think a lot of these questions could be solved by having very targeted checks, e.g. not running all the checks whenever we check the properties. And I also think the second set of questions will be more easily answerable with the split to core and lib (or whatever name we use) in docs: 📝 describe split between lib and core #872
  • Hmm, it's really hard to visualize how this would actually play out in practice, also given how little I know of jsonschema. But good to have these questions on mind as we go along.

@lwjohnst86 lwjohnst86 linked a pull request Nov 13, 2024 that will close this issue
@martonvago
Copy link
Contributor

  • By "support" I just mean "implement". E.g. when the Data Package standard says "name SHOULD be human-readable and consist only of lowercase English alphanumeric characters plus ., - and _.", then there is the question of whether we want to add a check for this or not.
  • Hmm, I kinda thought that the advantage of using jsonschema was that it would just run all checks required in the standard with very little customisation from us. But I think it would be fairly easy to make it run a subset of these checks.
  • Yeah, I just put them here because these were things that we made decisions about in the past based on how frictionless-py worked, so we may need/want to revisit them now.

@martonvago
Copy link
Contributor

So JSON Schema is a specification for checking the structure and content of JSON objects, defining expected properties, data types, constraints (like minimum values or string patterns), and formats.
It’s a bit like Pydantic or TypedDict for Python objects, or TypeScript types for JS objects, but for JSON.
jsonschema is an implementation of the specification in Python (i.e. it actually runs the checks defined/inherent in a schema on a given object).

Here’s a draft of how it would look like:

from pathlib import Path

from jsonschema import Draft7Validator, FormatChecker, ValidationError

from seedcase_sprout.core.read_json import read_json

resource_properties = {
    "name": "my-resource",
    "title": "My Resource",
    "path": "data.csv",
}

package_properties = {
    "name": "my-package",
    "title": "My Package",
    "resources": [resource_properties],
    "created": "2024-05-14T05:00:01+00:00",
}

dp_schema = read_json(Path("schema.json"))
validator = Draft7Validator(dp_schema, format_checker=FormatChecker())

try:
    validator.validate(package_properties)
except ValidationError as e:
    print(
        f"Error at `{e.json_path}` caused by validator `{e.validator}`: {e.message}.",
        f"\nFurther context: {e.context}.",
    )

The error object is pretty big, but here is what a bad date would print:

Error at `$.created` caused by validator `format`: 'bad date' is not a 'date-time'. 
Further context: [].

(Draft7Validator is a validator for the current (7th) version of JSON Schema, where versions are apparently called drafts, or something like that.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Development

Successfully merging a pull request may close this issue.

3 participants