Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make validation more explicit #125

Closed
loleg opened this issue Mar 20, 2023 · 6 comments
Closed

Make validation more explicit #125

loleg opened this issue Mar 20, 2023 · 6 comments
Milestone

Comments

@loleg
Copy link

loleg commented Mar 20, 2023

None of the documentation mentions when validation happens, though it's a key value add of this package. I would also suggest adding a native call to show the issues from the validator. At least a code snippet like this could be a useful example to tell people how to use readr to check their Data Package:

library(frictionless) # https://docs.ropensci.org/frictionless/ 
library(readr) # https://readr.tidyverse.org/reference/problems.html

# Read a Data Package
my_package <- read_package("datapackage.json")

# Read and validate the resource
my_resource = read_resource(my_package, "my-package-csv")

# Use tidyverse/readr to explain the problem with the data
problems(my_resource)
@peterdesmet
Copy link
Member

Thanks for the suggestion. Can you clarify what you mean with validation, since this R package does not offer full validation like frictionless-py.

That said, it is possible to get warnings when reading a resource that doesn’t match the provided table schema. Note that:

Column names are taken from the provided Table Schema (schema), not from the header in the CSV file(s).

For example, a schema might have 3 fields defined, while the data has only 2 columns. This mismatch can lead to readr (and therefore this package) to return a warning. Is this the validation you refer to?

@loleg
Copy link
Author

loleg commented Mar 20, 2023

Yes, exactly. Thanks. It would be good to also point out the difference, and any reasons (if it's by design) why a validate function is not available.

@peterdesmet
Copy link
Member

I assume that with “point out the difference”, you mean the difference between this R package and frictionless-py.

  1. You are right that it (“this package does not offer validation”) could be clarified in the package description.
  2. The reason I didn’t implement a validate function is because it is daunting. 😊 It covers a lot of aspects and requires skills I do not have to make it performant. But if someone is willing, it could be implemented with frictionless-py under the hood.

Regarding clarifying the cause of sometimes obscure warnings returned by readr:

  1. We could show users an additional helpful message. E.g. tell them it might be caused by a mismatch between schema and data, and that one can inspect with problems().
  2. Actually inspecting if schema and data align before reading would require more work, because then we have to read the data twice: once to check alignment between header and schema and once actually reading the data (ignoring the header).

I’m in favour of implementing 1 and 3, would that be sufficient?

Ping @PietrH

@peterdesmet peterdesmet added this to the 1.1.0 milestone Mar 21, 2023
@peterdesmet
Copy link
Member

@loleg I have discussed this in person with @PietrH. We decided the following:

  1. Clarify in package description that validation is not included (Indicate in package description that it does not support validation #128)

  2. We won't offer a validate function

  3. We will leave the default message that readr returns on parsing issues; without clarifying it further, i.e.

    One or more parsing issues, call problems() on your data frame for details

    We will include problems() in the namespace, so that the user can use it without loading readr (Include problems() as part of namespace #129)

  4. We will explicitly compare schema and header and return an error on mismatches (Compare header and schema #127). That way parsing issues will be restricted to data type incompatibilities, not schema/header mismatches.

@loleg
Copy link
Author

loleg commented May 3, 2023

Sorry for the slow reply. That looks to me like a more than reasonable proposal! Anything I could help with?

@peterdesmet
Copy link
Member

peterdesmet commented May 4, 2023

Thanks, I will close this overarching issue. The remaining subtasks are part of https://github.com/frictionlessdata/frictionless-r/milestone/4 and will (with the already implemented changes) be included in version 1.1.0.

@PietrH and I will probably get to that during the summer, but you are welcome to tackle #127 if you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants