Frictionless silently adds columns not defined in the schema (add `schema_sync`) #150

peterdesmet · 2023-09-13T15:16:41Z

example.zip contains data.csv with 3 columns. Only the first two are defined in datapackage.json.

name (string)
identifier (integer)
include (boolean)

Frictionless will silently include the extra column include and name it X3:

> p <- read_package("example/datapackage.json")
Please make sure you have the right to access data from this Data Package for your intended use.
Follow applicable norms or requirements to credit the dataset and its authors.
> d <- read_resource(p, "data")                                                                                                                      
> d
# A tibble: 4 × 3
  name     identifier X3   
  <chr>         <dbl> <lgl>
1 oconnell          1 TRUE 
2 rovero            2 TRUE 
3 cadman            3 FALSE
4 burton            4 FALSE

However, if the extra column is not in the middle (name, include, identifier, see example2.zip), frictionless will throw a parsing issue for the second column, and still add the last column:

> p <- read_package("example/datapackage.json")
Please make sure you have the right to access data from this Data Package for your intended use.
Follow applicable norms or requirements to credit the dataset and its authors.
> d <- read_resource(p, "data")
Warning message:                                                                                                                      
One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat) 
> d
# A tibble: 4 × 3
  name     identifier    X3
  <chr>         <dbl> <dbl>
1 oconnell         NA     1
2 rovero           NA     2
3 cadman           NA     3
4 burton           NA     4

A col_select does not circumvent this issue:

> d <- read_resource(p, "data", col_select = c("name", "identifier"))
Warning message:                                                                                                                      
One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat) 
> d
# A tibble: 4 × 2
  name     identifier
  <chr>         <dbl>
1 oconnell         NA
2 rovero           NA
3 cadman           NA
4 burton           NA

I think this behaviour should be documented better. Ideally, we have a schema_sync parameter in read_resource() which compares column headers with the schema (#127) and:

If false (default):

Warn for name mismatch
Error for order mismatch
Error for extra columns (different than current behavior)

If true:

Return all columns, in the order of the header in the data (cf. Frictionless Framework)
Applies type, enum, etc. if a matching column is found in the schema
Guesses type if column is not defined in the schema

schema_sync = true then allows more loose reading of data, something that would be beneficial to e.g. the bioRad package

The text was updated successfully, but these errors were encountered:

peterdesmet added enhancement New feature or request function:read_resource Function read_resource() labels Sep 13, 2023

peterdesmet added this to the 1.2.0 milestone Mar 27, 2024

ElsLommelen mentioned this issue Apr 12, 2024

Clarify overwrite behaviour in documentation of write_package #144

Open

peterdesmet removed this from the 1.2.0 milestone Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frictionless silently adds columns not defined in the schema (add `schema_sync`) #150

Frictionless silently adds columns not defined in the schema (add `schema_sync`) #150

peterdesmet commented Sep 13, 2023 •

edited

Loading

Frictionless silently adds columns not defined in the schema (add schema_sync) #150

Frictionless silently adds columns not defined in the schema (add schema_sync) #150

Comments

peterdesmet commented Sep 13, 2023 • edited Loading

Frictionless silently adds columns not defined in the schema (add `schema_sync`) #150

Frictionless silently adds columns not defined in the schema (add `schema_sync`) #150

peterdesmet commented Sep 13, 2023 •

edited

Loading