Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better messaging when arrow::open_dataset throws error whithin connect_hub #7

Open
annakrystalli opened this issue Jun 13, 2023 · 1 comment

Comments

@annakrystalli
Copy link
Member

Errors produced during the arrow::open_dataset() from problems involving anything from columns in schema provided not matching columns in data (e.g. when trying to open data that still had type and type_id columns when we changed to output_type and output_type_id) to mis-specification of field data type (e.g. trying to cast double or character column to integer) produces the same wildly uninformative error.

For example, here I try to cast character field output_type as int32 data type.

library(hubUtils)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

model_output_schema <- schema(
  origin_date = date32(),
  target = string(),
  horizon = int32(),
  location = string(),
  output_type = int32(),
  output_type_id = string(),
  value = int32(),
  model_id = string()
)

model_output_dir <- system.file("testhubs/simple/model-output", package = "hubUtils")
mod_out_con <- connect_model_output(model_output_dir, file_format = "csv",
                                    schema = model_output_schema)
#> Error in `arrow::open_dataset()` at hubUtils/R/connect_model_output.R:32:8:
#> ! Invalid: No non-null segments were available for field 'model_id'; couldn't infer type
#> Backtrace:
#>     ▆
#>  1. ├─hubUtils::connect_model_output(...)
#>  2. └─hubUtils:::connect_model_output.default(...) at hubUtils/R/connect_model_output.R:17:4
#>  3.   └─arrow::open_dataset(...) at hubUtils/R/connect_model_output.R:32:8
#>  4.     └─base::tryCatch(...)
#>  5.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  6.         └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  7.           └─value[[3L]](cond)
#>  8.             └─arrow:::augment_io_error_msg(e, call, format = format)
#>  9.               └─rlang::abort(msg, call = call)

Created on 2023-06-13 with reprex v2.0.2

The error thrown:

#> Error in `arrow::open_dataset()` at hubUtils/R/connect_model_output.R:32:8:
#> ! Invalid: No non-null segments were available for field 'model_id'; couldn't infer type

has sent us on many a wild goose chase while not providing any useful pointers to actual problem and will likely be even more confusing to downstream hub users.

Our options are:

  1. Report the poor error handling to arrow and wait for a resolution in the package itself.
  2. Try and capture, analyse and produce are own messages within hubUtils.

I feel we should definitely report the behaviour whatever else we decide.
While I'm leaning towards 2 out of principle that our functions are currently resulting in really unhelpful error messages, it may not be that straight forward to implement.

@elray1
Copy link
Contributor

elray1 commented Jun 14, 2023

I like the idea of 1 as a temporary solution, with a goal to do 2 if still necessary later on, once we've got hubValidations in place

@annakrystalli annakrystalli transferred this issue from hubverse-org/hubUtils Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

2 participants