Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-8474: Add Daffodil Format Plugin #2836
DRILL-8474: Add Daffodil Format Plugin #2836
Changes from all commits
ca709af
13183ac
225504a
b80e74a
ab567d9
ad25972
7567911
e15707a
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need an option here to convert validation errors to fatal?
Will logger.warn be seen by a query user, or is that just for someone dealing with the logs?
Validation errors either should be escalated to fatal, OR they should be visible in the query output display to a user somehow.
Either way, users will need a mechanism to suppress validation errors that prove to be unavoidable since they could be common place. Nodody wants thousands of warnings about something they can't avoid that doesn't stop parsing and querying the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbeckerle The question I'd have is whether the query can proceed if validation fails. (I don't know the answer)
If the answer is no, then we need to halt execution ASAP and throw an exception. If the answer is it can proceed, but the data might be less than ideal, maybe we add a configuration option which will allow the user to decide the behavior on a validation failure.
I could imagine situations where you have Drill unable to read a huge file because someone fat fingered a quotation mark somewhere or something like that. In a situation like that, sometimes you might just want to say I'll accept a row or two of bad data just so I can read the whole file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
We draw a distinction between "well formed" and "invalid" data and whether one does validation seems like the right switch in daffodil to use.
If data is malformed, that means you can't successfully parse it. If it is invalid, that just means values are unexpected. Example: A 3 digit number representing a percentage 0 to 100. -1 is invalid, ABC is malformed.
If data is not well formed, you really cannot continue parsing it, as you cannot convert it to the type expected. But, if you are able to determine at least how big it is, it's possible to capture that length of data into a dummy "badData" element which is always invalid (so isn't a "false positive" parse). This capability has to be designed into the DFDL schema, but it is something we've been doing more and more.
Hence, one can tolerate even some malformed data. If it is malformed to where you cannot determine the length, then continuing is impossible.
We will see if more than this is needed. Options like the "use all strings/varchar" or all numbers are float, which you have for toleratng situations with other data connectors may prove useful, particularly while a DFDL schema is in development and you are really just testing it (and the corresponding data) using Drill.