-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TIFF: Add ASCII validation for ASCII fields #487
Open
david-russo
wants to merge
4
commits into
openpreserve:integration
Choose a base branch
from
bl-dpt:tiff-ascii-validation
base: integration
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
49be012
TIFF: Refactor ASCII reading to remove repeated code
david-russo 28a827f
TIFF: Add ASCII validation for ASCII fields
david-russo 22cd868
Merge branch 'integration' into tiff-ascii-validation
carlwilson 0a3982a
Merge branch 'integration' into tiff-ascii-validation
carlwilson File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi I have mixed feelings with declaring invalid TIFF files just because of non ASCII (but ISO-8859-1) characters.
Indeed, I know, because this control doesn't exist, that we have thousand of images that have "Bibliothèque nationale de France" in comment but are otherwise perfectly fine.
It would be nice if we could make this configurable like the
byteoffset=true
parameter to allow it to simply be a warning.A parameter like invalidNonAsciiCharacters with the default value true will be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tledoux I've been working on a more generic way of ignoring / downgrading / upgrading errors by id using a configuration file passed at run-time. It might prove a useful addition as it makes reporting more flexible. Will do my best to get the PR out for the first part of next week for review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, yeah, we ourselves have quite a few (somewhat questionable) "©" symbols peppered about our TIFF collections. That was actually the motivator for the addition of this check. If JHOVE had had it earlier we would have picked up on this issue sooner and been able to remedy it with a better alternative.
The TIFF 5 and 6 specs are very clear that these fields should only contain "7-bit ASCII". Rendering software has no way to know what character encoding is being used in these fields if not ASCII, and so no characters outside of the ASCII range can be relied upon to render predictably outside of the software which wrote them. Given that, I think marking such TIFFs as invalid is actually fairly desirable, from a preservation perspective, though I certainly sympathize about now needing to deal with the problems our past mistakes have made for us.
The optional parameter isn't a bad idea, if you'd like to implement it, but it does make me think how much better a more general solution would be, allowing us to selectively suppress any messages using our cool newfangled message IDs! Alas... we currently set our validity statuses separately from our messages, so until message objects carry their own status modifiers, parameters may be the way to go.
Then again, a part of me also can't help but feel that by providing an explicit option for this check we may be implying that it's safe or reasonable to ignore this incompatibility under certain circumstances, which I don't believe is necessarily true. So I'm torn... but feel free to add what options you like :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carlwilson Just now saw your reply. Of course you're already working on it :)