Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TIFF: Add ASCII validation for ASCII fields #487

Open
wants to merge 4 commits into
base: integration
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -560,78 +560,69 @@ protected Property propertyHeader(String type, List entries)
}

/**
* Reads a string value from the TIFF file.
* If there are non-ASCII characters, they're escaped as %XX
* @param count Length of string
* @param value Offset of string
* Reads a single null-terminated ASCII string from the TIFF file.
* If there are non-ASCII characters, they're converted to hex as %XX.
*
* All ASCII fields may contain multiple strings but this method
* will only return the first. For that reason, it's recommended
* to instead use {@link #readASCIIArray} and handle any additional
* strings appropriately.
*
* @param count Length of ASCII field
* @param value Offset of ASCII field
*/
protected String readASCII(long count, long value)
throws IOException
{
_raf.seek(value);

byte [] buffer = new byte [(int) count];
_raf.read(buffer);

StringBuffer sb = new StringBuffer();
for (int i=0; i<count; i++) {
byte c = buffer[i];
if (c == 0) {
break;
}
// Escape characters that aren't ASCII. There really shouldn't
// be any, and if there are, we don't know how they're encoded.
if (c < 32 || c > 127) {
sb.append(byteToHex(c));
}
else {
sb.append((char) c);
}
}
return sb.toString();
String [] asciiStrings = readASCIIArray(count, value);
return asciiStrings.length > 0 ? asciiStrings[0] : "";
}

/** Reads an array of strings from the TIFF file.
/**
* Reads an array of null-terminated ASCII strings from the TIFF file.
* If there are non-ASCII characters, they're converted to hex as %XX.
*
* @param count Number of strings to read
* @param value Offset from which to read
*
* @param count Length of ASCII field
* @param value Offset of ASCII field
*/
protected String [] readASCIIArray(long count, long value)
throws IOException
{
_raf.seek(value);
final int LOWEST_PRINTABLE_ASCII = 32;
final int HIGHEST_PRINTABLE_ASCII = 126;
final int HIGHEST_VALID_ASCII = 127;

int nstrs = 0;
List<String> list = new LinkedList<>();
byte[] buf = new byte[(int) count];
_raf.seek(value);
_raf.read(buf);
StringBuffer strbuf = new StringBuffer();
for (int i=0; i<count; i++) {
int b = buf[i];

boolean invalidAscii = false;
List<String> list = new LinkedList<>();
StringBuilder sb = new StringBuilder();
for (int i = 0; i < count; i++) {
int b = Byte.toUnsignedInt(buf[i]);
if (b == 0) {
list.add(strbuf.toString());
strbuf.setLength(0);
list.add(sb.toString());
sb.setLength(0);
}
else {
// Escape characters that aren't ASCII. There really shouldn't
// be any, and if there are, we don't know how they're encoded.
if (b < 32 || b > 127) {
strbuf.append(byteToHex((byte) b));
if (b < LOWEST_PRINTABLE_ASCII || b > HIGHEST_PRINTABLE_ASCII) {
sb.append(byteToHex((byte) b));
if (b > HIGHEST_VALID_ASCII) invalidAscii = true;
}
else {
strbuf.append((char) b);
sb.append((char) b);
}
}
}
/* We can't use ArrayList.toArray because that returns an
Object[], not a String[] ... sigh. */
String [] strs = new String[nstrs];
ListIterator<String> iter = list.listIterator();
for (int i=0; i<nstrs; i++) {
strs[i] = iter.next();

if (invalidAscii) {
_info.setMessage(new ErrorMessage(
MessageConstants.TIFF_HUL_72, value));
_info.setValid(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi I have mixed feelings with declaring invalid TIFF files just because of non ASCII (but ISO-8859-1) characters.
Indeed, I know, because this control doesn't exist, that we have thousand of images that have "Bibliothèque nationale de France" in comment but are otherwise perfectly fine.

It would be nice if we could make this configurable like the byteoffset=true parameter to allow it to simply be a warning.
A parameter like invalidNonAsciiCharacters with the default value true will be great.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tledoux I've been working on a more generic way of ignoring / downgrading / upgrading errors by id using a configuration file passed at run-time. It might prove a useful addition as it makes reporting more flexible. Will do my best to get the PR out for the first part of next week for review.

Copy link
Member Author

@david-russo david-russo Oct 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, yeah, we ourselves have quite a few (somewhat questionable) "©" symbols peppered about our TIFF collections. That was actually the motivator for the addition of this check. If JHOVE had had it earlier we would have picked up on this issue sooner and been able to remedy it with a better alternative.

The TIFF 5 and 6 specs are very clear that these fields should only contain "7-bit ASCII". Rendering software has no way to know what character encoding is being used in these fields if not ASCII, and so no characters outside of the ASCII range can be relied upon to render predictably outside of the software which wrote them. Given that, I think marking such TIFFs as invalid is actually fairly desirable, from a preservation perspective, though I certainly sympathize about now needing to deal with the problems our past mistakes have made for us.

The optional parameter isn't a bad idea, if you'd like to implement it, but it does make me think how much better a more general solution would be, allowing us to selectively suppress any messages using our cool newfangled message IDs! Alas... we currently set our validity statuses separately from our messages, so until message objects carry their own status modifiers, parameters may be the way to go.

Then again, a part of me also can't help but feel that by providing an explicit option for this check we may be implying that it's safe or reasonable to ignore this incompatibility under certain circumstances, which I don't believe is necessarily true. So I'm torn... but feel free to add what options you like :)

Copy link
Member Author

@david-russo david-russo Oct 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carlwilson Just now saw your reply. Of course you're already working on it :)

}
return strs;

return list.toArray(new String[list.size()]);
}

/** Reads and returns a single unsigned 8-bit integer value.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -112,87 +112,5 @@ public enum MessageConstants {
public static final JhoveMessage TIFF_HUL_69 = messageFactory.getMessage("TIFF-HUL-69");
public static final JhoveMessage TIFF_HUL_70 = messageFactory.getMessage("TIFF-HUL-70");
public static final JhoveMessage TIFF_HUL_71 = messageFactory.getMessage("TIFF-HUL-71");

// private static final String tagMismatch = " mismatch for tag ";
//
// /**
// * Error "fill in" strings
// */
// public static final String MISMATCH_SUB_1 = "; expecting ";
// public static final String MISMATCH_SUB_2 = ", saw ";
// public static final String MISMATCH_SUB_3 = " or ";
// public static final String TAG_SUB_MESS = "Tag = ";
// public static final String MESSAGE_SUB_MESS = "; message ";
//
// /**
// * Information messages
// */
// public static final String INF_COMP_SCH_6_DEPR = "TIFF compression scheme 6 is deprecated";
// public static final String INF_IFD_TAG_UNK = "Unknown TIFF IFD tag: ";
// public static final String INF_IMAGE_LEN_NO_DEF = "ImageLength not defined";
// public static final String INF_IMAGE_WID_NO_DEF = "ImageWidth not defined";
// public static final String INF_PHO_NO_DEF = "PhotometricInterpretation not defined";
// public static final String INF_SHADOW_SCALE = "ShadowScale (50739)";
// public static final String INF_STR_AND_TILE_TOGETHER = "Strips and tiles defined together";
// public static final String INF_STR_AND_TILE_NO_DEF = "Neither strips nor tiles defined";
// public static final String INF_STR_BYTE_COUNT_NO_DEF = "StripByteCounts not defined";
// public static final String INF_STR_OFF_BYTE_COUNT_INCONS = "StripOffsets inconsistent with StripByteCounts: ";
// public static final String INF_STR_OFF_INVALID = "Invalid strip offset";
// public static final String INF_STR_OFF_NO_DEF = "StripOffsets not defined";
// public static final String INF_TIFF_TAG_UNDOC = "Undocumented TIFF tag";
// public static final String INF_VALOFF_NOT_WORD_ALIGN = "Value offset not word-aligned: ";
//
// /**
// * Error messages
// */
// public static final String ERR_CELL_LEN_NOT_ALLWD = "CellLength tag not permitted when Threshholding not 2";
// public static final String ERR_CIELAB_BPS_NOT_8_OR_16 = "BitsPerSample not 8 or 16 for CIE L*a*b*";
// public static final String ERR_COL_MAP_NOT_DEF = "ColorMap not defined for palette-color";
// public static final String ERR_COL_MAP_MISS_VALS = "Insufficient ColorMap values for palette-color: ";
// public static final String ERR_DATE_TIME_LEN_INV = "Invalid DateTime length: ";
// public static final String ERR_DATE_TIME_SEP_INV = "Invalid DateTime separator: ";
// public static final String ERR_DATE_TIME_DIG_INV = "Invalid DateTime digit: ";
// public static final String ERR_DOT_RANGE_BPS = "DotRange out of range specified by BitsPerSample";
// public static final String ERR_EXIF_BLOCK_TOO_SHORT = "Embedded Exif block is too short";
// public static final String ERR_FILE_TOO_SHORT = "File is too short";
// public static final String ERR_GEO_KEY_DIRECT_INVALID = "Invalid GeoKeyDirectory tag";
// public static final String ERR_GEO_KEY_OUT_SEQ = "GeoKey ";
// public static final String ERR_GEO_KEY_OUT_SEQ_2 = " out of sequence";
// public static final String ERR_GPS_IFD_TAG_UNK = "Unknown GPSInfo IFD tag";
// public static final String ERR_IFD_MISSING = "No IFD in file ";
// public static final String ERR_IFD_OFF_MISALIGN = "IFD offset not word-aligned: ";
// public static final String ERR_IFD_MAX_EXCEEDED = "More than 50 IFDs in chain, probably an infinite loop";
// public static final String ERR_IO_READ = "Read error";
// public static final String ERR_JPEGPROC_NO_DEF = "JPEGProc not defined for JPEG compression";
// public static final String ERR_PAL_COL_SPP_NE_1 = "For palette-color SamplesPerPixel must be 1: ";
// public static final String ERR_PHO_AND_NEW_SUBFILE_INCONSISTENT = "PhotometricInterpretation and NewSubfileType must agree on transparency mask";
// public static final String ERR_PHO_INT_SPP_GT_1 = "For PhotometricInterpretation, SamplesPerPixel must be >= 1, equals: ";
// public static final String ERR_PHO_INT_SPP_GT_3 = "For PhotometricInterpretation, SamplesPerPixel must be >= 3, equals: ";
// public static final String ERR_SPP_EXTRA_NT_1_OR_3 = "SamplesPerPixel-ExtraSamples not 1 or 3:";
// public static final String ERR_TAG_COUNT_MISMATCH = "Count" + tagMismatch;
// public static final String ERR_TAG_IO_READ = ERR_IO_READ + " for tag ";
// public static final String ERR_TAG_ICCPROFILE_BAD = "Bad ICCProfile in tag ";
// public static final String ERR_TAG_OUT_OF_SEQ_1 = "Tag ";
// public static final String ERR_TAG_OUT_OF_SEQ_2 = ERR_GEO_KEY_OUT_SEQ_2;
// public static final String ERR_TAG_TYPE_MISMATCH = "Type" + tagMismatch;
// public static final String ERR_TIFF_HEADER_MISSING = "No TIFF header: ";
// public static final String ERR_TIFF_MAGIC_NUM_MISSING = "No TIFF magic number: ";
// public static final String ERR_TIFF_PREM_EOF = "Premature EOF";
// public static final String ERR_TRANS_MASK_BPS = "For transparency mask BitsPerSample must be 1";
// public static final String ERR_UNK_DATA_TYPE = "Unknown data type";
// public static final String ERR_UNK_DATA_TYPE_SUB_1 = "Type = ";
// public static final String ERR_UNK_DATA_TYPE_SUB_2 = ", Tag = ";
// public static final String ERR_EXIF_INTER_IFD_UNK = "Unknown Exif Interoperability IFD tag";
// public static final String ERR_VAL_OUT_OF_RANGE = " value out of range: ";
// public static final String ERR_XMP_INVALID = "Invalid or ill-formed XMP metadata";
// public static final String ERR_TILE_WID_NO_DEF = "TileWidth not defined";
// public static final String ERR_TILE_LEN_NO_DEF = "TileLength not defined";
// public static final String ERR_TILE_OFF_NO_DEF = "TileOffsets not defined";
// public static final String ERR_TILE_COUNT_NO_DEF = "TileByteCounts not defined";
// public static final String ERR_TILE_WID_NOT_DIV_16 = "TileWidth not a multiple of 16: ";
// public static final String ERR_TILE_LEN_NOT_DIV_16 = "TileLength not a multiple of 16: ";
// public static final String ERR_TILE_OFF_MISS_VALS = "Insufficient values for TileOffsets: ";
// public static final String ERR_TILE_COUNT_MISS_VALS = "Insufficient values for TileByteCounts: ";
// public static final String ERR_XCLIP_PATH_NO_DEF = "XClipPathUnits not defined for ClipPath";

public static final JhoveMessage TIFF_HUL_72 = messageFactory.getMessage("TIFF-HUL-72");
}
Original file line number Diff line number Diff line change
Expand Up @@ -52,15 +52,15 @@ TIFF-HUL-46 = Insufficient ColorMap values for palette-color: {0,number,#} < {1,
TIFF-HUL-47 = CellLength tag not permitted when Threshholding not 2
TIFF-HUL-48 = DotRange out of range specified by BitsPerSample
TIFF-HUL-49 = JPEGProc not defined for JPEG compression
TIFF-HUL-50 = SamplesPerPixel-ExtraSamples not 1 or 3:{0,number,#}-{1,number,#}
TIFF-HUL-50 = SamplesPerPixel-ExtraSamples not 1 or 3: {0,number,#}-{1,number,#}
TIFF-HUL-51 = BitsPerSample not 8 or 16 for CIE L*a*b*
TIFF-HUL-52 = XClipPathUnits not defined for ClipPath
TIFF-HUL-53 = Invalid DateTime length: {0}
TIFF-HUL-54 = Invalid DateTime separator: {0}
TIFF-HUL-55 = Invalid DateTime digit: {0}
TIFF-HUL-56 = Invalid DateTime digit: {0}
TIFF-HUL-57 = Premature EOF
TIFF-HUL-58 = No IFD in file
TIFF-HUL-58 = No IFD in file
TIFF-HUL-59 = IFD offset not word-aligned: {0,number,#}
TIFF-HUL-60 = More than 50 IFDs in chain, probably an infinite loop
TIFF-HUL-61 = TIFF compression scheme 6 is deprecated
Expand All @@ -74,4 +74,4 @@ TIFF-HUL-68 = Unexpected Exception: {0}
TIFF-HUL-69 = Unexpected Exception: {0}
TIFF-HUL-70 = Embedded Exif block is too short
TIFF-HUL-71 = Bad ICCProfile in tag {0,number,#}; message {1}
TIFF-HUL-72 = ASCII field contains non-ASCII value(s)