Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The term "file format" as it's used in AIPscan might be a bit misleading #96

Open
tw4l opened this issue Nov 26, 2020 · 1 comment
Open
Labels
help wanted Extra attention is needed Request: discussion Issues to talk about...

Comments

@tw4l
Copy link
Contributor

tw4l commented Nov 26, 2020

Many of the reports in AIPscan, such as "File format count" as well as the "File format" and "File format version" reports introduced in #76, refer to "file formats". The file format names found in AIPscan are aggregated from the format names in Archivematica METS files and ultimately reflect the Archivematica FPR data model (and, one degree further, the PRONOM data model). These sometimes look a bit different than what one would expect. In the words of @ross-spencer, "they're not quite distinct file formats, and they're not quite format families either." Ross has suggested a better name for these might be "format naming group".

To give a few examples:

Most end users would probably consider PDF to be a file format, and variations of it to be file format versions. In Archivematica/AIPscan, "Acrobat PDF 1.4 - Portable Document Format" and "Acrobat PDF 1.5 - Portable Document Format" are considered to be different file formats, not different versions of the same format. In the Archivematica FPR, these are aggregated into a "Portable Document Format" format group, but that aspect of the FPR data model has not made its way down to AIPscan yet.

Similarly, most end users would consider JPEG to be a file format. In PRONOM, valid files with a .jpg/.jpeg file extension have the following file format names, among others, each with one or more associated PUIDs:

  • "Raw JPEG Stream"
  • "JPEG File Interchange Format"
  • "Exchangeable Image File Format (Compressed)"

By the time we get to AIPscan, reading format names from the METS files, we seem to have all of the above as well as "JPEG" and "Generic JPEG". This makes it really difficult for an end user to see all of the files they would consider to be in the JPEG format. And in this instance, Archivematica's format groups likely wouldn't help us, as the nearest format group is "Image (Raster)".

I'm not sure what the solution to this looks like at this point. It might be useful to do some thinking about whether or how to communicate some of these subtleties through the UI, as well as what might become possible by bringing additional data sources into AIPscan.

@tw4l
Copy link
Contributor Author

tw4l commented Nov 27, 2020

One stopgap solution to consider that was suggested by @ross-spencer is to list the related PUIDs alongside the format name wherever possible in the UI, e.g.:

image

@tw4l tw4l added help wanted Extra attention is needed Request: discussion Issues to talk about... labels Nov 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed Request: discussion Issues to talk about...
Projects
None yet
Development

No branches or pull requests

1 participant