-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some filesets are not getting characterized #2531
Comments
I re-ran the Solr query to get updated not-characterized counts, and one PID in the CSV file is now characterized - xs55mm96c. There are 8 new filesets that showed up in the results. I'm not adding them to the list to be re-characterized at this time, because they are all recently created and I'm curious to see if they might resolve on their own over the next couple of weeks, like the one above. I'm going to try to re-run this query more regularly to keep an eye on the fluctuations. |
Further work required to remediate above file sets |
Per Corey's most recent comment, this is not complete until outstanding works are fully characterized. I just re-ran the solr query to verify. The extras mentioned in my November comment plus a few more from a December run are still not fully characterized, so I'm adding them to the list. No new objects showed up since the December run, which seems like a good sign. This is the current list of 43 fileset PIDs that are not fully characterized.
These fileset objects must have characterization information, including original checksum and filesize, before this ticket passes QA. (The 9 zombies from #2530 also show up in the solr results, but I cut them from the list.) |
Doing the monthly fixity + characterization check. One fileset from the list above, |
I tested that one yesterday, checking to find the right command. It seemed to work so I'll go ahead and run it in a little bit |
All known uncharacterized works have been resolved. No new ones have appeared since November. |
Descriptive summary
The request is twofold:
Documentation
A number of problem fileset objects were identified in the preservation assessment format inventory. One subset of problem filesets were described as "Not fully characterized" presenting these characteristics:
I tried to re-save both the files and their parent works to see if it would trigger the characterization process, but it did not. There are currently 31 fileset objects showing these characteristics. I'll add a CSV list so we can identify a way to trigger full characterization.
I've been looking for patterns, and found that these 31 have just 3 MIME types: application/pdf, image/tiff, image/png.
For the TIFFs and PNGs, it looks like every TIFF and PNG file that has been deposited since a point in time (roughly March 2022?) have failed to be fully characterized, though sample sizes are small:
For the PDFs, all of the affected, not-fully-characterized filesets have a
file_format_tesim
value of "pdf", with no other qualifiers. These are the only PDFs with that value -- others have "pdf (Portable Document Format)" or "pdf (PDF/A)" or several other variations. Does this mean that the characterization fails before completing the format evaluation, or that there's something about these particular PDF files (embedded media files?) that prevents the characterization process from completing? Also worth mentioning: one of the affected PDF filesets was created in 2017, but the other 14 all have creation dates of 2022-03-21 and later, which matches the cutoff point for the images above.It's not clear whether this issue is contributing to the slight mismatch in the total number of filesets in SA versus the number reported in the monthly fixity checks, since they lack a checksum but do have a value in the fixity check field.
Expected behavior
Files are automatically characterized upon deposit, and that characterization metadata is stored in Solr.
Actual behavior
A small number of files are not getting characterized.
Steps to reproduce
Solr dashboard query: q: has_model_ssim:"FileSet" ; fq: -original_checksum_tesim:*
Related work
This is an offshoot of #2491.
The text was updated successfully, but these errors were encountered: