Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some filesets are not getting characterized #2531

Closed
carakey opened this issue Sep 15, 2023 · 7 comments · Fixed by #2562 or #2567
Closed

Some filesets are not getting characterized #2531

carakey opened this issue Sep 15, 2023 · 7 comments · Fixed by #2562 or #2567

Comments

@carakey
Copy link

carakey commented Sep 15, 2023

Descriptive summary

The request is twofold:

  1. Investigate this as a potential bug, given a suspicion that only certain file formats are affected.
  2. Re-characterize the 31 fileset objects whose PIDs are listed in the attached CSV.

Documentation

A number of problem fileset objects were identified in the preservation assessment format inventory. One subset of problem filesets were described as "Not fully characterized" presenting these characteristics:

  • Fileset Solr record does not include file_size_lts or original_checksum_tesim
  • Fileset page loads at https://ir.library.oregonstate.edu/concern/file_sets/{pid}
  • Fileset page includes metadata - title, depositor, date uploaded
  • Fixity check field displays results of check on 5/16/23
  • Characterization field displays file format and mime type (but not file size or original checksum)
  • File is downloadable and usable with either the download button or direct link, https://ir.library.oregonstate.edu/downloads/{pid}
  • A link to a parent work is present and functional and the work is deposited

I tried to re-save both the files and their parent works to see if it would trigger the characterization process, but it did not. There are currently 31 fileset objects showing these characteristics. I'll add a CSV list so we can identify a way to trigger full characterization.

I've been looking for patterns, and found that these 31 have just 3 MIME types: application/pdf, image/tiff, image/png.

For the TIFFs and PNGs, it looks like every TIFF and PNG file that has been deposited since a point in time (roughly March 2022?) have failed to be fully characterized, though sample sizes are small:

  • Every PNG fileset created 2022-02-18 and earlier is characterized, while every PNG created 2023-02-21 and later is not (granted, the sample size of the failed ones is 13 files in 3 parent works).
  • Every TIFF fileset created 2021-08-31 and earlier is characterized, while the 3 TIFFs created on 2022-03-23 are not (which are the most recent 3 TIFF files, and are all in the same parent work).

For the PDFs, all of the affected, not-fully-characterized filesets have a file_format_tesim value of "pdf", with no other qualifiers. These are the only PDFs with that value -- others have "pdf (Portable Document Format)" or "pdf (PDF/A)" or several other variations. Does this mean that the characterization fails before completing the format evaluation, or that there's something about these particular PDF files (embedded media files?) that prevents the characterization process from completing? Also worth mentioning: one of the affected PDF filesets was created in 2017, but the other 14 all have creation dates of 2022-03-21 and later, which matches the cutoff point for the images above.

It's not clear whether this issue is contributing to the slight mismatch in the total number of filesets in SA versus the number reported in the monthly fixity checks, since they lack a checksum but do have a value in the fixity check field.

Expected behavior

Files are automatically characterized upon deposit, and that characterization metadata is stored in Solr.

Actual behavior

A small number of files are not getting characterized.

Steps to reproduce

Solr dashboard query: q: has_model_ssim:"FileSet" ; fq: -original_checksum_tesim:*

Related work

This is an offshoot of #2491.

@carakey
Copy link
Author

carakey commented Sep 15, 2023

@carakey
Copy link
Author

carakey commented Nov 3, 2023

I re-ran the Solr query to get updated not-characterized counts, and one PID in the CSV file is now characterized - xs55mm96c.

There are 8 new filesets that showed up in the results. I'm not adding them to the list to be re-characterized at this time, because they are all recently created and I'm curious to see if they might resolve on their own over the next couple of weeks, like the one above. I'm going to try to re-run this query more regularly to keep an eye on the fluctuations.

@CGillen
Copy link
Contributor

CGillen commented Jan 24, 2024

Further work required to remediate above file sets

@carakey
Copy link
Author

carakey commented Mar 13, 2024

Per Corey's most recent comment, this is not complete until outstanding works are fully characterized.

I just re-ran the solr query to verify. The extras mentioned in my November comment plus a few more from a December run are still not fully characterized, so I'm adding them to the list. No new objects showed up since the December run, which seems like a good sign.

This is the current list of 43 fileset PIDs that are not fully characterized.

1831ct33f
3t946015w
4x51hs18f
6682xb950
6682xc069
6682xc23z
6h441268f
6q182v24z
6t053q84f
70795h27k
7w62fj16c
8k71nr403
8s45qh957
8s45qj261
9c67ww47d
9w032b69v
bc386s629
c247f146z
dv140248c
fq978333b
ft848z204
g158bs272
gt54kw49z
jq085t80v
mc87pz79k
mp48sm98k
n8710002m
ng451s069
nz806734q
pg15bp255
qr46r7630
rx913z328
s1784v26p
sn00b578r
st74d0144
vx021p319
w0892b809
ws859p66s
ww72bk84z
x920g518n
xd07h210z
zp38wm33h
zw12zd71d

These fileset objects must have characterization information, including original checksum and filesize, before this ticket passes QA. (The 9 zombies from #2530 also show up in the solr results, but I cut them from the list.)

@carakey
Copy link
Author

carakey commented Apr 16, 2024

Doing the monthly fixity + characterization check. One fileset from the list above, 3t946015w, appears to be resolved (now has original checksum and filesize listed), and otherwise the list remains identical. It is unclear what happened with this one fileset. It does show Date Modified: 2024-04-15 but doesn't indicate any new User Activity.

@CGillen
Copy link
Contributor

CGillen commented Apr 16, 2024

I tested that one yesterday, checking to find the right command. It seemed to work so I'll go ahead and run it in a little bit

@carakey
Copy link
Author

carakey commented May 1, 2024

All known uncharacterized works have been resolved. No new ones have appeared since November.

@carakey carakey closed this as completed May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Complete
4 participants