-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up the sdr-ingest-transfer step in the accessionWF #4594
Comments
I'd suggest a more general inquiry into why sdr-ingest-transfer step is slow, which may or may not have something to do with fixity generation. |
Good idea. Consider it done. |
https://github.com/sul-dlss/dor-services-app/blob/main/app/services/preservation_ingest_service.rb#L32 forces a regeneration of fixities since preservation requires md5, sha1, and sha256 but only md5 and sha1 are present. |
Is there a place earlier in accessiioning where we are computing md5 and sha1 that we could add the sha256 so preservation wouldn't be re-reading all the files? @andrewjbtw |
There are multiple places, including the SDR API, plus the Cocina model would need the new hash added. It's still possible that there is a benefit to sdr-ingest-transfer checking fixity at this step and the problem is that its method for checking is slower than it needs to be. It's also possible that checking fixity at this step is redundant because bag validation will cover that need. |
The method for checking is not slower than it needs to be. I defer to your judgment whether there is value performing fixity check at this step. |
Also, while sha256 would need to be added in multiple places, each change should be minor. |
@andrewjbtw Based on reading the code, the fixities are re-generated when the number of algorithms != 3. (That is, it does NOT check the fixities and if they don't match then re-generate.) |
FYI, code from moab-versioning: https://github.com/sul-dlss/moab-versioning/blob/main/lib/moab/file_signature.rb#L77 def self.from_file(pathname, algos_to_use = active_algos)
raise(MoabRuntimeError, 'Unrecognized algorithm requested') unless algos_to_use.all? { |a| KNOWN_ALGOS.include?(a) }
signatures = algos_to_use.to_h { |k| [k, KNOWN_ALGOS[k].call] }
pathname.open('r') do |stream|
while (buffer = stream.read(8192))
signatures.each_value { |digest| digest.update(buffer) }
end
end
new(signatures.transform_values(&:hexdigest).merge(size: pathname.size))
end code from common-accessioning def generate_checksums(filepath)
md5 = Digest::MD5.new
sha1 = Digest::SHA1.new
File.open(filepath, 'r') do |stream|
while (buffer = stream.read(8192))
md5.update(buffer)
sha1.update(buffer)
end
end
{ md5: md5.hexdigest, sha1: sha1.hexdigest }
end Spitballing: Differences I see (that may have no bearing)
Differences that could be hidden to us
Also: Is there any software that re-reads the files when creating the bag? |
checksum-compute does the computation on the common-accessioning boxes. sdr-ingest-transfer calls DSA -- would it be comparing the DSA hardware to the common-accessioning? |
late breaking thought -- I think that we might be re-reading files in order to compute file size: https://github.com/sul-dlss/moab-versioning/blob/main/lib/moab/file_inventory.rb
|
Network bandwidth is another thing that could be limiting the speed. It's definitely possible that I'm overestimating how we've resourced the accessioning system, but I'm not ready to conclude that yet. |
I think common-accessioning ends up using the dor-services-app hardware, because that's where the actual checksum computations are happening (via moab-versioning gem called by DSA) |
JLitt will run a test to see if there differences between diff checksum algorithms. Can test be run on dor-services-app VM vs common-accessioning VM? something something technical metadata computing checksums; techmd may parallel-ize multiple files for an object. |
I believe this was: It's possible that technical-metadata is generating checksums using a faster method. One possibility is that it's parallelizing the reads rather than reading files more than once. technical-metadata is only doing MD5, so the algorithm could also make a difference. For what it's worth, this was my experience with checksums on large files (100+ GB) at the Computer History Museum:
|
Last night, I ran sha256sum on the content in https://argo.stanford.edu/view/ch776mh8649 (about 880 GB), using dor-services-app-prod-a with the content in a folder on the /dor mount. It took about 6 hours, which is around what I'd expect. The sdr-ingest-transfer step took 19 hours when this was accessioned. $ time find . -type f -exec sha256sum {} \;
b4ecf6f5cee3b997cdb9f001ad1db311ae9a62570953ac4241fd7a23e7157e2c ./ch776mh8649/ch776mh8649_em_sh.mov.md5
d65165279105ca6773180500688df4bdc69a2c7b771752f0a46ef120b7fd8ec3 ./ch776mh8649/.DS_Store
77a2511522a62945093e23eb0c9d89bcc9dea8c9304092f768ccd165fcc8d4c8 ./ch776mh8649/._.DS_Store
a04b70fe8bc1336e5f816a5479acc5b9899d552e3394810be719907bf95113af ./ch776mh8649/ch776mh8649_em_sh.mov
c182e18aa773adec7ced87a102f3f5c1ad69afe97a4bc752ca26ce0ea042af65 ./ch776mh8649/ch776mh8649_em_sl.mp4.md5
c96a8be954ab6d2a3cfddf6e341f1d2e891db068ebaf1a0d8415edc1577cc295 ./ch776mh8649/ch776mh8649_em_sl.mp4
2b663825455d29cf83c00bf8bbeef06603e4eb148f9500ad2ceb9fdb6dc82f3f ./ch776mh8649/ch776mh8649_thumb.jp2.md5
f4114abce3e147dc10f5e156f5162d1e9245c8592ce7cc8a9e8495fd66d7fe26 ./ch776mh8649/ch776mh8649_md.csv
9fac75cd068d6db516b3ba9884c404cffcc8c7738e82257d0ce97ba04a231f58 ./ch776mh8649/ch776mh8649_md.csv.md5
415d066052675a87c31042ed0d39319e7f7d4f14977e5229a9d4f6c20b9d67c8 ./ch776mh8649/ch776mh8649_pm.tar
79259dba1f41a146f7f483b2d272b38fb20a2240a5d15d118f5231b62511724b ./ch776mh8649/ch776mh8649_pm.tar.md5
dae00215cff24a47766417530c2866727a198fd630c415b6f6d473b0156078a8 ./ch776mh8649/ch776mh8649_thumb.jp2
real 365m2.198s
user 74m12.377s
sys 17m24.700s |
I can an experiment generating fixities in production on a 386GB file using Ruby code that resembled the current code:
The result where:
|
Thanks for the further analysis. It looks to me like the Ruby approach may be slow for some reason. I couldn't find the file you used in the workspace, but I did use a 280 GB. Since it's smaller, I would expect it to take less time for checksums. But the results I got on dor-services-app are all less than 1/3 the time of the results with the Ruby test:
It does look like I was wrong about sdr-ingest-transfer and it is not reading files multiple times. I checked when we updated checksum-compute to stop doing multiple reads, and I think that was deployed before the latest set of large content was accessioned. So if checksum-compute and sdr-ingest-transfer take close to the same amount of time, both must be doing the same amount of reads. |
This is waiting for evaluation from @andrewjbtw about whether to move forward with this now, or put it in the backlog. |
It may be due to checksumming:
The
accessionWF
has a stepsdr-ingest-transfer
that computes checksums; Andrew has observed this step throwing errors when computing checksums.The action(s) for this ticket:
Find where the checksums are being computed.
I think this is the code path
and Andrew and I surmise that the Moab::Bagger is ultimately triggering checksum creation:
But here my spelunking stopped.
Determine if the checksum computation(s) can be optimized in some way. For example, Change compute checksums to digest in single pass. common-accessioning#1100 combines the file reads for checksum computations. Another possible improvement has to do with block size for the reads. Not sure if optimal block size depends on hardware, and no idea what our hardware is.
Note that if you find a change that should be made in MoabVersioning, this ticket might also be relevant to the work: sul-dlss/moab-versioning/issues/144
From a wiki page that has been removed, but was locally pulled:
The text was updated successfully, but these errors were encountered: