-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* DEV-1335 Make hathifiles_database Date Independent - Add `Log` class for recording file + timestamp in `hf_log` table. - Add `Hathifiles` class for producing agenda of files to load. - Add `exe/hathifiles_database_full_update` script for bringing database up to date. - Update README with `exe/` inventory and notes. - The existing code was getting too many false changes on ISSNs in monthly delta - Loosened restriction on input file format to accommodate more database-like values (allow 0/1 for `access`) - Add more tests for round-trip data fidelity -- one should be able to load any hathifile, and the delta with itself should be empty. - Address Dependabot #10 REXML denial of service vulnerability - TIDY - Remove dead code after __END__ blocks - Address issue #11 Remove wait-for and use healthchecks - Address #8 add prometheus / pushgateway - Batch up the calls to milemarker instead of calling for each INSERT - Monthly update bucket chain must `sort` after `cut` to keep `comm` happy.
- Loading branch information
Showing
27 changed files
with
559 additions
and
186 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,8 @@ | ||
FROM ruby:3.2 | ||
|
||
# bin/wait-for depends on netcat | ||
RUN apt-get update -yqq && apt-get install -yqq --no-install-recommends \ | ||
netcat-traditional \ | ||
mariadb-client | ||
|
||
WORKDIR /usr/src/app | ||
ENV BUNDLE_PATH /gems | ||
RUN gem install bundler | ||
|
||
RUN wget -O /usr/local/bin/wait-for https://github.com/eficode/wait-for/releases/download/v2.2.3/wait-for; chmod +x /usr/local/bin/wait-for |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,5 +4,3 @@ IFS=$'\n\t' | |
set -vx | ||
|
||
bundle install | ||
|
||
# Do any other automated setup that you need to do here |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
#!/usr/bin/env ruby | ||
|
||
# This is the preferred, date-independent way of bringing the database | ||
# completely up to date with the hathifiles inventory using the hathifiles.hf_log | ||
# database table. | ||
|
||
$LOAD_PATH.unshift "../lib" | ||
|
||
require "dotenv" | ||
require "pathname" | ||
require "push_metrics" | ||
require "tmpdir" | ||
|
||
require "hathifiles_database" | ||
|
||
envfile = Pathname.new(__dir__).parent + ".env" | ||
Dotenv.load(envfile) | ||
|
||
connection = HathifilesDatabase.new(ENV["HATHIFILES_MYSQL_CONNECTION"]) | ||
hathifiles = HathifilesDatabase::Hathifiles.new( | ||
hathifiles_directory: ENV["HATHIFILES_DIR"], | ||
connection: connection | ||
) | ||
|
||
tracker = PushMetrics.new( | ||
# batch_size could be put in ENV but care would have to be taken with the integer conversion. | ||
batch_size: 10_000, | ||
job_name: ENV.fetch("HATHIFILES_DATABASE_JOB_NAME", "hathifiles_database"), | ||
logger: connection.logger | ||
) | ||
|
||
Dir.mktmpdir do |tempdir| | ||
# `missing_full_hathifiles` returns an Array with zero or one element | ||
# since only the most recent monthly file (if any) is of interest. | ||
# | ||
# We always process the full file first, then any updates. | ||
# Whether or not this is strictly necessary (the update released | ||
# on the same day as the full file may be superfluous), this is how | ||
# `hathitrust_catalog_indexer` does it. | ||
connection.logger.info "full hathifiles: #{hathifiles.missing_full_hathifiles}" | ||
if hathifiles.missing_full_hathifiles.any? | ||
hathifile = File.join(ENV["HATHIFILES_DIR"], hathifiles.missing_full_hathifiles.first) | ||
connection.logger.info "processing monthly #{hathifile}" | ||
HathifilesDatabase::MonthlyUpdate.new( | ||
connection: connection, | ||
hathifile: hathifile, | ||
output_directory: tempdir | ||
).run do |records_inserted| | ||
tracker.increment records_inserted | ||
tracker.on_batch { |_t| connection.logger.info tracker.batch_line } | ||
end | ||
end | ||
connection.logger.info "updates: #{hathifiles.missing_update_hathifiles}" | ||
hathifiles.missing_update_hathifiles.each do |hathifile| | ||
hathifile = File.join(ENV["HATHIFILES_DIR"], hathifile) | ||
connection.logger.info "processing update #{hathifile}" | ||
connection.update_from_file(hathifile) do |records_inserted| | ||
tracker.increment records_inserted | ||
tracker.on_batch { |_t| connection.logger.info tracker.batch_line } | ||
end | ||
end | ||
end | ||
tracker.log_final_line |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.