Skip to content

Commit

Permalink
Dev 1335 date independence (#18)
Browse files Browse the repository at this point in the history
* DEV-1335 Make hathifiles_database Date Independent
- Add `Log` class for recording file + timestamp in `hf_log` table.
- Add `Hathifiles` class for producing agenda of files to load.
- Add `exe/hathifiles_database_full_update` script for bringing database up to date.
- Update README with `exe/` inventory and notes.
- The existing code was getting too many false changes on ISSNs in monthly delta
  - Loosened restriction on input file format to accommodate more database-like values (allow 0/1 for `access`)
  - Add more tests for round-trip data fidelity -- one should be able to load any hathifile, and the delta with itself should be empty.
- Address Dependabot #10 REXML denial of service vulnerability
- TIDY
  - Remove dead code after __END__ blocks
  - Address issue #11 Remove wait-for and use healthchecks
- Address #8 add prometheus / pushgateway
  - Batch up the calls to milemarker instead of calling for each INSERT
- Monthly update bucket chain must `sort` after `cut` to keep `comm` happy.
  • Loading branch information
moseshll authored Sep 6, 2024
1 parent 87f3076 commit 4ef68ac
Show file tree
Hide file tree
Showing 27 changed files with 559 additions and 186 deletions.
4 changes: 0 additions & 4 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,12 +1,8 @@
FROM ruby:3.2

# bin/wait-for depends on netcat
RUN apt-get update -yqq && apt-get install -yqq --no-install-recommends \
netcat-traditional \
mariadb-client

WORKDIR /usr/src/app
ENV BUNDLE_PATH /gems
RUN gem install bundler

RUN wget -O /usr/local/bin/wait-for https://github.com/eficode/wait-for/releases/download/v2.2.3/wait-for; chmod +x /usr/local/bin/wait-for
2 changes: 2 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ source "https://rubygems.org"
# Specify your gem's dependencies in hathifiles_database.gemspec
gemspec

gem "push_metrics", git: "https://github.com/hathitrust/push_metrics.git", tag: "v0.9.1"

group :development, :test do
gem "simplecov"
gem "simplecov-lcov"
Expand Down
17 changes: 15 additions & 2 deletions Gemfile.lock
Original file line number Diff line number Diff line change
@@ -1,7 +1,16 @@
GIT
remote: https://github.com/hathitrust/push_metrics.git
revision: 90ed5f4ee823c1316e1ce0cd4c0db465a180103a
tag: v0.9.1
specs:
push_metrics (0.9.1)
milemarker (~> 1.0)
prometheus-client (~> 4.0)

PATH
remote: .
specs:
hathifiles_database (0.2.2)
hathifiles_database (0.4.0)
date_named_file
dotenv
ettin
Expand Down Expand Up @@ -37,21 +46,24 @@ GEM
library_stdnums (1.6.0)
lint_roller (1.1.0)
method_source (1.0.0)
milemarker (1.0.0)
mysql2 (0.5.4)
parallel (1.23.0)
parser (3.2.2.4)
ast (~> 2.4.1)
racc
pastel (0.8.0)
tty-color (~> 0.5)
prometheus-client (4.2.3)
base64
pry (0.14.1)
coderay (~> 1.1)
method_source (~> 1.0)
racc (1.7.1)
rainbow (3.1.1)
rake (13.0.6)
regexp_parser (2.8.2)
rexml (3.3.3)
rexml (3.3.6)
strscan
rspec (3.11.0)
rspec-core (~> 3.11.0)
Expand Down Expand Up @@ -129,6 +141,7 @@ DEPENDENCIES
bundler (~> 2.0)
hathifiles_database!
pry
push_metrics!
rake (~> 13.0)
rspec (~> 3.0)
simplecov
Expand Down
67 changes: 62 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# HathifilesDatabase

Code to take data from the [hathifiles]() and keep an up-to-date
set of tables in mysql for querying by HT staff.
Code to take data from the [hathifiles](https://github.com/hathitrust/hathifiles)
and keep an up-to-date set of tables in mysql for querying by HT staff.

## Developer Setup
```
Expand All @@ -15,24 +15,81 @@ docker compose run --rm test bundle exec standardrb

## Structure

There are five tables: one with all the data in the hathifiles
There are six tables: one with all the data in the hathifiles
(some of it normalized) in the same order we have there, and
four where we break out and index normalized versions of the
standard identifiers.
standard identifiers.

`hf_log` records each hathifile as it is successfully loaded (along with a timestamp)
so that updates can occur in batches, as needed, for date independence.

* hf
* hf_isbn
* hf_issn
* hf_lccn
* hf_oclc
* hf_log

ISBNs, ISSNs, and OCLC numbers are indexed after normalization (just
runs of digits and potentially an upper-case 'x'). In addition, ISBNs
are indexed in both their 10- and 13-character forms.

LCCNs are more of a mess. They're stored twice, too -- one normalized
and what with whatever string was in the MARC record.


## Binaries
```
bin
├── console
└── setup
```
These are intended to be run under Docker for development purposes.

- `console` invokes `irb` with `hathifiles_database` pre-`require`d
- `setup` is just a shell wrapper around `bundle install` (see Developer Setup)

```
exe
├── catchup
├── daily_run
├── hathifiles_database_clear_everything_out
├── hathifiles_database_convert
├── hathifiles_database_full
├── hathifiles_database_full_update
├── hathifiles_database_update
└── swap_production_and_reindex
```
These are exported by the `gemspec` as the gem's executables.
- `catchup` _deprecated_ loads multiple `upd` files
- `daily_run` _deprecated_ (contains hardcoded paths) loads today's `upd` file
- `hathifiles_database_clear_everything_out` interactive script to reinitialize the database
- `hathifiles_database_convert` _deprecated_ interactive script to dump `hathifiles` database to tab-delimited files
- `hathifiles_database_full` _deprecated_ load a single `full` hathifile
- `hathifiles_database_full_update` the preferred date-independent method for loading `full` and `upd` hathifiles
- `hathifiles_database_update` _deprecated_ load a single `upd` hathifile
- `swap_production_and_reindex` _deprecated_ swaps tables between `hathifiles` and `hathifiles_reindex` databases

`swap_production_and_reindex` used to be part of the workflow for clearing and rebuilding the
production database from an auxiliary database. With Argo Workflows we should no longer need to
do this as `hathifiles_database_full_update` should be touching only the changed/deleted rows
in the `full` monthly hathifile.

## Pitfalls

The `hf` database does not record exactly the same data as the hathifiles.
In particular, standard numbers like ISBN and ISSN are normalized.
Furthermore `access` is a Boolean in the database, but in the hathifiles it appears as
`allow` or `deny`. Because of the `library_stdnums` normalization, it is not possible
to do a round-trip conversion from database to hathifile, only the reverse.
As a result of this, the monthly update (which computes a diff before making changes to
the database) dumps the `hathi_full_*` file into an intermediate "DB-ized" dialect for
comparison.

The `push_metrics` gem, which is required for running `exe/hathifiles_database_full_update`,
is not part of the gemspec because it is currently unpublished. Code which uses `hathifiles_database`
as a gem should also declare a `push_metrics` dependency or use its own implementation
of `hathifiles_database_full_update`.

## Some query examples

Get info for records with the given issn
Expand Down
2 changes: 0 additions & 2 deletions bin/setup
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,3 @@ IFS=$'\n\t'
set -vx

bundle install

# Do any other automated setup that you need to do here
26 changes: 24 additions & 2 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,14 @@
---

x-condition-healthy: &healthy
condition: service_healthy

x-healthcheck-defaults: &healthcheck-defaults
interval: 5s
timeout: 10s
start_period: 10s
retries: 5

services:
test:
build: .
Expand All @@ -11,12 +22,13 @@ services:
# TODO: construct this based on the above variables
HATHIFILES_MYSQL_CONNECTION: "mysql2://ht_rights:ht_rights@mariadb/ht"
HATHIFILES_DIR: "/usr/src/app/spec/data"
PUSHGATEWAY: http://pushgateway:9091
volumes:
- .:/usr/src/app
- gem_cache:/gems
command: bash -c "/usr/local/bin/wait-for mariadb:3306 && bundle exec rspec"
command: bundle exec rspec
depends_on:
- mariadb
mariadb: *healthy

mariadb:
image: mariadb:latest
Expand All @@ -28,7 +40,17 @@ services:
MYSQL_DATABASE: ht
MYSQL_USER: ht_rights
MYSQL_PASSWORD: ht_rights
healthcheck:
<<: *healthcheck-defaults
test: [ "CMD", "healthcheck.sh", "--su-mysql", "--connect", "--innodb_initialized" ]

pushgateway:
image: prom/pushgateway
ports:
- 9092:9091
healthcheck:
<<: *healthcheck-defaults
test: [ "CMD", "wget", "--quiet", "--tries=1", "-O", "/dev/null", "pushgateway:9091/-/healthy" ]

volumes:
gem_cache:
Expand Down
37 changes: 0 additions & 37 deletions exe/catchup
Original file line number Diff line number Diff line change
Expand Up @@ -56,40 +56,3 @@ files.each do |f|
connection.logger.info "Starting work on #{f}"
connection.update_from_file f
end


__END__

require 'pathname'
require 'tty-prompt'


filename = ARGV[0]
tempdir = Pathname.new('.').realdirpath + 'tmp'

prompt = TTY::Prompt.new


p

#!/bin/bash

PATH=/l/local/rbenv/shims:/l/local/rbenv/bin:$PATH
HF_FILES=/htapps/archive/hathifiles
LOGFILE=../logs/hathifiles_database/20210304catchup

# DEV=""
DEV="dev"



#bundle exec ruby exe/hathifiles_database_full $HF_FILES/hathi_full_20200801.txt.gz > $LOGFILE 2>&1
#bundle exec ruby exe/hathifiles_database_update $HF_FILES/hathi_upd_20200731.txt.gz >> $LOGFILE 2>&1;

for i in 01 02 03 04; do
SOURCEFILE=$HF_FILES/hathi_upd_202103${i}.txt.gz;
bundle exec ruby exe/hathifiles_database_update $SOURCEFILE $DEV >> $LOGFILE 2>&1;
done



63 changes: 63 additions & 0 deletions exe/hathifiles_database_full_update
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
#!/usr/bin/env ruby

# This is the preferred, date-independent way of bringing the database
# completely up to date with the hathifiles inventory using the hathifiles.hf_log
# database table.

$LOAD_PATH.unshift "../lib"

require "dotenv"
require "pathname"
require "push_metrics"
require "tmpdir"

require "hathifiles_database"

envfile = Pathname.new(__dir__).parent + ".env"
Dotenv.load(envfile)

connection = HathifilesDatabase.new(ENV["HATHIFILES_MYSQL_CONNECTION"])
hathifiles = HathifilesDatabase::Hathifiles.new(
hathifiles_directory: ENV["HATHIFILES_DIR"],
connection: connection
)

tracker = PushMetrics.new(
# batch_size could be put in ENV but care would have to be taken with the integer conversion.
batch_size: 10_000,
job_name: ENV.fetch("HATHIFILES_DATABASE_JOB_NAME", "hathifiles_database"),
logger: connection.logger
)

Dir.mktmpdir do |tempdir|
# `missing_full_hathifiles` returns an Array with zero or one element
# since only the most recent monthly file (if any) is of interest.
#
# We always process the full file first, then any updates.
# Whether or not this is strictly necessary (the update released
# on the same day as the full file may be superfluous), this is how
# `hathitrust_catalog_indexer` does it.
connection.logger.info "full hathifiles: #{hathifiles.missing_full_hathifiles}"
if hathifiles.missing_full_hathifiles.any?
hathifile = File.join(ENV["HATHIFILES_DIR"], hathifiles.missing_full_hathifiles.first)
connection.logger.info "processing monthly #{hathifile}"
HathifilesDatabase::MonthlyUpdate.new(
connection: connection,
hathifile: hathifile,
output_directory: tempdir
).run do |records_inserted|
tracker.increment records_inserted
tracker.on_batch { |_t| connection.logger.info tracker.batch_line }
end
end
connection.logger.info "updates: #{hathifiles.missing_update_hathifiles}"
hathifiles.missing_update_hathifiles.each do |hathifile|
hathifile = File.join(ENV["HATHIFILES_DIR"], hathifile)
connection.logger.info "processing update #{hathifile}"
connection.update_from_file(hathifile) do |records_inserted|
tracker.increment records_inserted
tracker.on_batch { |_t| connection.logger.info tracker.batch_line }
end
end
end
tracker.log_final_line
36 changes: 0 additions & 36 deletions exe/swap_production_and_reindex
Original file line number Diff line number Diff line change
Expand Up @@ -41,39 +41,3 @@ renames = tables.flat_map { |t| [[prod(t), tmp(t)], [ri(t), prod(t)], [tmp(t), r
sql = "RENAME TABLE " + renames.map { |x| x.join(" TO ") }.join(", ")

production.run(sql)

__END__

require 'pathname'
require 'tty-prompt'


filename = ARGV[0]
tempdir = Pathname.new('.').realdirpath + 'tmp'

prompt = TTY::Prompt.new


p

#!/bin/bash

PATH=/l/local/rbenv/shims:/l/local/rbenv/bin:$PATH
HF_FILES=/htapps/archive/hathifiles
LOGFILE=../logs/hathifiles_database/20210304catchup

# DEV=""
DEV="dev"



#bundle exec ruby exe/hathifiles_database_full $HF_FILES/hathi_full_20200801.txt.gz > $LOGFILE 2>&1
#bundle exec ruby exe/hathifiles_database_update $HF_FILES/hathi_upd_20200731.txt.gz >> $LOGFILE 2>&1;

for i in 01 02 03 04; do
SOURCEFILE=$HF_FILES/hathi_upd_202103${i}.txt.gz;
bundle exec ruby exe/hathifiles_database_update $SOURCEFILE $DEV >> $LOGFILE 2>&1;
done



3 changes: 3 additions & 0 deletions lib/hathifiles_database.rb
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@
require "hathifiles_database/version"
require "hathifiles_database/datafile"
require "hathifiles_database/db/connection"
require "hathifiles_database/hathifiles"
require "hathifiles_database/log"
require "hathifiles_database/monthly_update"

module HathifilesDatabase
def self.new(connection_string)
Expand Down
3 changes: 3 additions & 0 deletions lib/hathifiles_database/constants.rb
Original file line number Diff line number Diff line change
Expand Up @@ -33,5 +33,8 @@ module Constants
content_provider_code

]

LOG_TABLE = :hf_log
ALL_TABLES = [MAINTABLE] + FOREIGN_TABLES.values + [LOG_TABLE]
end
end
Loading

0 comments on commit 4ef68ac

Please sign in to comment.