Skip to content

Commit

Permalink
DEV-1373 make catalog indexing date independent (#52)
Browse files Browse the repository at this point in the history
- Add `CICTL::Journal` class for writing dated files to a predefined location to record Zephir files indexed.
- Add `journal_directory` to `Services` with `ENV`-overridable default location for journal files.
- Add `cictl continue` command that calls `cictl all` or `cictl since` depending on presence or absence of relevant jourmals.
- TIDY: remove deprecated docker-compose.yml version.
- Address a number of nokogiri/rexml vulnerabilities identified by Dependabot.
- Address #50 availability maps should account for icus.
- Remove `standardrb` exception for `lib/translation_maps` (mostly) by changing single to double quotes.
- Remove `ht_namespace_map` and unused reference to it.
- Remove unused umich translation maps.
- Run many cictl tests in temp directory with `around` block.
- Add `CICTL::Examples` helpers to tidy up test setup.
  • Loading branch information
moseshll authored Oct 29, 2024
1 parent 2ade7c3 commit 7558bc3
Show file tree
Hide file tree
Showing 23 changed files with 783 additions and 469 deletions.
1 change: 0 additions & 1 deletion .standard.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ ignore:
- 'lib/ht_traject/**/*'
- 'lib/ht_traject.rb'
- 'lib/traject/**/*'
- 'lib/translation_maps/**/*'
- 'lib/umich_traject/**/*'
- 'lib/umich_traject.rb'
- 'readers/**/*'
Expand Down
1 change: 1 addition & 0 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ source "https://rubygems.org"

group :development, :test do
gem "bundler", "~>2.0"
gem "climate_control"
gem "rake", "~> 13.0"
gem "standard"
gem "rspec"
Expand Down
10 changes: 6 additions & 4 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ GEM
ast (2.4.2)
builder (3.2.4)
canister (0.9.2)
climate_control (1.2.0)
coderay (1.1.3)
concurrent-ruby (1.2.2)
date_named_file (0.1.1)
Expand Down Expand Up @@ -61,11 +62,11 @@ GEM
match_map (3.0.0)
method_source (1.0.0)
naconormalizer (1.0.1-java)
nokogiri (1.16.2-arm64-darwin)
nokogiri (1.16.7-arm64-darwin)
racc (~> 1.4)
nokogiri (1.16.2-java)
nokogiri (1.16.7-java)
racc (~> 1.4)
nokogiri (1.16.2-x86_64-linux)
nokogiri (1.16.7-x86_64-linux)
racc (~> 1.4)
parallel (1.23.0)
parser (3.2.2.1)
Expand All @@ -83,7 +84,7 @@ GEM
rainbow (3.1.1)
rake (13.0.6)
regexp_parser (2.8.0)
rexml (3.2.5)
rexml (3.3.8)
rsolr (2.5.0)
builder (>= 2.1.2)
faraday (>= 0.9, < 3, != 2.0.0)
Expand Down Expand Up @@ -178,6 +179,7 @@ PLATFORMS
DEPENDENCIES
bundler (~> 2.0)
canister (~> 0.9.2)
climate_control
date_named_file
dotenv
http (~> 5.0)
Expand Down
14 changes: 13 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,16 @@ network (i.e. the one started with `docker-compose up` from this repository).
Solr should be reachable via the `solr-sdr-catalog` hostname.

## How to do the basics
### Date-Independent Indexing

For use in production environments where daily and monthly indexing are ongoing activities,
we enable the indexer to maintain state by writing "journal" files: empty datestamped
files in a known location (`JOURNAL_DIRECTORY`). The command `cictl index continue` does whatever
full or daily indexing is appropriate given the state of the journals.

Note that all of the `cictl index *` commands write journal files, with the exception of
`cictl index file` which takes only an `upd` MARC file rather than a MARC-deletes pair, and is not
expected to be used in an environment where date independence is in force.

### Putting a new solr configuration into place

Expand All @@ -133,7 +142,7 @@ Solr should be reachable via the `solr-sdr-catalog` hostname.
* (Optional) If your new solr config requires a full reindex, go ahead and
get rid of the data with `rm -rf data`
* Fire solr back up: `systemctl start solr-current-catalog`
* Give it a minute and then go to http://beeftea-2.umdl.umich.edu:9033/solr` to make sure the core came back up.
* Give it a minute and then go to `http://beeftea-2.umdl.umich.edu:9033/solr` to make sure the core came back up.
* Do whatever indexing needs doing.

### Indexing
Expand Down Expand Up @@ -193,6 +202,7 @@ The `index` command has a number of possibilities:
> bundle exec bin/cictl help index
Commands:
cictl index all # Empty the catalog and index the most recent m...
cictl index continue # index all files not represented in the indexe...
cictl index date YYYYMMDD # Run the catchup (delete and index) for a part...
cictl index file FILE # Index a single file
cictl index help [COMMAND] # Describe subcommands or one specific subcommand
Expand Down Expand Up @@ -283,6 +293,8 @@ and `config/env`. The defaults in the repository suffice for testing under Docke
## Environment variables

* `DDIR` data directory, defaults to `/htsolr/catalog/prep`
* `JOURNAL_DIRECTORY` location of journal files (see Date-Independent Indexing above) defaulting
to `journal/` inside the repo directory.
* `LOG_DIR` where to store logs, defaults to `/htsolr/catalog/prep`.
* `MYSQL_HOST`, `MYSQL_DATABASE`, `MYSQL_USER`, `MYSQL_PASSWORD` *required* unless run with `NO_DB`.
* `NO_DB` if you want to skip all the database stuff. Useful for testing. Implied by `NO_EXTERNAL_DATA`.
Expand Down
1 change: 0 additions & 1 deletion docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
---
version: '3'

services:
traject:
Expand Down
37 changes: 36 additions & 1 deletion lib/cictl/index_command.rb
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,37 @@
require_relative "base_command"
require_relative "zephir_file"
require_relative "deleted_records"
require_relative "journal"

module CICTL
class IndexCommand < BaseCommand
class_option :reader, type: :string, desc: "Reader name/path"
class_option :writer, type: :string, desc: "Writer name/path"

desc "continue", "Index all files not represented in the indexer journals"
def continue
last_full = ZephirFile.full_files.last
fatal "unable to find full Zephir file" unless last_full
# Index the most recent full file and subsequent ones if the
# full file journal is missing.
full_journal = Journal.new(date: last_full.to_datetime.to_date, full: true)
if full_journal.missing?
logger.info "missing full journal #{full_journal}, calling `cictl all`"
call_all_command
# Otherwise, iterate from the last full file date to yesterday.
# If there is a missing journal, start indexing from that point.
else
(last_full.to_datetime.to_date..(Date.today - 1)).each do |date|
journal = Journal.new(date: last_full.to_datetime.to_date, full: false)
if journal.missing?
logger.info "missing update journal #{journal}, calling `cictl since #{journal.date}`"
call_since_command(journal.date)
break
end
end
end
end

desc "all", "Empty the catalog and index the most recent monthly followed by subsequent daily updates"
option :wait, type: :boolean, desc: "Wait 5 seconds for Control-C", default: true
def all
Expand Down Expand Up @@ -39,6 +64,8 @@ def all
solr_client.commit!
end

# Note: this command does not write a journal since it only processes the MARC file
# but not the deletes.
option :commit, type: :boolean, desc: "Commit changes to Solr", default: true
desc "file FILE", "Index a single MARC file"
def file(marcfile)
Expand All @@ -52,7 +79,11 @@ def date(date)
preflight
with_date(date) do |date|
index_deletes_for_date date
index_records_for_date date
if index_records_for_date date
journal = Journal.new(date: date)
logger.info("write journal file #{journal.path}")
journal.write!
end
end
end

Expand Down Expand Up @@ -84,6 +115,7 @@ def today
end

no_commands do
alias_method :call_all_command, :all
alias_method :call_date_command, :date
alias_method :call_file_command, :file
alias_method :call_since_command, :since
Expand Down Expand Up @@ -118,14 +150,17 @@ def marc_file_for_date(date)
ZephirFile.update_files.at(date)
end

# @return [Boolean] true if the marcfile for the given date exists
def index_records_for_date(date)
marcfile = marc_file_for_date date
if File.exist? marcfile
Indexer.new(reader: options[:reader], writer: options[:writer]).run marcfile
solr_client.commit!
logger.debug "index date(#{date}): Solr count now #{solr_client.count}"
true
else
logger.warn "could not find marcfile '#{marcfile}'"
false
end
end

Expand Down
66 changes: 66 additions & 0 deletions lib/cictl/journal.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# frozen_string_literal: true

require_relative "../services"

module CICTL
# A class that enables date-independent catalog indexing using the filesystem.
#
# Each time a full or update file is indexed, writes an (empty) file of the form
# hathitrust_catalog_indexer_journal_upd_YYYYMMDD.txt or
# hathitrust_catalog_indexer_journal_full_YYYYMMDD.txt in the journal directory.
#
# When we use the index command `cictl continue`
# we calculate the earliest zephir file not yet indexed and proceed in order from
# that point.
#
# Nomenclature note: "journal" is the closest semantic match to "log" I could find.
# This is a log, of sorts, but the term was already taken.
class Journal
attr_reader :date

FILENAME_PATTERN = /hathitrust_catalog_indexer_journal_(full|upd)_(\d{8})\.txt/

def self.filename_for(date:, full:)
yyyymmdd = date.strftime "%Y%m%d"
type = full ? "full" : "upd"
"hathitrust_catalog_indexer_journal_#{type}_#{yyyymmdd}.txt"
end

def initialize(date: Date.today - 1, full: false)
@date = date
@full = full
end

# Use the built-in but append the date and full/upd because that's what we care about.
def to_s
super.tap do |s|
s.gsub!(/>$/, " [#{date} #{full? ? "full" : "upd"}]>")
end
end

def full?
@full
end

# Of the form `hathitrust_catalog_indexer_journal_(full|upd)_YYYYMMDD.txt`
def file
self.class.filename_for(date: date, full: full?)
end

def path
File.join(HathiTrust::Services[:journal_directory], file)
end

def exist?
File.exist? path
end

def missing?
!exist?
end

def write!
FileUtils.touch path
end
end
end
5 changes: 0 additions & 5 deletions lib/ht_traject/ht_item.rb
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ class << self
attr_accessor :ht_ns, :ht_avail_us, :ht_avail_intl
end

self.ht_ns = ::Traject::TranslationMap.new('ht/ht_namespace_map')
self.ht_avail_us = ::Traject::TranslationMap.new('ht/availability_map_ht')
self.ht_avail_intl = ::Traject::TranslationMap.new('ht/availability_map_ht_intl')

Expand Down Expand Up @@ -256,10 +255,6 @@ def enum_pubdate=(e)
end
end

def source
ItemSet.ht_ns[namespace]
end

def us_availability
ItemSet.ht_avail_us[rights].first
end
Expand Down
8 changes: 8 additions & 0 deletions lib/services.rb
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,14 @@ def env_local_file
ENV["LOG_DIR"] || default
end

Services.register(:journal_directory) do
(ENV["JOURNAL_DIRECTORY"] || File.join(HOME, "journal")).tap do |dir|
if !File.exist?(dir)
FileUtils.mkdir dir
end
end
end

Services.register(:redirect_file) do
# Start migrating from redirect_file to REDIRECT_FILE on principle of least surprise
ENV["redirect_file"] || ENV["REDIRECT_FILE"] || Redirects.default_redirects_file
Expand Down
33 changes: 21 additions & 12 deletions lib/translation_maps/ht/availability_map_ht.rb
Original file line number Diff line number Diff line change
@@ -1,18 +1,27 @@
require 'ht_traject/ht_constants'
require 'match_map'
require "ht_traject/ht_constants"
require "match_map"

mm = MatchMap.new

mm[/^umall$/] = HathiTrust::Constants::FT
mm[/world$/] = HathiTrust::Constants::FT # matches world, ic-world, und-world
mm[/^cc.*/] = HathiTrust::Constants::FT
mm[/^pd(?:us)?$/] = HathiTrust::Constants::FT # pd or pdus
# Note: orph, orphcand, and umall are unattested in rights_current as of Oct 2024

mm[/^ic$/] = HathiTrust::Constants::SO
mm[/^orph$/] = HathiTrust::Constants::SO
mm[/^nobody$/] = HathiTrust::Constants::SO
mm[/^und$/] = HathiTrust::Constants::SO
mm[/^pd-p/] = HathiTrust::Constants::SO # pd-pvt or pd-private
mm[/^opb?$/] = HathiTrust::Constants::SO
# Full Text
mm["pd"] = HathiTrust::Constants::FT # [1]
mm["ic-world"] = HathiTrust::Constants::FT # [7]
mm["pdus"] = HathiTrust::Constants::FT # [9]
mm[/^cc-/] = HathiTrust::Constants::FT # [10-15, 17, 20-25]
mm["und-world"] = HathiTrust::Constants::FT # [18]

# Search Only
mm["ic"] = HathiTrust::Constants::SO # [2]
mm["op"] = HathiTrust::Constants::SO # [3]
mm["orph"] = HathiTrust::Constants::SO # [4]
mm["und"] = HathiTrust::Constants::SO # [5]
mm["umall"] = HathiTrust::Constants::SO # [6]
mm["nobody"] = HathiTrust::Constants::SO # [8]
mm["orphcand"] = HathiTrust::Constants::SO # [16]
mm["icus"] = HathiTrust::Constants::SO # [19]
mm["pd-pvt"] = HathiTrust::Constants::SO # [26]
mm["supp"] = HathiTrust::Constants::SO # [27]

mm
31 changes: 20 additions & 11 deletions lib/translation_maps/ht/availability_map_ht_intl.rb
Original file line number Diff line number Diff line change
@@ -1,17 +1,26 @@
require 'ht_traject/ht_constants'
require "ht_traject/ht_constants"

mm = MatchMap.new

mm['umall'] = HathiTrust::Constants::FT
mm['world'] = HathiTrust::Constants::FT # matches world, ic-world, und-world
mm[/^cc.*/] = HathiTrust::Constants::FT
mm['pd'] = HathiTrust::Constants::FT
# Note: orph, orphcand, and umall are unattested in rights_current as of Oct 2024

mm['pdus'] = HathiTrust::Constants::SO
mm['ic'] = HathiTrust::Constants::SO
mm[/^opb?$/] = HathiTrust::Constants::SO
mm['orph'] = HathiTrust::Constants::SO
mm['nobody'] = HathiTrust::Constants::SO
mm['und'] = HathiTrust::Constants::SO
# Full Text
mm["pd"] = HathiTrust::Constants::FT # [1]
mm["ic-world"] = HathiTrust::Constants::FT # [7]
mm[/^cc-/] = HathiTrust::Constants::FT # [10-15, 17, 20-25]
mm["und-world"] = HathiTrust::Constants::FT # [18]
mm["icus"] = HathiTrust::Constants::FT # [19]

# Search Only
mm["ic"] = HathiTrust::Constants::SO # [2]
mm["op"] = HathiTrust::Constants::SO # [3]
mm["orph"] = HathiTrust::Constants::SO # [4]
mm["und"] = HathiTrust::Constants::SO # [5]
mm["umall"] = HathiTrust::Constants::SO # [6]
mm["nobody"] = HathiTrust::Constants::SO # [8]
mm["pdus"] = HathiTrust::Constants::SO # [9]
mm["orphcand"] = HathiTrust::Constants::SO # [16]
mm["pd-pvt"] = HathiTrust::Constants::SO # [26]
mm["supp"] = HathiTrust::Constants::SO # [27]

mm
Loading

0 comments on commit 7558bc3

Please sign in to comment.