Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEV-1373 make catalog indexing date independent #52

Merged
merged 16 commits into from
Oct 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .standard.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ ignore:
- 'lib/ht_traject/**/*'
- 'lib/ht_traject.rb'
- 'lib/traject/**/*'
- 'lib/translation_maps/**/*'
- 'lib/umich_traject/**/*'
- 'lib/umich_traject.rb'
- 'readers/**/*'
Expand Down
1 change: 1 addition & 0 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ source "https://rubygems.org"

group :development, :test do
gem "bundler", "~>2.0"
gem "climate_control"
gem "rake", "~> 13.0"
gem "standard"
gem "rspec"
Expand Down
10 changes: 6 additions & 4 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ GEM
ast (2.4.2)
builder (3.2.4)
canister (0.9.2)
climate_control (1.2.0)
coderay (1.1.3)
concurrent-ruby (1.2.2)
date_named_file (0.1.1)
Expand Down Expand Up @@ -61,11 +62,11 @@ GEM
match_map (3.0.0)
method_source (1.0.0)
naconormalizer (1.0.1-java)
nokogiri (1.16.2-arm64-darwin)
nokogiri (1.16.7-arm64-darwin)
racc (~> 1.4)
nokogiri (1.16.2-java)
nokogiri (1.16.7-java)
racc (~> 1.4)
nokogiri (1.16.2-x86_64-linux)
nokogiri (1.16.7-x86_64-linux)
racc (~> 1.4)
parallel (1.23.0)
parser (3.2.2.1)
Expand All @@ -83,7 +84,7 @@ GEM
rainbow (3.1.1)
rake (13.0.6)
regexp_parser (2.8.0)
rexml (3.2.5)
rexml (3.3.8)
rsolr (2.5.0)
builder (>= 2.1.2)
faraday (>= 0.9, < 3, != 2.0.0)
Expand Down Expand Up @@ -178,6 +179,7 @@ PLATFORMS
DEPENDENCIES
bundler (~> 2.0)
canister (~> 0.9.2)
climate_control
date_named_file
dotenv
http (~> 5.0)
Expand Down
14 changes: 13 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,16 @@ network (i.e. the one started with `docker-compose up` from this repository).
Solr should be reachable via the `solr-sdr-catalog` hostname.

## How to do the basics
### Date-Independent Indexing

For use in production environments where daily and monthly indexing are ongoing activities,
we enable the indexer to maintain state by writing "journal" files: empty datestamped
files in a known location (`JOURNAL_DIRECTORY`). The command `cictl index continue` does whatever
full or daily indexing is appropriate given the state of the journals.

Note that all of the `cictl index *` commands write journal files, with the exception of
`cictl index file` which takes only an `upd` MARC file rather than a MARC-deletes pair, and is not
expected to be used in an environment where date independence is in force.

### Putting a new solr configuration into place

Expand All @@ -133,7 +142,7 @@ Solr should be reachable via the `solr-sdr-catalog` hostname.
* (Optional) If your new solr config requires a full reindex, go ahead and
get rid of the data with `rm -rf data`
* Fire solr back up: `systemctl start solr-current-catalog`
* Give it a minute and then go to http://beeftea-2.umdl.umich.edu:9033/solr` to make sure the core came back up.
* Give it a minute and then go to `http://beeftea-2.umdl.umich.edu:9033/solr` to make sure the core came back up.
* Do whatever indexing needs doing.

### Indexing
Expand Down Expand Up @@ -193,6 +202,7 @@ The `index` command has a number of possibilities:
> bundle exec bin/cictl help index
Commands:
cictl index all # Empty the catalog and index the most recent m...
cictl index continue # index all files not represented in the indexe...
cictl index date YYYYMMDD # Run the catchup (delete and index) for a part...
cictl index file FILE # Index a single file
cictl index help [COMMAND] # Describe subcommands or one specific subcommand
Expand Down Expand Up @@ -283,6 +293,8 @@ and `config/env`. The defaults in the repository suffice for testing under Docke
## Environment variables

* `DDIR` data directory, defaults to `/htsolr/catalog/prep`
* `JOURNAL_DIRECTORY` location of journal files (see Date-Independent Indexing above) defaulting
to `journal/` inside the repo directory.
* `LOG_DIR` where to store logs, defaults to `/htsolr/catalog/prep`.
* `MYSQL_HOST`, `MYSQL_DATABASE`, `MYSQL_USER`, `MYSQL_PASSWORD` *required* unless run with `NO_DB`.
* `NO_DB` if you want to skip all the database stuff. Useful for testing. Implied by `NO_EXTERNAL_DATA`.
Expand Down
1 change: 0 additions & 1 deletion docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
---
version: '3'

services:
traject:
Expand Down
37 changes: 36 additions & 1 deletion lib/cictl/index_command.rb
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,37 @@
require_relative "base_command"
require_relative "zephir_file"
require_relative "deleted_records"
require_relative "journal"

module CICTL
class IndexCommand < BaseCommand
class_option :reader, type: :string, desc: "Reader name/path"
class_option :writer, type: :string, desc: "Writer name/path"

desc "continue", "Index all files not represented in the indexer journals"
def continue
last_full = ZephirFile.full_files.last
fatal "unable to find full Zephir file" unless last_full
# Index the most recent full file and subsequent ones if the
# full file journal is missing.
full_journal = Journal.new(date: last_full.to_datetime.to_date, full: true)
if full_journal.missing?
logger.info "missing full journal #{full_journal}, calling `cictl all`"
call_all_command
# Otherwise, iterate from the last full file date to yesterday.
# If there is a missing journal, start indexing from that point.
else
(last_full.to_datetime.to_date..(Date.today - 1)).each do |date|
journal = Journal.new(date: last_full.to_datetime.to_date, full: false)
if journal.missing?
logger.info "missing update journal #{journal}, calling `cictl since #{journal.date}`"
call_since_command(journal.date)
break
end
end
end
end

desc "all", "Empty the catalog and index the most recent monthly followed by subsequent daily updates"
option :wait, type: :boolean, desc: "Wait 5 seconds for Control-C", default: true
def all
Expand Down Expand Up @@ -39,6 +64,8 @@ def all
solr_client.commit!
end

# Note: this command does not write a journal since it only processes the MARC file
# but not the deletes.
option :commit, type: :boolean, desc: "Commit changes to Solr", default: true
desc "file FILE", "Index a single MARC file"
def file(marcfile)
Expand All @@ -52,7 +79,11 @@ def date(date)
preflight
with_date(date) do |date|
index_deletes_for_date date
index_records_for_date date
if index_records_for_date date
journal = Journal.new(date: date)
logger.info("write journal file #{journal.path}")
journal.write!
end
end
end

Expand Down Expand Up @@ -84,6 +115,7 @@ def today
end

no_commands do
alias_method :call_all_command, :all
alias_method :call_date_command, :date
alias_method :call_file_command, :file
alias_method :call_since_command, :since
Expand Down Expand Up @@ -118,14 +150,17 @@ def marc_file_for_date(date)
ZephirFile.update_files.at(date)
end

# @return [Boolean] true if the marcfile for the given date exists
def index_records_for_date(date)
marcfile = marc_file_for_date date
if File.exist? marcfile
Indexer.new(reader: options[:reader], writer: options[:writer]).run marcfile
solr_client.commit!
logger.debug "index date(#{date}): Solr count now #{solr_client.count}"
true
else
logger.warn "could not find marcfile '#{marcfile}'"
false
end
end

Expand Down
66 changes: 66 additions & 0 deletions lib/cictl/journal.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# frozen_string_literal: true

require_relative "../services"

module CICTL
# A class that enables date-independent catalog indexing using the filesystem.
#
# Each time a full or update file is indexed, writes an (empty) file of the form
# hathitrust_catalog_indexer_journal_upd_YYYYMMDD.txt or
# hathitrust_catalog_indexer_journal_full_YYYYMMDD.txt in the journal directory.
#
# When we use the index command `cictl continue`
# we calculate the earliest zephir file not yet indexed and proceed in order from
# that point.
#
# Nomenclature note: "journal" is the closest semantic match to "log" I could find.
# This is a log, of sorts, but the term was already taken.
class Journal
attr_reader :date

FILENAME_PATTERN = /hathitrust_catalog_indexer_journal_(full|upd)_(\d{8})\.txt/

def self.filename_for(date:, full:)
yyyymmdd = date.strftime "%Y%m%d"
type = full ? "full" : "upd"
"hathitrust_catalog_indexer_journal_#{type}_#{yyyymmdd}.txt"
end

def initialize(date: Date.today - 1, full: false)
@date = date
@full = full
end

# Use the built-in but append the date and full/upd because that's what we care about.
def to_s
super.tap do |s|
s.gsub!(/>$/, " [#{date} #{full? ? "full" : "upd"}]>")
end
end

def full?
@full
end

# Of the form `hathitrust_catalog_indexer_journal_(full|upd)_YYYYMMDD.txt`
def file
self.class.filename_for(date: date, full: full?)
end

def path
File.join(HathiTrust::Services[:journal_directory], file)
end

def exist?
File.exist? path
end

def missing?
!exist?
end

def write!
FileUtils.touch path
end
end
end
5 changes: 0 additions & 5 deletions lib/ht_traject/ht_item.rb
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ class << self
attr_accessor :ht_ns, :ht_avail_us, :ht_avail_intl
end

self.ht_ns = ::Traject::TranslationMap.new('ht/ht_namespace_map')
self.ht_avail_us = ::Traject::TranslationMap.new('ht/availability_map_ht')
self.ht_avail_intl = ::Traject::TranslationMap.new('ht/availability_map_ht_intl')

Expand Down Expand Up @@ -256,10 +255,6 @@ def enum_pubdate=(e)
end
end

def source
ItemSet.ht_ns[namespace]
end

def us_availability
ItemSet.ht_avail_us[rights].first
end
Expand Down
8 changes: 8 additions & 0 deletions lib/services.rb
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,14 @@ def env_local_file
ENV["LOG_DIR"] || default
end

Services.register(:journal_directory) do
(ENV["JOURNAL_DIRECTORY"] || File.join(HOME, "journal")).tap do |dir|
if !File.exist?(dir)
FileUtils.mkdir dir
end
end
end

Services.register(:redirect_file) do
# Start migrating from redirect_file to REDIRECT_FILE on principle of least surprise
ENV["redirect_file"] || ENV["REDIRECT_FILE"] || Redirects.default_redirects_file
Expand Down
33 changes: 21 additions & 12 deletions lib/translation_maps/ht/availability_map_ht.rb
Original file line number Diff line number Diff line change
@@ -1,18 +1,27 @@
require 'ht_traject/ht_constants'
require 'match_map'
require "ht_traject/ht_constants"
require "match_map"

mm = MatchMap.new

mm[/^umall$/] = HathiTrust::Constants::FT
mm[/world$/] = HathiTrust::Constants::FT # matches world, ic-world, und-world
mm[/^cc.*/] = HathiTrust::Constants::FT
mm[/^pd(?:us)?$/] = HathiTrust::Constants::FT # pd or pdus
# Note: orph, orphcand, and umall are unattested in rights_current as of Oct 2024

mm[/^ic$/] = HathiTrust::Constants::SO
mm[/^orph$/] = HathiTrust::Constants::SO
mm[/^nobody$/] = HathiTrust::Constants::SO
mm[/^und$/] = HathiTrust::Constants::SO
mm[/^pd-p/] = HathiTrust::Constants::SO # pd-pvt or pd-private
mm[/^opb?$/] = HathiTrust::Constants::SO
# Full Text
mm["pd"] = HathiTrust::Constants::FT # [1]
mm["ic-world"] = HathiTrust::Constants::FT # [7]
mm["pdus"] = HathiTrust::Constants::FT # [9]
mm[/^cc-/] = HathiTrust::Constants::FT # [10-15, 17, 20-25]
mm["und-world"] = HathiTrust::Constants::FT # [18]

# Search Only
mm["ic"] = HathiTrust::Constants::SO # [2]
mm["op"] = HathiTrust::Constants::SO # [3]
mm["orph"] = HathiTrust::Constants::SO # [4]
mm["und"] = HathiTrust::Constants::SO # [5]
mm["umall"] = HathiTrust::Constants::SO # [6]
mm["nobody"] = HathiTrust::Constants::SO # [8]
mm["orphcand"] = HathiTrust::Constants::SO # [16]
mm["icus"] = HathiTrust::Constants::SO # [19]
mm["pd-pvt"] = HathiTrust::Constants::SO # [26]
mm["supp"] = HathiTrust::Constants::SO # [27]

mm
31 changes: 20 additions & 11 deletions lib/translation_maps/ht/availability_map_ht_intl.rb
Original file line number Diff line number Diff line change
@@ -1,17 +1,26 @@
require 'ht_traject/ht_constants'
require "ht_traject/ht_constants"

mm = MatchMap.new

mm['umall'] = HathiTrust::Constants::FT
mm['world'] = HathiTrust::Constants::FT # matches world, ic-world, und-world
mm[/^cc.*/] = HathiTrust::Constants::FT
mm['pd'] = HathiTrust::Constants::FT
# Note: orph, orphcand, and umall are unattested in rights_current as of Oct 2024

mm['pdus'] = HathiTrust::Constants::SO
mm['ic'] = HathiTrust::Constants::SO
mm[/^opb?$/] = HathiTrust::Constants::SO
mm['orph'] = HathiTrust::Constants::SO
mm['nobody'] = HathiTrust::Constants::SO
mm['und'] = HathiTrust::Constants::SO
# Full Text
mm["pd"] = HathiTrust::Constants::FT # [1]
mm["ic-world"] = HathiTrust::Constants::FT # [7]
mm[/^cc-/] = HathiTrust::Constants::FT # [10-15, 17, 20-25]
mm["und-world"] = HathiTrust::Constants::FT # [18]
mm["icus"] = HathiTrust::Constants::FT # [19]

# Search Only
mm["ic"] = HathiTrust::Constants::SO # [2]
mm["op"] = HathiTrust::Constants::SO # [3]
mm["orph"] = HathiTrust::Constants::SO # [4]
mm["und"] = HathiTrust::Constants::SO # [5]
mm["umall"] = HathiTrust::Constants::SO # [6]
mm["nobody"] = HathiTrust::Constants::SO # [8]
mm["pdus"] = HathiTrust::Constants::SO # [9]
mm["orphcand"] = HathiTrust::Constants::SO # [16]
mm["pd-pvt"] = HathiTrust::Constants::SO # [26]
mm["supp"] = HathiTrust::Constants::SO # [27]

mm
Loading
Loading