Skip to content

Commit

Permalink
Merge pull request #33 from hathitrust/DEV-953
Browse files Browse the repository at this point in the history
DEV-953: New report for items that change rights ic->pd
  • Loading branch information
aelkiss authored Oct 30, 2023
2 parents 1757679 + b3ee8b4 commit 40d5414
Show file tree
Hide file tree
Showing 5 changed files with 109 additions and 0 deletions.
7 changes: 7 additions & 0 deletions bin/bash_session.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash
# Opens up a hathifiles bash session.
docker-compose up -d pushgateway
docker-compose run --rm "hf" bash
# Now do e.g. `bundle exec rspec` or whatever.
# Exit to be done with the session.
docker-compose down; yes | docker system prune
68 changes: 68 additions & 0 deletions bin/rights_change.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
#!/bin/bash

# Compare 2 hathifiles and report which items have changed rights
# from ic/bib in file1 to pd/bib in file2.
#
# Invoke thusly:
# $ bash rights_change.sh f1 f2
# Results end up in ic_to_pd_YYYYMMDD.txt.
# Each record in the output are presumed to have changed from ic
# to pd between the generation of the 2 files.
# Script starts at the bottom.

run(){
f1=$1
f2=$2
echo "Started"
# First, simplify input hathifiles to the data we want.
# All the ic/bib records from f1 into one file...
cut_sort $f1 ic > cut_sort_ic.tsv
# And all the pd/bib records from f2 into another file.
cut_sort $f2 pd > cut_sort_pd.tsv
# Then compare the 2 simplified files.
isodate=`date +'%Y%m%d'`
outfile="`pwd`/ic_to_pd_${isodate}.tsv"
diff_records cut_sort_ic.tsv cut_sort_pd.tsv > $outfile
echo -e "Wrote $outfile"
# Remove intermediate files
rm cut_sort_ic.tsv cut_sort_pd.tsv
echo "Finished"
}

# Turn a hathifile into fewer cols and sorted matching lines.
# Matching means: has rights:$rights and reason:bib
cut_sort(){
file=$1
rights=$2
# Get these cols from the hathifiles:
# 1 (id), 3 (rights), 14 (reason), 16 (govdoc),
# grep to only get lines matching $rights,
# and sort the output.
zcat -f $file |
cut -f1,3,14,16 |
grep -P "\t${rights}\tbib\t[01]$" |
collated_sort
}

# Compare 2 outputs from cut_sort, but only look at col 1 (id) and
# col 4 (govdoc), meaning we will only output records that have the
# same id + govdoc values in both files, meaning each output record
# changed from ic to pd but kept the same govdoc status.
diff_records(){
ic_file=$1
pd_file=$2
collated_comm -12 <(cut -f1,4 $ic_file) <(cut -f1,4 $pd_file)
}

# Sort and comm must use the same collation,
# or comm won't think the files are sorted...
# and the defaults may be different, so specify.
collated_comm(){
LC_COLLATE=C comm $@
}
collated_sort(){
LC_COLLATE=C sort $@
}

# Script starts here.
run $1 $2
2 changes: 2 additions & 0 deletions spec/data/rights_change_file_1.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
mdp.39015027625402 deny ic 000018677 MIU 990000186770106381 1613293 66014593 Go up for glory, by Bill Russell, as told to William McSweeny. Coward-McCann [1966] bib 0 1966 eng BK MIU umich umich google Russell, Bill, 1934-2022.
mdp.39015003746396 deny ic 000018677 MIU 990000186770106381 1613293 66014593 Go up for glory, by Bill Russell, as told to William McSweeny. Coward-McCann [1966] bib 0 1966 eng BK MIU umich umich google Russell, Bill, 1934-2022.
2 changes: 2 additions & 0 deletions spec/data/rights_change_file_2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
mdp.39015027625402 deny ic 000018677 MIU 990000186770106381 1613293 66014593 Go up for glory, by Bill Russell, as told to William McSweeny. Coward-McCann [1966] bib 0 1966 eng BK MIU umich umich google Russell, Bill, 1934-2022.
mdp.39015003746396 allow pd 000018677 MIU 990000186770106381 1613293 66014593 Go up for glory, by Bill Russell, as told to William McSweeny. Coward-McCann [1966] bib 0 1966 eng BK MIU umich umich google Russell, Bill, 1934-2022.
30 changes: 30 additions & 0 deletions spec/jobs/rights_change.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# frozen_string_literal: true

RSpec.describe "bin/rights_change.sh" do
it "writes the expected report file" do
# Setup.
Dir.chdir("/tmp")
isodate = Time.now.strftime("%Y%m%d")
# Command to run:
cmd = [
"bash",
"/usr/src/app/bin/rights_change.sh",
"/usr/src/app/spec/data/rights_change_file_1.txt",
"/usr/src/app/spec/data/rights_change_file_2.txt"
].join(" ")
# Expect this outfile
outfile = "/tmp/ic_to_pd_#{isodate}.tsv"
FileUtils.rm_f(outfile)
expect(File.exist?(outfile)).to be false
# Now do it.
system(cmd)
# Expect a file with a single line...
expect(File.exist?(outfile)).to be true
lines = File.read(outfile).split("\n")
expect(lines.count).to eq 1
# and that single line looks like this:
expect(lines).to eq ["mdp.39015003746396\t0"]
# Cleanup
FileUtils.rm_f(outfile)
end
end

0 comments on commit 40d5414

Please sign in to comment.