-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data valid process: Compare counts of bitstreams in Preservica to pages in Islandora (SYSDEV-2612) #6
base: main
Are you sure you want to change the base?
Data valid process: Compare counts of bitstreams in Preservica to pages in Islandora (SYSDEV-2612) #6
Changes from all commits
6c5d8b4
83a949f
78466a3
2edc209
e1bade5
84137b6
56d1559
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
*.log | ||
*.txt | ||
*.xml | ||
.env | ||
|
||
#sample testing file | ||
input/ | ||
output/ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
## Description | ||
islandora-preservica-validation process is to compare islandora objects's pagemember count with the corresponding preservica objects's bitstreams count. Islandora Object is validated if the counts are matched, and the islandora object's rdf is to updated by adding new element with the value of the number of count. | ||
|
||
### Requirements | ||
* Python 3.12 | ||
* pip requirements.txt | ||
|
||
### Process | ||
* execute islandoraObjCheck.py to retrieve all islandora objects needed. It will generates an outputfile containing objectID and objects' page membercount and the corresponding preservics object reference Ids | ||
* execute preservicaObjCapture.py to valide the bitstreams count from preservica with islandora. It will prompt user to use the preservica login credentials in order to generate preservica RESTful APIs | ||
* execute rdfUpdate.py to update the rdfs for the validated islandora objects and drush to push back the updaexecute rdfUpdate.py to update the rdfs for the validated islandora objects and drush to push back the updates to islandora | ||
|
||
## Disclaimer | ||
|
||
THIS SCRIPT IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT | ||
LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
pitt:31735073061008 | ||
pitt:31735073060927 | ||
pitt:31735073060901 | ||
pitt:31735073060943 | ||
pitt:2000.07.010 | ||
pitt:2000.07.011 | ||
pitt:2000.07.063 | ||
pitt:1935e49702 | ||
pitt:193xe49702 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,116 @@ | ||
##Function: pageCount_of_Pid | ||
## readin a pidlist file containing pitt identifiers and process through | ||
## islandora object api request to compute the number of the pids' child | ||
## objects with the filter 'RELS_EXT_isPageOf_uri_s' on objects' metadata | ||
## @params: file_pids | ||
## @result: file_PgCount | ||
|
||
import requests, json, os | ||
import csv | ||
import subprocess | ||
from collections import defaultdict | ||
|
||
f_path = os.path.dirname(os.path.realpath(__file__)) | ||
file_pids = "./input/file-pids.csv" #intakes pidfile | ||
file_pgCount="./output/membercount.csv" | ||
|
||
#retrieve Object and its pageOf members from islandora | ||
def get_islandoraData(s_query): | ||
#pid format convention | ||
q_par = "PID:pitt\\" + s_query[4:] | ||
q_pages = "RELS_EXT_isPageOf_uri_s: info\\:fedora\\/pitt\\" + s_query[4:] + " OR " + q_par | ||
try: | ||
#step1). retrieve object from islandora api request | ||
url ='https://gamera.library.pitt.edu/solr/uls_digital_core/select' | ||
payload = {"q": q_pages, | ||
"fl":"PID,RELS_EXT_isPageOf_uri_s,RELS_EXT_hasModel_uri_ms,RELS_EXT_preservicaRef_literal_s", | ||
"sort":"PID asc", | ||
"rows":"100000", | ||
"wt":"json"} | ||
|
||
responses = requests.get(url, params=payload) | ||
if (responses.status_code ==200) : | ||
json_data = (responses.json()) | ||
results = json_data['response'] | ||
#print(json.dumps(ms_items, indent=4)) | ||
return (results) | ||
except requests.exceptions.HTTPError as e: | ||
print("Error: " + str(e)) | ||
|
||
#define a dict value with a value of list holding islandora object and its pageOf count | ||
ms_items = defaultdict(list) | ||
|
||
# Helper function to compute the multpart objects via the relation mapping | ||
# 'RELS_EXT_isPageOf_uri_s' to the Object PID from solr api response | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add documentation on the parameters and return values here. E.g. I recommend adding docstrings which cover the inputs, returns, and side effects for any methods. |
||
def get_multipart_count(objID): | ||
results = get_islandoraData(objID) | ||
numOfpages = results['numFound'] | ||
s_preservicaRef ="" | ||
#make sure the response data is a dict | ||
assert isinstance(results, dict) | ||
|
||
for data in results['docs']: | ||
tmpPagelst = defaultdict(list) | ||
#capture the preservica reference ID associated to the ObjectID, if existing | ||
if ( "RELS_EXT_preservicaRef_literal_s" in data): | ||
s_preservicaRef = data["RELS_EXT_preservicaRef_literal_s"] | ||
numOfpages -=1 #exclude parent Object | ||
|
||
#pass objID to solr to retrieve childcontent from islandora | ||
if ("RELS_EXT_isPageOf_uri_s" in data): | ||
#retrieve parent object associated | ||
uri_obj = data["RELS_EXT_isPageOf_uri_s"].split("/")[-1] | ||
if not ( uri_obj in ms_items.keys()): | ||
tmpPagelst['counter'] = 1 | ||
ms_items[uri_obj]=tmpPagelst | ||
else: | ||
#update the value for the key matching object ID | ||
v= [v for k,v in ms_items.items() if k == uri_obj] | ||
v[0]["counter"] += 1 | ||
|
||
#export the associated preservica reference ID if existing | ||
if (s_preservicaRef): | ||
val = [val for keyId, val in ms_items.items() if keyId==objID] | ||
if val: | ||
val[0]['preservica_RefID'] = s_preservicaRef | ||
|
||
return ms_items | ||
|
||
# Main Function: takes in PIDfile in the format {PID}. It iterates pids to check on islandora via | ||
# solr search, and outputs a csv file containing total# of the Object's pageOf items from islandora, and | ||
# preservica referenceID associated to the pid, if exising | ||
def pageCount_of_Pid (inFile_pids): | ||
with open (os.path.join(f_path, inFile_pids), 'r') as pid_f: | ||
pidreader = csv.reader(pid_f) | ||
|
||
#write output file | ||
with open(os.path.join(f_path, file_pgCount), 'w', newline='') as match_f: | ||
header_lst = ['PID', 'num_isPageOf_uri_s', 'preservica_refID'] | ||
f_writer = csv.writer(match_f, delimiter=',') | ||
f_writer.writerow(header_lst) | ||
#now iterate each objs from response | ||
for row in pidreader: | ||
mydict = get_multipart_count(row[0]) | ||
|
||
if mydict: | ||
for k,v in mydict.items(): | ||
f_writer.writerow([k, v['counter'], v['preservica_RefID']]) | ||
|
||
def drushfetchPids(): | ||
file_name = os.getcwd() +"/input/file-pids.csv" | ||
user = os.environ['USER'] if os.getenv("USER") is not None else os.environ['USERNAME'] | ||
squery = 'RELS_EXT_preservicaRef_literal_s:* ' | ||
squery += 'AND (RELS_EXT_hasModel_uri_ms:info\:fedora/islandora\:manuscriptCModel OR RELS_EXT_hasModel_uri_ms:info\:fedora/islandora\:newspaperIssueCModel OR RELS_EXT_hasModel_uri_ms:info\:fedora/islandora\:bookCModel)' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't the backslashes be escaped here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks. I think the subprocess call passing arguments ignores the double backslashes in the string. It executed correctly either by applying an escape \ to special char ("\" ) or just use a single \ before the special char. |
||
squery += 'AND NOT RELS_EXT_preservicaChildCount_literal_s:*' | ||
|
||
try: | ||
s = subprocess.check_call (['drush', '--root=/var/www/html/drupal7/', '--user={}'.format(user), \ | ||
'--uri=http://gamera.library.pitt.edu', 'islandora_datastream_crud_fetch_pids', \ | ||
'--solr_query={}'.format(squery), '--pid_file={}'.format(file_name)]) | ||
|
||
except subprocess.CalledProcessError as e: | ||
print(f"Command failed with return code {e.returncode}") | ||
|
||
if __name__ == "__main__": | ||
drushfetchPids() | ||
pageCount_of_Pid(file_pids) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
PID,islandora_count,preservica_refID,bitstreamCount,isValid | ||
pitt:31735073061008,29,e55870c4-2b5b-48a6-a2f9-c3d13f2a96b0,29,Y | ||
pitt:31735073060927,6,4b2346c0-6e83-4e64-96ac-77a6ad220734,6,Y | ||
pitt:31735073060901,23,26f2f20b-4806-431b-85e3-a86cbb6fa425,23,Y | ||
pitt:31735073060943,8,bc19d052-068a-494d-aefb-4b28119dfb7e,8,Y | ||
pitt:2000.07.010,1,79c25b21-1f0d-492c-937e-b07a756ddc1e,1,Y | ||
pitt:2000.07.011,1,3ae034b2-d09a-44f2-9b77-cc326c9bdd76,1,Y | ||
pitt:2000.07.063,1,883fdbfa-34e8-4687-90e7-b53910a1d453,1,Y | ||
pitt:1935e49702,272,d2e0a615-898e-4896-a176-ad00b907fa82,272,Y | ||
pitt:193xe49702,288,8f0b8647-1209-49d3-a6b7-6a8646f225d7,287,N |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
import requests | ||
import getopt, sys | ||
from getpass import getpass | ||
|
||
#generate token to access preservica restful api | ||
sUrl="https://pitt.preservica.com/api/accesstoken/login" | ||
headers = {"Content-Type": "application/x-www-form-urlencoded"} | ||
|
||
def generateToken(): | ||
#retrieve login credentials | ||
testusr = getpass("Please enter login: ") | ||
testpw =getpass("Please enter password: ") | ||
if (testpw and testusr): | ||
sdata ={ | ||
"username": testusr, | ||
"password": testpw | ||
} | ||
#retrieve token | ||
r = requests.post (sUrl, data=sdata, headers=headers) | ||
if r.status_code != 200: | ||
print("Error:" , r.status_code) | ||
sys.exit(-1) | ||
else: | ||
return [r.json()['token'], r.json()['refresh-token']] | ||
|
||
def getRefreshToken(s): | ||
sRefreshUrl ="https://pitt.preservica.com/api/accesstoken/refresh?refreshToken=" + s[1] | ||
newheaders = {"Preservica-Access-Token" : s[0], | ||
"Contnent-Type": "application/x-www-form-urlencoded"} | ||
|
||
res = requests.post (sRefreshUrl, headers=newheaders) | ||
if res.status_code == 200: | ||
return [res.json()['token'], res.json()['refresh-token']] | ||
else: | ||
print("Error: Failed to get refresh token" , res.status_code) | ||
sys.exit(-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the namespace "pitt" both hardcoded here, but also removed from the
s_query
via a substring? It would be better to formulate this as:Or, more generally, escape any Solr metacharacters:
https://lucene.apache.org/core/6_5_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Escaping_Special_Characters