-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data valid process: Compare counts of bitstreams in Preservica to pages in Islandora (SYSDEV-2612) #6
base: main
Are you sure you want to change the base?
Conversation
…t of the corresponding preservica object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some initial requests on a first read-through, and I'm hoping that Preservica can help us simplify the API interaction. Will read-through again if they recommend our current approach.
file_name = os.getcwd() +"/input/file-pids.csv" | ||
user = os.environ['USER'] if os.getenv("USER") is not None else os.environ['USERNAME'] | ||
squery = 'RELS_EXT_preservicaRef_literal_s:* ' | ||
squery += 'AND (RELS_EXT_hasModel_uri_ms:info\:fedora/islandora\:manuscriptCModel OR RELS_EXT_hasModel_uri_ms:info\:fedora/islandora\:newspaperIssueCModel OR RELS_EXT_hasModel_uri_ms:info\:fedora/islandora\:bookCModel)' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the backslashes be escaped here?
squery += 'AND (RELS_EXT_hasModel_uri_ms:info\\:fedora/islandora\\:manuscriptCModel OR RELS_EXT_hasModel_uri_ms:info\\:fedora/islandora\\:newspaperIssueCModel OR RELS_EXT_hasModel_uri_ms:info\\:fedora/islandora\\:bookCModel)'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I think the subprocess call passing arguments ignores the double backslashes in the string. It executed correctly either by applying an escape \ to special char ("\" ) or just use a single \ before the special char.
ms_items = defaultdict(list) | ||
|
||
# Helper function to compute the multpart objects via the relation mapping | ||
# 'RELS_EXT_isPageOf_uri_s' to the Object PID from solr api response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add documentation on the parameters and return values here. E.g. objID
looks like it is coming in as a string of the full PID
(namespace + colon + identifier). What is the datastructure of ms_items
as a return value?
I recommend adding docstrings which cover the inputs, returns, and side effects for any methods.
#retrieve Object and its pageOf members from islandora | ||
def get_islandoraData(s_query): | ||
#pid format convention | ||
q_par = "PID:pitt\\" + s_query[4:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the namespace "pitt" both hardcoded here, but also removed from the s_query
via a substring? It would be better to formulate this as:
q_par = "PID:" + s_query.replace(":", "\\:")
Or, more generally, escape any Solr metacharacters:
https://lucene.apache.org/core/6_5_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Escaping_Special_Characters
|
||
#generate token to access preservica restful api | ||
sUrl="https://pitt.preservica.com/api/accesstoken/login" | ||
headers = {"Contnent-Type": "application/x-www-form-urlencoded"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Content-Type
is misspelled here.
for row in csvreader: | ||
#logic to check token expiration | ||
if (round((time.time()-st_timer)*10**3) - 600000 >0 ): | ||
new_session = token_fn.getRefreshToken(curr_session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getRefreshToken()
allows for an empty return in the case of a server error, but this condition is not caught here.
|
||
global st_timer, curr_session | ||
for row in csvreader: | ||
#logic to check token expiration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a description of these magic numbers (or turn them into named constants). I think this is multiplying a difference in seconds by 1,000, and then comparing against 600,000 ms (600 seconds, 10 minutes)?
st_timer = time.time() | ||
|
||
bitstream_dict ={} | ||
bitstream_dict = getRepresentionInfo(sInfoObj_baseUrl, row['preservica_refID']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like there ought to be a way to download the full XIP from Preservica in one API call rather than traverse the entire tree, and I've opened a ticket with them to ask.
curr_root =curr_tree.getroot() | ||
#add new tag "preservicaChildCount" | ||
desc_element = curr_root.find('rdf:Description', ns_dict) | ||
newNode =etree.SubElement(desc_element, etree.QName(ns_dict["islandora"], ele_name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this node already exists in the tree, will this add a new one? We should delete/replace any existing preservicaChildCount
node.
No description provided.