Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data valid process: Compare counts of bitstreams in Preservica to pages in Islandora (SYSDEV-2612) #6

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

rzhang152
Copy link

No description provided.

Copy link
Member

@ctgraham ctgraham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some initial requests on a first read-through, and I'm hoping that Preservica can help us simplify the API interaction. Will read-through again if they recommend our current approach.

file_name = os.getcwd() +"/input/file-pids.csv"
user = os.environ['USER'] if os.getenv("USER") is not None else os.environ['USERNAME']
squery = 'RELS_EXT_preservicaRef_literal_s:* '
squery += 'AND (RELS_EXT_hasModel_uri_ms:info\:fedora/islandora\:manuscriptCModel OR RELS_EXT_hasModel_uri_ms:info\:fedora/islandora\:newspaperIssueCModel OR RELS_EXT_hasModel_uri_ms:info\:fedora/islandora\:bookCModel)'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the backslashes be escaped here?
squery += 'AND (RELS_EXT_hasModel_uri_ms:info\\:fedora/islandora\\:manuscriptCModel OR RELS_EXT_hasModel_uri_ms:info\\:fedora/islandora\\:newspaperIssueCModel OR RELS_EXT_hasModel_uri_ms:info\\:fedora/islandora\\:bookCModel)'

Copy link
Author

@rzhang152 rzhang152 Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think the subprocess call passing arguments ignores the double backslashes in the string. It executed correctly either by applying an escape \ to special char ("\" ) or just use a single \ before the special char.

ms_items = defaultdict(list)

# Helper function to compute the multpart objects via the relation mapping
# 'RELS_EXT_isPageOf_uri_s' to the Object PID from solr api response
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add documentation on the parameters and return values here. E.g. objID looks like it is coming in as a string of the full PID (namespace + colon + identifier). What is the datastructure of ms_items as a return value?

I recommend adding docstrings which cover the inputs, returns, and side effects for any methods.

#retrieve Object and its pageOf members from islandora
def get_islandoraData(s_query):
#pid format convention
q_par = "PID:pitt\\" + s_query[4:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the namespace "pitt" both hardcoded here, but also removed from the s_query via a substring? It would be better to formulate this as:

q_par = "PID:" + s_query.replace(":", "\\:")

Or, more generally, escape any Solr metacharacters:
https://lucene.apache.org/core/6_5_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Escaping_Special_Characters


#generate token to access preservica restful api
sUrl="https://pitt.preservica.com/api/accesstoken/login"
headers = {"Contnent-Type": "application/x-www-form-urlencoded"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Content-Type is misspelled here.

for row in csvreader:
#logic to check token expiration
if (round((time.time()-st_timer)*10**3) - 600000 >0 ):
new_session = token_fn.getRefreshToken(curr_session)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getRefreshToken() allows for an empty return in the case of a server error, but this condition is not caught here.


global st_timer, curr_session
for row in csvreader:
#logic to check token expiration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a description of these magic numbers (or turn them into named constants). I think this is multiplying a difference in seconds by 1,000, and then comparing against 600,000 ms (600 seconds, 10 minutes)?

st_timer = time.time()

bitstream_dict ={}
bitstream_dict = getRepresentionInfo(sInfoObj_baseUrl, row['preservica_refID'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there ought to be a way to download the full XIP from Preservica in one API call rather than traverse the entire tree, and I've opened a ticket with them to ask.

curr_root =curr_tree.getroot()
#add new tag "preservicaChildCount"
desc_element = curr_root.find('rdf:Description', ns_dict)
newNode =etree.SubElement(desc_element, etree.QName(ns_dict["islandora"], ele_name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this node already exists in the tree, will this add a new one? We should delete/replace any existing preservicaChildCount node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants