-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update file reference #143
Conversation
sc_titles_from_datasets_in_submission.intersection(sc_titles_from_file_lists) | ||
) | ||
|
||
if not study_components_to_process: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can there be file lists in a submission that aren't in a dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so - however, files may be attached directly to submissions if the bioimaging template is not used by the ST (e.g. see https://www.ebi.ac.uk/biostudies/files/S-BSST24/S-BSST24.json ). Also I am not sure what is possible with direct pagetab submissions... Do you think this affects the logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are scanning a submission for all the file lists i guess it's sensible to check we also have all the datasets...
I would eventually consider throwing a warning if # of filelists =/= # of datasets provided perhaps, since that sounds like something weird has gone on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warning added
# Get list of study component titles to process | ||
sc_titles_from_datasets_in_submission = { | ||
dataset.title_id for dataset in datasets_in_submission | ||
} | ||
file_list_dicts = find_file_lists_in_submission(submission) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function (and the function that it calls) should be renamed, since it returns whole objects as long as they have a file list.
Or better, maybe this should return {<section name>: <file_list_name>}
I doubt it's going to be used in a way in which we need all the extra information, and it's already looping through the sections, so this saves looping back over these objects (in terms of code readability, the lists are short so i'm not worried about time)
for file_list_dict in file_list_dicts: | ||
study_component_name = file_list_dict["Name"] | ||
fileref_to_study_components = {} | ||
datasets_to_process = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a little bit confusing at first read because it felt like we already had all this information. Can we not do this loop once above with:
datasets_to_process = {
ds.title_id: ds for ds in datasets_in_submission if ds.title_id in sc_titles_from_file_lists
}
if not datasets_to_process:
message = f""" ...
Instead of getting the lists of titles, doing an intersection, and then going back to the lists to filter them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed implementation to use the above suggestion. However, did not change names of functions in biostudies module - as biostudies does not use the concept 'dataset'. Instead created a utility function that groups the results from biostudies into a dict whose keys are the dataset titles and values a lists of the file lists.
Create a warning message if the number of datasets passed into the get_file_reference_by_dataset differs from the number of datasets computed from all the file lists in the submission
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Modify creation of FileReference during ingestion to align with new agreed format
FileReference now contains field pointing to its parent dataset
Write tests for ingestion of FileReference