Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update file reference #143

Merged
merged 7 commits into from
Aug 9, 2024
Merged

Update file reference #143

merged 7 commits into from
Aug 9, 2024

Conversation

kbab
Copy link
Contributor

@kbab kbab commented Aug 6, 2024

Modify creation of FileReference during ingestion to align with new agreed format

FileReference now contains field pointing to its parent dataset
Write tests for ingestion of FileReference

sc_titles_from_datasets_in_submission.intersection(sc_titles_from_file_lists)
)

if not study_components_to_process:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be file lists in a submission that aren't in a dataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so - however, files may be attached directly to submissions if the bioimaging template is not used by the ST (e.g. see https://www.ebi.ac.uk/biostudies/files/S-BSST24/S-BSST24.json ). Also I am not sure what is possible with direct pagetab submissions... Do you think this affects the logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are scanning a submission for all the file lists i guess it's sensible to check we also have all the datasets...
I would eventually consider throwing a warning if # of filelists =/= # of datasets provided perhaps, since that sounds like something weird has gone on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning added

# Get list of study component titles to process
sc_titles_from_datasets_in_submission = {
dataset.title_id for dataset in datasets_in_submission
}
file_list_dicts = find_file_lists_in_submission(submission)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function (and the function that it calls) should be renamed, since it returns whole objects as long as they have a file list.

Or better, maybe this should return {<section name>: <file_list_name>} I doubt it's going to be used in a way in which we need all the extra information, and it's already looping through the sections, so this saves looping back over these objects (in terms of code readability, the lists are short so i'm not worried about time)

for file_list_dict in file_list_dicts:
study_component_name = file_list_dict["Name"]
fileref_to_study_components = {}
datasets_to_process = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a little bit confusing at first read because it felt like we already had all this information. Can we not do this loop once above with:

datasets_to_process = {
   ds.title_id: ds for ds in datasets_in_submission if ds.title_id in sc_titles_from_file_lists
}

if not datasets_to_process:
    message = f""" ...

Instead of getting the lists of titles, doing an intersection, and then going back to the lists to filter them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed implementation to use the above suggestion. However, did not change names of functions in biostudies module - as biostudies does not use the concept 'dataset'. Instead created a utility function that groups the results from biostudies into a dict whose keys are the dataset titles and values a lists of the file lists.

@kbab kbab marked this pull request as draft August 7, 2024 12:07
Create a warning message if the number of datasets passed into the
get_file_reference_by_dataset differs from the number of datasets
computed from all the file lists in the submission
@kbab kbab marked this pull request as ready for review August 8, 2024 14:13
Copy link
Contributor

@sherwoodf sherwoodf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kbab kbab merged commit b9d335a into main Aug 9, 2024
27 of 33 checks passed
@kbab kbab deleted the update-file-reference branch August 9, 2024 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants