Update file reference #143

kbab · 2024-08-06T16:28:11Z

Modify creation of FileReference during ingestion to align with new agreed format

FileReference now contains field pointing to its parent dataset
Write tests for ingestion of FileReference

bia-ingest-shared-models/test/utils.py

sherwoodf · 2024-08-07T08:33:46Z

bia-ingest-shared-models/bia_ingest_sm/conversion/file_reference.py

+        sc_titles_from_datasets_in_submission.intersection(sc_titles_from_file_lists)
+    )
+
+    if not study_components_to_process:


Can there be file lists in a submission that aren't in a dataset?

I don't think so - however, files may be attached directly to submissions if the bioimaging template is not used by the ST (e.g. see https://www.ebi.ac.uk/biostudies/files/S-BSST24/S-BSST24.json ). Also I am not sure what is possible with direct pagetab submissions... Do you think this affects the logic?

If we are scanning a submission for all the file lists i guess it's sensible to check we also have all the datasets...
I would eventually consider throwing a warning if # of filelists =/= # of datasets provided perhaps, since that sounds like something weird has gone on.

Warning added

sherwoodf · 2024-08-07T09:02:29Z

bia-ingest-shared-models/bia_ingest_sm/conversion/file_reference.py

+    # Get list of study component titles to process
+    sc_titles_from_datasets_in_submission = {
+        dataset.title_id for dataset in datasets_in_submission
+    }
    file_list_dicts = find_file_lists_in_submission(submission)


I think this function (and the function that it calls) should be renamed, since it returns whole objects as long as they have a file list.

Or better, maybe this should return {<section name>: <file_list_name>} I doubt it's going to be used in a way in which we need all the extra information, and it's already looping through the sections, so this saves looping back over these objects (in terms of code readability, the lists are short so i'm not worried about time)

sherwoodf · 2024-08-07T09:27:17Z

bia-ingest-shared-models/bia_ingest_sm/conversion/file_reference.py

-    for file_list_dict in file_list_dicts:
-        study_component_name = file_list_dict["Name"]
+    fileref_to_study_components = {}
+    datasets_to_process = {


This was a little bit confusing at first read because it felt like we already had all this information. Can we not do this loop once above with:

datasets_to_process = { ds.title_id: ds for ds in datasets_in_submission if ds.title_id in sc_titles_from_file_lists } if not datasets_to_process: message = f""" ...

Instead of getting the lists of titles, doing an intersection, and then going back to the lists to filter them?

Changed implementation to use the above suggestion. However, did not change names of functions in biostudies module - as biostudies does not use the concept 'dataset'. Instead created a utility function that groups the results from biostudies into a dict whose keys are the dataset titles and values a lists of the file lists.

Create a warning message if the number of datasets passed into the get_file_reference_by_dataset differs from the number of datasets computed from all the file lists in the submission

sherwoodf

LGTM

kbab added 5 commits August 6, 2024 16:59

Save WIP after modifying file reference to store dataset uuid

e00783c

Implement creating FileReference objects and associated tests

1772cc6

Add test for FileReference objects

ab3844f

Fix bug in test due to arbitrary ordering of sets

d2fffc6

Fix missing version in FileReference format with Black

ba27429

kbab requested a review from sherwoodf August 6, 2024 16:28

kbab had a problem deploying to test August 6, 2024 16:28 — with GitHub Actions Failure

kbab temporarily deployed to test August 6, 2024 16:28 — with GitHub Actions Inactive

sherwoodf reviewed Aug 7, 2024

View reviewed changes

kbab marked this pull request as draft August 7, 2024 12:07

Save WIP addressing PR review comments

429d4cd

kbab had a problem deploying to test August 7, 2024 17:13 — with GitHub Actions Failure

Add warning message for discrepancy in number of datasets

0f97002

Create a warning message if the number of datasets passed into the get_file_reference_by_dataset differs from the number of datasets computed from all the file lists in the submission

kbab had a problem deploying to test August 8, 2024 14:09 — with GitHub Actions Failure

kbab marked this pull request as ready for review August 8, 2024 14:13

sherwoodf approved these changes Aug 9, 2024

View reviewed changes

kbab merged commit b9d335a into main Aug 9, 2024
27 of 33 checks passed

kbab deleted the update-file-reference branch August 9, 2024 10:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update file reference #143

Update file reference #143

kbab commented Aug 6, 2024

sherwoodf Aug 7, 2024

kbab Aug 7, 2024

sherwoodf Aug 7, 2024

kbab Aug 8, 2024

sherwoodf Aug 7, 2024

sherwoodf Aug 7, 2024

kbab Aug 8, 2024

sherwoodf left a comment

Update file reference #143

Update file reference #143

Conversation

kbab commented Aug 6, 2024

sherwoodf Aug 7, 2024

Choose a reason for hiding this comment

kbab Aug 7, 2024

Choose a reason for hiding this comment

sherwoodf Aug 7, 2024

Choose a reason for hiding this comment

kbab Aug 8, 2024

Choose a reason for hiding this comment

sherwoodf Aug 7, 2024

Choose a reason for hiding this comment

sherwoodf Aug 7, 2024

Choose a reason for hiding this comment

kbab Aug 8, 2024

Choose a reason for hiding this comment

sherwoodf left a comment

Choose a reason for hiding this comment