Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address metadata during altexp merging and other metadata fixes #255

Merged
merged 22 commits into from
Dec 22, 2023

Conversation

sjspielman
Copy link
Member

@sjspielman sjspielman commented Dec 21, 2023

Closes #250

This (surprise!) PR deals with the metadata situation. To be frank, I did not formally diagnose whatever bug I thought might have been going on in #250 (indeed, there may have been no bug at all...), but I did get everything working as it should be which more or less is the goal anyways!

This was definitely necessary because, while writing tests for #150, I realized we had yet to fully deal with altExp metadata in the first place, and this itself was distracting me while writing said tests. So, I went ahead and tackled the metadata situation now via a bit of code refactoring. IMO, this makes our code much more streamlined and modular, and much easier to follow (for me, at least..) compared to what it was before.

This PR consolidates metadata preparation code as follows:

  • At the top of merge_sce_list(), I check that the expected metadata fields are present. This is handled by the new function check_metadata().
  • After merging, I update the previous metadata into the correct merged format and add it into the merged SCE. This is handled by the new function prepare_merged_metadata() - this function now handles every aspect of the metadata, which again I think greatly simplifies the script overall and makes it easier to follow.
    • For any altExps that didn't exist and were create on the fly with combineCols, I also create a "MVM" (minimum viable metadata) list to include in the final merged object with extract_metadata_for_altexp(). All other potential metadata fields will therefore end up get filled in with NULL.
      • Note that extract_metadata_for_altexp() could also be an anonymous function in the purrr::map() directly?
      • Note also that prepare_merged_metadata() takes a metadata list rather than SCE so that we can use it in situations where there is no metadata, ala for an altexp that didn't originally exist

All is tested and passing locally. Enjoy today's🎢 !

@sjspielman
Copy link
Member Author

sjspielman commented Dec 21, 2023

Noting I'm going to wait to officially request review here until #251 is merged, but it is ready (aka this is not "draft" status) if folks want to look sooner!

Base automatically changed from sjspielman/150-next-steps to main December 21, 2023 21:41
@sjspielman sjspielman requested a review from jashapiro December 21, 2023 21:41
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much of this looks good, but I think you made a pretty big change to the contents of library_metadata in that you are now returning the transposed data. I think we want to leave it as it was, with contents organized first by the SCE name, such that metadata(merged_sce)$library_metadata[["sce1"]] returns the original metadata as it went in for each SCE.

Note that you may not want to just re-transpose, as that will add in NULL values for any metadata fields that are not present in all samples.

I had a few other comments below, but I think most of those are more minor, except maybe for the one about preserving order in the altExp metadata.

R/merge_sce_list.R Outdated Show resolved Hide resolved
R/merge_sce_list.R Outdated Show resolved Hide resolved
R/merge_sce_list.R Outdated Show resolved Hide resolved
sce_list |>
purrr::keep(\(sce) altexp_name %in% altExpNames(sce)) |>
purrr::map(altExp, altexp_name) |>
check_metadata()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the library id & sample id in the altExp? I guess it doesn't hurt, but it seems redundant. We have it there mostly because it was there in the original import.

But it is small, so I guess it makes sense to keep it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree it could sort of go either way. I figure if someone wanted to fully pull out the altexp it might be useful to have this information handy?

R/merge_sce_list.R Outdated Show resolved Hide resolved
R/merge_sce_list.R Outdated Show resolved Hide resolved
R/merge_sce_list.R Outdated Show resolved Hide resolved
R/merge_sce_list.R Outdated Show resolved Hide resolved
tests/testthat/test-merge_sce_list.R Outdated Show resolved Hide resolved
@sjspielman
Copy link
Member Author

Noting the GHA error -

ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
    tensorstore==0.1.36 from https://files.pythonhosted.org/packages/0c/7e/ec209ed096cc55b10e4d25a84f6683b8916d4808fa0f92881e24de5a30bb/tensorstore-0.1.36-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (from -r /tmp/Rtmp6pk8WG/renv-requirements-14d51b25703a.txt (line 131)):
        Expected sha256 d50b27919cde623e3918fe6ba054f41e2da5d7dbf7817d46d43131b50bcc9df4
             Got        0e5e398e0345ef50eae04967b739f9517b534c5d5a3250d402bfbd451c35a045

@jashapiro
Copy link
Member

Noting the GHA error -

This can be safely ignored as a transient cache problem, unless we start to see it on every build. renv does not store the caches in the requirements file.

@sjspielman
Copy link
Member Author

Passing locally! ✅

R/merge_sce_list.R Outdated Show resolved Hide resolved
@sjspielman sjspielman requested a review from jashapiro December 22, 2023 16:19
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, a bunch of small comments, but I think we are almost there.

The biggest thing is probably just adding a few tests that we get the sample_metadata table correct when we merge the metadata. We do check it as part of the main merge checks, a bit, but it would be good to have those tests in the prepare_merged_metadata tests as well.

R/merge_sce_list.R Outdated Show resolved Hide resolved
R/merge_sce_list.R Outdated Show resolved Hide resolved
R/merge_sce_list.R Outdated Show resolved Hide resolved
for (altexp_name in names(altexp_attributes)) {
# For any SCEs without this altExp, create the library_id and sample_id metadata
additional_metadata <- sce_list |>
purrr::keep(\(sce) !(altexp_name %in% altExpNames(sce))) |>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the named functions and fewer bangs

Suggested change
purrr::keep(\(sce) !(altexp_name %in% altExpNames(sce))) |>
purrr::discard(\(sce) (altexp_name %in% altExpNames(sce))) |>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might actually pull the test into a separate line before the keep and discard, just to save recalculating it, and use that in both keep and discard.

has_altexp <- purrr::map_lgl(\(sce) (altexp_name %in% altExpNames(sce))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R/merge_sce_list.R Outdated Show resolved Hide resolved
tests/testthat/test-merge_sce_list.R Outdated Show resolved Hide resolved
)
})

test_that("prepare_merged_metadata works as expected, with sample_metadata present", {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add a couple more tests in here that the sample_metadata is what we expect as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +661 to +686
test_that("prepare_merged_metadata works as expected from altExps metadata (aka, sample_metadata NOT present)", {
observed_metadata <- sce_list_with_altexp |>
purrr::map(altExp) |>
purrr::map(metadata) |>
prepare_merged_metadata()

expect_setequal(
names(observed_metadata),
c("library_id", "sample_id", "library_metadata")
)

expect_setequal(
observed_metadata$library_id,
glue::glue("library-{names(sce_list)}")
)
expect_setequal(
observed_metadata$sample_id,
glue::glue("sample-{names(sce_list)}")
)
expect_setequal(
observed_metadata$library_metadata |>
purrr::map(names) |>
purrr::reduce(intersect),
c("library_id", "sample_id", "mapped_reads", "ambient_profile")
)
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these tests are probably redundant. You already do much of the required testing for altExp metadata in the tests for the full merge with altExps, so this is really just repeating the prepare_merged_metadata() tests with a slightly different set of data. But it is good that you are testing that the behavior is correct when sample_metadata doesn't exist, so I guess maybe leave it!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had almost the identical thought process about redundancy and then deciding maybe it was ok!

tests/testthat/test-merge_sce_list.R Outdated Show resolved Hide resolved
tests/testthat/test-merge_sce_list.R Outdated Show resolved Hide resolved
@sjspielman sjspielman requested a review from jashapiro December 22, 2023 19:25
Co-authored-by: Joshua Shapiro <[email protected]>
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made one small variable name suggestion, but I think this should be good to go.

R/merge_sce_list.R Outdated Show resolved Hide resolved
R/merge_sce_list.R Outdated Show resolved Hide resolved
R/merge_sce_list.R Outdated Show resolved Hide resolved
@sjspielman sjspielman merged commit ef93b9f into main Dec 22, 2023
2 checks passed
@sjspielman sjspielman deleted the sjspielman/250-merge-metadata branch December 22, 2023 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix merging metadata bug
2 participants