Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address metadata during altexp merging and other metadata fixes #255

Merged
merged 22 commits into from
Dec 22, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
b8884e4
Remove previous metadata testing
sjspielman Dec 21, 2023
a7185b4
metadata handling overhaul to avoid duplicated code and improved checks
sjspielman Dec 21, 2023
77074d6
metadata function docs and a little reorg so they are all in the same…
sjspielman Dec 21, 2023
a46437c
Add tests for new functions, and modify a buggy part accordingly
sjspielman Dec 21, 2023
39e0854
use c() instead
sjspielman Dec 21, 2023
f35ed2f
add tests for altexp metadata for the condition when one altexp isn't…
sjspielman Dec 21, 2023
ee24fae
Merge sjspielman/150-next-steps and address conflict
sjspielman Dec 21, 2023
11e693e
Merge branch 'main' into sjspielman/250-merge-metadata
sjspielman Dec 21, 2023
64f3b8a
Apply suggestions from code review
sjspielman Dec 22, 2023
44b46fc
Don't transpose at all, use pluck instead. Update docs, and add to WO…
sjspielman Dec 22, 2023
9aea1e9
More plucking for library_metadata, and fix sample_metadata df constr…
sjspielman Dec 22, 2023
c237355
apparently transpose has been superceded by list_transpose
sjspielman Dec 22, 2023
ef6629a
Updated metadata tests
sjspielman Dec 22, 2023
c26d826
ensure correct metadata order for altexp metadata
sjspielman Dec 22, 2023
cf9a5d4
Apply suggestions from code review
sjspielman Dec 22, 2023
24e97b1
use has_altexp_name vector with keep and discard
sjspielman Dec 22, 2023
3598c94
Add metadata test for sample_metadata
sjspielman Dec 22, 2023
08351d8
dear pluck, you are now gone but not forgotten
sjspielman Dec 22, 2023
020b9bf
Merge branch 'main' into sjspielman/250-merge-metadata
jashapiro Dec 22, 2023
8ee0594
Update R/merge_sce_list.R
sjspielman Dec 22, 2023
f7b8037
Apply suggestions from code review
sjspielman Dec 22, 2023
6599cfc
Update R/merge_sce_list.R
jashapiro Dec 22, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
245 changes: 158 additions & 87 deletions R/merge_sce_list.R
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,9 @@ merge_sce_list <- function(
Please check that `retain_coldata_cols` was correctly specified.")
}

# Check metadata - library_id and sample_id must be present
check_metadata(sce_list)

# Check altExp compatibility, if we are including them
# altExps for a given name must all have the same features and the same assays
if (include_altexp) {
Expand All @@ -102,6 +105,17 @@ merge_sce_list <- function(

# Update retain_coldata_cols
retain_coldata_cols <- c(retain_coldata_cols, altexp_columns)

# Check each altExp metadata, for SCEs with the altExp
names(altexp_attributes) |>
purrr::walk(
\(altexp_name) {
sce_list |>
purrr::keep(\(sce) altexp_name %in% altExpNames(sce)) |>
purrr::map(altExp, altexp_name) |>
check_metadata()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the library id & sample id in the altExp? I guess it doesn't hurt, but it seems redundant. We have it there mostly because it was there in the original import.

But it is small, so I guess it makes sense to keep it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree it could sort of go either way. I figure if someone wanted to fully pull out the altexp it might be useful to have this information handy?

}
)
} else {
# Remove altexps if we are not including them
sce_list <- sce_list |>
Expand Down Expand Up @@ -136,17 +150,6 @@ merge_sce_list <- function(
warning("The provided `retain_coldata_cols` are not present in any SCEs.")
}

# check that library id and sample id are present in metadata
id_checks <- sce_list |>
purrr::map(\(sce) {
all(c("library_id", "sample_id") %in% names(metadata(sce)))
}) |>
unlist()

if (!all(id_checks)) {
stop("The metadata for each SCE object must contain `library_id` and `sample_id`.")
}

## Prepare main experiment of SCEs for merging --------------------
sce_list <- sce_list |>
purrr::imap(
Expand All @@ -159,45 +162,7 @@ merge_sce_list <- function(
)


## Handle metadata ---------------------------------------------
# get a list of metadata from the list of sce objects
# each library becomes an element within the metadata components
metadata_list <- sce_list |>
purrr::map(metadata) |>
purrr::transpose()

# flatten and reduce the library ids and sample ids
metadata_list$library_id <- metadata_list$library_id |>
unlist() |>
unique()

metadata_list$sample_id <- metadata_list$sample_id |>
unlist() |>
unique()

# if object has sample metadata then combine into a single data frame
if ("sample_metadata" %in% names(metadata_list)) {
sample_metadata <- metadata_list$sample_metadata |>
purrr::map(\(df) {
# make sure all column types are compatible first
df |>
dplyr::mutate(dplyr::across(dplyr::everything(), as.character))
}) |>
dplyr::bind_rows() |>
unique()

# check that all sample ids are found in the new sample metadata and warn if not
if (!all(metadata_list$sample_id %in% sample_metadata$sample_id)) {
warning("Not all sample ids are present in metadata(merged_sce)$sample_metadata.")
}

# replace sample metadata in metadata list
metadata_list$sample_metadata <- sample_metadata
}


## Handle altExps ------------------------------------------------------

# If we are including altExps, process them and save to list to add to merged SCE
if (include_altexp) {
for (altexp_name in names(altexp_attributes)) {
Expand All @@ -219,8 +184,35 @@ merge_sce_list <- function(
# Create the merged SCE from the processed list
merged_sce <- do.call(combineCols, unname(sce_list))

# Replace existing metadata list with merged metadata
metadata(merged_sce) <- metadata_list
# Update metadata in merged objects, using the unmerged sce_list
metadata(merged_sce) <- sce_list |>
purrr::map(metadata) |>
prepare_merged_metadata()

if (include_altexp) {
for (altexp_name in names(altexp_attributes)) {
has_altexp_name <- sce_list |>
purrr::map_lgl(\(sce) (altexp_name %in% altExpNames(sce)))

# For any SCEs without this altExp, create the library_id and sample_id metadata
additional_metadata <- sce_list |>
purrr::discard(has_altexp_name) |>
purrr::map(extract_metadata_for_altexp)

# Update metadata in altExps that were originally present
altexp_metadata_list <- sce_list |>
purrr::keep(has_altexp_name) |>
purrr::map(altExp, altexp_name) |>
purrr::map(metadata) |>
# Tack on the metadata we created for libraries without altExps
c(additional_metadata)

# Ensure correct order
altexp_metadata_list <- altexp_metadata_list[names(sce_list)]

metadata(altExp(merged_sce, altexp_name)) <- prepare_merged_metadata(altexp_metadata_list)
}
}

return(merged_sce)
}
Expand Down Expand Up @@ -304,9 +296,6 @@ prepare_sce_for_merge <- function(
colnames(sce) <- glue::glue("{sce_name}-{colnames(sce)}")
}

# Update metadata list
metadata(sce) <- update_sce_metadata(metadata(sce))

# return the processed SCE
return(sce)
}
Expand Down Expand Up @@ -358,39 +347,6 @@ prepare_altexp_for_merge <- function(
}



#' Helper function to update SCE metadata for merging
#'
#' @param metadata_list Current SCE metadata before modification
#'
#' @return Updated metadata list to store in the SCE
update_sce_metadata <- function(metadata_list) {
# first check that this library hasn't already been merged
if ("library_metadata" %in% names(metadata_list)) {
stop(paste(
"This SCE object appears to be a merged object",
"We do not support merging objects with objects that have already been merged."
))
}

# create library and sample metadata.
# library metadata will hold all the previous metadata fields, to avoid conflicts
library_metadata <- metadata_list[names(metadata_list) != "sample_metadata"]
sample_metadata <- metadata_list$sample_metadata

# combine into one list
metadata_list <- list(
library_id = metadata_list[["library_id"]],
sample_id = metadata_list[["sample_id"]],
library_metadata = library_metadata, # this will be all previous metadata for the given library
sample_metadata = sample_metadata # this will be the same as the previous sample_metadata
)

# return updated metadata
return(metadata_list)
}


#' Helper function to check altExp compatibility
#'
#' @param sce_list List of SCEs with altExps to check
Expand Down Expand Up @@ -445,3 +401,118 @@ get_altexp_attributes <- function(sce_list) {
}
return(altexp_attributes)
}





#' Prepare an updated list of metadata for the merged SCE
#'
#' The metadata will be reformatted to contain the following fields:
#' - `library_id`: vector of all library ids in the merged object
#' - `sample_id`: vector of all sample ids in the merged object
#' - `library_metadata`: pre-merge metadata list for each library, named by SCE name
#' - `sample_metadata`: a data frame with all distinct rows from the pre-merge sample_metadata data frames,
#' with all values coerced to character
#'
#' @param metadata_list List of metadata to update
#'
#' @return Updated metadata list to include in the merged SCE
prepare_merged_metadata <- function(metadata_list) {
# Define vectors of library and sample ids
metadata_library_ids <- metadata_list |>
purrr::map_chr("library_id")

metadata_sample_ids <- metadata_list |>
purrr::map_chr("sample_id")

# Grab names to check contents
transposed_names <- metadata_list |>
purrr::map(names) |>
purrr::reduce(union)

# first check that this library hasn't already been merged
if ("library_metadata" %in% transposed_names) {
stop(paste(
"This SCE object appears to be a merged object",
"We do not support merging objects with objects that have already been merged."
))
}

# Create new version of metadata_list without any sample_metadata fields
# Note that this will work even if `sample_metadata` is not present
library_metadata <- metadata_list |>
purrr::map(
\(meta) meta[which(names(meta) != "sample_metadata")]
)

# combine into final metadata list
final_metadata_list <- list(
# vector of all library ids
library_id = metadata_library_ids,
# vector of all sample ids
sample_id = metadata_sample_ids,
# list of existing metadata for each library, with sample_metadata removed
library_metadata = library_metadata
)

# if object has sample metadata then combine into a single data frame,
# and add this to the final_metadata_list
if ("sample_metadata" %in% transposed_names) {
sample_metadata <- metadata_list |>
purrr::map("sample_metadata") |>
purrr::map(\(df) {
# make sure all column types are compatible first
df |>
dplyr::mutate(dplyr::across(dplyr::everything(), as.character))
}) |>
dplyr::bind_rows() |>
dplyr::distinct()

# check that all sample ids are found in the new sample metadata and warn if not
if (!all(metadata_sample_ids %in% sample_metadata$sample_id)) {
warning("Not all sample ids are present in metadata(merged_sce)$sample_metadata.")
}

# add to final_metadata_list
final_metadata_list$sample_metadata <- sample_metadata
}

return(final_metadata_list)
}



#' Helper function to check that expected metadata fields are present in a given
#' list of SCEs.
#'
#' The default expected fields are `library_id` and `sample_id`. If either is missing from
#' any SCE, an error is thrown. This function does not return anything.
#'
#' @param sce_list List of SCEs to check
#' @param expected_fields a vector of metadata fields that should be present
check_metadata <- function(sce_list, expected_fields = c("library_id", "sample_id")) {
metadata_checks <- sce_list |>
purrr::map_lgl(\(sce) {
all(expected_fields %in% names(metadata(sce)))
})

if (!all(metadata_checks)) {
stop(glue::glue("
The metadata for each SCE object must contain {stringr::str_flatten_comma(expected_fields, ', and ')}.
If `include_altexp` is TRUE, these fields must also be present in all altExps.
"))
}
}

#' Helper function to extract main SCE metadata for inclusion in an altExp
#'
#' @param sce SCE object to extract metadata from
#'
#' @return List with fields `library_id` and `sample_id`
extract_metadata_for_altexp <- function(sce) {
list(
library_id = metadata(sce)$library_id,
sample_id = metadata(sce)$sample_id
)
}
4 changes: 4 additions & 0 deletions inst/WORDLIST
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@

altExp
Alevin
CMD
commenter
Dockerfile
Kallisto
nf
pre
SCE
SCEs
PRs
RStudio
ScPCA
Expand Down
18 changes: 18 additions & 0 deletions man/check_metadata.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

17 changes: 17 additions & 0 deletions man/extract_metadata_for_altexp.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

22 changes: 22 additions & 0 deletions man/prepare_merged_metadata.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

17 changes: 0 additions & 17 deletions man/update_sce_metadata.Rd

This file was deleted.

Loading
Loading