Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readVizgen() and readXenium() #12

Merged
merged 136 commits into from
Jan 31, 2024
Merged

Conversation

alikhuseynov
Copy link
Collaborator

@alikhuseynov alikhuseynov commented Sep 8, 2023

Hi
as discussed in #3 I did a major update for readVizgen() to handle older and the newest Vizgen/Merscope output data, as well as .filter_polygons()

  • New optional args, they are to FALSE as default, not sure what would be the best?
    • filter_counts = FALSE # if to keep cells with counts > 0, this will subset polys as well.
    • add_molecules = FALSE # if to add molecule/transcript coords to rowGeometries
    • use_bboxes = FALSE # if no segmentation output present, use cell_metadata to make bounding boxes instead

I tested it on few old-new datasets we have here, all went fine on my side.
Only plotting including the image_id has a small issue with -> Voyager:::geom_spi_rgb(., max_col_value = 255), I will open a separate issue for Voyager on this.
I use tidyverse and pipe operator from magrittr a lot when implementing, hope that's ok.

I will work further on converter from/to SFE <-> Seurat, is utils.R the proper place to add it?

This is the run example:

# install this PR
devtools::install_github(repo = "alikhuseynov/SpatialFeatureExperiment", ref = "seurat_v4")

# load libs ----
suppressPackageStartupMessages({
  library(ggplot2)
  library(Seurat)
  library(dplyr)
  library(magrittr)
  library(BiocParallel)
  library(scuttle)
  library(SpatialFeatureExperiment)
  library(Voyager)
  library(sf)
})

# set directory to read from
dir_use <- "./merfish_seurat_debug/" # or something like "./region_0/"

message("Reading data from:" , "\n", dir_use)
gc() %>% invisible()

start.time <- Sys.time()
sfe <-
  readVizgen(data_dir = dir_use,
             z = 3L,
             sample_id = "test.merfish",
             min_area = 15,
             image = "DAPI",
             flip = "geometry", 
             max_flip = "50 MB",
             filter_counts = TRUE, # keep cells with counts > 0, this will subset cell segmentation as well 
             add_molecules = TRUE, # add molecule/transcript coords to `rowGeometries`
             use_bboxes = TRUE, # if no segmentation output is present, use `cell_metadata` to make bounding boxes
             BPPARAM = BiocParallel::MulticoreParam(14, 
                                                    tasks = 80L,
                                                    force.GC = FALSE, 
                                                    progressbar = TRUE)
  )
end.time <- Sys.time()
message("Time taken to read data = ", 
        round(end.time - start.time, digits = 2), " ", 
        attr(c(end.time - start.time), "units"))
sfe 

update from `devel` branch
..major update for `readVizgen()` and few helper functions.
It supports old and new Vizgen/Merscope output
..added helper function `callMeta()` to return a df of SFE object
@lambdamoses
Copy link
Collaborator

Thank you for the PR! I can change the magrittr pipe to the base pipe so I don't introduce another dependency. The PR will go into the Bioc 3.18 version of Voyager, and R 4.3 is required anyway so the base pipe should be available. Please put the Seurat conversion function in the coerce.R file where the SCE and SPE conversion functions are. After reviewing the PR, I might add you as an author of the Voyager paper before (re)submitting it.

@alikhuseynov
Copy link
Collaborator Author

Sounds great, thanks! 👍
I did it on R 4.3.1, can change the magrittr pipes in this PR to base pipe operator, no problem. If you have few Vizgen datasets for independent testing on your side, would be great.

@alikhuseynov
Copy link
Collaborator Author

It seems that base pipe |> does not like curly braces.
function '{' not supported in RHS call of a pipe
Also, on R pipes comparison youtu.be video, the execution speed of base pipe is slower than with magrittr especially when using it with dplyr or tidyverse.

I think adding magrittr would be the easiest solution, what do you think? Else one would need to change all the code to base-like style. It just makes it easier to follow the code with pipe operator, especially when it's linear.
And, in base pipe, I could not find magrittr equivalent backward assign pipe %<>% which is quite handy.

Let me know what changes I would need to do here.
Thanks! 😊

..some functions are masked from other loaded libs, this seems to break stuff. Using explicit, eg `dplyr::slice()` than `importFrom` -> r-lib/roxygen2#217
@lambdamoses
Copy link
Collaborator

While I use Tidyverse in my analyses, I usually stick to base R as far as I can in packages to avoid additional dependencies whenever possible. Unless using Tidyverse has significant performance improvement and that's important to users, I would prefer to use base R. To not to make you spend too much time on this, I can translate the code to base R.

@lambdamoses
Copy link
Collaborator

Since dplyr and magrittr are already commonly used, it's most likely fine, though it doesn't seem too difficult to translate your code to base R, so I'll still try translating.

@alikhuseynov
Copy link
Collaborator Author

alikhuseynov commented Sep 9, 2023

Ok, sounds good. Thanks for translating to base R. tidyverse and magrittr are really just personal preference.
Last commit fixed few bugs related to masked functions from different libs. It also works to load the example dataset -> system.file("extdata/vizgen", package = "SpatialFeatureExperiment") both for .parquet and .hdf5 files. That directory lacks detected_transcripts file, so it won’t work when add_molecules = TRUE.
Hopefully it will now pass the unit testing.
thanks

@alikhuseynov alikhuseynov mentioned this pull request Sep 9, 2023
@lambdamoses
Copy link
Collaborator

I'm reviewing your pull request and I have translated your code into base R. That said, I might end up using tidyverse or data.table to speed up the part for transcript coordinates when I test it on that 30 GB transcript spot file. I noticed that you would search recursively for files for cell metadata and transcript spots if the file is not found at top level. I based my original function on the directory structure of the example datasets on the Vizgen website, like this:

Are there other standard directory structures in which the cell metadata and transcript coordinates are inside another directory?

At this point I'm also not sure if it would be more efficient to load the transcript coordinates into memory and convert into MULTIPOINT or to convert it to raster with terra with a separate function, write it to disk, and let terra handle the on disk operations without loading the whole thing into memory. I'm not sure what use there is to load the transcript coordinates into memory. Maybe one use is to plot them with a very zoomed in image of cells. I talked to the Giotto team at the Bioconductor conference in Boston; they said converting to raster makes it faster to match transcript spots to cells.

@lambdamoses
Copy link
Collaborator

lambdamoses commented Sep 23, 2023

Can you give me a schematic for the directory structure standards for the older and newest Vizgen output you mentioned? This should also include file name conventions. If the column names in the cell metadata and detected transcripts are different between versions, please provide those as well. This is so that I can make toy test data and write unit tests for your function.

@alikhuseynov
Copy link
Collaborator Author

alikhuseynov commented Sep 23, 2023

I'm reviewing your pull request and I have translated your code into base R. That said, I might end up using tidyverse or data.table to speed up the part for transcript coordinates when I test it on that 30 GB transcript spot file. I noticed that you would search recursively for files for cell metadata and transcript spots if the file is not found at top level. I based my original function on the directory structure of the example datasets on the Vizgen website, like this:

Are there other standard directory structures in which the cell metadata and transcript coordinates are inside another directory?

At this point I'm also not sure if it would be more efficient to load the transcript coordinates into memory and convert into MULTIPOINT or to convert it to raster with terra with a separate function, write it to disk, and let terra handle the on disk operations without loading the whole thing into memory. I'm not sure what use there is to load the transcript coordinates into memory. Maybe one use is to plot them with a very zoomed in image of cells. I talked to the Giotto team at the Bioconductor conference in Boston; they said converting to raster makes it faster to match transcript spots to cells.

Thanks a lot. tidyverse and data.table (then one doesn't need to specify anything in vroom call) would be the most efficient ways to load the data, that's what I used for Seurat PR.
As for the transcript coordinates, this is optional and up to user to include them. If one can handle those with terra and term them into raster, that would be awesome.
Most of the times those molecule coordinates are used for plots (zoomed-in or cropped/bbox-like area). For the whole section, I would suggest to subsample them prior to plotting, using an arg like n_mols = 1000 inside the plotting function and overlay them together with polygon geometries (centroids or cell boundaries).

@alikhuseynov
Copy link
Collaborator Author

Can you give me a schematic for the directory structure standards for the older and newest Vizgen output you mentioned? This should also include file name conventions. If the column names in the cell metadata and detected transcripts are different between versions, please provide those as well. This is so that I can make toy test data and write unit tests for your function.

Sure, yeah. Since users might have older data and new data. Here are examples of the output dirs:

# for hdf5 files
region_0/
|-- cell_boundaries
|-- cell_by_gene.csv
|-- cell_metadata.csv
|-- detected_transcripts.csv
|-- images
`-- summary.png

# this one has Cellpose dir
region_0/
|-- Cellpose
|   |-- cellpose_cell_by_gene.csv
|   |-- cellpose_cell_metadata.csv
|   |-- cellpose_micron_space.parquet
|   `-- cellpose_mosaic_space.parquet
|-- detected_transcripts.csv
|-- images
`-- summary.png

# same as above but Cellpose dir has another name
region_0/
|-- Cellpose_polyT_CB3
|   |-- cellpose_cell_by_gene.csv
|   |-- cellpose_cell_metadata.csv
|   |-- cellpose_micron_space.parquet
|   `-- cellpose_mosaic_space.parquet
|-- detected_transcripts.csv
|-- images
`-- summary.png

# the most recent version
|-- cell_boundaries.parquet
|-- cell_by_gene.csv
|-- cell_metadata.csv
|-- detected_transcripts.csv
`-- images

# some users had no segmentations
|-- cell_by_gene.csv
|-- cell_metadata.csv
|-- detected_transcripts.csv
`-- images

For cell_metadata:

  • transcript_count col was added,
  • cell ids were previously stored in ...1, new version has cell col

For detected_transcripts:

  • cell_id col was added to match molecule coordinates to cell ids, -1 means no assigned to a cell.

Copy link
Collaborator Author

@alikhuseynov alikhuseynov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_cellpose is no longer needed

@alikhuseynov
Copy link
Collaborator Author

As for upcoming PRs, I can do more base R style such that you don't need to do much of additional translation from pipe magrittr tidyverse style of coding? It's just easier to follow/understand the code that way, especially where there are a number of operations. Generally I try to avoid nested coding style for that reason.

@lambdamoses
Copy link
Collaborator

use_cellpose is no longer needed

I put it back for this reason: when running code in a separate R session, such as rendering the package website with pkgdown or running unit tests on CI, sometimes the arrow package can't be loaded, I don't know why because it would work when run interactively. I always run unit tests locally before pushing. I need to force it to use HDF5 for the remote processes to work.

@lambdamoses
Copy link
Collaborator

I can try dusting off my C++ to stream the ginormous transcript spots file line by line without loading the whole thing into memory to construct the raster or a parquet file. In principle it can be parallelized. This way it's possible to never load the whole thing into memory. If all we need is plotting a very zoomed in region, then it shouldn't be necessary to load the whole thing. If making a heatmap of transcript density for one gene, it shouldn't be necessary to load the other genes.

@alikhuseynov
Copy link
Collaborator Author

I can try dusting off my C++ to stream the ginormous transcript spots file line by line without loading the whole thing into memory to construct the raster or a parquet file. In principle it can be parallelized. This way it's possible to never load the whole thing into memory. If all we need is plotting a very zoomed in region, then it shouldn't be necessary to load the whole thing. If making a heatmap of transcript density for one gene, it shouldn't be necessary to load the other genes.

Yeah, if only for visualization, then it shouldn’t load everything/all genes in to memory.
However, I think if one wants to do spatial binning on transcripts coords using sf::st_make_grid, then rowGeometries be present/loaded into memory

@alikhuseynov
Copy link
Collaborator Author

alikhuseynov commented Sep 26, 2023

Probably you already heard MoleculeExperiment, using BumpyMatrix to store molecule coords.
They make emphasis on analysis value of molecules that are not assigned to a segmented cell.
Probably BumpyDataFrameMatrix could be used instead to store rowGeometries?
Generally, one could make it up to a user, if to load all molecules and used them for analysis, or if only for visualization, then your approach of not loading all molecule coords into memory would be enough?

@lambdamoses
Copy link
Collaborator

Probably you already heard MoleculeExperiment, using BumpyMatrix to store molecule coords. They make emphasis on analysis value of molecules that are not assigned to a segmented cell. Probably BumpyDataFrameMatrix could be used instead to store rowGeometries? Generally, one could make it up to a user, if to load all molecules and used them for analysis, or if only for visualization, then your approach of not loading all molecule coords into memory would be enough?

I still want something that supports geometric operations, so I would prefer either sf or geoarrow or terra. One could preform point process analyses with spatstat, but that requires a spatstat's own class.

@lambdamoses
Copy link
Collaborator

I'm still working on this pull request and I really tend to overthink. I noted that you would filter the transcript spot coordinates by z-plane, but these coordinates are actually 3D. There're also images at each z-plane. It's just that at least for the example datasets from Vizgen, the cell segmentation is the same in all z-planes. Is that always the case? Since you work with MERFISH data a lot, I wonder if you find it useful to visualize transcript spots from all z-planes or only one z-plane. I think it could be useful to have the option to load data from all z-planes.

Also, how useful do you think the cell IDs for transcript spots are? I think already having the transcript assignment to cells can help with analyses on transcript localization within each cell, but when the cell segmentation is only in one z-plane, these analyses can't get very far. If cell IDs are useful, then maybe the transcript spots should be loaded into colGeometries. Maybe I can write a function just to read the transcript spots, and give options on whether to use cell IDs and read into colGeometries or read all spots regardless of whether they're in cells into rowGeometries.

I'm not prioritizing ways to not load everything into memory for now since I have other features to work on for Bioc 3.18.

@alikhuseynov
Copy link
Collaborator Author

currently SFE Xenium count matrix is is a class 'DelayedMatrix', however Vizgen is a class 'dgCMatrix'.

assay(sfe_xen, "counts") %>% str
Formal class 'DelayedMatrix' [package "DelayedArray"] with 1 slot
...
assay(sfe_vz, "counts") %>% str
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
...

should we keep them the same, either 'DelayedMatrix' or 'dgCMatrix'?

@lambdamoses
Copy link
Collaborator

Yes, please keep them the same. Having DelayedMatrix is helpful for larger datasets so you don't have to load everything into memory but it makes computation slower.

@alikhuseynov
Copy link
Collaborator Author

always keep cell ids in $ID in cellSeg (and nucSeg)? which is removed in readVizgen (readCosMX has $ID in polys as well)

@alikhuseynov
Copy link
Collaborator Author

Yes, please keep them the same. Having DelayedMatrix is helpful for larger datasets so you don't have to load everything into memory but it makes computation slower.

sounds good, also readCosMx has 'dgCMatrix', should be fixed as well then

@lambdamoses
Copy link
Collaborator

Oh, hold on, I don't mean making them all DelayedMatrix. What I mean is to keep whatever class originally read in. In the case of Xenium, DropletUtils::read10xCounts reads the h5 file as DelayedMatrix. dgCMatrix is fine for other technologies when csv's are read in.

@alikhuseynov
Copy link
Collaborator Author

alikhuseynov commented Jan 4, 2024

Oh, hold on, I don't mean making them all DelayedMatrix. What I mean is to keep whatever class originally read in. In the case of Xenium, DropletUtils::read10xCounts reads the h5 file as DelayedMatrix. dgCMatrix is fine for other technologies when csv's are read in.

ok, then we keep as it is read in

@alikhuseynov
Copy link
Collaborator Author

always keep cell ids in $ID in cellSeg (and nucSeg)? which is removed in readVizgen (readCosMX has $ID in polys as well)

I think keeping $ID in polys is better for convenience, will add it to readVizgen

`sample_id` are now in row/colGeomentries
if `min_phred = NULL` all features/genes will be present, else SFE will have only features present in rowGeomentries
@lambdamoses lambdamoses merged commit 2574409 into pachterlab:devel Jan 31, 2024
@alikhuseynov alikhuseynov deleted the seurat_v4 branch March 25, 2024 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants