`readVizgen()` and `readXenium()` #12

alikhuseynov · 2023-09-08T17:06:04Z

Hi
as discussed in #3 I did a major update for readVizgen() to handle older and the newest Vizgen/Merscope output data, as well as .filter_polygons()

New optional args, they are to FALSE as default, not sure what would be the best?
- filter_counts = FALSE # if to keep cells with counts > 0, this will subset polys as well.
- add_molecules = FALSE # if to add molecule/transcript coords to rowGeometries
- use_bboxes = FALSE # if no segmentation output present, use cell_metadata to make bounding boxes instead

I tested it on few old-new datasets we have here, all went fine on my side.
Only plotting including the image_id has a small issue with -> Voyager:::geom_spi_rgb(., max_col_value = 255), I will open a separate issue for Voyager on this.
I use tidyverse and pipe operator from magrittr a lot when implementing, hope that's ok.

I will work further on converter from/to SFE <-> Seurat, is utils.R the proper place to add it?

This is the run example:

# install this PR
devtools::install_github(repo = "alikhuseynov/SpatialFeatureExperiment", ref = "seurat_v4")

# load libs ----
suppressPackageStartupMessages({
  library(ggplot2)
  library(Seurat)
  library(dplyr)
  library(magrittr)
  library(BiocParallel)
  library(scuttle)
  library(SpatialFeatureExperiment)
  library(Voyager)
  library(sf)
})

# set directory to read from
dir_use <- "./merfish_seurat_debug/" # or something like "./region_0/"

message("Reading data from:" , "\n", dir_use)
gc() %>% invisible()

start.time <- Sys.time()
sfe <-
  readVizgen(data_dir = dir_use,
             z = 3L,
             sample_id = "test.merfish",
             min_area = 15,
             image = "DAPI",
             flip = "geometry", 
             max_flip = "50 MB",
             filter_counts = TRUE, # keep cells with counts > 0, this will subset cell segmentation as well 
             add_molecules = TRUE, # add molecule/transcript coords to `rowGeometries`
             use_bboxes = TRUE, # if no segmentation output is present, use `cell_metadata` to make bounding boxes
             BPPARAM = BiocParallel::MulticoreParam(14, 
                                                    tasks = 80L,
                                                    force.GC = FALSE, 
                                                    progressbar = TRUE)
  )
end.time <- Sys.time()
message("Time taken to read data = ", 
        round(end.time - start.time, digits = 2), " ", 
        attr(c(end.time - start.time), "units"))
sfe

update from `devel` branch

..major update for `readVizgen()` and few helper functions. It supports old and new Vizgen/Merscope output

..added helper function `callMeta()` to return a df of SFE object

lambdamoses · 2023-09-08T19:41:01Z

Thank you for the PR! I can change the magrittr pipe to the base pipe so I don't introduce another dependency. The PR will go into the Bioc 3.18 version of Voyager, and R 4.3 is required anyway so the base pipe should be available. Please put the Seurat conversion function in the coerce.R file where the SCE and SPE conversion functions are. After reviewing the PR, I might add you as an author of the Voyager paper before (re)submitting it.

alikhuseynov · 2023-09-08T20:17:08Z

Sounds great, thanks! 👍
I did it on R 4.3.1, can change the magrittr pipes in this PR to base pipe operator, no problem. If you have few Vizgen datasets for independent testing on your side, would be great.

alikhuseynov · 2023-09-09T15:21:34Z

It seems that base pipe |> does not like curly braces.
function '{' not supported in RHS call of a pipe
Also, on R pipes comparison youtu.be video, the execution speed of base pipe is slower than with magrittr especially when using it with dplyr or tidyverse.

I think adding magrittr would be the easiest solution, what do you think? Else one would need to change all the code to base-like style. It just makes it easier to follow the code with pipe operator, especially when it's linear.
And, in base pipe, I could not find magrittr equivalent backward assign pipe %<>% which is quite handy.

Let me know what changes I would need to do here.
Thanks! 😊

..some functions are masked from other loaded libs, this seems to break stuff. Using explicit, eg `dplyr::slice()` than `importFrom` -> r-lib/roxygen2#217

lambdamoses · 2023-09-09T18:33:29Z

While I use Tidyverse in my analyses, I usually stick to base R as far as I can in packages to avoid additional dependencies whenever possible. Unless using Tidyverse has significant performance improvement and that's important to users, I would prefer to use base R. To not to make you spend too much time on this, I can translate the code to base R.

lambdamoses · 2023-09-09T18:44:21Z

Since dplyr and magrittr are already commonly used, it's most likely fine, though it doesn't seem too difficult to translate your code to base R, so I'll still try translating.

alikhuseynov · 2023-09-09T19:09:49Z

Ok, sounds good. Thanks for translating to base R. tidyverse and magrittr are really just personal preference.
Last commit fixed few bugs related to masked functions from different libs. It also works to load the example dataset -> system.file("extdata/vizgen", package = "SpatialFeatureExperiment") both for .parquet and .hdf5 files. That directory lacks detected_transcripts file, so it won’t work when add_molecules = TRUE.
Hopefully it will now pass the unit testing.
thanks

lambdamoses · 2023-09-23T06:09:10Z

I'm reviewing your pull request and I have translated your code into base R. That said, I might end up using tidyverse or data.table to speed up the part for transcript coordinates when I test it on that 30 GB transcript spot file. I noticed that you would search recursively for files for cell metadata and transcript spots if the file is not found at top level. I based my original function on the directory structure of the example datasets on the Vizgen website, like this:

Are there other standard directory structures in which the cell metadata and transcript coordinates are inside another directory?

At this point I'm also not sure if it would be more efficient to load the transcript coordinates into memory and convert into MULTIPOINT or to convert it to raster with terra with a separate function, write it to disk, and let terra handle the on disk operations without loading the whole thing into memory. I'm not sure what use there is to load the transcript coordinates into memory. Maybe one use is to plot them with a very zoomed in image of cells. I talked to the Giotto team at the Bioconductor conference in Boston; they said converting to raster makes it faster to match transcript spots to cells.

lambdamoses · 2023-09-23T06:12:02Z

Can you give me a schematic for the directory structure standards for the older and newest Vizgen output you mentioned? This should also include file name conventions. If the column names in the cell metadata and detected transcripts are different between versions, please provide those as well. This is so that I can make toy test data and write unit tests for your function.

alikhuseynov · 2023-09-23T19:38:47Z

I'm reviewing your pull request and I have translated your code into base R. That said, I might end up using tidyverse or data.table to speed up the part for transcript coordinates when I test it on that 30 GB transcript spot file. I noticed that you would search recursively for files for cell metadata and transcript spots if the file is not found at top level. I based my original function on the directory structure of the example datasets on the Vizgen website, like this:

Are there other standard directory structures in which the cell metadata and transcript coordinates are inside another directory?

At this point I'm also not sure if it would be more efficient to load the transcript coordinates into memory and convert into MULTIPOINT or to convert it to raster with terra with a separate function, write it to disk, and let terra handle the on disk operations without loading the whole thing into memory. I'm not sure what use there is to load the transcript coordinates into memory. Maybe one use is to plot them with a very zoomed in image of cells. I talked to the Giotto team at the Bioconductor conference in Boston; they said converting to raster makes it faster to match transcript spots to cells.

Thanks a lot. tidyverse and data.table (then one doesn't need to specify anything in vroom call) would be the most efficient ways to load the data, that's what I used for Seurat PR.
As for the transcript coordinates, this is optional and up to user to include them. If one can handle those with terra and term them into raster, that would be awesome.
Most of the times those molecule coordinates are used for plots (zoomed-in or cropped/bbox-like area). For the whole section, I would suggest to subsample them prior to plotting, using an arg like n_mols = 1000 inside the plotting function and overlay them together with polygon geometries (centroids or cell boundaries).

alikhuseynov · 2023-09-23T20:04:16Z

Can you give me a schematic for the directory structure standards for the older and newest Vizgen output you mentioned? This should also include file name conventions. If the column names in the cell metadata and detected transcripts are different between versions, please provide those as well. This is so that I can make toy test data and write unit tests for your function.

Sure, yeah. Since users might have older data and new data. Here are examples of the output dirs:

# for hdf5 files
region_0/
|-- cell_boundaries
|-- cell_by_gene.csv
|-- cell_metadata.csv
|-- detected_transcripts.csv
|-- images
`-- summary.png

# this one has Cellpose dir
region_0/
|-- Cellpose
|   |-- cellpose_cell_by_gene.csv
|   |-- cellpose_cell_metadata.csv
|   |-- cellpose_micron_space.parquet
|   `-- cellpose_mosaic_space.parquet
|-- detected_transcripts.csv
|-- images
`-- summary.png

# same as above but Cellpose dir has another name
region_0/
|-- Cellpose_polyT_CB3
|   |-- cellpose_cell_by_gene.csv
|   |-- cellpose_cell_metadata.csv
|   |-- cellpose_micron_space.parquet
|   `-- cellpose_mosaic_space.parquet
|-- detected_transcripts.csv
|-- images
`-- summary.png

# the most recent version
|-- cell_boundaries.parquet
|-- cell_by_gene.csv
|-- cell_metadata.csv
|-- detected_transcripts.csv
`-- images

# some users had no segmentations
|-- cell_by_gene.csv
|-- cell_metadata.csv
|-- detected_transcripts.csv
`-- images

For cell_metadata:

transcript_count col was added,
cell ids were previously stored in ...1, new version has cell col

For detected_transcripts:

cell_id col was added to match molecule coordinates to cell ids, -1 means no assigned to a cell.

alikhuseynov

use_cellpose is no longer needed

alikhuseynov · 2023-09-23T20:27:51Z

As for upcoming PRs, I can do more base R style such that you don't need to do much of additional translation from pipe magrittr tidyverse style of coding? It's just easier to follow/understand the code that way, especially where there are a number of operations. Generally I try to avoid nested coding style for that reason.

lambdamoses · 2023-09-24T06:37:36Z

use_cellpose is no longer needed

I put it back for this reason: when running code in a separate R session, such as rendering the package website with pkgdown or running unit tests on CI, sometimes the arrow package can't be loaded, I don't know why because it would work when run interactively. I always run unit tests locally before pushing. I need to force it to use HDF5 for the remote processes to work.

lambdamoses · 2023-09-26T01:51:20Z

I can try dusting off my C++ to stream the ginormous transcript spots file line by line without loading the whole thing into memory to construct the raster or a parquet file. In principle it can be parallelized. This way it's possible to never load the whole thing into memory. If all we need is plotting a very zoomed in region, then it shouldn't be necessary to load the whole thing. If making a heatmap of transcript density for one gene, it shouldn't be necessary to load the other genes.

alikhuseynov · 2023-09-26T07:48:07Z

I can try dusting off my C++ to stream the ginormous transcript spots file line by line without loading the whole thing into memory to construct the raster or a parquet file. In principle it can be parallelized. This way it's possible to never load the whole thing into memory. If all we need is plotting a very zoomed in region, then it shouldn't be necessary to load the whole thing. If making a heatmap of transcript density for one gene, it shouldn't be necessary to load the other genes.

Yeah, if only for visualization, then it shouldn’t load everything/all genes in to memory.
However, I think if one wants to do spatial binning on transcripts coords using sf::st_make_grid, then rowGeometries be present/loaded into memory

alikhuseynov · 2023-09-26T12:05:28Z

Probably you already heard MoleculeExperiment, using BumpyMatrix to store molecule coords.
They make emphasis on analysis value of molecules that are not assigned to a segmented cell.
Probably BumpyDataFrameMatrix could be used instead to store rowGeometries?
Generally, one could make it up to a user, if to load all molecules and used them for analysis, or if only for visualization, then your approach of not loading all molecule coords into memory would be enough?

lambdamoses · 2023-09-27T21:48:28Z

Probably you already heard MoleculeExperiment, using BumpyMatrix to store molecule coords. They make emphasis on analysis value of molecules that are not assigned to a segmented cell. Probably BumpyDataFrameMatrix could be used instead to store rowGeometries? Generally, one could make it up to a user, if to load all molecules and used them for analysis, or if only for visualization, then your approach of not loading all molecule coords into memory would be enough?

I still want something that supports geometric operations, so I would prefer either sf or geoarrow or terra. One could preform point process analyses with spatstat, but that requires a spatstat's own class.

lambdamoses · 2023-09-28T01:06:36Z

I'm still working on this pull request and I really tend to overthink. I noted that you would filter the transcript spot coordinates by z-plane, but these coordinates are actually 3D. There're also images at each z-plane. It's just that at least for the example datasets from Vizgen, the cell segmentation is the same in all z-planes. Is that always the case? Since you work with MERFISH data a lot, I wonder if you find it useful to visualize transcript spots from all z-planes or only one z-plane. I think it could be useful to have the option to load data from all z-planes.

Also, how useful do you think the cell IDs for transcript spots are? I think already having the transcript assignment to cells can help with analyses on transcript localization within each cell, but when the cell segmentation is only in one z-plane, these analyses can't get very far. If cell IDs are useful, then maybe the transcript spots should be loaded into colGeometries. Maybe I can write a function just to read the transcript spots, and give options on whether to use cell IDs and read into colGeometries or read all spots regardless of whether they're in cells into rowGeometries.

I'm not prioritizing ways to not load everything into memory for now since I have other features to work on for Bioc 3.18.

making sure that count matrix and transcripts have the same genes

image metadata `globalMetadata(img)` now has updated `SizeX` and `SizeY`, which correspond to cropped full resolution image cell id from `nucSeg(sfe)` are used after cropping

alikhuseynov · 2024-01-03T12:24:44Z

currently SFE Xenium count matrix is is a class 'DelayedMatrix', however Vizgen is a class 'dgCMatrix'.

assay(sfe_xen, "counts") %>% str
Formal class 'DelayedMatrix' [package "DelayedArray"] with 1 slot
...
assay(sfe_vz, "counts") %>% str
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
...

should we keep them the same, either 'DelayedMatrix' or 'dgCMatrix'?

lambdamoses · 2024-01-03T22:15:24Z

Yes, please keep them the same. Having DelayedMatrix is helpful for larger datasets so you don't have to load everything into memory but it makes computation slower.

alikhuseynov · 2024-01-04T11:15:48Z

always keep cell ids in $ID in cellSeg (and nucSeg)? which is removed in readVizgen (readCosMX has $ID in polys as well)

alikhuseynov · 2024-01-04T11:20:11Z

Yes, please keep them the same. Having DelayedMatrix is helpful for larger datasets so you don't have to load everything into memory but it makes computation slower.

sounds good, also readCosMx has 'dgCMatrix', should be fixed as well then

lambdamoses · 2024-01-04T11:50:28Z

Oh, hold on, I don't mean making them all DelayedMatrix. What I mean is to keep whatever class originally read in. In the case of Xenium, DropletUtils::read10xCounts reads the h5 file as DelayedMatrix. dgCMatrix is fine for other technologies when csv's are read in.

alikhuseynov · 2024-01-04T12:32:32Z

Oh, hold on, I don't mean making them all DelayedMatrix. What I mean is to keep whatever class originally read in. In the case of Xenium, DropletUtils::read10xCounts reads the h5 file as DelayedMatrix. dgCMatrix is fine for other technologies when csv's are read in.

ok, then we keep as it is read in

alikhuseynov · 2024-01-04T13:02:10Z

always keep cell ids in $ID in cellSeg (and nucSeg)? which is removed in readVizgen (readCosMX has $ID in polys as well)

I think keeping $ID in polys is better for convenience, will add it to readVizgen

`sample_id` are now in row/colGeomentries if `min_phred = NULL` all features/genes will be present, else SFE will have only features present in rowGeomentries

..make sure all features are present when ->`min_phred = NULL`

alikhuseynov added 5 commits August 21, 2023 09:45

Merge branch 'seurat_v4' into devel

fc236d2

Merge pull request #1 from alikhuseynov/devel

0ea13f1

update from `devel` branch

major update -> readVizgen()

bd56125

..major update for `readVizgen()` and few helper functions. It supports old and new Vizgen/Merscope output

helper ->callMeta()

a554e9c

..added helper function `callMeta()` to return a df of SFE object

small fix for empty files

68865af

alikhuseynov added 3 commits September 9, 2023 18:09

try fix imgData() & dplyr ->slice()

8c56b3f

another try to fix dplyr::slice()

696dfa1

dplyr fix

5764394

..some functions are masked from other loaded libs, this seems to break stuff. Using explicit, eg `dplyr::slice()` than `importFrom` -> r-lib/roxygen2#217

alikhuseynov mentioned this pull request Sep 9, 2023

Add readXenium #10

Closed

alikhuseynov and others added 3 commits September 11, 2023 10:34

remove use_cellpose

57b745c

Merge branch 'devel' into seurat_v4

7a916a0

Merge branch 'devel' into seurat_v4

f70299c

alikhuseynov commented Sep 23, 2023

View reviewed changes

alikhuseynov added 16 commits December 18, 2023 22:00

add Xenium toy dataset

28af71c

script to make Xenium toy data

1d6e2ee

updated vizgen toy dataset

2f6faf8

making sure that count matrix and transcripts have the same genes

..vizgen toy update

e376a2f

toy vizgen mols

c55f4c4

Add files via upload

3b1a373

Delete detected_transcripts

87d688e

upload new detected_transcripts

9b24622

update cellpose_cell_by_gene

a4e6b5a

Delete inst/extdata/vizgen_cellbound_Cellpose/detected_transcripts.csv

7eb1759

upload new detected_transcripts

9cd4c04

update Xenium toy

ba98061

Update vizgen_cellbound.R

956a17a

Update xenium.R

9b797bf

update Xenium toy dataset

8fa1e84

image metadata `globalMetadata(img)` now has updated `SizeX` and `SizeY`, which correspond to cropped full resolution image cell id from `nucSeg(sfe)` are used after cropping

Update xenium.R

7b90c2a

bugs fixed & updated read.R 👨🏻‍💻

c5d61d5

`sample_id` are now in row/colGeomentries if `min_phred = NULL` all features/genes will be present, else SFE will have only features present in rowGeomentries

alikhuseynov mentioned this pull request Jan 5, 2024

Support to handle .parquet output from Vizgen satijalab/seurat#7190

Open

alikhuseynov added 3 commits January 5, 2024 21:29

small fix for addTxSpots

de73c65

..make sure all features are present when ->`min_phred = NULL`

sanity on geoms after checks for !is.NULL

b3a6200

optimized callMeta in utils.R

d20fc79

lambdamoses merged commit 2574409 into pachterlab:devel Jan 31, 2024

alikhuseynov deleted the seurat_v4 branch March 25, 2024 10:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`readVizgen()` and `readXenium()` #12

`readVizgen()` and `readXenium()` #12

alikhuseynov commented Sep 8, 2023 •

edited

Loading

lambdamoses commented Sep 8, 2023

alikhuseynov commented Sep 8, 2023

alikhuseynov commented Sep 9, 2023

lambdamoses commented Sep 9, 2023

lambdamoses commented Sep 9, 2023

alikhuseynov commented Sep 9, 2023 •

edited

Loading

lambdamoses commented Sep 23, 2023

lambdamoses commented Sep 23, 2023 •

edited

Loading

alikhuseynov commented Sep 23, 2023 •

edited

Loading

alikhuseynov commented Sep 23, 2023

alikhuseynov left a comment

alikhuseynov commented Sep 23, 2023

lambdamoses commented Sep 24, 2023

lambdamoses commented Sep 26, 2023

alikhuseynov commented Sep 26, 2023

alikhuseynov commented Sep 26, 2023 •

edited

Loading

lambdamoses commented Sep 27, 2023

lambdamoses commented Sep 28, 2023

alikhuseynov commented Jan 3, 2024

lambdamoses commented Jan 3, 2024

alikhuseynov commented Jan 4, 2024

alikhuseynov commented Jan 4, 2024

lambdamoses commented Jan 4, 2024

alikhuseynov commented Jan 4, 2024 •

edited

Loading

alikhuseynov commented Jan 4, 2024

readVizgen() and readXenium() #12

readVizgen() and readXenium() #12

Conversation

alikhuseynov commented Sep 8, 2023 • edited Loading

This is the run example:

lambdamoses commented Sep 8, 2023

alikhuseynov commented Sep 8, 2023

alikhuseynov commented Sep 9, 2023

lambdamoses commented Sep 9, 2023

lambdamoses commented Sep 9, 2023

alikhuseynov commented Sep 9, 2023 • edited Loading

lambdamoses commented Sep 23, 2023

lambdamoses commented Sep 23, 2023 • edited Loading

alikhuseynov commented Sep 23, 2023 • edited Loading

alikhuseynov commented Sep 23, 2023

alikhuseynov left a comment

Choose a reason for hiding this comment

alikhuseynov commented Sep 23, 2023

lambdamoses commented Sep 24, 2023

lambdamoses commented Sep 26, 2023

alikhuseynov commented Sep 26, 2023

alikhuseynov commented Sep 26, 2023 • edited Loading

lambdamoses commented Sep 27, 2023

lambdamoses commented Sep 28, 2023

alikhuseynov commented Jan 3, 2024

lambdamoses commented Jan 3, 2024

alikhuseynov commented Jan 4, 2024

alikhuseynov commented Jan 4, 2024

lambdamoses commented Jan 4, 2024

alikhuseynov commented Jan 4, 2024 • edited Loading

alikhuseynov commented Jan 4, 2024

`readVizgen()` and `readXenium()` #12

`readVizgen()` and `readXenium()` #12

alikhuseynov commented Sep 8, 2023 •

edited

Loading

alikhuseynov commented Sep 9, 2023 •

edited

Loading

lambdamoses commented Sep 23, 2023 •

edited

Loading

alikhuseynov commented Sep 23, 2023 •

edited

Loading

alikhuseynov commented Sep 26, 2023 •

edited

Loading

alikhuseynov commented Jan 4, 2024 •

edited

Loading