-
Notifications
You must be signed in to change notification settings - Fork 186
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: add new wrapper to create annotation tables via Ensembl biomart (…
…#3072) <!-- Ensure that the PR title follows conventional commit style (<type>: <description>)--> <!-- Possible types are here: https://github.com/commitizen/conventional-commit-types/blob/master/index.json --> <!-- Add a description of your PR here--> ### QC <!-- Make sure that you can tick the boxes below. --> * [x] I confirm that: For all wrappers added by this PR, * there is a test case which covers any introduced changes, * `input:` and `output:` file paths in the resulting rule can be changed arbitrarily, * either the wrapper can only use a single core, or the example rule contains a `threads: x` statement with `x` being a reasonable default, * rule names in the test case are in [snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell what the rule is about or match the tools purpose or name (e.g., `map_reads` for a step that maps reads), * all `environment.yaml` specifications follow [the respective best practices](https://stackoverflow.com/a/64594513/2352071), * the `environment.yaml` pinning has been updated by running `snakedeploy pin-conda-envs environment.yaml` on a linux machine, * wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in `input:` or `output:`), * all fields of the example rules in the `Snakefile`s and their entries are explained via comments (`input:`/`output:`/`params:` etc.), * `stderr` and/or `stdout` are logged correctly (`log:`), depending on the wrapped tool, * temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function `tempfile.gettempdir()` points to (see [here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir); this also means that using any Python `tempfile` default behavior works), * the `meta.yaml` contains a link to the documentation of the respective tool or command, * `Snakefile`s pass the linting (`snakemake --lint`), * `Snakefile`s are formatted with [snakefmt](https://github.com/snakemake/snakefmt), * Python wrapper scripts are formatted with [black](https://black.readthedocs.io). * Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays).
- Loading branch information
1 parent
9d47dd7
commit 07dc088
Showing
6 changed files
with
531 additions
and
3 deletions.
There are no files selected for viewing
244 changes: 244 additions & 0 deletions
244
bio/reference/ensembl-biomart-table/environment.linux-64.pin.txt
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
channels: | ||
- conda-forge | ||
- bioconda | ||
- nodefaults | ||
dependencies: | ||
- bioconductor-biomart =2.58 | ||
- r-nanoparquet =0.3 | ||
- r-tidyverse = 2.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
name: ensembl-biomart-table | ||
description: > | ||
Create a table of annotations available via the ``bioconductor-biomart``, | ||
with one column per specified annotation (for example ``ensembl_gene_id``, | ||
``ensembl_transcript_id``, ``ext_gene``, ... for the human reference). For | ||
reference, have a look at the | ||
`Ensembl biomart online <https://www.ensembl.org/biomart/martview>`_ | ||
or at the ``biomaRt`` package documentation linked in the ``URL`` field. | ||
url: https://bioconductor.org/packages/deveol/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html | ||
authors: | ||
- David Lähnemann | ||
output: | ||
- > | ||
tab-separated values (``.tsv``); for supported compression extensions, see | ||
`the write_tsv documentation page <https://readr.tidyverse.org/reference/write_delim.html#output>`_ | ||
- > | ||
parquet (``.parquet``) file; for supported compression algorithms, see | ||
`the write_parquet documentation page <https://r-lib.github.io/nanoparquet/reference/write_parquet.html#arguments>`_ | ||
params: | ||
- biomart: > | ||
for example, 'genes'; for options, see | ||
`the documentation on identifying databases <https://bioconductor.org/packages/devel/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html#step1-identifying-the-database-you-need>`_ | ||
- species: > | ||
species that has a 'genes' database / dataset available via the Ensembl | ||
BioMart (for example, 'homo_sapiens'), for example check the | ||
`Ensembl species list <https://www.ensembl.org/info/about/species.html>`_ | ||
- build: build available for the selected species, for example 'GRCh38' | ||
- release: release from which the species and build are available, for example '112' | ||
- attributes: > | ||
A list of wanted annotation columns ("database attributes"). For | ||
finding available attributes, see the | ||
`instructions in the biomaRt documentation <https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html#how-to-build-a-biomart-query>`_. | ||
Note that these need to be available for the combination of species, | ||
build and release from the specified biomart database. | ||
- filters: > | ||
(optional) This will restrict the download and output to the filters you | ||
specify. The format is a dictionary, for example | ||
``{"chromosome_name": ["X", "Y"]}``. Note that non-existing filter values | ||
(for example a ``chromosomes_name`` of ``"Z"``) will simply be ignored | ||
without error or warning. For finding available filters, see the | ||
`instructions in the biomaRt documentation <https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html#how-to-build-a-biomart-query>`_. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
rule create_transcripts_to_genes_mapping: | ||
output: | ||
table="resources/ensembl_transcripts_to_genes_mapping.tsv.gz", # .gz extension is optional, but recommended | ||
params: | ||
biomart="genes", | ||
species="homo_sapiens", | ||
build="GRCh38", | ||
release="112", | ||
attributes=[ | ||
"ensembl_transcript_id", | ||
"ensembl_gene_id", | ||
"external_gene_name", | ||
"genecards", | ||
"chromosome_name", | ||
], | ||
filters={ "chromosome_name": ["22", "X"] }, # optional: restrict output by using filters | ||
log: | ||
"logs/create_transcripts_to_genes_mapping.log", | ||
cache: "omit-software" # save space and time with between workflow caching (see docs) | ||
wrapper: | ||
"master/bio/reference/ensembl-biomart-table" | ||
|
||
|
||
rule create_transcripts_to_genes_mapping_parquet: | ||
output: | ||
table="resources/ensembl_transcripts_to_genes_mapping.parquet.gz", # .gz extension is optional, but recommended | ||
params: | ||
biomart="genes", | ||
species="mus_musculus", | ||
build="GRCm39", | ||
release="112", | ||
attributes=["ensembl_transcript_id", "ensembl_gene_id"], | ||
# filters={ "chromosome_name": "19"}, # optional: restrict output by using filters | ||
log: | ||
"logs/create_transcripts_to_genes_mapping_parquet.log", | ||
cache: "omit-software" # save space and time with between workflow caching (see docs) | ||
wrapper: | ||
"master/bio/reference/ensembl-biomart-table" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,146 @@ | ||
# __author__ = "David Lähnemann" | ||
# __copyright__ = "Copyright 2024, David Lähnemann" | ||
# __email__ = "[email protected]" | ||
# __license__ = "MIT" | ||
|
||
log <- file(snakemake@log[[1]], open="wt") | ||
sink(log) | ||
sink(log, type="message") | ||
|
||
library("tidyverse") | ||
library("nanoparquet") | ||
rlang::global_entrace() | ||
library("fs") | ||
library("cli") | ||
|
||
library("biomaRt") | ||
|
||
wanted_biomart <- snakemake@params[["biomart"]] | ||
# bioconductor-biomart needs the species as something like `hsapiens` instead | ||
# of `homo_sapiens`, and `chyarkandensis` instead of `cervus_hanglu_yarkandensis` | ||
species_name_components <- str_split(snakemake@params[["species"]], "_")[[1]] | ||
if (length(species_name_components) == 2) { | ||
wanted_species <- str_c( | ||
str_sub(species_name_components[1], 1, 1), | ||
species_name_components[2] | ||
) | ||
} else if (length(species_name_components) == 3) { | ||
wanted_species <- str_c( | ||
str_sub(species_name_components[1], 1, 1), | ||
str_sub(species_name_components[2], 1, 1), | ||
species_name_components[3] | ||
) | ||
} else { | ||
cli_abort(c( | ||
"Unsupported species name '{snakemake@params[['species']]}'.", | ||
"x" = "Splitting on underscores led to unexpected number of name components: {length(species_name_components)}.", | ||
"i" = "Expected species name with 2 (e.g. `homo_sapiens`) or 3 (e.g. `cervus_hanglu_yarkandensis`) components.", | ||
"Anything else either does not exist in Ensembl, or we don't yet handle it properly.", | ||
"In case you are sure the species you specified is correct and exists in Ensembl, please", | ||
"file a bug report as an issue on GitHub, referencing this file: ", | ||
"https://github.com/snakemake/snakemake-wrappers/blob/master/bio/reference/ensembl-biomart-table/wrapper.R" | ||
)) | ||
} | ||
|
||
wanted_release <- snakemake@params[["release"]] | ||
wanted_build <- snakemake@params[["build"]] | ||
|
||
wanted_filters <- snakemake@params[["filters"]] | ||
|
||
wanted_columns <- snakemake@params[["attributes"]] | ||
|
||
output_filename <- snakemake@output[["table"]] | ||
|
||
if (wanted_build == "GRCh37") { | ||
grch <- "37" | ||
version <- NULL | ||
cli_warn(c( | ||
"As you specified build 'GRCH37' in your configuration yaml, biomart forces", | ||
"us to ignore the release you specified ('{release}')." | ||
)) | ||
} else { | ||
grch <- NULL | ||
version <- wanted_release | ||
} | ||
|
||
get_mart <- function(biomart, species, build, version, grch, dataset) { | ||
mart <- useEnsembl( | ||
biomart = biomart, | ||
dataset = str_c(species, "_", dataset), | ||
version = version, | ||
GRCh = grch | ||
) | ||
|
||
if (build == "GRCh37") { | ||
retrieved_build <- str_remove(listDatasets(mart)$version, "\\..*") | ||
} else { | ||
retrieved_build <- str_remove(searchDatasets(mart, species)$version, "\\..*") | ||
} | ||
|
||
if (retrieved_build != build) { | ||
cli_abort(c( | ||
"The Ensembl release and genome build number you specified are not compatible.", | ||
"x" = "Genome build '{build}' not available via biomart for Ensembl release '{release}'.", | ||
"i" = "Ensembl release '{release}' only provides build '{retrieved_build}'.", | ||
" " = "Please fix your configuration yaml file's reference entry, you have two options:", | ||
"*" = "Change the build entry to '{retrieved_build}'.", | ||
"*" = "Change the release entry to one that provides build '{build}'. You have to determine this from biomart by yourself." | ||
)) | ||
} | ||
mart | ||
} | ||
|
||
gene_ensembl <- get_mart(wanted_biomart, wanted_species, wanted_build, version, grch, "gene_ensembl") | ||
|
||
if ( !is.null(wanted_filters) ) { | ||
table <- getBM( | ||
attributes = wanted_columns, | ||
filters = names(wanted_filters), | ||
values = unname(wanted_filters), | ||
mart = gene_ensembl | ||
) |> as_tibble() | ||
} else { | ||
table <- getBM( | ||
attributes = wanted_columns, | ||
mart = gene_ensembl | ||
) |> as_tibble() | ||
} | ||
|
||
|
||
|
||
if ( str_detect(output_filename, "tsv(\\.(gz|bz2|xz))?$") ) { | ||
write_tsv( | ||
x = table, | ||
file = output_filename | ||
) | ||
} else if ( str_detect(output_filename, "\\.parquet") ) { | ||
last_ext <- path_ext(output_filename) | ||
compression <- case_match( | ||
last_ext, | ||
"parquet" ~ "uncompressed", | ||
"gz" ~ "gzip", | ||
"zst" ~ "zstd", | ||
"sz" ~ "snappy" | ||
) | ||
if ( is.na(compression) ) { | ||
cli_abort( | ||
"File extension '{last_ext}' not supported for writing with the used nanoparquet version.", | ||
"x" = "Cannot write to a file '{output_filename}', because the version of the package", | ||
"nanoparquet used does not support writing files of type '{last_ext}'.", | ||
"i" = "For supported file types, see: https://r-lib.github.io/nanoparquet/reference/write_parquet.html" | ||
) | ||
} | ||
write_parquet( | ||
x = table, | ||
file = output_filename, | ||
compression = compression | ||
) | ||
} else { | ||
cli_abort(c( | ||
"Unsupported file format in output file '{output_filename}'.", | ||
"x" = "Only '.tsv' and '.parquet' files are supported, with certain compression variants each.", | ||
"i" = "For supported compression extensions, see:", | ||
"*" = "tsv: https://readr.tidyverse.org/reference/write_delim.html#output", | ||
"*" = "parquet: https://r-lib.github.io/nanoparquet/reference/write_parquet.html#arguments" | ||
)) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters