Skip to content

Commit

Permalink
feat: add new wrapper to create annotation tables via Ensembl biomart (
Browse files Browse the repository at this point in the history
…#3072)

<!-- Ensure that the PR title follows conventional commit style (<type>:
<description>)-->
<!-- Possible types are here:
https://github.com/commitizen/conventional-commit-types/blob/master/index.json
-->

<!-- Add a description of your PR here-->

### QC
<!-- Make sure that you can tick the boxes below. -->

* [x] I confirm that:

For all wrappers added by this PR, 

* there is a test case which covers any introduced changes,
* `input:` and `output:` file paths in the resulting rule can be changed
arbitrarily,
* either the wrapper can only use a single core, or the example rule
contains a `threads: x` statement with `x` being a reasonable default,
* rule names in the test case are in
[snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell
what the rule is about or match the tools purpose or name (e.g.,
`map_reads` for a step that maps reads),
* all `environment.yaml` specifications follow [the respective best
practices](https://stackoverflow.com/a/64594513/2352071),
* the `environment.yaml` pinning has been updated by running
`snakedeploy pin-conda-envs environment.yaml` on a linux machine,
* wherever possible, command line arguments are inferred and set
automatically (e.g. based on file extensions in `input:` or `output:`),
* all fields of the example rules in the `Snakefile`s and their entries
are explained via comments (`input:`/`output:`/`params:` etc.),
* `stderr` and/or `stdout` are logged correctly (`log:`), depending on
the wrapped tool,
* temporary files are either written to a unique hidden folder in the
working directory, or (better) stored where the Python function
`tempfile.gettempdir()` points to (see
[here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir);
this also means that using any Python `tempfile` default behavior
works),
* the `meta.yaml` contains a link to the documentation of the respective
tool or command,
* `Snakefile`s pass the linting (`snakemake --lint`),
* `Snakefile`s are formatted with
[snakefmt](https://github.com/snakemake/snakefmt),
* Python wrapper scripts are formatted with
[black](https://black.readthedocs.io).
* Conda environments use a minimal amount of channels, in recommended
ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as
conda-forge should have highest priority and defaults channels are
usually not needed because most packages are in conda-forge nowadays).
  • Loading branch information
dlaehnemann authored Jul 24, 2024
1 parent 9d47dd7 commit 07dc088
Show file tree
Hide file tree
Showing 6 changed files with 531 additions and 3 deletions.
244 changes: 244 additions & 0 deletions bio/reference/ensembl-biomart-table/environment.linux-64.pin.txt

Large diffs are not rendered by default.

8 changes: 8 additions & 0 deletions bio/reference/ensembl-biomart-table/environment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
channels:
- conda-forge
- bioconda
- nodefaults
dependencies:
- bioconductor-biomart =2.58
- r-nanoparquet =0.3
- r-tidyverse = 2.0
41 changes: 41 additions & 0 deletions bio/reference/ensembl-biomart-table/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
name: ensembl-biomart-table
description: >
Create a table of annotations available via the ``bioconductor-biomart``,
with one column per specified annotation (for example ``ensembl_gene_id``,
``ensembl_transcript_id``, ``ext_gene``, ... for the human reference). For
reference, have a look at the
`Ensembl biomart online <https://www.ensembl.org/biomart/martview>`_
or at the ``biomaRt`` package documentation linked in the ``URL`` field.
url: https://bioconductor.org/packages/deveol/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html
authors:
- David Lähnemann
output:
- >
tab-separated values (``.tsv``); for supported compression extensions, see
`the write_tsv documentation page <https://readr.tidyverse.org/reference/write_delim.html#output>`_
- >
parquet (``.parquet``) file; for supported compression algorithms, see
`the write_parquet documentation page <https://r-lib.github.io/nanoparquet/reference/write_parquet.html#arguments>`_
params:
- biomart: >
for example, 'genes'; for options, see
`the documentation on identifying databases <https://bioconductor.org/packages/devel/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html#step1-identifying-the-database-you-need>`_
- species: >
species that has a 'genes' database / dataset available via the Ensembl
BioMart (for example, 'homo_sapiens'), for example check the
`Ensembl species list <https://www.ensembl.org/info/about/species.html>`_
- build: build available for the selected species, for example 'GRCh38'
- release: release from which the species and build are available, for example '112'
- attributes: >
A list of wanted annotation columns ("database attributes"). For
finding available attributes, see the
`instructions in the biomaRt documentation <https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html#how-to-build-a-biomart-query>`_.
Note that these need to be available for the combination of species,
build and release from the specified biomart database.
- filters: >
(optional) This will restrict the download and output to the filters you
specify. The format is a dictionary, for example
``{"chromosome_name": ["X", "Y"]}``. Note that non-existing filter values
(for example a ``chromosomes_name`` of ``"Z"``) will simply be ignored
without error or warning. For finding available filters, see the
`instructions in the biomaRt documentation <https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html#how-to-build-a-biomart-query>`_.
38 changes: 38 additions & 0 deletions bio/reference/ensembl-biomart-table/test/Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
rule create_transcripts_to_genes_mapping:
output:
table="resources/ensembl_transcripts_to_genes_mapping.tsv.gz", # .gz extension is optional, but recommended
params:
biomart="genes",
species="homo_sapiens",
build="GRCh38",
release="112",
attributes=[
"ensembl_transcript_id",
"ensembl_gene_id",
"external_gene_name",
"genecards",
"chromosome_name",
],
filters={ "chromosome_name": ["22", "X"] }, # optional: restrict output by using filters
log:
"logs/create_transcripts_to_genes_mapping.log",
cache: "omit-software" # save space and time with between workflow caching (see docs)
wrapper:
"master/bio/reference/ensembl-biomart-table"


rule create_transcripts_to_genes_mapping_parquet:
output:
table="resources/ensembl_transcripts_to_genes_mapping.parquet.gz", # .gz extension is optional, but recommended
params:
biomart="genes",
species="mus_musculus",
build="GRCm39",
release="112",
attributes=["ensembl_transcript_id", "ensembl_gene_id"],
# filters={ "chromosome_name": "19"}, # optional: restrict output by using filters
log:
"logs/create_transcripts_to_genes_mapping_parquet.log",
cache: "omit-software" # save space and time with between workflow caching (see docs)
wrapper:
"master/bio/reference/ensembl-biomart-table"
146 changes: 146 additions & 0 deletions bio/reference/ensembl-biomart-table/wrapper.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# __author__ = "David Lähnemann"
# __copyright__ = "Copyright 2024, David Lähnemann"
# __email__ = "[email protected]"
# __license__ = "MIT"

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

library("tidyverse")
library("nanoparquet")
rlang::global_entrace()
library("fs")
library("cli")

library("biomaRt")

wanted_biomart <- snakemake@params[["biomart"]]
# bioconductor-biomart needs the species as something like `hsapiens` instead
# of `homo_sapiens`, and `chyarkandensis` instead of `cervus_hanglu_yarkandensis`
species_name_components <- str_split(snakemake@params[["species"]], "_")[[1]]
if (length(species_name_components) == 2) {
wanted_species <- str_c(
str_sub(species_name_components[1], 1, 1),
species_name_components[2]
)
} else if (length(species_name_components) == 3) {
wanted_species <- str_c(
str_sub(species_name_components[1], 1, 1),
str_sub(species_name_components[2], 1, 1),
species_name_components[3]
)
} else {
cli_abort(c(
"Unsupported species name '{snakemake@params[['species']]}'.",
"x" = "Splitting on underscores led to unexpected number of name components: {length(species_name_components)}.",
"i" = "Expected species name with 2 (e.g. `homo_sapiens`) or 3 (e.g. `cervus_hanglu_yarkandensis`) components.",
"Anything else either does not exist in Ensembl, or we don't yet handle it properly.",
"In case you are sure the species you specified is correct and exists in Ensembl, please",
"file a bug report as an issue on GitHub, referencing this file: ",
"https://github.com/snakemake/snakemake-wrappers/blob/master/bio/reference/ensembl-biomart-table/wrapper.R"
))
}

wanted_release <- snakemake@params[["release"]]
wanted_build <- snakemake@params[["build"]]

wanted_filters <- snakemake@params[["filters"]]

wanted_columns <- snakemake@params[["attributes"]]

output_filename <- snakemake@output[["table"]]

if (wanted_build == "GRCh37") {
grch <- "37"
version <- NULL
cli_warn(c(
"As you specified build 'GRCH37' in your configuration yaml, biomart forces",
"us to ignore the release you specified ('{release}')."
))
} else {
grch <- NULL
version <- wanted_release
}

get_mart <- function(biomart, species, build, version, grch, dataset) {
mart <- useEnsembl(
biomart = biomart,
dataset = str_c(species, "_", dataset),
version = version,
GRCh = grch
)

if (build == "GRCh37") {
retrieved_build <- str_remove(listDatasets(mart)$version, "\\..*")
} else {
retrieved_build <- str_remove(searchDatasets(mart, species)$version, "\\..*")
}

if (retrieved_build != build) {
cli_abort(c(
"The Ensembl release and genome build number you specified are not compatible.",
"x" = "Genome build '{build}' not available via biomart for Ensembl release '{release}'.",
"i" = "Ensembl release '{release}' only provides build '{retrieved_build}'.",
" " = "Please fix your configuration yaml file's reference entry, you have two options:",
"*" = "Change the build entry to '{retrieved_build}'.",
"*" = "Change the release entry to one that provides build '{build}'. You have to determine this from biomart by yourself."
))
}
mart
}

gene_ensembl <- get_mart(wanted_biomart, wanted_species, wanted_build, version, grch, "gene_ensembl")

if ( !is.null(wanted_filters) ) {
table <- getBM(
attributes = wanted_columns,
filters = names(wanted_filters),
values = unname(wanted_filters),
mart = gene_ensembl
) |> as_tibble()
} else {
table <- getBM(
attributes = wanted_columns,
mart = gene_ensembl
) |> as_tibble()
}



if ( str_detect(output_filename, "tsv(\\.(gz|bz2|xz))?$") ) {
write_tsv(
x = table,
file = output_filename
)
} else if ( str_detect(output_filename, "\\.parquet") ) {
last_ext <- path_ext(output_filename)
compression <- case_match(
last_ext,
"parquet" ~ "uncompressed",
"gz" ~ "gzip",
"zst" ~ "zstd",
"sz" ~ "snappy"
)
if ( is.na(compression) ) {
cli_abort(
"File extension '{last_ext}' not supported for writing with the used nanoparquet version.",
"x" = "Cannot write to a file '{output_filename}', because the version of the package",
"nanoparquet used does not support writing files of type '{last_ext}'.",
"i" = "For supported file types, see: https://r-lib.github.io/nanoparquet/reference/write_parquet.html"
)
}
write_parquet(
x = table,
file = output_filename,
compression = compression
)
} else {
cli_abort(c(
"Unsupported file format in output file '{output_filename}'.",
"x" = "Only '.tsv' and '.parquet' files are supported, with certain compression variants each.",
"i" = "For supported compression extensions, see:",
"*" = "tsv: https://readr.tidyverse.org/reference/write_delim.html#output",
"*" = "parquet: https://r-lib.github.io/nanoparquet/reference/write_parquet.html#arguments"
))
}
57 changes: 54 additions & 3 deletions test.py
Original file line number Diff line number Diff line change
Expand Up @@ -5602,23 +5602,74 @@ def test_ensembl_annotation_gtf_gz():
def test_ensembl_regulatory_gff3_gz():
run(
"bio/reference/ensembl-regulation",
["snakemake", "--cores", "1", "resources/regulatory_features.gff3.gz", "--use-conda", "-F"],
[
"snakemake",
"--cores",
"1",
"resources/regulatory_features.gff3.gz",
"--use-conda",
"-F",
],
)


@skip_if_not_modified
def test_ensembl_regulatory_features_grch37_gff():
run(
"bio/reference/ensembl-regulation",
["snakemake", "--cores", "1", "resources/regulatory_features.gff", "--use-conda", "-F"],
[
"snakemake",
"--cores",
"1",
"resources/regulatory_features.gff",
"--use-conda",
"-F",
],
)


@skip_if_not_modified
def test_ensembl_regulatory_features_mouse_gff_gz():
run(
"bio/reference/ensembl-regulation",
["snakemake", "--cores", "1", "resources/regulatory_features.mouse.gff.gz", "--use-conda", "-F"],
[
"snakemake",
"--cores",
"1",
"resources/regulatory_features.mouse.gff.gz",
"--use-conda",
"-F",
],
)


@skip_if_not_modified
def test_ensembl_transcripts_to_genes_mapping():
run(
"bio/reference/ensembl-biomart-table",
[
"snakemake",
"--cores",
"1",
"resources/ensembl_transcripts_to_genes_mapping.tsv.gz",
"--use-conda",
"-F",
],
)


@skip_if_not_modified
def test_ensembl_transcripts_to_genes_mapping_parquet():
run(
"bio/reference/ensembl-biomart-table",
[
"snakemake",
"--cores",
"1",
"resources/ensembl_transcripts_to_genes_mapping.parquet.gz",
"--use-conda",
"-F",
],
)


Expand Down

0 comments on commit 07dc088

Please sign in to comment.