feat: add new wrapper to create annotation tables via Ensembl biomart (…

…#3072)    ### QC  * [x] I confirm that: For all wrappers added by this PR, * there is a test case which covers any introduced changes, * `input:` and `output:` file paths in the resulting rule can be changed arbitrarily, * either the wrapper can only use a single core, or the example rule contains a `threads: x` statement with `x` being a reasonable default, * rule names in the test case are in [snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell what the rule is about or match the tools purpose or name (e.g., `map_reads` for a step that maps reads), * all `environment.yaml` specifications follow [the respective best practices](https://stackoverflow.com/a/64594513/2352071), * the `environment.yaml` pinning has been updated by running `snakedeploy pin-conda-envs environment.yaml` on a linux machine, * wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in `input:` or `output:`), * all fields of the example rules in the `Snakefile`s and their entries are explained via comments (`input:`/`output:`/`params:` etc.), * `stderr` and/or `stdout` are logged correctly (`log:`), depending on the wrapped tool, * temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function `tempfile.gettempdir()` points to (see [here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir); this also means that using any Python `tempfile` default behavior works), * the `meta.yaml` contains a link to the documentation of the respective tool or command, * `Snakefile`s pass the linting (`snakemake --lint`), * `Snakefile`s are formatted with [snakefmt](https://github.com/snakemake/snakefmt), * Python wrapper scripts are formatted with [black](https://black.readthedocs.io). * Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays).
snakemake · Jul 24, 2024 · 07dc088 · 07dc088
1 parent 9d47dd7
commit 07dc088
Show file tree

Hide file tree

Showing 6 changed files with 531 additions and 3 deletions.
diff --git a/bio/reference/ensembl-biomart-table/environment.linux-64.pin.txt b/bio/reference/ensembl-biomart-table/environment.linux-64.pin.txt
diff --git a/bio/reference/ensembl-biomart-table/environment.yaml b/bio/reference/ensembl-biomart-table/environment.yaml
@@ -0,0 +1,8 @@
+channels:
+  - conda-forge
+  - bioconda
+  - nodefaults
+dependencies:
+  - bioconductor-biomart =2.58
+  - r-nanoparquet =0.3
+  - r-tidyverse = 2.0
diff --git a/bio/reference/ensembl-biomart-table/meta.yaml b/bio/reference/ensembl-biomart-table/meta.yaml
@@ -0,0 +1,41 @@
+name: ensembl-biomart-table
+description: >
+    Create a table of annotations available via the ``bioconductor-biomart``,
+    with one column per specified annotation (for example ``ensembl_gene_id``,
+    ``ensembl_transcript_id``, ``ext_gene``, ... for the human reference). For
+    reference, have a look at the
+    `Ensembl biomart online <https://www.ensembl.org/biomart/martview>`_
+    or at the ``biomaRt`` package documentation linked in the ``URL`` field.
+url: https://bioconductor.org/packages/deveol/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html
+authors:
+  - David Lähnemann
+output:
+  - >
+    tab-separated values (``.tsv``); for supported compression extensions, see
+    `the write_tsv documentation page <https://readr.tidyverse.org/reference/write_delim.html#output>`_
+  - >
+    parquet (``.parquet``) file; for supported compression algorithms, see
+    `the write_parquet documentation page <https://r-lib.github.io/nanoparquet/reference/write_parquet.html#arguments>`_
+params:
+  - biomart: >
+      for example, 'genes'; for options, see
+      `the documentation on identifying databases <https://bioconductor.org/packages/devel/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html#step1-identifying-the-database-you-need>`_
+  - species: >
+      species that has a 'genes' database / dataset available via the Ensembl
+      BioMart (for example, 'homo_sapiens'), for example check the
+      `Ensembl species list <https://www.ensembl.org/info/about/species.html>`_
+  - build: build available for the selected species, for example 'GRCh38'
+  - release: release from which the species and build are available, for example '112'
+  - attributes: >
+      A list of wanted annotation columns ("database attributes"). For
+      finding available attributes, see the
+      `instructions in the biomaRt documentation <https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html#how-to-build-a-biomart-query>`_.
+      Note that these need to be available for the combination of species,
+      build and release from the specified biomart database.
+  - filters: >
+      (optional) This will restrict the download and output to the filters you
+      specify. The format is a dictionary, for example
+      ``{"chromosome_name": ["X", "Y"]}``. Note that non-existing filter values
+      (for example a ``chromosomes_name`` of ``"Z"``) will simply be ignored
+      without error or warning. For finding available filters, see the
+      `instructions in the biomaRt documentation <https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html#how-to-build-a-biomart-query>`_.
diff --git a/bio/reference/ensembl-biomart-table/test/Snakefile b/bio/reference/ensembl-biomart-table/test/Snakefile
@@ -0,0 +1,38 @@
+rule create_transcripts_to_genes_mapping:
+    output:
+        table="resources/ensembl_transcripts_to_genes_mapping.tsv.gz",  # .gz extension is optional, but recommended
+    params:
+        biomart="genes",
+        species="homo_sapiens",
+        build="GRCh38",
+        release="112",
+        attributes=[
+            "ensembl_transcript_id",
+            "ensembl_gene_id",
+            "external_gene_name",
+            "genecards",
+            "chromosome_name",
+        ],
+        filters={ "chromosome_name": ["22", "X"] }, # optional: restrict output by using filters
+    log:
+        "logs/create_transcripts_to_genes_mapping.log",
+    cache: "omit-software"  # save space and time with between workflow caching (see docs)
+    wrapper:
+        "master/bio/reference/ensembl-biomart-table"
+
+
+rule create_transcripts_to_genes_mapping_parquet:
+    output:
+        table="resources/ensembl_transcripts_to_genes_mapping.parquet.gz",  # .gz extension is optional, but recommended
+    params:
+        biomart="genes",
+        species="mus_musculus",
+        build="GRCm39",
+        release="112",
+        attributes=["ensembl_transcript_id", "ensembl_gene_id"],
+        # filters={ "chromosome_name": "19"}, # optional: restrict output by using filters
+    log:
+        "logs/create_transcripts_to_genes_mapping_parquet.log",
+    cache: "omit-software"  # save space and time with between workflow caching (see docs)
+    wrapper:
+        "master/bio/reference/ensembl-biomart-table"
diff --git a/bio/reference/ensembl-biomart-table/wrapper.R b/bio/reference/ensembl-biomart-table/wrapper.R
@@ -0,0 +1,146 @@
+# __author__ = "David Lähnemann"
+# __copyright__ = "Copyright 2024, David Lähnemann"
+# __email__ = "[email protected]"
+# __license__ = "MIT"
+
+log <- file(snakemake@log[[1]], open="wt")
+sink(log)
+sink(log, type="message")
+
+library("tidyverse")
+library("nanoparquet")
+rlang::global_entrace()
+library("fs")
+library("cli")
+
+library("biomaRt")
+
+wanted_biomart <- snakemake@params[["biomart"]]
+# bioconductor-biomart needs the species as something like `hsapiens` instead
+# of `homo_sapiens`, and `chyarkandensis` instead of `cervus_hanglu_yarkandensis`
+species_name_components <- str_split(snakemake@params[["species"]], "_")[[1]]
+if (length(species_name_components) == 2) {
+  wanted_species <- str_c(
+    str_sub(species_name_components[1], 1, 1),
+    species_name_components[2]
+  )
+} else if (length(species_name_components) == 3) {
+  wanted_species <- str_c(
+    str_sub(species_name_components[1], 1, 1),
+    str_sub(species_name_components[2], 1, 1),
+    species_name_components[3]
+  )
+} else {
+  cli_abort(c(
+          "Unsupported species name '{snakemake@params[['species']]}'.",
+    "x" = "Splitting on underscores led to unexpected number of name components: {length(species_name_components)}.",
+    "i" = "Expected species name with 2 (e.g. `homo_sapiens`) or 3 (e.g. `cervus_hanglu_yarkandensis`) components.",
+          "Anything else either does not exist in Ensembl, or we don't yet handle it properly.",
+          "In case you are sure the species you specified is correct and exists in Ensembl, please",
+          "file a bug report as an issue on GitHub, referencing this file: ",
+          "https://github.com/snakemake/snakemake-wrappers/blob/master/bio/reference/ensembl-biomart-table/wrapper.R"
+  ))
+}
+
+wanted_release <- snakemake@params[["release"]]
+wanted_build <- snakemake@params[["build"]]
+
+wanted_filters <- snakemake@params[["filters"]]
+
+wanted_columns <- snakemake@params[["attributes"]]
+
+output_filename <- snakemake@output[["table"]]
+
+if (wanted_build == "GRCh37") {
+  grch <- "37"
+  version <- NULL
+  cli_warn(c(
+    "As you specified build 'GRCH37' in your configuration yaml, biomart forces",
+    "us to ignore the release you specified ('{release}')."
+  ))
+} else {
+  grch <- NULL
+  version <- wanted_release
+}
+
+get_mart <- function(biomart, species, build, version, grch, dataset) {
+  mart <- useEnsembl(
+    biomart = biomart,
+    dataset = str_c(species, "_", dataset),
+    version = version,
+    GRCh = grch
+  )
+
+  if (build == "GRCh37") {
+    retrieved_build <- str_remove(listDatasets(mart)$version, "\\..*")
+  } else {
+    retrieved_build <- str_remove(searchDatasets(mart, species)$version, "\\..*")
+  }
+
+  if (retrieved_build != build) {
+    cli_abort(c(
+            "The Ensembl release and genome build number you specified are not compatible.",
+      "x" = "Genome build '{build}' not available via biomart for Ensembl release '{release}'.",
+      "i" = "Ensembl release '{release}' only provides build '{retrieved_build}'.",
+      " " = "Please fix your configuration yaml file's reference entry, you have two options:",
+      "*" = "Change the build entry to '{retrieved_build}'.",
+      "*" = "Change the release entry to one that provides build '{build}'. You have to determine this from biomart by yourself."
+    ))
+  }
+  mart
+}
+
+gene_ensembl <- get_mart(wanted_biomart, wanted_species, wanted_build, version, grch, "gene_ensembl")
+
+if ( !is.null(wanted_filters) ) {
+  table <- getBM(
+    attributes = wanted_columns,
+    filters = names(wanted_filters),
+    values = unname(wanted_filters),
+    mart = gene_ensembl
+  ) |> as_tibble()
+} else {
+  table <- getBM(
+    attributes = wanted_columns,
+    mart = gene_ensembl
+  ) |> as_tibble()
+}
+
+
+
+if ( str_detect(output_filename, "tsv(\\.(gz|bz2|xz))?$") ) {
+  write_tsv(
+    x = table,
+    file = output_filename
+  )
+} else if ( str_detect(output_filename, "\\.parquet") ) {
+  last_ext <- path_ext(output_filename)
+  compression <- case_match(
+    last_ext,
+    "parquet" ~ "uncompressed",
+    "gz" ~ "gzip",
+    "zst" ~ "zstd",
+    "sz" ~ "snappy"
+  )
+  if ( is.na(compression) ) {
+    cli_abort(
+            "File extension '{last_ext}' not supported for writing with the used nanoparquet version.",
+      "x" = "Cannot write to a file '{output_filename}', because the version of the package",
+            "nanoparquet used does not support writing files of type '{last_ext}'.",
+      "i" = "For supported file types, see: https://r-lib.github.io/nanoparquet/reference/write_parquet.html"
+    )
+  }
+  write_parquet(
+    x = table,
+    file = output_filename, 
+    compression = compression
+  )
+} else {
+  cli_abort(c(
+    "Unsupported file format in output file '{output_filename}'.",
+    "x" = "Only '.tsv' and '.parquet' files are supported, with certain compression variants each.",
+    "i" = "For supported compression extensions, see:",
+    "*" = "tsv: https://readr.tidyverse.org/reference/write_delim.html#output",
+    "*" = "parquet: https://r-lib.github.io/nanoparquet/reference/write_parquet.html#arguments"
+  ))
+}
diff --git a/test.py b/test.py
@@ -5602,23 +5602,74 @@ def test_ensembl_annotation_gtf_gz():
 def test_ensembl_regulatory_gff3_gz():
     run(
         "bio/reference/ensembl-regulation",
-        ["snakemake", "--cores", "1", "resources/regulatory_features.gff3.gz", "--use-conda", "-F"],
+        [
+            "snakemake",
+            "--cores",
+            "1",
+            "resources/regulatory_features.gff3.gz",
+            "--use-conda",
+            "-F",
+        ],
     )
 
 
 @skip_if_not_modified
 def test_ensembl_regulatory_features_grch37_gff():
     run(
         "bio/reference/ensembl-regulation",
-        ["snakemake", "--cores", "1", "resources/regulatory_features.gff", "--use-conda", "-F"],
+        [
+            "snakemake",
+            "--cores",
+            "1",
+            "resources/regulatory_features.gff",
+            "--use-conda",
+            "-F",
+        ],
     )
 
 
 @skip_if_not_modified
 def test_ensembl_regulatory_features_mouse_gff_gz():
     run(
         "bio/reference/ensembl-regulation",
-        ["snakemake", "--cores", "1", "resources/regulatory_features.mouse.gff.gz", "--use-conda", "-F"],
+        [
+            "snakemake",
+            "--cores",
+            "1",
+            "resources/regulatory_features.mouse.gff.gz",
+            "--use-conda",
+            "-F",
+        ],
+    )
+
+
+@skip_if_not_modified
+def test_ensembl_transcripts_to_genes_mapping():
+    run(
+        "bio/reference/ensembl-biomart-table",
+        [
+            "snakemake",
+            "--cores",
+            "1",
+            "resources/ensembl_transcripts_to_genes_mapping.tsv.gz",
+            "--use-conda",
+            "-F",
+        ],
+    )
+
+
+@skip_if_not_modified
+def test_ensembl_transcripts_to_genes_mapping_parquet():
+    run(
+        "bio/reference/ensembl-biomart-table",
+        [
+            "snakemake",
+            "--cores",
+            "1",
+            "resources/ensembl_transcripts_to_genes_mapping.parquet.gz",
+            "--use-conda",
+            "-F",
+        ],
     )