Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Allow concurrent and large refgenie downloads #1172

Open
wants to merge 20 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions .github/workflows/conventional-prs.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: PR
name: Lint PR Title
on:
pull_request_target:
types:
Expand All @@ -9,8 +9,10 @@ on:

jobs:
title-format:
runs-on: ubuntu-latest
runs-on: [self-hosted, Linux, v1.0.2]
steps:
- uses: amannn/action-semantic-pull-request@v3.4.0
- uses: amannn/action-semantic-pull-request@v3.6.0
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
validateSingleCommit: true
23 changes: 14 additions & 9 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ on:

jobs:
docs:
runs-on: ubuntu-latest
runs-on: [self-hosted, Linux, python-3.10]
steps:
- uses: actions/checkout@v1

Expand All @@ -27,21 +27,26 @@ jobs:
make html

testing:
runs-on: ubuntu-latest
runs-on: [self-hosted, Linux, python-3.10]
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v3
with:
submodules: recursive
fetch-depth: 0

- name: Setup mamba
# This is a hybrid of things, might need some tweaking
- name: create environment with mamba
uses: conda-incubator/setup-miniconda@v2
env:
CONDA: "/usr/bin/conda"
with:
mamba-version: "*"
# ?miniforge-variant: Mambaforge
# ?miniforge-version: latest
channels: conda-forge, bioconda
auto-activate-base: true
activate-environment: snakemake
channels: "conda-forge, bioconda"
miniforge-variant: Mambaforge
miniforge-version: latest

use-only-tar-bz2: true # IMPORTANT: This needs to be set for caching to work properly!
auto-update-conda: true
- name: Setup Snakemake environment
shell: bash -el {0}
run: |
Expand Down
5 changes: 2 additions & 3 deletions .github/workflows/qc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,10 @@ name: Code quality

on:
pull_request:
branches_ignore: []

jobs:
formatting:
runs-on: ubuntu-latest
runs-on: [self-hosted, Linux, python-3.10]
steps:
- uses: actions/checkout@v3

Expand Down Expand Up @@ -37,7 +36,7 @@ jobs:
# snakefmt --check $(git diff origin/master --name-only | grep Snakefile)

linting:
runs-on: ubuntu-latest
runs-on: [self-hosted, Linux, python-3.10]
steps:
- uses: actions/checkout@v1

Expand Down
130 changes: 130 additions & 0 deletions CHANGELOG.md

Large diffs are not rendered by default.

18 changes: 16 additions & 2 deletions bio/refgenie/test/Snakefile
Original file line number Diff line number Diff line change
@@ -1,11 +1,25 @@
rule obtain_asset:
output:
# the name refers to the refgenie seek key (see attributes on http://refgenomes.databio.org)
fai="refs/genome.fasta"
fai="refs/genome.fasta",
# Multiple outputs/seek keys are possible here.
params:
genome="human_alu",
asset="fasta",
tag="default"
tag="default",
log:
"logs/refgenie/obtain_large_asset.log",
wrapper:
"master/bio/refgenie"

rule obtain_large_asset:
output:
star_index=directory("refs/star_index/hg38/star_index"),
params:
genome="hg38",
asset="star_index",
tag="default",
log:
"logs/refgenie/obtain_large_asset.log",
wrapper:
"master/bio/refgenie"
16 changes: 13 additions & 3 deletions bio/refgenie/wrapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,27 @@

import os
import refgenconf
from refgenconf.exceptions import RefgenconfError

genome = snakemake.params.genome
asset = snakemake.params.asset
tag = snakemake.params.tag

conf_path = os.environ["REFGENIE"]

rgc = refgenconf.RefGenConf(conf_path, writable=True)

# BUG If there are multiple concurrent refgenie commands, this will fail due to
# unable to acquire lock of the config file.
try:
rgc = refgenconf.RefGenConf(conf_path, writable=True)
except RefgenconfError:
# If read lock timeout, attempt to skip the read lock
rgc = refgenconf.RefGenConf(
conf_path, writable=True, skip_read_lock=True, genome_exact=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't really find out what the exact implications of skip_read_lock=TRUE are, but it seems dangerous to use, to me. Have you also tried increasing wait_max= as an alternative?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't attempt to, but I suspect that this might not be a great choice either. If someone is downloading an asset over a slow connection, even setting wait_max from its default of 60 to 600 might not make a difference and result in a hard-to-diagnose timeout error.

I'm not sure if this was some sort of conflict with the snakemake locking system as well. If we rely on that to protect other files, then the result of the wrapper is it either produces the output file, or the rule fails with a RefgenconfError error and recommends setting the skip_read_lock=TRUE param to try to fix the issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I gathered by poking around a little, I think that the lock only happens while something is written to the conf file. So I would think that this lock is not in place the whole time you are doing the download and that the wait_max= should already help. But the documentation on this is not very clear and I didn't immediately find the mechanism in the code, so I might be misunderstanding this lock.

Do you have the possibility to try wait_max= in your use case and test whether this actually helps?

)
# pull asset if necessary
gat, archive_data, server_url = rgc.pull(genome, asset, tag, force=False)
gat, archive_data, server_url = rgc.pull(
genome, asset, tag, force=False, force_large=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is force_large=True a good general default, or would it make more sense to make this settable via the params: keyword in the rule definition? I am assuming their default of prompting has a reason, to avoid accidental downloads of huge reference data, and having to explicitly specify this via params: would at least be a minimal sanity check that the user knows what they are doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is a good alternative to implement. As is, there is no way to override this while using the wrapper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you feel comfortable implementing this?

I'd introduce an (optional) params: force_large=True in the (one of the) examples, and parse this here in the wrapper.py with force_large=snakemake.params.get("force_large", None), so defaulting to what the default in the original function is, only changing it if this is a deliberate choice by the user.

)

for seek_key, out in snakemake.output.items():
path = rgc.seek(genome, asset, tag_name=tag, seek_key=seek_key, strict_exists=True)
Expand Down
19 changes: 13 additions & 6 deletions bio/rsem/calculate-expression/wrapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@


import os

from pathlib import Path
from snakemake.shell import shell

bam = snakemake.input.get("bam", "")
Expand Down Expand Up @@ -60,15 +60,22 @@
"output.isoforms_results file name malformed "
"(rsem will append .isoforms.results suffix)"
)

reference_prefix = os.path.splitext(snakemake.input.reference[0])[0]

# BUG input_string is 'r' given the input but is should be the reference base path?
# subprocess.CalledProcessError: Command 'set -euo pipefail; rsem-calculate-expression --num-threads 24 --estimate-rspd --calc-ci --strandedness reverse --time --paired-end --alignments results/star/D-1/Aligned.toTranscriptome.out.bam r results/rsem/D-1/D-1 > logs/rsem/calculate_expression/D-1.log 2>&1' returned non-zero exit status 255.
reference_path = Path(snakemake.input.reference[0])
reference_prefix = str(reference_path.parents[0]/reference_path.stem)

extra = snakemake.params.get("extra", "")
threads = snakemake.threads
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
shell(
"rsem-calculate-expression --num-threads {snakemake.threads} {extra} "
"{paired_end_string} {input_bam} {input_string} "
"{reference_prefix} {output_prefix} "
"rsem-calculate-expression --num-threads {snakemake.threads} "
"{extra} "
"{paired_end_string} "
"{input_bam} "
"{input_string} "
"{reference_prefix} "
"{output_prefix} "
"{log}"
)