Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
…uiggle into main
  • Loading branch information
denisbeslic committed Nov 26, 2024
2 parents ce6deb0 + fa2c34a commit c236a22
Show file tree
Hide file tree
Showing 16 changed files with 801 additions and 437 deletions.
102 changes: 102 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# seq2squiggle contributing guide

This guide provides an overview of the contribution workflow from setting up a development environment, testing your changes, submitting a pull request and performing a release.


It's based on the contribution guide of [breakfast written by Matthew Huska](https://github.com/rki-mf1/breakfast), which follows the packaging guidelines ["Hypermodern Python" by Claudio Jolowicz](https://cjolowicz.github.io/posts/hypermodern-python-01-setup/).

## New contributor guide
To get an overview of the project itself, read the [README](README.md).

### Setting up your development tools

Some tooling needs to be set up before you can work on seq2squiggle. To install this we use mamba, a faster replacement for the conda package manager, and place them in their own environment:

```sh
mamba create -n seq2squiggle-dev python=3 poetry fortran-compiler nox pre-commit
```

Then when you want to work on the project, or at the very least if you want to use poetry commands or run tests, you need to switch to this environment:

```sh
mamba activate seq2squiggle-dev
```

The rest of this document assumes that you have the seq2squiggle-dev environment active.

### Installing the package

As you're developing, you can install what you have developed using poetry install into your seq2squiggle-dev conda environment:

```sh
poetry install
seq2squiggle version
```

### Testing

**Not implemented yet**

### Adding dependencies, updating dependency versions

You can add dependencies using poetry:

```sh
poetry add scikit-learn
poetry add pandas
```

You can automatically update the dependency to the newest minor or patch release like this:

```sh
poetry update pandas
```

and for major releases you have to be more explicit, assuming you're coming from 1.x to 2.x:

```sh
poetry update pandas^2.0
```

### Releasing a new version

First update the version in pyproject.toml using poetry:

```sh
poetry version patch
# <it will say the new version number here, e.g. 0.3.1>
git commit -am "Bump version"
git push
```

Then tag the commit with the same version number (note the "v" prefix), push the code and push the tag:

```sh
git tag v0.3.1
git push origin v0.3.1
```

Now go to github.com and do a release, selecting the version number tag you just pushed. This will automatically trigger the new version being tested and pushed to PyPI if the tests pass.

### Updating the python version dependency

Aside from updating package dependencies, it is also sometimes useful to update the dependency on python itself. One way to do this is to edit the pyproject.toml file and change the python version description. Versions can be specified using constraints that are documented in the [poetry docs](https://python-poetry.org/docs/dependency-specification/):

```
[tool.poetry.dependencies]
python = "^3.10" # <-- this
```

Afterwards, you need to use poetry to update the poetry.lock file to reflect the change that you just made to the pyproject.toml file. Be sure to use the `--no-update` flag to not update the locked versions of all dependency packages.

```sh
poetry lock --no-update
```

Then you need to run your tests to make sure everything is working, commit and push the changes.

You might also need to update/change the version of python in your conda environment, but I'm not certain about that.

### Updating the bioconda package when dependencies, dependency versions, or the python version has been changed

For package updates that don't lead to added/removed dependencies, changes to dependency versions, or changes to the allowed python version, a normal release (as above) is sufficient to automatically update both the PyPI and bioconda packages. However, for changes that do result in changes to dependencies it is necessary to update the bioconda meta.yml file. This is explained in [bioconda docs](https://bioconda.github.io/contributor/updating.html), and they also provide tools to help you with this.
44 changes: 23 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@

`seq2squiggle` is a deep learning-based tool for generating artifical nanopore signals from DNA sequence data.

<img src="/img/seq2squiggle_architecture.png" width="750">
<img src="/img/seq2squiggle.png" width="750">


Please cite the following publication if you use `seq2squiggle` in your work:
- Beslic, D., Kucklick, M., Engelmann, S., Fuchs, S., Renards, B.Y., Körber, N. End-to-end simulation of nanopore sequencing signals with feed-forward transformers. bioRxiv (2024).
- Beslic, D., Kucklick, M., Engelmann, S., Fuchs, S., Renard, B. Y., & Körber, N. (2024). End-to-end simulation of nanopore sequencing signals with feed-forward transformers. bioRxiv. doi:10.1101/2024.08.12.607296

## Installation

Expand Down Expand Up @@ -50,6 +50,10 @@ Generate 10,000 reads from a fasta file:
```
seq2squiggle predict example.fasta -o example.blow5 -n 10000
```
Generate 10,000 reads using R9.4.1 chemistry on a MinION:
```
seq2squiggle predict example.fasta -o example.blow5 -n 10000 --profile dna_r9_min
```
Generate reads with a coverage of 30:
```
seq2squiggle predict example.fasta -o example.blow5 -c 30
Expand All @@ -67,41 +71,38 @@ Export as pod5:
seq2squiggle predict example.fastq -o example.pod5 --read-input
```

## Noise options
`seq2squiggle` provides flexible options for generating signal data with various noise configurations. By default, it uses its duration sampler and noise sampler modules to predict event durations and amplitude noise levels specific to each input k-mer. Alternatively, you can deactivate these modules (`--noise-sampler False --duration-sampler False`) and use static distributions to sample event durations and amplitude noise. The static distributions can be configured using the options `--noise-std`, `--dwell-std`, and `--dwell-mean`.

### Examples using different noise options

## Different noise options
`seq2squiggle` supports different options for generating the signal data.
Per default, the noise sampler and duration sampler are used.

### Examples

Generate reads using both the noise sampler and duration sampler:
Default configuration (noise sampler and duration sampler enabled):
```
seq2squiggle predict example.fasta -o example.blow5
```
Generate reads using the noise sampler with an increased factor and duration sampler:
Using the noise sampler with increased noise standard deviation and the duration sampler:
```
seq2squiggle predict example.fasta -o example.blow5 --noise-std 1.5
```
Generate reads using a static normal distribution for the noise and duration sampler:
Using a static normal distribution for the amplitude noise and the duration sampler:
```
seq2squiggle predict example.fasta -o example.blow5 --noise-std 1.5 --noise-sampling False
seq2squiggle predict example.fasta -o example.blow5 --noise-std 1.0 --noise-sampling False
```
Generate reads using only the noise sampler and a static normal distribution for the event length:
Using the noise sampler and a static normal distribution for event durations:
```
seq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --ideal-event-length -1
seq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --dwell-std 4.0
```
Generate reads using only the noise sampler and ideal event lengths:
Using the noise sampler with ideal event lengths (each k-mer event will have a length of 10):
```
seq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --ideal-event-length 10.0
seq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --dwell-mean 10.0 --dwell-std 0.0
```
Generate reads using a static normal distribution for the amplitude noise and ideal event lengths:
Using a static normal distribution for amplitude noise and ideal event lengths:
```
seq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --ideal-event-length 10.0 --noise-sampling False --noise-std 1.0
seq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --dwell-mean 10.0 --dwell-std 0.0 --noise-sampling False --noise-std 1.0
```
Generate reads using no amplitude noise and ideal event lengths:
Generating reads with no amplitude noise and ideal event lengths:
```
seq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --ideal-event-length 10.0 --noise-sampling False --noise-std -1
seq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --dwell-mean 10.0 --dwell-std 0.0 --noise-sampling False --noise-std 0.0
```

## Train a new model
Expand All @@ -125,4 +126,5 @@ seq2squiggle train train_dir valid_dir --config my_config.yml --model last.ckpt
```

## Acknowledgement
The model is based on [xcmyz's implementation of FastSpeech](https://github.com/xcmyz/FastSpeech). Some code snippets for preprocessing DNA-signal chunks have been taken from [bonito](https://github.com/nanoporetech/bonito).
The model is based on [xcmyz's implementation of FastSpeech](https://github.com/xcmyz/FastSpeech). Some code snippets for preprocessing DNA-signal chunks have been taken from [bonito](https://github.com/nanoporetech/bonito). We also incorporated code snippets from [Casanovo](https://github.com/Noble-Lab/casanovo) for different functionalities, including downloading weights, logging, and the design of the main function.
Additionally, we used parameter profiles from squigulator for various chemistries to set digitisation, sample-rate, range, median_before, and other signal parameters. These profiles are detailed in [squigulator's documentation](https://hasindu2008.github.io/squigulator/docs/profile.html).
Binary file added img/seq2squiggle.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed img/seq2squiggle_architecture.png
Binary file not shown.
Empty file added src/seq2squiggle/__init.py__
Empty file.
10 changes: 5 additions & 5 deletions src/seq2squiggle/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ log_name: "Human-R1041-4khz"
wandb_logger_state: disabled # disabled, online, offline

### Preprocessing parameters
max_chunks_train: 170000000
max_chunks_train: 210000000
max_chunks_valid: 100000
scaling_max_value: 165.0
# If valid_dir is not provided, validation data will be generated from the training dataset.
Expand All @@ -28,9 +28,9 @@ encoder_layers: 2
encoder_heads: 8
decoder_layers: 2
decoder_heads: 8
encoder_dropout: 0.1
decoder_dropout: 0.1
duration_dropout: 0.1
encoder_dropout: 0.2
decoder_dropout: 0.2
duration_dropout: 0.2

### Learning rate parameters
train_batch_size: 512
Expand All @@ -39,7 +39,7 @@ save_model: True
# Optimizer. Allowed options: Adam, AdamW, SGD, RMSProp,
optimizer: "Adam"
warmup_ratio: 0.01 # Percentage of total steps used for warmup
lr: 0.00025
lr: 0.0005
weight_decay: 0.0
# Schedule for learning rate. Allowed options: warmup_cosine, warmup_constant, constant, warmup_cosine_restarts, one_cycle
lr_schedule: "warmup_cosine"
Expand Down
Loading

0 comments on commit c236a22

Please sign in to comment.