Merge branch 'main' of https://github.com/ZKI-PH-ImageAnalysis/seq2sq…

…uiggle into main
ZKI-PH-ImageAnalysis · Nov 26, 2024 · c236a22 · c236a22
2 parents ce6deb0 + fa2c34a
commit c236a22
Show file tree

Hide file tree

Showing 16 changed files with 801 additions and 437 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,102 @@
+# seq2squiggle contributing guide
+
+This guide provides an overview of the contribution workflow from setting up a development environment, testing your changes, submitting a pull request and performing a release.
+
+
+It's based on the contribution guide of [breakfast written by Matthew Huska](https://github.com/rki-mf1/breakfast), which follows the packaging guidelines ["Hypermodern Python" by Claudio Jolowicz](https://cjolowicz.github.io/posts/hypermodern-python-01-setup/).
+
+## New contributor guide
+To get an overview of the project itself, read the [README](README.md).
+
+### Setting up your development tools
+
+Some tooling needs to be set up before you can work on seq2squiggle. To install this we use mamba, a faster replacement for the conda package manager, and place them in their own environment:
+
+```sh
+mamba create -n seq2squiggle-dev python=3 poetry fortran-compiler nox pre-commit
+```
+
+Then when you want to work on the project, or at the very least if you want to use poetry commands or run tests, you need to switch to this environment:
+
+```sh
+mamba activate seq2squiggle-dev
+```
+
+The rest of this document assumes that you have the seq2squiggle-dev environment active.
+
+### Installing the package
+
+As you're developing, you can install what you have developed using poetry install into your seq2squiggle-dev conda environment:
+
+```sh
+poetry install
+seq2squiggle version
+```
+
+### Testing
+
+**Not implemented yet**
+
+### Adding dependencies, updating dependency versions
+
+You can add dependencies using poetry:
+
+```sh
+poetry add scikit-learn
+poetry add pandas
+```
+
+You can automatically update the dependency to the newest minor or patch release like this:
+
+```sh
+poetry update pandas
+```
+
+and for major releases you have to be more explicit, assuming you're coming from 1.x to 2.x:
+
+```sh
+poetry update pandas^2.0
+```
+
+### Releasing a new version
+
+First update the version in pyproject.toml using poetry:
+
+```sh
+poetry version patch
+# <it will say the new version number here, e.g. 0.3.1>
+git commit -am "Bump version"
+git push
+```
+
+Then tag the commit with the same version number (note the "v" prefix), push the code and push the tag:
+
+```sh
+git tag v0.3.1
+git push origin v0.3.1
+```
+
+Now go to github.com and do a release, selecting the version number tag you just pushed. This will automatically trigger the new version being tested and pushed to PyPI if the tests pass.
+
+### Updating the python version dependency
+
+Aside from updating package dependencies, it is also sometimes useful to update the dependency on python itself. One way to do this is to edit the pyproject.toml file and change the python version description. Versions can be specified using constraints that are documented in the [poetry docs](https://python-poetry.org/docs/dependency-specification/):
+
+```
+[tool.poetry.dependencies]
+python = "^3.10"  # <-- this
+```
+
+Afterwards, you need to use poetry to update the poetry.lock file to reflect the change that you just made to the  pyproject.toml file. Be sure to use the `--no-update` flag to not update the locked versions of all dependency packages.
+
+```sh
+poetry lock --no-update
+```
+
+Then you need to run your tests to make sure everything is working, commit and push the changes.
+
+You might also need to update/change the version of python in your conda environment, but I'm not certain about that.
+
+### Updating the bioconda package when dependencies, dependency versions, or the python version has been changed
+
+For package updates that don't lead to added/removed dependencies, changes to dependency versions, or changes to the allowed python version, a normal release (as above) is sufficient to automatically update both the PyPI and bioconda packages. However, for changes that do result in changes to dependencies it is necessary to update the bioconda meta.yml file. This is explained in [bioconda docs](https://bioconda.github.io/contributor/updating.html), and they also provide tools to help you with this.
diff --git a/README.md b/README.md
@@ -2,11 +2,11 @@
 
 `seq2squiggle` is a deep learning-based tool for generating artifical nanopore signals from DNA sequence data.
 
-<img src="/img/seq2squiggle_architecture.png" width="750">
+<img src="/img/seq2squiggle.png" width="750">
 
 
 Please cite the following publication if you use `seq2squiggle` in your work:
-- Beslic,  D., Kucklick, M., Engelmann, S., Fuchs, S., Renards, B.Y., Körber, N. End-to-end simulation of nanopore sequencing signals with feed-forward transformers. bioRxiv (2024).
+- Beslic, D., Kucklick, M., Engelmann, S., Fuchs, S., Renard, B. Y., & Körber, N. (2024). End-to-end simulation of nanopore sequencing signals with feed-forward transformers. bioRxiv. doi:10.1101/2024.08.12.607296 
 
 ## Installation 
 
@@ -50,6 +50,10 @@ Generate 10,000 reads from a fasta file:
 ```
 seq2squiggle predict example.fasta -o example.blow5 -n 10000
 ```
+Generate 10,000 reads using R9.4.1 chemistry on a MinION:
+```
+seq2squiggle predict example.fasta -o example.blow5 -n 10000 --profile dna_r9_min
+```
 Generate reads with a coverage of 30:
 ```
 seq2squiggle predict example.fasta -o example.blow5 -c 30
@@ -67,41 +71,38 @@ Export as pod5:
 seq2squiggle predict example.fastq -o example.pod5 --read-input
 ```
 
+## Noise options
+`seq2squiggle` provides flexible options for generating signal data with various noise configurations. By default, it uses its duration sampler and noise sampler modules to predict event durations and amplitude noise levels specific to each input k-mer. Alternatively, you can deactivate these modules (`--noise-sampler False --duration-sampler False`) and use static distributions to sample event durations and amplitude noise. The static distributions can be configured using the options `--noise-std`, `--dwell-std`, and `--dwell-mean`.
 
+### Examples using different noise options
 
-## Different noise options
-`seq2squiggle` supports different options for generating the signal data.
-Per default, the noise sampler and duration sampler are used.
-
-### Examples
-
-Generate reads using both the noise sampler and duration sampler: 
+Default configuration (noise sampler and duration sampler enabled): 
 ```
 seq2squiggle predict example.fasta -o example.blow5
 ```
-Generate reads using the noise sampler with an increased factor and duration sampler:
+Using the noise sampler with increased noise standard deviation and the duration sampler:
 ```
 seq2squiggle predict example.fasta -o example.blow5 --noise-std 1.5
 ```
-Generate reads using a static normal distribution for the noise and duration sampler:
+Using a static normal distribution for the amplitude noise and the duration sampler:
 ```
-seq2squiggle predict example.fasta -o example.blow5 --noise-std 1.5 --noise-sampling False
+seq2squiggle predict example.fasta -o example.blow5 --noise-std 1.0 --noise-sampling False
 ```
-Generate reads using only the noise sampler and a static normal distribution for the event length:
+Using the noise sampler and a static normal distribution for event durations:
 ```
-seq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --ideal-event-length -1
+seq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --dwell-std 4.0
 ```
-Generate reads using only the noise sampler and ideal event lengths:
+Using the noise sampler with ideal event lengths (each k-mer event will have a length of 10):
 ```
-seq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --ideal-event-length 10.0
+seq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --dwell-mean 10.0 --dwell-std 0.0
 ```
-Generate reads using a static normal distribution for the amplitude noise and ideal event lengths:
+Using a static normal distribution for amplitude noise and ideal event lengths:
 ```
-seq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --ideal-event-length 10.0 --noise-sampling False --noise-std 1.0
+seq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --dwell-mean 10.0 --dwell-std 0.0 --noise-sampling False --noise-std 1.0
 ```
-Generate reads using no amplitude noise and ideal event lengths:
+Generating reads with no amplitude noise and ideal event lengths:
 ```
-seq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --ideal-event-length 10.0 --noise-sampling False --noise-std -1
+seq2squiggle predict example.fasta -o example.blow5 --duration-sampling False --dwell-mean 10.0 --dwell-std 0.0 --noise-sampling False --noise-std 0.0
 ```
 
 ## Train a new model
@@ -125,4 +126,5 @@ seq2squiggle train train_dir valid_dir --config my_config.yml --model last.ckpt
 ```
 
 ## Acknowledgement
-The model is based on [xcmyz's implementation of FastSpeech](https://github.com/xcmyz/FastSpeech). Some code snippets for preprocessing DNA-signal chunks have been taken from [bonito](https://github.com/nanoporetech/bonito). 
+The model is based on [xcmyz's implementation of FastSpeech](https://github.com/xcmyz/FastSpeech). Some code snippets for preprocessing DNA-signal chunks have been taken from [bonito](https://github.com/nanoporetech/bonito). We also incorporated code snippets from [Casanovo](https://github.com/Noble-Lab/casanovo) for different functionalities, including downloading weights, logging, and the design of the main function. 
+Additionally, we used parameter profiles from squigulator for various chemistries to set digitisation, sample-rate, range, median_before, and other signal parameters. These profiles are detailed in [squigulator's documentation](https://hasindu2008.github.io/squigulator/docs/profile.html).
diff --git a/img/seq2squiggle.png b/img/seq2squiggle.png
diff --git a/img/seq2squiggle_architecture.png b/img/seq2squiggle_architecture.png
diff --git a/src/seq2squiggle/__init.py__ b/src/seq2squiggle/__init.py__
diff --git a/src/seq2squiggle/config.yaml b/src/seq2squiggle/config.yaml
@@ -9,7 +9,7 @@ log_name: "Human-R1041-4khz"
 wandb_logger_state: disabled # disabled, online, offline
 
 ### Preprocessing parameters
-max_chunks_train: 170000000 
+max_chunks_train: 210000000
 max_chunks_valid: 100000
 scaling_max_value: 165.0
 # If valid_dir is not provided, validation data will be generated from the training dataset.
@@ -28,9 +28,9 @@ encoder_layers: 2
 encoder_heads: 8
 decoder_layers: 2 
 decoder_heads: 8
-encoder_dropout: 0.1
-decoder_dropout: 0.1 
-duration_dropout: 0.1
+encoder_dropout: 0.2
+decoder_dropout: 0.2 
+duration_dropout: 0.2
 
 ### Learning rate parameters
 train_batch_size: 512
@@ -39,7 +39,7 @@ save_model: True
 # Optimizer. Allowed options: Adam, AdamW, SGD, RMSProp,
 optimizer: "Adam"
 warmup_ratio: 0.01 # Percentage of total steps used for warmup
-lr: 0.00025
+lr: 0.0005
 weight_decay: 0.0
 # Schedule for learning rate. Allowed options: warmup_cosine, warmup_constant, constant, warmup_cosine_restarts, one_cycle
 lr_schedule: "warmup_cosine"