diff --git a/.github/release_checklist.md b/.github/release_checklist.md
index ba16295c..dd8b23df 100644
--- a/.github/release_checklist.md
+++ b/.github/release_checklist.md
@@ -1,5 +1,7 @@
Release checklist
- [ ] Check outstanding issues on JIRA and Github.
+- [ ] Check [latest documentation](https://sequali.readthedocs.io/en/latest)
+ looks fine.
- [ ] Create a release branch.
- [ ] Change current development version in `CHANGELOG.rst` to stable version.
- [ ] Check memory leaks with `tox -e asan`
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index e9867022..40a67f6f 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -13,24 +13,14 @@ on:
- "*"
jobs:
- lint:
- runs-on: ubuntu-20.04
- steps:
- - uses: actions/checkout@v2.3.4
- - name: Set up Python 3.8
- uses: actions/setup-python@v2.2.1
- with:
- python-version: 3.8
- - name: Install tox
- run: pip install tox
- - name: Lint
- run: tox -e lint
package-checks:
strategy:
matrix:
tox_env:
- twine_check
+ - docs
+ - lint
runs-on: ubuntu-20.04
steps:
- uses: actions/checkout@v2.3.4
@@ -86,7 +76,6 @@ jobs:
# test-arch:
# if: startsWith(github.ref, 'refs/tags') || github.ref == 'refs/heads/develop' || github.ref == 'refs/heads/main'
# runs-on: "ubuntu-latest"
-# needs: lint
# strategy:
# matrix:
# distro: [ "ubuntu20.04" ]
@@ -108,7 +97,7 @@ jobs:
deploy:
if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags')
runs-on: ${{ matrix.os }}
- needs: [lint, package-checks, test]
+ needs: [package-checks, test]
strategy:
matrix:
os:
diff --git a/.gitignore b/.gitignore
index b6e47617..0a898c4b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,3 +1,5 @@
+src/sequali/_version.py
+
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
diff --git a/.readthedocs.yml b/.readthedocs.yml
new file mode 100644
index 00000000..190dfe69
--- /dev/null
+++ b/.readthedocs.yml
@@ -0,0 +1,16 @@
+version: 2
+formats: [] # Do not build epub and pdf
+
+python:
+ install:
+ - requirements: "docs/requirements-docs.txt"
+ - method: "pip"
+ path: "."
+
+sphinx:
+ configuration: docs/conf.py
+
+build:
+ os: "ubuntu-22.04"
+ tools:
+ python: "3"
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
index b853f2a9..84e74fed 100644
--- a/CHANGELOG.rst
+++ b/CHANGELOG.rst
@@ -7,6 +7,21 @@ Changelog
.. This document is user facing. Please word the changes in such a way
.. that users understand how the changes affect the new version.
+version 0.6.0
+-----------------
++ Add links to the documentation in the report.
++ Moved documentation to readthedocs and added extensive module documentation.
++ Change the ``-deduplication-estimate-bits`` to a more understandable
+ ``--duplication-max-stored-fingerprints``.
++ Add a small table that lists how many reads are >=Q5, >=Q7 etc. in the
+ per sequence average quality report.
++ The progressbar can track progress through more file formats.
++ The deduplication fingerprint that is used is now configurable from the
+ command line.
++ The deduplication module starts by gathering all sequences rather than half
+ of the sequences. This allows all sequences to be considered using a big
+ enough hash table.
+
version 0.5.1
-----------------
+ Fix a bug in the overrepresented sequence sampling where the fragments from
diff --git a/README.rst b/README.rst
index 5be1940c..faa6d1ee 100644
--- a/README.rst
+++ b/README.rst
@@ -14,10 +14,25 @@
:target: https://github.com/rhpvorderman/sequali/blob/main/LICENSE
:alt:
+.. image:: https://readthedocs.org/projects/sequali/badge/?version=latest
+ :target: https://sequali.readthedocs.io/en/latest/?badge=latest
+ :alt:
+
+.. image:: https://codecov.io/gh/rhpvorderman/sequali/graph/badge.svg?token=MSR1A6BEGC
+ :target: https://codecov.io/gh/rhpvorderman/sequali
+ :alt:
+
+.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.10854010.svg
+ :target: https://doi.org/10.5281/zenodo.10854010
+ :alt:
+
========
-sequali
+Sequali
========
-Sequence quality metrics
+
+.. introduction start
+
+Sequence quality metrics for FASTQ and uBAM files.
Features:
@@ -36,11 +51,18 @@ Features:
Example reports:
-+ `GM24385_1.fastq.gz `_;
++ `GM24385_1.fastq.gz `_;
HG002 (Genome In A Bottle) on ultra-long Nanopore Sequencing. `Sequence file download `_.
+.. introduction end
+
+For more information check `the documentation `_.
+
Supported formats
=================
+
+.. formats start
+
- FASTQ. Only the Sanger variation with a phred offset of 33 and the error rate
calculation of 10 ^ (-phred/10) is supported. All sequencers use this
format today.
@@ -55,9 +77,13 @@ Supported formats
- For uBAM data as delivered by dorado additional nanopore plots will be
provided.
+.. formats end
+
Installation
============
+.. installation start
+
Installation via pip is available with::
pip install sequali
@@ -66,83 +92,37 @@ Sequali is also distributed via bioconda. It can be installed with::
conda install -c conda-forge -c bioconda sequali
-Usage
-=====
+.. installation end
+
+Quickstart
+==========
+
+.. quickstart start
.. code-block::
- usage: sequali [-h] [--json JSON] [--html HTML] [--outdir OUTDIR]
- [--adapter-file ADAPTER_FILE]
- [--overrepresentation-threshold-fraction FRACTION]
- [--overrepresentation-min-threshold THRESHOLD]
- [--overrepresentation-max-threshold THRESHOLD]
- [--overrepresentation-max-unique-fragments N]
- [--overrepresentation-fragment-length LENGTH]
- [--overrepresentation-sample-every DIVISOR]
- [--deduplication-estimate-bits BITS] [-t THREADS] [--version]
- INPUT
-
- Create a quality metrics report for sequencing data.
-
- positional arguments:
- INPUT Input FASTQ or uBAM file. The format is autodetected
- and compressed formats are supported.
-
- options:
- -h, --help show this help message and exit
- --json JSON JSON output file. default: '.json'.
- --html HTML HTML output file. default: '.html'.
- --outdir OUTDIR, --dir OUTDIR
- Output directory for the report files. default:
- current working directory.
- --adapter-file ADAPTER_FILE
- File with adapters to search for. See default file for
- formatting. Default: src/sequali/adapters/adapter_list.tsv.
- --overrepresentation-threshold-fraction FRACTION
- At what fraction a sequence is determined to be
- overrepresented. The threshold is calculated as
- fraction times the number of sampled sequences.
- Default: 0.001 (1 in 1,000).
- --overrepresentation-min-threshold THRESHOLD
- The minimum amount of occurrences for a sequence to be
- considered overrepresented, regardless of the bound
- set by the threshold fraction. Useful for smaller
- files. Default: 100.
- --overrepresentation-max-threshold THRESHOLD
- The amount of occurrences for a sequence to be
- considered overrepresented, regardless of the bound
- set by the threshold fraction. Useful for very large
- files. Default: unlimited.
- --overrepresentation-max-unique-fragments N
- The maximum amount of unique fragments to store.
- Larger amounts increase the sensitivity of finding
- overrepresented sequences at the cost of increasing
- memory usage. Default: 5,000,000.
- --overrepresentation-fragment-length LENGTH
- The length of the fragments to sample. The maximum is
- 31. Default: 21.
- --overrepresentation-sample-every DIVISOR
- How often a read should be sampled. More samples leads
- to better precision, lower speed, and also towards
- more bias towards the beginning of the file as the
- fragment store gets filled up with more sequences from
- the beginning. Default: 1 in 8.
- --deduplication-estimate-bits BITS
- Determines how many sequences are maximally stored to
- estimate the deduplication rate. Maximum stored
- sequences: 2 ** bits * 7 // 10. Memory required: 2 **
- bits * 24. Default: 21.
- -t THREADS, --threads THREADS
- Number of threads to use. If greater than one sequali
- will use an additional thread for gzip decompression.
- Default: 2.
- --version show program's version number and exit
+ sequali path/to/my.fastq.gz
+
+This will create a report ``my.fastq.gz.html`` and a json ``my.fastq.gz.json``
+in the current working directory.
+
+.. quickstart end
+
+For all command line options checkout the
+`usage documentation `_.
+
+For more extensive information about the module options check the
+`documentation on the module options
+`_.
Acknowledgements
================
+
+.. acknowledgements start
+
+ `FastQC `_ for
its excellent selection of relevant metrics. For this reason these metrics
- are also gathered by sequali.
+ are also gathered by Sequali.
+ The matplotlib team for their excellent work on colormaps. Their work was
an inspiration for how to present the data and their RdBu colormap is used
to represent quality score data. Check their `writings on colormaps
@@ -152,11 +132,17 @@ Acknowledgements
scores `_.
+ Marcel Martin for providing very extensive feedback.
+.. acknowledgements end
+
License
=======
+.. license start
+
This project is licensed under the GNU Affero General Public License v3. Mainly
to avoid commercial parties from using it without notifying the users that they
-can run it themselves. If you want to include code from sequali in your
+can run it themselves. If you want to include code from Sequali in your
open source project, but it is not compatible with the AGPL, please contact me
and we can discuss a separate license.
+
+.. license end
\ No newline at end of file
diff --git a/codecov.yml b/codecov.yml
new file mode 100644
index 00000000..cbf6103b
--- /dev/null
+++ b/codecov.yml
@@ -0,0 +1,8 @@
+coverage:
+ status:
+ project:
+ default:
+ target: 90 # let's try to hit high standards
+ patch:
+ default:
+ target: 90 # Tests should be written for new features
diff --git a/docs/_static/images/fingerprint.fodg b/docs/_static/images/fingerprint.fodg
new file mode 100644
index 00000000..fc1c7753
--- /dev/null
+++ b/docs/_static/images/fingerprint.fodg
@@ -0,0 +1,456 @@
+
+
+
+ Ruben Vorderman2024-03-27T13:04:54.5616308132024-03-27T13:18:01.085215758Ruben VordermanPT13M5S4LibreOffice/7.4.7.2$Linux_X86_64 LibreOffice_project/40$Build-2
+
+
+ -3995
+ -3332
+ 14099
+ 9999
+
+
+ view1
+ true
+ false
+ true
+ true
+ false
+ false
+ false
+ false
+ true
+ 1500
+ false
+ Hw==
+ Hw==
+
+ false
+ true
+ true
+ 0
+ 0
+ true
+ true
+ true
+ 4
+ 0
+ -3995
+ -3332
+ 22662
+ 15597
+ 1000
+ 1000
+ 500
+ 500
+ 1000
+ 2
+ 1000
+ 2
+ false
+ 1500
+ false
+ false
+
+
+
+
+ 1
+ 0
+ 4
+ true
+
+
+ en
+ US
+
+
+
+
+
+ 3
+ $(brandbaseurl)/share/palette%3B$(user)/config/standard.sob
+ false
+ $(brandbaseurl)/share/palette%3B$(user)/config/standard.sog
+ true
+ false
+ $(brandbaseurl)/share/palette%3B$(user)/config/standard.soh
+ $(brandbaseurl)/share/palette%3B$(user)/config/standard.sod
+ true
+ $(brandbaseurl)/share/palette%3B$(user)/config/html.soc
+ true
+ 0
+ $(brandbaseurl)/share/palette%3B$(user)/config/standard.soe
+ true
+ true
+ 1
+ 0
+ false
+ low-resolution
+ false
+ true
+ false
+ false
+ false
+ true
+ true
+ false
+ false
+ false
+ false
+ false
+ false
+
+
+ 1250
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAMAAABEpIrGAAADAFBMVEUAAAAAAIAAgAAAgICA
+ AACAAICAgACAgIDAwMAAAP8A/wAA////AAD/AP///wD///8AAAAzAABmAACZAADMAAD/AAAA
+ MwAzMwBmMwCZMwDMMwD/MwAAZgAzZgBmZgCZZgDMZgD/ZgAAmQAzmQBmmQCZmQDMmQD/mQAA
+ zAAzzABmzACZzADMzAD/zAAA/wAz/wBm/wCZ/wDM/wD//wAAADMzADNmADOZADPMADP/ADMA
+ MzMzMzNmMzOZMzPMMzP/MzMAZjMzZjNmZjOZZjPMZjP/ZjMAmTMzmTNmmTOZmTPMmTP/mTMA
+ zDMzzDNmzDOZzDPMzDP/zDMA/zMz/zNm/zOZ/zPM/zP//zMAAGYzAGZmAGaZAGbMAGb/AGYA
+ M2YzM2ZmM2aZM2bMM2b/M2YAZmYzZmZmZmaZZmbMZmb/ZmYAmWYzmWZmmWaZmWbMmWb/mWYA
+ zGYzzGZmzGaZzGbMzGb/zGYA/2Yz/2Zm/2aZ/2bM/2b//2YAAJkzAJlmAJmZAJnMAJn/AJkA
+ M5kzM5lmM5mZM5nMM5n/M5kAZpkzZplmZpmZZpnMZpn/ZpkAmZkzmZlmmZmZmZnMmZn/mZkA
+ zJkzzJlmzJmZzJnMzJn/zJkA/5kz/5lm/5mZ/5nM/5n//5kAAMwzAMxmAMyZAMzMAMz/AMwA
+ M8wzM8xmM8yZM8zMM8z/M8wAZswzZsxmZsyZZszMZsz/ZswAmcwzmcxmmcyZmczMmcz/mcwA
+ zMwzzMxmzMyZzMzMzMz/zMwA/8wz/8xm/8yZ/8zM/8z//8wAAP8zAP9mAP+ZAP/MAP//AP8A
+ M/8zM/9mM/+ZM//MM///M/8AZv8zZv9mZv+ZZv/MZv//Zv8Amf8zmf9mmf+Zmf/Mmf//mf8A
+ zP8zzP9mzP+ZzP/MzP//zP8A//8z//9m//+Z///M//////8AuP8AAAAAAAAAAAAAAAAAAAAA
+ AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC4
+ 0k8YAAAAFklEQVR4nGPgJwAYRhWMKhhVMFIVAADLVTwBNA/g8AAAAABJRU5ErkJggg==
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Sequence #1
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Sequence #2
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ #3
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/_static/images/fingerprint.svg b/docs/_static/images/fingerprint.svg
new file mode 100644
index 00000000..154b3e70
--- /dev/null
+++ b/docs/_static/images/fingerprint.svg
@@ -0,0 +1,138 @@
+
+
+
\ No newline at end of file
diff --git a/docs/_static/images/overrepresented_sampling.fodg b/docs/_static/images/overrepresented_sampling.fodg
new file mode 100644
index 00000000..27b19a0f
--- /dev/null
+++ b/docs/_static/images/overrepresented_sampling.fodg
@@ -0,0 +1,509 @@
+
+
+
+ Ruben Vorderman2024-03-27T12:13:53.658159835LibreOffice/7.4.7.2$Linux_X86_64 LibreOffice_project/40$Build-22024-03-27T12:27:29.181257342Ruben VordermanPT13M36S3
+
+
+ -5024
+ -849
+ 44763
+ 30807
+
+
+ view1
+ true
+ false
+ true
+ true
+ false
+ false
+ false
+ false
+ true
+ 1500
+ false
+ Hw==
+ Hw==
+
+ false
+ true
+ true
+ 0
+ 0
+ true
+ true
+ true
+ 4
+ 0
+ -5024
+ -849
+ 31530
+ 21700
+ 1000
+ 1000
+ 500
+ 500
+ 1000
+ 2
+ 1000
+ 2
+ false
+ 1500
+ true
+ false
+
+
+
+
+ 1
+ 0
+ 4
+ true
+
+
+ en
+ US
+
+
+
+
+
+ 3
+ $(brandbaseurl)/share/palette%3B$(user)/config/standard.sob
+ false
+ $(brandbaseurl)/share/palette%3B$(user)/config/standard.sog
+ true
+ false
+ $(brandbaseurl)/share/palette%3B$(user)/config/standard.soh
+ $(brandbaseurl)/share/palette%3B$(user)/config/standard.sod
+ true
+ $(brandbaseurl)/share/palette%3B$(user)/config/html.soc
+ true
+ 0
+ $(brandbaseurl)/share/palette%3B$(user)/config/standard.soe
+ true
+ true
+ 1
+ 0
+ false
+ low-resolution
+ false
+ true
+ false
+ false
+ false
+ true
+ true
+ false
+ false
+ false
+ false
+ false
+ false
+ lAH+/0dlbmVyaWMgUHJpbnRlcgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAU0dFTlBSVAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAWAAMAtQAAAAAAAAAEAAhSAAAEdAAASm9iRGF0YSAxCnByaW50ZXI9R2VuZXJpYyBQcmludGVyCm9yaWVudGF0aW9uPVBvcnRyYWl0CmNvcGllcz0xCmNvbGxhdGU9ZmFsc2UKbWFyZ2luYWRqdXN0bWVudD0wLDAsMCwwCmNvbG9yZGVwdGg9MjQKcHNsZXZlbD0wCnBkZmRldmljZT0xCmNvbG9yZGV2aWNlPTAKUFBEQ29udGV4dERhdGEKUGFnZVNpemU6QTQAABIAQ09NUEFUX0RVUExFWF9NT0RFDwBEdXBsZXhNb2RlOjpPZmY=
+ Generic Printer
+ 1250
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Sequence #1
+
+
+
+ barcode
+
+
+
+ adapter
+
+
+
+ adapter
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Sequence #2
+
+
+
+ barcode
+
+
+
+ adapter
+
+
+
+ adapter
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/_static/images/overrepresented_sampling.svg b/docs/_static/images/overrepresented_sampling.svg
new file mode 100644
index 00000000..3cf7716a
--- /dev/null
+++ b/docs/_static/images/overrepresented_sampling.svg
@@ -0,0 +1,264 @@
+
+
+
\ No newline at end of file
diff --git a/docs/conf.py b/docs/conf.py
new file mode 100644
index 00000000..9eeafbcd
--- /dev/null
+++ b/docs/conf.py
@@ -0,0 +1,33 @@
+# Configuration file for the Sphinx documentation builder.
+#
+# For the full list of built-in configuration values, see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Project information -----------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
+
+import importlib.metadata
+
+project = 'Sequali'
+copyright = '2023, Leiden University Medical Center'
+author = 'Ruben Vorderman'
+version = [x.version for x in importlib.metadata.distributions() if x.name == "sequali"][0]
+# -- General configuration ---------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
+
+extensions = ['sphinxarg.ext']
+
+templates_path = ['_templates']
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+
+
+
+# -- Options for HTML output -------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
+
+html_theme = 'sphinx_rtd_theme'
+html_static_path = ['_static']
+
+html_theme_options = dict(
+ display_version=True,
+)
diff --git a/docs/includes/CHANGELOG.rst b/docs/includes/CHANGELOG.rst
new file mode 120000
index 00000000..bfa394db
--- /dev/null
+++ b/docs/includes/CHANGELOG.rst
@@ -0,0 +1 @@
+../../CHANGELOG.rst
\ No newline at end of file
diff --git a/docs/includes/README.rst b/docs/includes/README.rst
new file mode 120000
index 00000000..c768ff7d
--- /dev/null
+++ b/docs/includes/README.rst
@@ -0,0 +1 @@
+../../README.rst
\ No newline at end of file
diff --git a/docs/index.rst b/docs/index.rst
new file mode 100644
index 00000000..194c7d43
--- /dev/null
+++ b/docs/index.rst
@@ -0,0 +1,70 @@
+.. Sequali documentation master file, created by
+ sphinx-quickstart on Mon Mar 25 14:47:42 2024.
+ You can adapt this file completely to your liking, but it should at least
+ contain the root `toctree` directive.
+
+===================================
+Welcome to Sequali's documentation!
+===================================
+
+.. contents:: Table of contents
+
+
+==================
+Introduction
+==================
+
+.. include:: includes/README.rst
+ :start-after: .. introduction start
+ :end-before: .. introduction end
+
+==================
+Supported formats
+==================
+
+.. include:: includes/README.rst
+ :start-after: .. formats start
+ :end-before: .. formats end
+
+==================
+Installation
+==================
+
+.. include:: includes/README.rst
+ :start-after: .. installation start
+ :end-before: .. installation end
+
+==================
+Quickstart
+==================
+
+.. include:: includes/README.rst
+ :start-after: .. quickstart start
+ :end-before: .. quickstart end
+
+For a complete overview of the available command line options check the
+usage below.
+
+For more information about how the different modules see the
+`Module option explanations`_.
+
+==================
+Usage
+==================
+
+.. argparse::
+ :module: sequali.__main__
+ :func: argument_parser
+ :prog: sequali
+
+.. include:: module_options.rst
+
+==================
+Acknowledgements
+==================
+
+.. include:: includes/README.rst
+ :start-after: .. acknowledgements start
+ :end-before: .. acknowledgements end
+
+.. include:: includes/CHANGELOG.rst
diff --git a/docs/module_options.rst b/docs/module_options.rst
new file mode 100644
index 00000000..28dab224
--- /dev/null
+++ b/docs/module_options.rst
@@ -0,0 +1,154 @@
+==========================
+Module option explanations
+==========================
+
+Adapter Content Module
+----------------------
+
+The adapter content module searches for adapter stubs that are 12 bp in length.
+These adapter probes are saved in the default adapter file which has the
+following structure:
+
+.. csv-table:: adapter_file.tsv
+ :header: "#Name", "Sequencing Technology", "Probe sequence", "sequence position"
+
+ "Illumina Universal Adapter", "illumina", "AGATCGGAAGAG", "end"
+ "Illumina Small RNA 3' adapter", "illumina", "TGGAATTCTCGG", "end"
+
+All empty rows and rows starting with ``#`` are ignored. The file is tab
+separated. The columns are as follows:
+
++ Name: The name of the sequence that shows up in the report.
++ Sequencing Technology: The name of the technology, currently ``illumina``,
+ ``nanopore`` and ``all`` are supported. Sequali detects the technology from
+ the file header and only loads the appropriate adapters and adapters with
+ ``all``.
++ Probe sequence: the sequence to probe for. Can be up to 64 bp in length.
+ Since exact matching is used false postives versus false negatives need to
+ be weighed when considering probe length.
++ Sequence position: Whether the adapter occurs at the begin or end. In the
+ resulting adapter graph, counts for this adapter will accumulate towards the
+ begin or end depending on this field.
+
+A new adapter file can be set with the ``--adapter-file`` flag on the CLI.
+
+Overrepresented Sequences Module
+----------------------------------
+Determining overrepresented sequences is challenging. One way is to take
+all the k-mers of each sequence and count all the k-mer occurences. To avoid
+issues with read orientation the canonical k-mers should be taken [#F1]_.
+Storing all kmers and counting them is very compute intensive as a k-mer has to
+be calculated and stored for every position in the sequence.
+
+Sequali therefore divides a sequence in fragments of length k. Unlike k-mers
+which are overlapping, this ensures that each part of the sequence is
+represented by just one fragment. The disadvantage is that these fragments
+can be caught in different frames, unlikely k-mers which capture all possible
+frames for length k. This hampers the detection rate.
+
+Since most overrepresented sequences will be adapter and helper sequences
+and since most of these sequences will be anchored at the beginning and end
+of the read, this problem is alleviated by capturing the fragments from the
+ends towards the middle. This means that the first and last fragment will
+always be the first 21 bp of the beginning and the last 21 bp in the end. As
+such the adapter sequences will always be sampled in the same frame.
+
+.. figure:: _static/images/overrepresented_sampling.svg
+
+ This figure shows how fragments are sampled in sequali. The silver elements
+ represent the fragments. Sequence #1 is longer and hence more fragments are
+ sampled. Despite the length differences between sequence #1 and sequence #2
+ the fragments for the adapter and barcode sequences are the same.
+ In Sequence #1 the fragments sampled from the back end overlap somewhat
+ with sequences from the front end. This is a necessity to ensure all of the
+ sequence is sampled when the length is not divisible by the size of the
+ fragments.
+
+Fragments are stored and counted in a hash table. When the hash table is full
+only fragments that are already present will be counted. To diminish the time
+spent on the module, by default 1 in 8 sequences is analysed.
+
+After the module is run, stored fragments are checked for their counts. If the
+count exceeds a certain threshold it is considered overrepresented. Sequali
+does a k-mer analysis of the sequences and compares that with sequences from
+the NCBI UniVec database to determine possible origins.
+
+The following command line parameters affect this module:
+
++ ``--overrepresentation-threshold-fraction``: If count / total exceeds this
+ fraction, the fragment is considered overrepresented.
++ ``--overrepresentation-min-threshold``: The minimum count to be considered
+ overrepresented.
++ ``--overrepresentation-max-threshold``: The maximum count to be considered
+ overrepresented. On large libraries with billions of sampled fragments this
+ can be used to force detection for certain counts regardless of threshold.
++ ``--overrepresentation-max-unique-fragments``: The amount of fragments to
+ store.
++ ``--overrepresentation-sample-every``: How often a sequence is sampled. Default
+ is every 8 sequences.
+
+.. [#F1] A canonical k-mer is the k-mer that has the lowest sort order compared
+ to itself and its reverse complement. This way the canonical-kmer is
+ always the same regardless if it, or its reverse complement are read.
+ This is useful to identify sequences regardless of orientation.
+
+Duplication Estimation Module
+-----------------------------
+Properly evaluating duplication in a reference-free fashion requires an
+all-to-all alignment of sequences and using a predefined set of criteria to
+ascertain whether the sequences are duplicates. This is unpractical.
+
+For a practical estimate it is common practice to take a small part of the
+sequence as a fingerprint and use a hash table to store and count fingerprints.
+Since the fingerprint is small, sequence errors do not affect it heavily. As
+such this can provide a reasonable estimate, which is good enough for detecting
+problematic libraries.
+
+Sequali's fingerprints by collecting a small sample from the front and back
+of the sequence. To avoid adapter sequences, the samples are taken at an
+offset. If the sequence is small, the offsets are sunk proportionally. If the
+sequence is smaller than the sample sequence lenghts, its entire length
+is sampled.
+
+.. figure:: _static/images/fingerprint.svg
+
+ Sequali fingerprinting. Small samples are taken from the front and back
+ of the sequence at an offset. Sequence #1 shows the common situation where
+ the sequence is long. Sequence #2 is smaller than the combined length of
+ the offsets and the samples, so the offsets are shrunk proportionally.
+ Sequence #3 is smaller than the sample length, so its sampled entirely.
+
+The sampled sequences are then combined into one and hashed. The hash
+seed is determined by the sequence length integer divided by 64. The resulting
+hash is the fingerprint.
+
+Since not all fingerprints can be counted due to memory constraints, `a hash
+subsampling technique from the file storage world
+`_ is used.
+
+This technique first counts all the fingerprints. Then when the hash table is
+full, a new hash table is created. The already counted fingerprints are inserted
+but only if the last bit of the hash is ``0``. This eliminates on average half
+of the fingerprints. The fingerprinting and counting is then continued, but
+only hashes that end in ``0`` are considered. If the hash table is full again,
+the process is repeated but now only hashes that end with the last two bits
+``00`` are considered, and so on.
+
+The advantage of this technique is that it subsamples
+only part of the fingerprints which is good for memory usage.
+As stated in the paper, unlike subsampling only the fingerprints from the
+beginning of the file, this technique is much less biased towards unique
+sequences.
+
+The following command line options affect this module:
+
++ ``--duplication-max-stored-fingerprints``: The maximum amount of stored
+ fingerprints. More fingerprints lead to more accurate estimates but also more
+ memory usage.
+
+These options can be used to control how the fingerprint is taken
+
++ ``--fingerprint-front-length``
++ ``--fingerprint-back-length``
++ ``--fingerprint-front-offset``
++ ``--fingerprint-back-offset``
diff --git a/docs/requirements-docs.txt b/docs/requirements-docs.txt
new file mode 100644
index 00000000..91ff9bd4
--- /dev/null
+++ b/docs/requirements-docs.txt
@@ -0,0 +1,3 @@
+sphinx
+sphinx-argparse
+sphinx_rtd_theme
diff --git a/pyproject.toml b/pyproject.toml
index ada053bc..c57ab4b0 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -32,7 +32,7 @@ classifiers = [
]
requires-python = ">3.8"
dependencies = [
- "xopen>=1.8.0",
+ "xopen>=2.0.0",
"pygal>=3.0.4",
"tqdm",
]
@@ -57,7 +57,9 @@ sequali = [
'style/*'
]
[project.urls]
+"Documentation" = "https://sequali.readthedocs.io"
"Homepage" = "https://github.com/rhpvorderman/sequali"
+"Issue tracker" = "https://github.com/rhpvorderman/sequali/issues"
[tool.setuptools_scm]
write_to = "src/sequali/_version.py"
diff --git a/scripts/finger_print_quality.py b/scripts/finger_print_quality.py
index 06a3a1c4..042db5e3 100644
--- a/scripts/finger_print_quality.py
+++ b/scripts/finger_print_quality.py
@@ -12,28 +12,44 @@ def fingerprint_sequence_original(sequence: str):
return sequence[:16] + sequence[-16:]
-def new_fingerprint(sequence: str, fingerprint_length=32, max_offset=32):
- fingerprint_part_length = fingerprint_length // 2
+def new_fingerprint(sequence: str,
+ front_length: int,
+ back_length: int,
+ front_offset: int,
+ back_offset: int):
+ fingerprint_length = front_length + back_length
if len(sequence) < fingerprint_length:
return sequence
+
remainder = len(sequence) - fingerprint_length
- offset = max(remainder // 2, max_offset)
- return sequence[offset: offset + fingerprint_part_length] + sequence[-(offset + fingerprint_part_length):-offset]
+ front_offset = min(remainder // 2, front_offset)
+ back_offset = min(remainder // 2, back_offset)
+ return (sequence[front_offset: front_offset + front_length] +
+ sequence[-(back_offset + back_length):-back_offset])
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("fastq")
- parser.add_argument("--fingerprint-length", nargs="?", default=32, type=int)
- parser.add_argument("--offset", nargs="?", default=32, type=int)
+ parser.add_argument("--front-length", default=8, type=int)
+ parser.add_argument("--back-length", default=8, type=int)
+ parser.add_argument("--front-offset", default=64, type=int)
+ parser.add_argument("--back-offset", default=64, type=int)
args = parser.parse_args()
- offset = args.offset
- fingerprint_length = args.fingerprint_length
+ front_length = args.front_length
+ back_length = args.back_length
+ fingerprint_length = front_length + back_length
expected_errors = [0 for _ in range(fingerprint_length + 1)]
with dnaio.open(args.fastq, mode="r", open_threads=1) as reader:
for read in reader: # type: dnaio.SequenceRecord
- fingerprint_quals = new_fingerprint(read.qualities, fingerprint_length, offset)
+ fingerprint_quals = new_fingerprint(
+ read.qualities,
+ front_length=front_length,
+ back_length=back_length,
+ front_offset=args.front_offset,
+ back_offset=args.back_offset,
+ )
prob = 0.0
for q in fingerprint_quals.encode("ascii"):
prob += QUAL_TO_PHRED[q]
diff --git a/scripts/fingerprinter.py b/scripts/fingerprinter.py
index 2e3328ec..79905ce1 100644
--- a/scripts/fingerprinter.py
+++ b/scripts/fingerprinter.py
@@ -23,4 +23,5 @@ def fastq_file_to_hashes(fastq_file):
dupcounter = collections.Counter(counter.values())
print(dict(sorted(dupcounter.items())))
estimated_fractions = DuplicationCounts.estimated_counts_to_fractions(dupcounter.items())
- print(estimated_fractions)
\ No newline at end of file
+ print(estimated_fractions)
+ print(DuplicationCounts.deduplicated_fraction(dupcounter))
\ No newline at end of file
diff --git a/src/sequali/__init__.py b/src/sequali/__init__.py
index 27930719..a77e1efa 100644
--- a/src/sequali/__init__.py
+++ b/src/sequali/__init__.py
@@ -1,18 +1,19 @@
# Copyright (C) 2023 Leiden University Medical Center
-# This file is part of sequali
+# This file is part of Sequali
#
-# sequali is free software: you can redistribute it and/or modify
+# Sequali is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.
#
-# sequali is distributed in the hope that it will be useful,
+# Sequali is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU Affero General Public License for more details.
#
# You should have received a copy of the GNU Affero General Public License
-# along with sequali. If not, see argparse.ArgumentParser:
parser.add_argument("--overrepresentation-min-threshold", type=int,
metavar="THRESHOLD",
default=100,
- help=f"The minimum amount of occurrences for a sequence "
- f"to be considered overrepresented, regardless of "
- f"the bound set by the threshold fraction. Useful for "
- f"smaller files. Default: {100}.")
+ help="The minimum amount of occurrences for a sequence "
+ "to be considered overrepresented, regardless of "
+ "the bound set by the threshold fraction. Useful for "
+ "smaller files. Default: 100.")
parser.add_argument("--overrepresentation-max-threshold", type=int,
metavar="THRESHOLD",
default=sys.maxsize,
@@ -101,18 +114,47 @@ def argument_parser() -> argparse.ArgumentParser:
f"gets filled up with more sequences from the "
f"beginning. "
f"Default: 1 in {DEFAULT_UNIQUE_SAMPLE_EVERY}.")
- parser.add_argument("--deduplication-estimate-bits", type=int,
- default=DEFAULT_DEDUP_HASH_TABLE_SIZE_BITS,
- metavar="BITS",
- help=f"Determines how many sequences are maximally "
- f"stored to estimate the deduplication rate. "
- f"Maximum stored sequences: 2 ** bits * 7 // 10. "
- f"Memory required: 2 ** bits * 24. "
- f"Default: {DEFAULT_DEDUP_HASH_TABLE_SIZE_BITS}.")
+ parser.add_argument("--duplication-max-stored-fingerprints", type=int,
+ default=DEFAULT_DEDUP_MAX_STORED_FINGERPRINTS,
+ metavar="N",
+ help=f"Determines how many fingerprints are maximally "
+ f"stored to estimate the duplication rate. "
+ f"More fingerprints leads to a more accurate "
+ f"estimate, but also more memory usage. "
+ f"Default: {DEFAULT_DEDUP_MAX_STORED_FINGERPRINTS:,}.")
+ parser.add_argument("--fingerprint-front-length", type=int,
+ default=DEFAULT_FINGERPRINT_FRONT_SEQUENCE_LENGTH,
+ metavar="LENGTH",
+ help=f"Set the number of bases to be taken for the "
+ f"deduplication fingerprint from the front of "
+ f"the sequence. "
+ f"Default: {DEFAULT_FINGERPRINT_FRONT_SEQUENCE_LENGTH}.")
+
+ parser.add_argument("--fingerprint-back-length", type=int,
+ default=DEFAULT_FINGERPRINT_BACK_SEQUENCE_LENGTH,
+ metavar="LENGTH",
+ help=f"Set the number of bases to be taken for the "
+ f"deduplication fingerprint from the back of "
+ f"the sequence. "
+ f"Default: {DEFAULT_FINGERPRINT_BACK_SEQUENCE_LENGTH}.")
+ parser.add_argument("--fingerprint-front-offset", type=int,
+ default=DEFAULT_FINGERPRINT_FRONT_SEQUENCE_OFFSET,
+ metavar="LENGTH",
+ help=f"Set the offset for the front part of the "
+ f"deduplication fingerprint. Useful for avoiding "
+ f"adapter sequences. "
+ f"Default: {DEFAULT_FINGERPRINT_FRONT_SEQUENCE_OFFSET}.")
+ parser.add_argument("--fingerprint-back-offset", type=int,
+ default=DEFAULT_FINGERPRINT_BACK_SEQUENCE_OFFSET,
+ metavar="LENGTH",
+ help=f"Set the offset for the back part of the "
+ f"deduplication fingerprint. Useful for avoiding "
+ f"adapter sequences. "
+ f"Default: {DEFAULT_FINGERPRINT_BACK_SEQUENCE_OFFSET}.")
parser.add_argument("-t", "--threads", type=int, default=2,
help="Number of threads to use. If greater than one "
- "sequali will use an additional thread for gzip "
- "decompression. Default: 2.")
+ "an additional thread for gzip "
+ "decompression will be used. Default: 2.")
parser.add_argument("--version", action="version",
version=__version__)
return parser
@@ -133,32 +175,38 @@ def main() -> None:
sample_every=args.overrepresentation_sample_every
)
dedup_estimator = DedupEstimator(
- hash_table_size_bits=args.deduplication_estimate_bits)
+ max_stored_fingerprints=args.duplication_max_stored_fingerprints,
+ front_sequence_length=args.fingerprint_front_length,
+ front_sequence_offset=args.fingerprint_front_offset,
+ back_sequence_length=args.fingerprint_back_length,
+ back_sequence_offset=args.fingerprint_back_offset,
+ )
nanostats = NanoStats()
filename: str = args.input
threads = args.threads
if threads < 1:
raise ValueError(f"Threads must be greater than 1, got {threads}.")
- with xopen.xopen(filename, "rb", threads=threads-1) as file: # type: ignore
- progress = ProgressUpdater(filename, file)
- if filename.endswith(".bam") or (
- hasattr(file, "peek") and file.peek(4)[:4] == b"BAM\1"):
- reader = BamParser(file)
- seqtech = guess_sequencing_technology_from_bam_header(reader.header)
- else:
- reader = FastqParser(file) # type: ignore
- seqtech = guess_sequencing_technology_from_file(file) # type: ignore
- adapters = list(adapters_from_file(args.adapter_file, seqtech))
- adapter_counter = AdapterCounter(adapter.sequence for adapter in adapters)
- with progress:
- for record_array in reader:
- metrics.add_record_array(record_array)
- per_tile_quality.add_record_array(record_array)
- adapter_counter.add_record_array(record_array)
- sequence_duplication.add_record_array(record_array)
- nanostats.add_record_array(record_array)
- dedup_estimator.add_record_array(record_array)
- progress.update(record_array)
+ with open(filename, "rb") as raw:
+ progress = ProgressUpdater(raw)
+ with xopen.xopen(raw, "rb", threads=threads - 1) as file:
+ if filename.endswith(".bam") or (
+ hasattr(file, "peek") and file.peek(4)[:4] == b"BAM\1"):
+ reader = BamParser(file)
+ seqtech = guess_sequencing_technology_from_bam_header(reader.header)
+ else:
+ reader = FastqParser(file) # type: ignore
+ seqtech = guess_sequencing_technology_from_file(file) # type: ignore
+ adapters = list(adapters_from_file(args.adapter_file, seqtech))
+ adapter_counter = AdapterCounter(adapter.sequence for adapter in adapters)
+ with progress:
+ for record_array in reader:
+ metrics.add_record_array(record_array)
+ per_tile_quality.add_record_array(record_array)
+ adapter_counter.add_record_array(record_array)
+ sequence_duplication.add_record_array(record_array)
+ nanostats.add_record_array(record_array)
+ dedup_estimator.add_record_array(record_array)
+ progress.update(record_array)
report_modules = calculate_stats(
filename,
metrics,
@@ -204,6 +252,5 @@ def sequali_report():
output = ".".join(in_json.split(".")[:-1]) + ".html"
with open(in_json) as j:
json_data = json.load(j)
- timestamp = os.stat(in_json).st_mtime
write_html_report(dict_to_report_modules(json_data), output,
- output.rstrip(".html"), timestamp)
+ output.rstrip(".html"))
diff --git a/src/sequali/_qc.pyi b/src/sequali/_qc.pyi
index c1494ec0..939fe0e8 100644
--- a/src/sequali/_qc.pyi
+++ b/src/sequali/_qc.pyi
@@ -1,18 +1,18 @@
# Copyright (C) 2023 Leiden University Medical Center
-# This file is part of sequali
+# This file is part of Sequali
#
-# sequali is free software: you can redistribute it and/or modify
+# Sequali is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.
#
-# sequali is distributed in the hope that it will be useful,
+# Sequali is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU Affero General Public License for more details.
#
# You should have received a copy of the GNU Affero General Public License
-# along with sequali. If not, see None: ...
def add_record_array(self, __record_array: FastqRecordArrayView) -> None: ...
- def get_tile_averages(self) -> List[Tuple[int, List[float]]]: ...
def get_tile_counts(self) -> List[Tuple[int, List[float], List[int]]]: ...
class SequenceDuplication:
@@ -108,14 +112,25 @@ class SequenceDuplication:
min_threshold: int = 1,
max_threshold: int = sys.maxsize,
) -> List[Tuple[int, float, str]]: ...
- def duplication_counts(self) -> array.ArrayType: ...
class DedupEstimator:
_modulo_bits: int
_hash_table_size: int
tracked_sequences: int
+ front_sequence_length: int
+ back_sequence_length: int
+ front_sequence_offset: int
+ back_sequence_offset: int
- def __init__(self, hash_table_size_bits: int = 21): ...
+ def __init__(
+ self,
+ max_stored_fingerprints: int = DEFAULT_DEDUP_MAX_STORED_FINGERPRINTS,
+ *,
+ front_sequence_length: int = DEFAULT_FINGERPRINT_FRONT_SEQUENCE_LENGTH,
+ back_sequence_length: int = DEFAULT_FINGERPRINT_BACK_SEQUENCE_LENGTH,
+ front_sequence_offset: int = DEFAULT_FINGERPRINT_FRONT_SEQUENCE_OFFSET,
+ back_sequence_offset: int = DEFAULT_FINGERPRINT_BACK_SEQUENCE_OFFSET,
+ ): ...
def add_sequence(self, __sequence: str) -> None: ...
def add_record_array(self, __record_array: FastqRecordArrayView) -> None: ...
def duplication_counts(self) -> array.ArrayType: ...
diff --git a/src/sequali/_qcmodule.c b/src/sequali/_qcmodule.c
index 747aec05..2e0f3437 100644
--- a/src/sequali/_qcmodule.c
+++ b/src/sequali/_qcmodule.c
@@ -1,19 +1,19 @@
/*
Copyright (C) 2023 Leiden University Medical Center
-This file is part of sequali
+This file is part of Sequali
-sequali is free software: you can redistribute it and/or modify
+Sequali is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as
published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.
-sequali is distributed in the hope that it will be useful,
+Sequali is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License
-along with sequali. If not, see tp_name);
- return NULL;
- }
if (!PyUnicode_IS_COMPACT_ASCII(name_obj)) {
PyErr_Format(PyExc_ValueError,
"name should contain only ASCII characters: %R",
name_obj);
return NULL;
}
- if (!PyUnicode_CheckExact(sequence_obj)) {
- PyErr_Format(PyExc_TypeError,
- "sequence should be of type str, got %s.",
- Py_TYPE(sequence_obj)->tp_name);
- return NULL;
- }
if (!PyUnicode_IS_COMPACT_ASCII(sequence_obj)) {
PyErr_Format(PyExc_ValueError,
"sequence should contain only ASCII characters: %R",
sequence_obj);
return NULL;
}
- if (!PyUnicode_CheckExact(qualities_obj)) {
- PyErr_Format(PyExc_TypeError,
- "qualities should be of type str, got %s.",
- Py_TYPE(qualities_obj)->tp_name);
- return NULL;
- }
if (!PyUnicode_IS_COMPACT_ASCII(qualities_obj)) {
PyErr_Format(PyExc_ValueError,
"qualities should contain only ASCII characters: %R",
qualities_obj);
return NULL;
}
-
uint8_t *name = PyUnicode_DATA(name_obj);
size_t name_length = PyUnicode_GET_LENGTH(name_obj);
@@ -2183,13 +2164,13 @@ AdapterCounter__new__(PyTypeObject *type, PyObject *args, PyObject *kwargs)
}
}
self = PyObject_New(AdapterCounter, type);
- uint64_t **uint64_tmp = PyMem_Malloc(sizeof(uint64_t *) * number_of_adapters);
- if (uint64_tmp == NULL) {
+ uint64_t **counter_tmp = PyMem_Malloc(sizeof(uint64_t *) * number_of_adapters);
+ if (counter_tmp == NULL) {
PyErr_NoMemory();
goto error;
}
- memset(uint64_tmp, 0, sizeof(uint64_t *) * number_of_adapters);
- self->adapter_counter = uint64_tmp;
+ memset(counter_tmp, 0, sizeof(uint64_t *) * number_of_adapters);
+ self->adapter_counter = counter_tmp;
self->adapters = NULL;
self->matchers = NULL;
self->max_length = 0;
@@ -2622,7 +2603,7 @@ static PyMemberDef AdapterCounter_members[] = {
{NULL},
};
-static PyTypeObject Adapteruint64_type = {
+static PyTypeObject AdapterCounter_type = {
.tp_name = "_qc.AdapterCounter",
.tp_basicsize = sizeof(AdapterCounter),
.tp_dealloc = (destructor)AdapterCounter_dealloc,
@@ -2942,68 +2923,6 @@ PerTileQuality_add_record_array(PerTileQuality *self, FastqRecordArrayView *reco
}
-PyDoc_STRVAR(PerTileQuality_get_tile_averages__doc__,
-"get_tile_averages($self, /)\n"
-"--\n"
-"\n"
-"Get a list of tuples with the tile IDs and a list of their averages. \n"
-);
-
-#define PerTileQuality_get_tile_averages_method METH_NOARGS
-
-static PyObject *
-PerTileQuality_get_tile_averages(PerTileQuality *self, PyObject *Py_UNUSED(ignore))
-{
- TileQuality *tile_qualities = self->tile_qualities;
- size_t maximum_tile = self->number_of_tiles;
- size_t tile_length = self->max_length;
- PyObject *result = PyList_New(0);
- if (result == NULL) {
- return PyErr_NoMemory();
- }
-
- for (size_t i=0; itotal_errors;
- uint64_t *length_counts = tile_quality->length_counts;
- if (length_counts == NULL && total_errors == NULL) {
- continue;
- }
- PyObject *entry = PyTuple_New(2);
- PyObject *tile_id = PyLong_FromSize_t(i);
- PyObject *averages_list = PyList_New(tile_length);
- if (entry == NULL || tile_id == NULL || averages_list == NULL) {
- Py_DECREF(result);
- return PyErr_NoMemory();
- }
-
- /* Work back from the lenght counts. If we have 200 reads total and a
- 100 are length 150 and a 100 are length 120. This means we have
- a 100 bases at each position 120-150 and 200 bases at 0-120. */
- uint64_t total_bases = 0;
- for (Py_ssize_t j=tile_length - 1; j >= 0; j -= 1) {
- total_bases += length_counts[j];
- double error_count = total_errors[j];
- double average = error_count / (double)total_bases;
- PyObject *average_obj = PyFloat_FromDouble(average);
- if (average_obj == NULL) {
- Py_DECREF(result);
- return PyErr_NoMemory();
- }
- PyList_SET_ITEM(averages_list, j, average_obj);
- }
- PyTuple_SET_ITEM(entry, 0, tile_id);
- PyTuple_SET_ITEM(entry, 1, averages_list);
- int ret = PyList_Append(result, entry);
- if (ret != 0) {
- Py_DECREF(result);
- return NULL;
- }
- Py_DECREF(entry);
- }
- return result;
-}
-
PyDoc_STRVAR(PerTileQuality_get_tile_counts__doc__,
"get_tile_counts($self, /)\n"
"--\n"
@@ -3073,9 +2992,6 @@ static PyMethodDef PerTileQuality_methods[] = {
PerTileQuality_add_read_method, PerTileQuality_add_read__doc__},
{"add_record_array", (PyCFunction)PerTileQuality_add_record_array,
PerTileQuality_add_record_array_method, PerTileQuality_add_record_array__doc__},
- {"get_tile_averages", (PyCFunction)PerTileQuality_get_tile_averages,
- PerTileQuality_get_tile_averages_method,
- PerTileQuality_get_tile_averages__doc__},
{"get_tile_counts", (PyCFunction)PerTileQuality_get_tile_counts,
PerTileQuality_get_tile_counts_method,
PerTileQuality_get_tile_counts__doc__},
@@ -3343,12 +3259,28 @@ SequenceDuplication_add_meta(SequenceDuplication *self, struct FastqMeta *meta)
return 0;
}
uint8_t *sequence = meta->record_start + meta->sequence_offset;
- Py_ssize_t mid_point = (sequence_length + 1) / 2;
+ /* A full fragment at the beginning and the end is desired so that adapter
+ fragments at the beginning and end do not get added to the hash table in
+ a lot of different frames. To do so sample from the beginning and end
+ with a little overlap in the middle
+
+ | <- mid_point
+ sequence ==========================================
+ from front |------||------||------|
+ from back |------||------||------|
+
+ The mid_point is not the exact middle, but the middlish point were the
+ back sequences start sampling.
+
+ If the sequence length is exactly divisible by the fragment length, this
+ results in exactly no overlap between front and back fragments, while
+ still all of the sequence is being sampled.
+ */
Py_ssize_t total_fragments = (sequence_length + fragment_length - 1) / fragment_length;
Py_ssize_t from_mid_point_fragments = total_fragments / 2;
- Py_ssize_t mid_point_start = sequence_length - (from_mid_point_fragments * fragment_length);
+ Py_ssize_t mid_point = sequence_length - (from_mid_point_fragments * fragment_length);
bool warn_unknown = false;
- // Save all fragments starting from 0 and up to the midpoint.
+ // Sample front sequences
for (Py_ssize_t i = 0; i < mid_point; i += fragment_length) {
int64_t kmer = sequence_to_canonical_kmer(sequence + i, fragment_length);
if (kmer < 0) {
@@ -3361,10 +3293,9 @@ SequenceDuplication_add_meta(SequenceDuplication *self, struct FastqMeta *meta)
uint64_t hash = wanghash64(kmer);
Sequence_duplication_insert_hash(self, hash);
}
- // Save all subsequences of length k starting from the end until the point
- // where the previous loop has saved the sequences. There might be slight
- // overlap in the middle..
- for (Py_ssize_t i = mid_point_start; i < sequence_length; i += fragment_length) {
+
+ // Sample back sequences
+ for (Py_ssize_t i = mid_point; i < sequence_length; i += fragment_length) {
int64_t kmer = sequence_to_canonical_kmer(sequence + i, fragment_length);
if (kmer < 0) {
if (kmer == TWOBIT_UNKNOWN_CHAR) {
@@ -3613,40 +3544,6 @@ SequenceDuplication_overrepresented_sequences(SequenceDuplication *self,
return NULL;
}
-PyDoc_STRVAR(SequenceDuplication_duplication_counts__doc__,
-"duplication_counts($self)\n"
-"--\n"
-"\n"
-"Return a array.array with only the counts.\n"
-);
-
-#define SequenceDuplication_duplication_counts_method METH_NOARGS
-
-static PyObject *
-SequenceDuplication_duplication_counts(SequenceDuplication *self,
- PyObject *Py_UNUSED(ignore))
-{
- uint64_t number_of_uniques = self->number_of_unique_fragments;
- uint64_t *counts = PyMem_Calloc(number_of_uniques, sizeof(uint64_t));
- if (counts == NULL) {
- return PyErr_NoMemory();
- }
- uint32_t *counters = self->counts;
- size_t count_index = 0;
- size_t hash_table_size = self->hash_table_size;
-
- for (size_t i=0; i < hash_table_size; i+=1) {
- uint32_t count = counters[i];
- if (count != 0) {
- counts[count_index] = count;
- count_index += 1;
- }
- }
- PyObject *result = PythonArray_FromBuffer('Q', counts, number_of_uniques * sizeof(uint64_t));
- PyMem_Free(counts);
- return result;
-}
-
static PyMethodDef SequenceDuplication_methods[] = {
{"add_read", (PyCFunction)SequenceDuplication_add_read,
SequenceDuplication_add_read_method,
@@ -3661,10 +3558,6 @@ static PyMethodDef SequenceDuplication_methods[] = {
(PyCFunction)(void(*)(void))SequenceDuplication_overrepresented_sequences,
SequenceDuplication_overrepresented_sequences_method,
SequenceDuplication_overrepresented_sequences__doc__},
- {"duplication_counts",
- (PyCFunction)(void(*)(void))SequenceDuplication_duplication_counts,
- SequenceDuplication_duplication_counts_method,
- SequenceDuplication_duplication_counts__doc__},
{NULL},
};
@@ -3710,10 +3603,25 @@ Fei Xie, Michael Condict, Sandip Shete
https://www.usenix.org/system/files/conference/atc13/atc13-xie.pdf
*/
-// 2 ** 21 * 12 is 24MB which balloons to 48MB when creating a new table.
-// This allows storing up to 1.46 million sequences which leads to quite
-// accurate results.
-#define DEFAULT_DEDUP_HASH_TABLE_SIZE_BITS 21
+/*
+Store 1 million fingerprints. This requires 24MB which balloons to 48MB when
+creating a new table. Between 500,000 and 1,000,000 sequences will lead to a
+quite accurate result.
+*/
+#define DEFAULT_DEDUP_MAX_STORED_FINGERPRINTS 1000000
+
+/*
+Avoid the beginning and end of the sequence by at most 64 bp to avoid
+any adapters. Take the 8 bp after the start offset and the 8 bp before
+the end offset. This creates a small 16 bp fingerprint. Hash it using
+MurmurHash. 16 bp is small and therefore relatively insensitive to
+sequencing errors while still offering 4^16 or 4 billion distinct
+fingerprints.
+*/
+#define DEFAULT_FINGERPRINT_FRONT_SEQUENCE_LENGTH 8
+#define DEFAULT_FINGERPRINT_BACK_SEQUENCE_LENGTH 8
+#define DEFAULT_FINGERPRINT_FRONT_SEQUENCE_OFFSET 64
+#define DEFAULT_FINGERPRINT_BACK_SEQUENCE_OFFSET 64
// Use packing at the 4-byte boundary to save 4 bytes of storage.
#pragma pack(4)
@@ -3729,47 +3637,108 @@ typedef struct _DedupEstimatorStruct {
size_t hash_table_size;
size_t max_stored_entries;
size_t stored_entries;
+ size_t front_sequence_length;
+ size_t front_sequence_offset;
+ size_t back_sequence_length;
+ size_t back_sequence_offset;
+ uint8_t *fingerprint_store;
struct EstimatorEntry *hash_table;
} DedupEstimator;
static void
DedupEstimator_dealloc(DedupEstimator *self) {
PyMem_Free(self->hash_table);
+ PyMem_Free(self->fingerprint_store);
Py_TYPE(self)->tp_free((PyObject *)self);
}
static PyObject *
DedupEstimator__new__(PyTypeObject *type, PyObject *args, PyObject *kwargs) {
- Py_ssize_t hash_table_size_bits = DEFAULT_DEDUP_HASH_TABLE_SIZE_BITS;
- static char *kwargnames[] = {"hash_table_size_bits", NULL};
- static char *format = "|n:DedupEstimator";
+ Py_ssize_t max_stored_fingerprints = DEFAULT_DEDUP_MAX_STORED_FINGERPRINTS;
+ Py_ssize_t front_sequence_length = DEFAULT_FINGERPRINT_FRONT_SEQUENCE_LENGTH;
+ Py_ssize_t front_sequence_offset = DEFAULT_FINGERPRINT_FRONT_SEQUENCE_OFFSET;
+ Py_ssize_t back_sequence_length = DEFAULT_FINGERPRINT_BACK_SEQUENCE_LENGTH;
+ Py_ssize_t back_sequence_offset = DEFAULT_FINGERPRINT_BACK_SEQUENCE_OFFSET;
+ static char *kwargnames[] = {
+ "max_stored_fingerprints",
+ "front_sequence_length",
+ "back_sequence_length",
+ "front_sequence_offset",
+ "back_sequence_offset",
+ NULL
+ };
+ static char *format = "|n$nnnn:DedupEstimator";
if (!PyArg_ParseTupleAndKeywords(args, kwargs, format, kwargnames,
- &hash_table_size_bits)) {
+ &max_stored_fingerprints,
+ &front_sequence_length,
+ &back_sequence_length,
+ &front_sequence_offset,
+ &back_sequence_offset)) {
return NULL;
}
- if (hash_table_size_bits < 8 || hash_table_size_bits > 58) {
+
+ if (max_stored_fingerprints < 100 ) {
PyErr_Format(
PyExc_ValueError,
- "hash_table_size_bits must be between 8 and 58, not %zd",
- hash_table_size_bits
+ "max_stored_fingerprints must be at least 100, not %zd",
+ max_stored_fingerprints
);
return NULL;
}
+ size_t hash_table_size_bits = (size_t)(log2(max_stored_fingerprints * 1.5) + 1);
+
+ Py_ssize_t lengths_and_offsets[4] = {
+ front_sequence_length,
+ back_sequence_length,
+ front_sequence_offset,
+ back_sequence_offset,
+ };
+ for (size_t i=0; i < 4; i++) {
+ if (lengths_and_offsets[i] < 0) {
+ PyErr_Format(
+ PyExc_ValueError,
+ "%s must be at least 0, got %zd.",
+ kwargnames[i+1],
+ lengths_and_offsets[i]
+ );
+ return NULL;
+ }
+ }
+ size_t fingerprint_size = front_sequence_length + back_sequence_length;
+ if (fingerprint_size == 0) {
+ PyErr_SetString(
+ PyExc_ValueError,
+ "The sum of front_sequence_length and back_sequence_length must be at least 0"
+ );
+ return NULL;
+ }
+
size_t hash_table_size = 1ULL << hash_table_size_bits;
+ uint8_t *fingerprint_store = PyMem_Malloc(fingerprint_size);
+ if (fingerprint_store == NULL) {
+ return PyErr_NoMemory();
+ }
struct EstimatorEntry *hash_table = PyMem_Calloc(hash_table_size, sizeof(struct EstimatorEntry));
if (hash_table == NULL) {
+ PyMem_Free(fingerprint_store);
return PyErr_NoMemory();
}
DedupEstimator *self = PyObject_New(DedupEstimator, type);
if (self == NULL) {
+ PyMem_Free(fingerprint_store);
PyMem_Free(hash_table);
return PyErr_NoMemory();
}
+ self->front_sequence_length = front_sequence_length;
+ self->front_sequence_offset = front_sequence_offset;
+ self->back_sequence_length = back_sequence_length;
+ self->back_sequence_offset = back_sequence_offset;
+ self->fingerprint_store = fingerprint_store;
self->hash_table_size = hash_table_size;
// Get about 70% occupancy max
- self->max_stored_entries = (hash_table_size * 7) / 10;
+ self->max_stored_entries = max_stored_fingerprints;
self->hash_table = hash_table;
- self->modulo_bits = 1;
+ self->modulo_bits = 0;
self->stored_entries = 0;
return (PyObject *)self;
}
@@ -3816,35 +3785,30 @@ DedupEstimator_increment_modulo(DedupEstimator *self)
return 0;
}
-/*
-Avoid the beginning and end of the sequence by at most 64 bp to avoid
-any adapters. Take the 8 bp after the start offset and the 8 bp before
-the end offset. This creates a small 16 bp fingerprint. Hash it using
-MurmurHash. 16 bp is small and therefore relatively insensitive to
-sequencing errors while still offering 4^16 or 4 billion distinct
-fingerprints.
-*/
-#define FINGERPRINT_MAX_OFFSET 64
-#define FINGERPRINT_LENGTH 16
-
static int
DedupEstimator_add_sequence_ptr(DedupEstimator *self,
uint8_t *sequence, size_t sequence_length)
{
uint64_t hash;
- if (sequence_length < 16) {
+ size_t front_sequence_length = self->front_sequence_length;
+ size_t back_sequence_length = self->back_sequence_length;
+ size_t front_sequence_offset = self->front_sequence_offset;
+ size_t back_sequence_offset = self->back_sequence_offset;
+ size_t fingerprint_length = front_sequence_length + back_sequence_length;
+ uint8_t *fingerprint = self->fingerprint_store;
+ if (sequence_length <= fingerprint_length) {
hash = MurmurHash3_x64_64(sequence, sequence_length, 0);
} else {
uint64_t seed = sequence_length >> 6;
- uint8_t fingerprint[FINGERPRINT_LENGTH];
- size_t remainder = sequence_length - FINGERPRINT_LENGTH;
- size_t offset = Py_MIN(remainder / 2, FINGERPRINT_MAX_OFFSET);
- memcpy(fingerprint, sequence + offset, FINGERPRINT_LENGTH / 2);
- memcpy(fingerprint + (FINGERPRINT_LENGTH / 2),
- sequence + sequence_length - (offset + (FINGERPRINT_LENGTH / 2)),
- (FINGERPRINT_LENGTH / 2));
- hash = MurmurHash3_x64_64(fingerprint, FINGERPRINT_LENGTH, seed);
+ size_t remainder = sequence_length - fingerprint_length;
+ size_t front_offset = Py_MIN(remainder / 2, front_sequence_offset);
+ size_t back_offset = Py_MIN(remainder / 2, back_sequence_offset);
+ memcpy(fingerprint, sequence + front_offset, front_sequence_length);
+ memcpy(fingerprint + front_sequence_length,
+ sequence + sequence_length - (back_offset + back_sequence_length),
+ back_sequence_length);
+ hash = MurmurHash3_x64_64(fingerprint, fingerprint_length, seed);
}
size_t modulo_bits = self->modulo_bits;
size_t ignore_mask = (1ULL << modulo_bits) - 1;
@@ -3998,6 +3962,14 @@ static PyMemberDef DedupEstimator_members[] = {
READONLY, NULL},
{"tracked_sequences", T_ULONGLONG, offsetof(DedupEstimator, stored_entries),
READONLY, NULL},
+ {"front_sequence_length", T_ULONGLONG,
+ offsetof(DedupEstimator, front_sequence_length), READONLY, NULL},
+ {"back_sequence_length", T_ULONGLONG,
+ offsetof(DedupEstimator, back_sequence_length), READONLY, NULL},
+ {"front_sequence_offset", T_ULONGLONG,
+ offsetof(DedupEstimator, front_sequence_offset), READONLY, NULL},
+ {"back_sequence_offset", T_ULONGLONG,
+ offsetof(DedupEstimator, back_sequence_offset), READONLY, NULL},
{NULL},
};
@@ -4481,7 +4453,7 @@ PyInit__qc(void)
if (python_module_add_type(m, &QCMetrics_Type) != 0) {
return NULL;
}
- if (python_module_add_type(m, &Adapteruint64_type) != 0) {
+ if (python_module_add_type(m, &AdapterCounter_type) != 0) {
return NULL;
}
if (python_module_add_type(m, &PerTileQuality_Type) != 0) {
@@ -4514,8 +4486,12 @@ PyInit__qc(void)
PyModule_AddIntMacro(m, PHRED_MAX);
PyModule_AddIntMacro(m, MAX_SEQUENCE_SIZE);
PyModule_AddIntMacro(m, DEFAULT_MAX_UNIQUE_FRAGMENTS);
- PyModule_AddIntMacro(m, DEFAULT_DEDUP_HASH_TABLE_SIZE_BITS);
+ PyModule_AddIntMacro(m, DEFAULT_DEDUP_MAX_STORED_FINGERPRINTS);
PyModule_AddIntMacro(m, DEFAULT_FRAGMENT_LENGTH);
PyModule_AddIntMacro(m, DEFAULT_UNIQUE_SAMPLE_EVERY);
+ PyModule_AddIntMacro(m, DEFAULT_FINGERPRINT_FRONT_SEQUENCE_LENGTH);
+ PyModule_AddIntMacro(m, DEFAULT_FINGERPRINT_BACK_SEQUENCE_LENGTH);
+ PyModule_AddIntMacro(m, DEFAULT_FINGERPRINT_FRONT_SEQUENCE_OFFSET);
+ PyModule_AddIntMacro(m, DEFAULT_FINGERPRINT_BACK_SEQUENCE_OFFSET);
return m;
}
diff --git a/src/sequali/adapters.py b/src/sequali/adapters.py
index 0cce6f4d..7df7b415 100644
--- a/src/sequali/adapters.py
+++ b/src/sequali/adapters.py
@@ -1,18 +1,18 @@
# Copyright (C) 2023 Leiden University Medical Center
-# This file is part of sequali
+# This file is part of Sequali
#
-# sequali is free software: you can redistribute it and/or modify
+# Sequali is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.
#
-# sequali is distributed in the hope that it will be useful,
+# Sequali is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU Affero General Public License for more details.
#
# You should have received a copy of the GNU Affero General Public License
-# along with sequali. If not, see str:
return f"""
-
- Every sequence is fingerprinted by skipping the first 64 bases and
- taking the first 8 bases after that, as well as getting the
- 8 bases before the last 64 bases. This gives a small 16 bp
- sequence from two 8 bp stubs from the beginning and end.
- This sequence is hashed using the length divided by 64 as a seed,
- which results in the final fingerprint.
- This ensures that sequences that have very different lengths get
- different fingerprints. The 64 bp offset ensures that sequencing adapters
- at the beginning or end of the sequence are not taken into account. On
- short sequences, the offsets are proportionally shrunk.
- The 16 bp length of the sequence used as base for the hash limits
- the effect of sequencing errors, especially on long-read sequencing
- technologies.
- A subsample of the fingerprints is stored to
- estimate the duplication rate. See,
-
- the paper describing the methodology.
-
The subsample for this file consists of
- {self.tracked_unique_sequences:,} fingerprints.
+ Fingerprints are taken by taking a sample from the beginning and the
+ end at an offset. The samples are combined and hashed while using the
+ length as a seed. A subsample of these fingerprints is stored to
+ estimate the duplication rate. See
+
+ the documentation for a complete explanation.
+
+
+
+
Fingerprint front sequence length
+
+ {self.fingerprint_front_sequence_length:,}
+
+
+
+
Fingerprint front sequence offset
+
+ {self.fingerprint_front_sequence_offset:,}
+
+
+
+
Fingerprint back sequence length
+
+ {self.fingerprint_back_sequence_length}
+
+
+
+
Fingerprint back sequence offset
+
+ {self.fingerprint_back_sequence_offset:,}
+
+
+
+
Subsampled fingerprints
+
+ {self.tracked_unique_sequences:,}
+
+
+
+
Estimated remaining sequences if deduplicated
+
{self.remaining_fraction:.2%}
+
+
-
Estimated remaining sequences if deduplicated:
- {self.remaining_fraction:.2%}
- """
+ """
return f"""
Duplication percentages
{first_part}
@@ -1094,6 +1146,10 @@ def from_dedup_estimator(cls, dedup_est: DedupEstimator):
duplication_counts=sorted(duplication_categories.items()),
estimated_duplication_fractions=estimated_duplication_fractions,
remaining_fraction=deduplicated_fraction,
+ fingerprint_front_sequence_length=dedup_est.front_sequence_length,
+ fingerprint_back_sequence_length=dedup_est.back_sequence_length,
+ fingerprint_front_sequence_offset=dedup_est.front_sequence_offset,
+ fingerprint_back_sequence_offset=dedup_est.back_sequence_offset,
)
@@ -1157,7 +1213,9 @@ def to_html(self) -> str:
Fragments are stored in their canonical representation. That is
either the sequence or the reverse complement, whichever has
the lowest sort order. Both representations are shown in the
- table.
+ table. See
+
+ the documentation for a complete explanation.
The percentage shown is an estimate based on the number of
@@ -1546,11 +1604,11 @@ def write_html_report(report_modules: Iterable[ReportModule],
{os.path.basename(filename)}: Sequali Report
- Report created by sequali. Please visit the
+ Report created by Sequali. Please visit the
homepage
for bug reports and feature requests.
-
sequali report
+
Sequali report
""")
# size: {os.stat(filename).st_size / (1024 ** 3):.2f}GiB
for module in report_modules:
diff --git a/src/sequali/sequence_identification.py b/src/sequali/sequence_identification.py
index db839a1c..7c32dad3 100644
--- a/src/sequali/sequence_identification.py
+++ b/src/sequali/sequence_identification.py
@@ -1,18 +1,18 @@
# Copyright (C) 2023 Leiden University Medical Center
-# This file is part of sequali
+# This file is part of Sequali
#
-# sequali is free software: you can redistribute it and/or modify
+# Sequali is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.
#
-# sequali is distributed in the hope that it will be useful,
+# Sequali is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU Affero General Public License for more details.
#
# You should have received a copy of the GNU Affero General Public License
-# along with sequali. If not, see