Now including the distance table to store calculated distances and a new get_distance_new function to check and update the distance table. This should be ~2000x speed improvement when compared with the PHE older version. The distance between two samples will only ever be calculated once forever! (before it was calculated 2 or 3 times per cluster merge).
A beta version of the software is provided in /phengs/hpc_software/phe/snapper3/3-0. To use it:
module load phe/snapper3/3-0
Prerequisites/dependencies:
- Python >= 2.7.6
- psycopg2 >= 2.5.2
The postgres server must be running postgres >=9.6 and the contrib package must be installed so that the intarray extension can be created for each database.
All commands require a connection string of the format:
"host='db host IP' dbname='database_name' user='uname' password='password'"
New databases can be created with the script 'reset.sh'. The script can also be used to scrap existing databases and start from scratch.
sh reset.sh dbuser 158.119.123.123 my_snapper3_db
A script is provided to migrate an existing snapper v2 database to the new format.
usage: migrate_to_snapperV3.py [-h] --reference REFNAME --oldconnstring
CONNECTION --newconnstring CONNECTION
version 0.1, date 30Sep2016, author [email protected]
optional arguments:
-h, --help show this help message and exit
--reference REFNAME, -r REFNAME
The sample_name of the reference genome in the
database.
--oldconnstring CONNECTION, -o CONNECTION
Connection string for old db ('source')
--newconnstring CONNECTION, -n CONNECTION
Connection string for new db ('target')
To create a new database, use the reset.sh script and then use the add_reference subcommand of snapper3.py
usage: snapper3.py add_reference [-h] --connstring CONNECTION --reference
FASTAFILE --input JSONFILE [--ref-name NAME]
Takes variants for a sample in json format and adds them to the database.
optional arguments:
-h, --help show this help message and exit
--connstring CONNECTION, -c CONNECTION
REQUIRED. Connection string for db.
--reference FASTAFILE
REQUIRED. Fasta reference file.
--input JSONFILE, -i JSONFILE
REQUIRED. Path to a input file.
--ref-name NAME, -r NAME
The name of the reference to go into the db [default: reference file name before 1st dot]
The best way to add the variants of a new sample to the database is to submit them in json format. These json files can be made with the latest (1-4) version of Phenix from vcf files. The json file required here defines the ignore positions within the reference. It is usually made by calling phenix run_snp_pipeline with simulated reads on the reference and applying the usual filters to the vcf. This vcf is then converted to json and used here.
To add samples other than the reference to the database use:
usage: snapper3.py add_sample [-h] --input FILE --format FORMAT --connstring
CONNECTION --refname REFNAME
[--sample-name NAME] [--reference FASTAFILE]
[--min-coverage FLOAT]
Takes variants for a sample in json format and adds them to the database.
optional arguments:
-h, --help show this help message and exit
--input FILE, -i FILE
REQUIRED. Path to a input file.
--format FORMAT, -f FORMAT
REQUIRED. Choose from 'json' or 'fasta'.
--connstring CONNECTION, -c CONNECTION
REQUIRED. Connection string for db.
--refname REFNAME, -r REFNAME
REQUIRED. The sample_name of the reference genome in the database.
--sample-name NAME, -s NAME
The name of the sample to go into the db [default: input file name before 1st dot]
--reference FASTAFILE
Path to reference for this sample. Must be the same as used for the database.
REQUIRED when format is fasta, else ignored.
--min-coverage FLOAT Minimum coverage required to aloow sample in database. Only applicable with json
format, ignored for fasta. This will check for the coverageMetaData annotation calculated by Phenix
and only allow the sample in, if the mean coverage is >= this value.
[default: do not check this]
Note:
- The name of the reference in the database is required here, because reference-ignore-positions are not stored for each sample individually.
- The sample can also be added from a fasta file (consensus sequence). This requires the reference as a fasta file (--reference).
- A minimum coverage can be specified (--min-coverage). This is read from the Phenix annotation in the json input file (-i). If the coverage is below the threshold the sample is not added to the database.
This will only create entries for the sample in the samples and variants table, but will do no clustering.
To cluster a sample use:
usage: snapper3.py cluster_sample [-h] --connstring CONNECTION --sample-name
NAME [--no-zscore-check]
[--with-registration] [--force-merge]
After the variants for a sample have been added to the database, use this to
determine the clustering for this sample. Will perform all statictical checks and
merging if necessary and update the database accordingly. If statistical checks fail
database will not be updated.
optional arguments:
-h, --help show this help message and exit
--connstring CONNECTION, -c CONNECTION
Connection string for db. REQUIRED
--sample-name NAME, -s NAME
The name of the sample to go into the db. REQUIRED.
--no-zscore-check Do not perform checks and just add the sample. It's fine.
[Default: Perform checks.]
--with-registration Register the clustering for this sample in the database
and update the cluster stats. [Default: Do not register.]
--force-merge Add the sample even if it causes clusters to merge.
[Default: Do not add if merge required.]
Note:
- --no-zscore-check will switch off all zscore calculations and cluster the sample. A sample added in this way will be excluded from all sample statistics and not be considered in future z-score checks.
- --with-registration without this option no permamant change will be made to the database. Effectively the postgres transaction is not commited before exiting.
- --force-merge The default behaviour is that if a sample requires a merge, the sample is not clustered. With this option the sample will be clustered and all merging will be done.
If a samples is dodgy and you want to not cluster it, use remove_sample to either remove all traces of the sample from the database or set the cluster to ignored=yes.
Further options:
- get_alignment Takes a list of sample names as a blanck separated list and provides an alignment in the same way that Phenix vcf2fasta does.
- remove_sample If you use this for a samples that has already been clustered, it removes the sample and un-does all the clustering. This TAKES FOREVER because it needs to be checked if removing this samples causes any clusters to be split. Do not use unless you have to (i.e. no backup available and staring from scratch not an option.)
- export_sample_variants If you want the variants for a sample in json format
- get_closest For a given sample, return the N closest samples in the database, or all samples cluster than x SNPs. This uses the SnapperDBInterrogation class.
API: The same class also serves as the most glorious API. Use it like this:
from lib.SnapperDBInterrogation import SnapperDBInterrogation, SnapperDBInterrogationError
with SnapperDBInterrogation(host=conf['db_host'],
dbname=dbname,
user=conf['db_username'],
password=conf['db_password']) as sdbi:
print sdbi.get_closest_samples("123456_H1234567890-2", 10)
print sdbi.get_samples_below_threshold("123456_H1234567890-2", 20)
print sdbi.get_snp_address("123456_H1234567890-2")
print sdbi.get_nearest("123456_H1234567890-2")