A centralised Python implementation of InterPro production procedures.
- Python 3.11+, with packages
oracledb
,mysqlclient
,psycopg3
, andmundone
(link) GCC
with thesqlite3.h
header
pip install .
The pyinterprod
package relies on three configuration files:
main.conf
: contains database connection strings, paths to files provided by/to UniProtKB, and various workflow parameters.members.conf
: contains path to files used to update InterPro's member databases (e.g. files containing signatures, HMM files, etc.).analyses.conf
: contains settings for the InterProScan match calculation (ipr-calc
).
All files can be renamed. main.conf
is passed as a command line argument, and the paths to members.conf
and analyses.conf
are defined in main.conf
.
The expected format for database connection strings is
user/password@host:port/service
. For Oracle databases,user/password@service
may work as well, depending ontnsnames.ora
.
- oracle
- ipro-interpro: connection string for the
interpro
user in the InterPro database - ipro-iprscan: connection string for the
iprscan
user in the InterPro database - ipro-uniparc: connection string for the
uniparc
user in the InterPro database - iscn-iprscan: connection string for the
iprscan
user in the InterProScan database - iscn-uniparc: connection string for the
uniparc
user in the InterProScan database - unpr-goapro: connection string for the GOA database
- unpr-swpread: connection string for the Swiss-Prot database
- unpr-uapro: connection string for the UniParc production database
- unpr-uaread: connection string for the UniParc database
- ipro-interpro: connection string for the
- postgresql:
- pronto: connection string
- uniprot:
- version: release number (e.g.
2019_08
) - date: date for the public release (e.g.
18-Sep-2019
) - swiss-prot: path to Swiss-Prot flat file
- trembl: path to TrEMBL flat file
- unirule: path to file listing InterPro entries and member database signatures used in UniRule
- xrefs: path to directory where to export InterPro cross-references (generated for UniProt)
- version: release number (e.g.
- emails:
- server: outgoing server (format:
host:port
) - sender: sender's email address (e.g. user running the workflow)
- aa: email address of the Automatic Annotation team
- aa_dev: email address of the Automatic Annotation development team
- interpro: email address of the InterPro team
- uniprot_db: email address of the UniProt database team
- uniprot_db: email address of the UniProt production team
- unirule: email address of the UniRule team (curators from EMBL-EBI, SIB, and PIR)
- sib: email address of the Swiss-Prot team
- server: outgoing server (format:
- misc:
- analyses: path to the
analyses.conf
config file - members: path to the
members.conf
config file - scheduler: scheduler and queue (format:
scheduler:queue
, e.g.lsf:production
) - pronto_url: URL of the Pronto curation application
- data_dir: directory where to store staging files
- match_calc_dir: directory where to run InterProScan match calculation
- temporary_dir: directory for temporary files
- workflows_dir: directory for workflows SQLite files, and jobs' input/output files
- analyses: path to the
Each section corresponds to a member database (or a sequence feature database), e.g.
[profile]
signatures =
Supported properties are:
Name | Description |
---|---|
signatures |
Path to the source of database signatures. |
hmm |
Path to an HMM file, used for databases that employ HMMER3-based models. Required when running ipr-hmm . |
fasta |
Path to sequences used by models, in the FASTA format. |
members |
Path to file containing the clan-signature mapping. |
go-terms |
Path to file or directory of GO annotations. PANTHER and NCBIfam only. |
summary |
Path to file of summary information. CDD only. |
seed |
Path to file of SEED alignments. Pfam only. |
full |
Path to file of full alignments. Pfam only. |
clans |
Path to file of clan information. Pfam only. |
mapping |
Path to file of model-signature mapping. CATH-Gene3D only. |
classes |
Path to file of information about classes. ELM only. |
instances |
Path to file of information about instances. ELM only. |
The DEFAULT
section defines the defaults values for the following properties:
job_cpu
: number of processes to request when submitting a job.job_mem
: the maximum amount of memory a job should be allowed to use (in MB).job_size
: the number of sequences to process in each job.job_timeout
: the number of hours a job is allowed to run for before being killed. Any value lower than 1 disable the timeout.
The default values can be overridden. For instance, adding the following block under the DEFAULT
section ensure that MobiDB-Lite jobs timeout after 48 hours and that PRINTS jobs are allocated 16GB of memory:
[mobidb-lite]
job_timeout = 48
[prints]
job_mem = 16384
Update proteins and matches to the latest private UniProt release.
$ ipr-uniprot [OPTIONS] main.conf
The optional arguments are:
-t, --tasks
: list of tasks to run, by default all tasks are run (see Tasks for a description of available tasks)--dry-run
: do not run tasks, only list those about to be run
Name | Description | Dependencies |
---|---|---|
update-uniparc | Import UniParc cross-references | |
taxonomy | Import the latest taxonomy data from UniProt | |
update-ipm-matches | Update protein matches from ISPRO | |
update-ipm-sites | Update protein site matches from ISPRO | |
update-proteins | Import the new Swiss-Prot and TrEMBL proteins, and compare with the current ones | |
delete-proteins | Delete obsolete proteins in all production tables | update-proteins |
check-proteins | Track UniParc sequences (UPI) associated to UniProt entries that need to be imported (e.g. new or updated sequence) | delete-proteins, update-uniparc |
update-matches | Update protein matches for new or updated sequences, run various checks, and track changes in protein counts for InterPro entries | update-ipm-matches, check-proteins |
update-fmatches | Update protein matches for sequence features (e.g. MobiDB-lite, Coils, etc.) | update-matches |
export-sib | Export Oracle tables required by the Swiss-Prot team | update-matches |
report-changes | Report recent integration changes to the UniRule team | update-matches |
aa-iprscan | Build the AA_IPRSCAN table, required by the Automatic Annotation team | update-matches |
xref-condensed | Build the XREF_CONDENSED table for the Automatic Annotation team (contains representations of protein matches for InterPro entries) | update-matches |
xref-summary | Build the XREF_SUMMARY table for the Automatic Annotation team (contains protein matches for integrated member database signatures) | report-changes |
export-xrefs | Export text files containing protein matches for the UniProt database team | xref-summary |
notify-interpro | Notify the InterPro team that all tables required by the Automatic Annotation team are ready, so we can take a snapshot of our database | update-fmatches, aa-iprscan, xref-condensed, xref-summary |
swissprot-de | Export Swiss-Prot descriptions associated to member database signatures in the public release of UniProt (i.e. the release we are updating *from*) | |
unirule | Update the list of signatures used by UniRule, so InterPro curators are warned if they attempt to unintegrated one of these signatures. | |
update-varsplic | Update splice variant matches | update-ipm-matches |
update-sites | Update residue annotations | update-ipm-sites, update-matches |
Pronto | Update the Pronto PostgreSQL table | taxonomy, update-fmatches, swissprot-de, unirule |
send-report | Send reports to curators, and inform them that Pronto is ready | Pronto tasks |
Update models and protein matches for one or more member databases.
Before running the update, this command must be repeated for each member database. -n
is the name of the database (case-insensitive), -d
is the release date (of the member database), and -v
is the release version.
$ ipr-pre-memdb main.conf -n DATABASE -d YYYY-MM-DD -v VERSION
Then, the actual update can be run:
$ ipr-memdb [OPTIONS] main.conf database [database ...]
The optional arguments are:
-t, --tasks
: list of tasks to run, by default all tasks are run (see Tasks for a description of available tasks)--dry-run
: do not run tasks, only list those about to be run
Name | Description | Dependencies |
---|---|---|
update-ipm-matches | Update protein matches from ISPRO | |
load-signatures | Import member database signatures for the version to update to | |
track-changes | Compare signatures between versions (e.g. name, description, matched proteins) | load-signatures |
delete-obsoletes | Remove signatures that are not in the latest version of the member database(s) | track-changes |
update-signatures | Update metadata for existing signatures, and add new signatures | delete-obsoletes |
update-matches | Update and check matches in production tables | update-ipm-matches, update-signatures |
update-varsplic | Update splice variant matches | update-ipm-matches, update-signatures |
persist-pfam-a | Parse Pfam-A files and store relevant information (only when updating Pfam) | update-ipm-matches, update-signatures |
persist-pfam-c | Parse Pfam-C to store clan information (only when updating Pfam) | update-ipm-matches, update-signatures |
update-features | Update sequence features for non-member databases (e.g. MobiDB-lite, COILS, etc.) | update-ipm-matches |
update-fmatches | Update matches for sequence features | update-features |
update-ipm-sites | Update protein site matches from ISPRO | |
update-sites | Update residue annotations (if updating a member database with residue annotations) | update-ipm-sites, update-matches |
Pronto | Update the Pronto PostgreSQL tables | update-matches |
send-report | Send reports to curators, and inform them that Pronto is ready | Pronto tasks |
$ ipr-pronto [OPTIONS] main.conf
The optional arguments are:
-t, --tasks
: list of tasks to run, by default all tasks are run (see Tasks for a description of available tasks)--dry-run
: do not run tasks, only list those about to be run
Name | Description | Dependencies |
---|---|---|
go-terms | Import publications associated to protein annotations | |
go-constraints | Import GO taxonomic constraints | |
proteins-similarities | Import UniProt general annotations (comments) on sequence similarities | |
proteins-names | Import UniProt sequence names | |
databases | Import database information (e.g. version, release date) | |
proteins | Import general information on proteins (e.g. accession, length, species) | |
init-matches | Create the match table (empty) | |
export-matches | Export protein matches for member database signatures | init-matches |
insert-matches | Insert protein matches for member database signatures | export-matches |
insert-fmatches | Insert protein matches for sequence features (AntiFam, etc.) | init-matches |
index-matches | Index and cluster the match table | insert-matches, insert-fmatches |
insert-signature2proteins | Associate member database signatures with UniProt proteins, UniProt descriptions, taxonomic origins, and GO terms | export-matches, proteins-names |
index-signature2proteins | Index the signature2proteins table | insert-signature2proteins |
signatures | Import and compare member database signatures | databases, export-matches |
taxonomy | Import UniProt taxonomy | |
structures | Import structural matches |
$ ipr-calc main.conf [COMMAND] [OPTIONS]
The available commands (and their optional arguments) are:
import
: import sequences from the UniParc Oracle database--top-up
: import new sequences only
clean
: delete obsolete data-a, --analyses
: IDs of analyses to clean (default: all)
search
: scan sequences using InterProScan--dry-run
: show the number of jobs to run and exit-l, --list
: list active analyses and exit-a, --analyses
: IDs of analyses to run (default: all)-t, --threads
: number of monitoring threads (default: 8)--concurrent-jobs
: maximum number of concurrently running InterProScan jobs (default: 1000)--max-jobs
: maximum number of jobs to run per analysis before exiting (default: disabled)--max-retries
: number of times a failed job is resubmitted (default: disabled)--keep none|all|failed
: keep input/output files (default: none)
Import new UniParc sequences:
ipr-calc main.conf import --top-up
Process jobs for analysis 42
only, allow each job to run three times (i.e. restart twice), but keep all temporary files, regardless of the job success/failure:
ipr-calc main.conf search -a 42 --max-retries 2 --keep all
Run 10 jobs per analysis, and keep failed jobs to investigate:
ipr-calc main.conf search --max-retries 10 --keep failed
Update clans and run profile-profile alignments.
$ ipr-clans [OPTIONS] main.conf database [database ...]
The optional arguments are:
-t, --threads
: number of alignment workers-T, --tempdir
: directory to use for temporary files
Load HMMs in the database.
$ ipr-hmms main.conf database [database ...]