Multi-task multilabel deep neural networks for identification and classification of integrons.
INTNet is designed to identify and classify integron integrases, predict bacterial hosts, and associate ARGs (multi-label) with integrons. This versatile tool supports a range of input types including:
- Long Amino Acid Sequences (Full Length/Contigs)
- Long Nucleotide Sequences
- Short Amino Acid Reads (30-50 aa)
- Short Nucleotide Reads (100-150 nt)
All inputs should be in FASTA format.
INTNet Components
INTNet comprises two specialized models to accommodate different read lengths:
- INTNet-s: Optimized for short reads, enhancing prediction accuracy for sequences ranging from 30 to 50 amino acids or 100 to 150 nucleotides.
- INTNet-l: Tailored for long sequences, ensuring robust predictions for full-length contigs or long nucleotide sequences.
clone the program to your local machine
git clone https://github.com/patience111/INTNet
1. Setting up environment
1.1 Installation with conda
1.1.1 For CPU inference, you could install the program with conda YAML file in the installation directory with the following commands:
cd ./installation
conda env create -f INTNet-CPU.yml -n INTNet-cpu
conda activate INTNet-cpu
(This was tested on Ubuntu 16.04, 20.04; Windows 10, macOS(14.1.1))
1.1.2 For GPU inference, you could install the program with conda YAML file in the installation directory with the following commands:
cd ./installation
conda env create -f INTNet-GPU.yml -n INTNet-gpu
conda activate INTNet-gpu
(This was tested on Ubuntu 16.04, cuda 10.1, Driver Version: 430.64)
1.2 Or, if you prefer installing dependencies manually, you might find this information useful:
The program was tested with the following package version, you can install exactly the same version or other compatible versions.
Biopython: 1.79
tensorflow: 2.2.0
cuda: 10.2 (for GPU using)
cudnn: 7.6.5.32 (for GPU using)
numpy: 1.18.5
scikit-learn: 0.24.1
tqdm: 4.56.0
2. Getting trained models
cd ./models
bash get-models.sh
for long sequences
python intnet.py --input input_path_data --type aa/nt --model intnet-l --outname output_file_name
for short reads
python intnet.py --input input_path_data --type aa/nt --model intnet-s --outname output_file_name
general options:
--input/-i the test file as input
--type/-t molecular type of your test data (aa for amino acid, nt for nucleotide)
--model/-m the model you assign to make the prediction (intnet-l for long sequences, intnet-s for short reads)
--outname/-on the output file name
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
the test data as input
-t {aa,nt}, --type {aa,nt}
molecular type of your input file
-m {intnet-s,intnet-l}, --model {intnet-s,intnet-l}
the model to make the prediction
-on OUTNAME, --outname OUTNAME
the name of results output
if we predict the short amino acid reads by using INTNet-S model, we could use command line (if you are in INTNet dirctory):
python3 ./scripts/intnet.py --input ../tests/test2091_50aa.fasta --type aa --model intnet-s --outname intnet_50aa_test_gpu.txt
output will be like and saved in the results folder:
The first column test_id is the sequence label of the test sequnece.
The second column inti_type is the "integron" or "non-integron" prediction of the input sequence.
The third column pre_prob is the integron prediction confidence of the input sequence by the model.
The fourth column bacterial_host is the bacterial host prediction of the input sequence if it is predicted as integron first.
The fifth column pre_prob is the bacterial host prediction confidence of the input sequence if it is predicted as integron first.
The last column resistance_category is the multi-label prediction of asssociated ARGs of the input sequences.
If you'd like to contribute to INTNet, check out https://github.com/patience111/INTNet.
Hope you enjoy INTNet journey, any problem please contact [email protected]