diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100755 index 0000000..51d7d37 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,339 @@ +# Changelog + +# mpwt 0.7.0 (2022-02-03) + +This version should be compatible with Pathway Tools 25.5. + +The behaviour of mpwt has been changed in this version. Before 0.7.0 mpwt will wait for all the PathoLogic processes to end before going to the next step. But if there was an error with one process, it will stop and no outputs were created even for successful processes. +Now, mpwt run the processes independently. This means that for example if you have 3 organisms (Org_A, Org_B and Org_C). If the process for Org_A fails but the processes for Org_B and Org_C succeed, then you will have an output folder containing results for Org_B and Org_C (whereas in older version no results were created). + +## Add: + +- mpwt can take as input PathoLogic files without fasta. It will output a warning message but it should be usable. This should make mpwt compatible with PathoLogic files created by EsMeCaTa. +- new option --permission/permission to choose the permission level for PGDB in ptools-local or output folder (useful when working with a container in a cluster). +- changelog file. +- citation in readme. + +## Fix: + +- issue with missing tmp path in run_move_pgdb. +- issue with topf (especially when using a GFF file some fields like dbxref were not correctly used). +- issue that prevents mpwt from killing Pathway Tools process that enters the Lisp Listener. +- issue with pathway prediction score not being correctly return to its previous value. +- String not Boolean and path incorrect in Readme (issue #70). +- Issue with Pathway Tools 25.5 when checking log (issue #71). The way of parsing the 'PGDB contains' line has been modified to be more robust. The parsing with str.split() has been replaced with a parsing with regex. +- numerous typos. + +## Modify: + +- refactor mpwt code to run each process independently from the other (issue #68). Creation of the functions run_mpwt (to launch a complete run of mpwt on one organism) and independent_mpwt (to launch multiple run_mpwt on each inputs). +- refactor log creation. Instead of a function taking as input a list of dictionary (check_pwt) use 2 functions (check_mpwt_pathologic_runs and extract_pathologic). The first one takes as input a list of the paths to the input folder containing pathologic.log file and an output folder. Then it uses the second function (extract_pathologic) for each pathologic file to extraction the informations. +- replace docopt with argparse. +- update readme. +- year in license code. +- update mpwt workflow svg according to the changes in mpwt code. Put a white background (useful when using a GitHub dark theme). + +# mpwt 0.6.3 (2021-04-02) + +This version should be compatible with Pathway Tools 25.0. + +## Fix: + +- Issue with Pathway Tools 25.0 when checking log (issue #67). + +# mpwt 0.6.2 (2021-04-01) + +## Fix: + +- Issue with encoding of flat_files_creation.log (issue #66). + +## Modify: + +- Use setup.cfg for version. + +# mpwt 0.6.1 (2021-03-16) + +## Modify: + +- replace GPL license by LGPL license. + +# mpwt 0.6.0 (2020-12-14) + +**Warning**: +In this version, some option names have been changed: +- "--dat" ("dat_creation" in python code) has been replaced by "--flat" ("flat_creation" in python code) to better reflect the attribute-values flat files created by this option. + +## Add: + +- new options to extract other files to the ouput folder (issue #63): + - "--mx" ("xml_creation") to extract XML files created by MetaFlux. + - "--mo" ("owl_extraction") to extract owl files. + - "--mc" ("col_extraction") to extract tabulated files (".col" extension). + +## Fix: + +- issue with "-r" (size_reduction) option: all files were extracted instead of only requested output files. + +## Modify: + +- replace "--dat" ("dat_creation") by "--flat" ("flat_creation") to better reflect data created by this command. +- use PGDB ID as output name when using mpwt without input_folder (like "mpwt -o output_folder"). +- update readme. + +# mpwt 0.5.9 (2020-11-20) + +## Add: + +- \_\_version\_\_ in mpwt init file to handle version (issue #59). + +## Fix: + +- error when creating dat files without input folder (issue #61). + +## Modify: + +- refactor how paths are handled (issue #60). This is a first step to the compatibility of mpwt on Windows. But this will need more tests and modification. + +# mpwt 0.5.8 (2020-10-12) + +mpwt has now a new dependency with **chardet** (linked to issue #56). + +## Fix: + +- issue with encoding when writing the terminal logs (issue #56). +- issue with Pathway Tools pop-up showing after attribute-value flat files creation (issue #57). + +# mpwt 0.5.7 (2020-10-01) + +## Add: + +- an option (--tp/patho_transporter_inference) to use Transporter Inference Parser with PathoLogic (issue #53). +- a badge showing last Pathway Tools release compatible with mpwt. +- a picture trying to show how mpwt works. +- clean option when using topf. +- GitHub Actions to release mpwt on PyPI with GitHub Release. + +## Fix: + +- no join and close of Pool map if an error occurred (issue #52). +- issue when error can occur after PGDB creation but not take into account (issue #54). + +## Modify: + +- replace Pool map by starmap (issue #51). + +# mpwt 0.5.6 (2020-07-28) + +## Add: + +- a check for unfinished builds. If during a previous run a build failed, mpwt will delete it and relaunch it (issue #49). +- a badge with a link to the preprint. + +## Fix: + +- numerous issues in topf (issue #30). +- numerous issues with taxon_id file when using circular, codon_table or element_type. +- numerous issues with raise error messages not showing. +- typos. + +# mpwt 0.5.5 (2020-05-27) + +## Add: + +- support for .fsa extension (issue #47). +- a license (issue #45). +- a --version argument to mpwt command-line (issue #44). +- pseudogene linked to mRNA in topf (issue #30). + +## Fix: + +- issue with encoding in mpwt_wrapper. +- issue when --md could be used without -o argument (issue #46). + +Thanks to @cfrioux for her work on this release. + +# mpwt 0.5.4 (2020-04-10) + +## Add: + +- support to .gbff extension for genbank files (issue #39). +- an option to use pathway prediction score of Pathway Tools (issue #42). +- a logger for mpwt (look at [For developer](https://github.com/AuReMe/mpwt#for-developer) for more informations). +- a check for ptools-init.dat file. + +## Fix: + +- issue when checking the input files (issue #37). +- issue when checking the existence of taxon_id.tsv (issue #38). +- issue if there is multiple files in the same input folder (issue #40). +- mpwt killing Pathway Tools process to create attribute-values files even if there is only a not fatal error (issue #41). +- numerous issues with to_pf (issue #30). Add DBLINK and a check for gene name length. Rewrite how CDS and RNAs are handled. But this option needs more testing and modification in 0.5.5. +- typos in Readme. + +## Modify: + +- pwt_wrapper: refactoring the code and add a new log (pwt_terminal.log) containing Pathway Tools log from the terminal. + +## Check: + +- no_download_articles/--nc option is currently not working. + +# mpwt 0.5.3 (2020-01-09) + +## Add: + +- an option to create taxon_id.tsv files from GenBank and GFF (issue #35). +- an option to use operon predictor of Pathway Tools (issue #33). +- an option to avoid downloading PubMed entries using a parameter in ptools-init.dat (added in Pathway Tools 23.5). But it needs more testing and maybe modification in 0.5.4 (issue #34). +- a test for to_pf using GenBank and GFF from test data (issue #30). +- errors and warnings counts in log files. + +## Fix: + +- numerous issues with to_pf (issue #30). But this option needs more testing and modification in 0.5.4. +- incompatibility with Pathway Tools 23.5. +- typos in Readme. + +# mpwt 0.5.2 (2019-10-17) + +## Add: + +- to_pf argument, allowing to convert GBK or GFF into PF files (issue #30). Need more polishing in 0.5.3. +- preprint citation in Readme. + +## Fix: + +- error message when a GFF has no region feature (issue #31). + +# mpwt 0.5.1 (2019-07-31) + +## Add: + +- taxon_id.tsv can be used to give other informations (element type, codon table, circularity of genome) to Pathway Tools (issue #28). + +## Fix: + +- test not working on Pathway Tools 23.0. +- error message if user do not give an integer for the number of CPUs. + +## Modification: + +- verbose and logging interactions. + +# mpwt 0.5.0 (2019-07-02) + +**Warning**: +In this version, the behaviours of some arguments have been changed. +- the -r/size_reduction argument will delete PGDB inside ptools-local and compressed the results in zip (issue #26). +- the --clean argument (when used with -f/input_folder argument) will delete only PGDBs corresponding to species in the input_folder (issue #23). + +## Add: + +- support for Pathologic Format (PF) file (issue #19). +- taxon_id.tsv file to manage taxon_id for species. With the argument --taxon-id/taxon_file argument, you can force mpwt to use taxon from taxon_id.tsv for all type of files (Genbank, GFF or PF). +- a new argument --ignore-error/ignore_error to ignore PathoLogic failed builds (issue #21). With this argument, mpwt will continue to run even if PathoLogic have crashed for some species. It will perform the rest of the workflow for the successful build. +- time measure of each steps of mpwt. They will be printed if you use -v/verbose or they can be accessed at the end of the log_error.txt file (created with --log). + +## Fix: + +- interaction between --clean and the other arguments. +- error if -r/size_reduction is used without output_folder argument. +- a typo error with the name of a function (remove_pgdbs). + + +## Modification: + +- split multipwt.py in multiple modules to ease reading (issue #25). +- --clean behaviour when used with -f/input_folder (issue #23) to delete only input species PGDB. +- -r/size_reduction behaviour: it will delete the PGDB from ptools-local and it will move the results into a compressed zip file. +- add table and list in Readme. + +# mpwt 0.4.2.4 (2019-06-07) + +## Fix: + +- wrong argument number given in the fix of the error with --clean when used with -v and --cpu (issue #20). + +# mpwt 0.4.2.3 (2019-06-07) + +## Fix: + +- error with --clean when used with -v and --cpu (issue #20). +- error if PathoLogic build is aborted (issue #22). +- typos. + +# mpwt 0.4.2.2 (2019-04-18) + +## Fix: + +- issue with BioPAX/dat creation (issue #17). +- issue with error report with BioPAX/dat creation. + +## Modification: + +- use by default 1 CPU, instead of using all CPUs available. + +# mpwt 0.4.2.1 (2019-03-28) + +## Fix: + +- error with outdated argument in cleaning_input. +- several typos. + +# mpwt 0.4.2 (2019-03-21) + +## Add: + +- logging instead of print (issue #12 ). +- mpwt can create BioPAX/attribute-values files when PGDBs are already in ptools-local but not in output folder (issue #13 ). In the same run, mpwt can launch PathoLogic on a set of species having no PGDBs and then launch BioPAX/attribute-values files creation on this set of species and on other species with PGDBs but with no results in the output folder. +- fatal errors from pathologic.log are now printed (issue #14 ). + +## Fix: + +- number_cpu argument not used by cleaning function. + +# mpwt 0.4.1 (2019-03-18) + +## Add: + +- option to use Pathway-Tools Hole-Filler (issue #9 ). +- error message if Pathway-Tools is not installed or not in PATH (issue #10 ). +- fasta requirement for GFF file (issue #7 ). + +## Fix: + +- mpwt could take as input hidden files/folders (issue #8 ). Add test to check this case. +- issue with the name of the species output folder when giving an output folder but not an input folder. + +## Modification: + +- remove global variables. +- version of dependencies. + +# mpwt 0.4.0 (2019-02-25) + +First release on GitHub. + +**Warning**: +In this version mpwt arguments have been changed to make the tool more flexible. Now to launch PathoLogic inference, you need the --patho argument. If you want to create BioPAX/attribute-values file, you need to use --dat and to move only the attribute-values dat files, you must use --md option. More informations in the readme. + +## Add: + +- mulitprocessing when creating input files, moving result files and deleting PGDBs (issue #1 ). + +- better Pathway-Tools error handling (issue #2 ). + +- more flexible tool (issue #3 ). Change how the argument work to make the tools more flexible. + +- accept GFF file (issue #5 ). Only partial support. + +- an argument to show the PGDBs in ptools-local (--list). + + +## Fix: + +- issue with pop-up (issue #4 ). To avoid all the pop-ups, you need to use at least the version 22.5 of Pathway-Tools. + +- when there is multiple db_xref qualifiers in the 'source' feature of the Genbank file, take the correct one. + +- mpwt stops when there is PGDBs in ptools-local folder. diff --git a/README.rst b/README.rst index 94f7020..8fe76ce 100755 --- a/README.rst +++ b/README.rst @@ -1,10 +1,10 @@ .. image:: https://img.shields.io/pypi/v/mpwt.svg - :target: https://pypi.python.org/pypi/mpwt + :target: https://pypi.python.org/pypi/mpwt .. image:: https://img.shields.io/badge/doi-10.7554/eLife.61968-blueviolet.svg - :target: https://doi.org/10.7554/eLife.61968 + :target: https://doi.org/10.7554/eLife.61968 -.. image:: https://img.shields.io/badge/Pathway%20Tools-25.0-brightgreen +.. image:: https://img.shields.io/badge/Pathway%20Tools-25.5-brightgreen :target: https://bioinformatics.ai.sri.com/ptools/release-notes.html mpwt: Multiprocessing Pathway Tools @@ -19,7 +19,7 @@ mpwt: Pipeline summary The following picture shows the main argument of mpwt: -.. image:: mpwt_pipeline.svg +.. image:: mpwt_runs.svg .. contents:: Table of contents :backlinks: top @@ -32,7 +32,7 @@ Requirements ~~~~~~~~~~~~ mpwt needs at least **Python 3.6**. -mpwt requires three python depedencies (`biopython `__, `docopt `__ and `gffutils `__) and **Pathway Tools**. For the multiprocessing, mpwt uses the `multiprocessing library of Python 3 `__. +mpwt requires three python depedencies (`biopython `__ , `chardet `__ and `gffutils `__) and **Pathway Tools**. For the multiprocessing, mpwt uses the `multiprocessing library of Python 3 `__. You must have an environment where Pathway Tools is installed. Pathway Tools can be obtained `here `__. @@ -40,13 +40,14 @@ Pathway Tools needs **Blast**, so it must be install on your system. Depending o /!\\ For all OS, Pathway-Tools must be in ``$PATH``. -On Linux and MacOS: ``export PATH=$PATH:your/install/directory/pathway-tools``. +On Linux and MacOS: ``export PATH=$PATH:/your/install/directory/pathway-tools``. -Consider adding Pathway Tools in ``$PATH`` permanently by running: +Consider adding Pathway Tools in ``$PATH`` permanently by using the following command and then sourcing bashrc: .. code:: sh - echo 'export PATH="$PATH:your/install/directory/pathway-tools:"' >> ~/.bashrc + echo 'export PATH="$PATH:/your/install/directory/pathway-tools:"' >> ~/.bashrc + source ~/.bashrc If your OS doesn't support Pathway Tools, you can use a docker container. If it's your case, look at `Pathway Tools Multiprocessing Docker `__. It is a dockerfile that will create a container with Pathway Tools, its dependencies and this package. You just need to give a Pathway Tools installer as input. @@ -59,7 +60,7 @@ Using pip .. code:: sh - pip install mpwt + pip install mpwt Use --- @@ -84,10 +85,10 @@ The script takes a folder containing sub-folders as input. Each sub-folder conta │ └── scaffold_1.fasta │ └── scaffold_2.pf │ └── scaffold_2.fsa - taxon_id.tsv + ├── taxon_id.tsv .. -Input files must have the same name as the folder in which they are located and also finished with a .gbk/.gbff or a .gff. +Input files must have the same name as the folder in which they are located and also finished with a .gbk/.gbff or a .gff (the name must not be only uppercase otherwise this can cause issue with Pathway Tools such as this one: ``Error: Cannot use the organism identifier ORGID as a genetic element ID.``). For PF files, there is one file for each scaffold/contig and one corresponding fasta file. @@ -178,7 +179,7 @@ PathoLogic Format │ └── scaffold_1.fasta │ └── scaffold_2.pf │ └── scaffold_2.fsa - taxon_id.tsv + ├── taxon_id.tsv .. PF file example: @@ -202,7 +203,7 @@ PF file example: Look at the `Pathologic format `__ for more informations. -You have to provide one nucleotide sequence (either '.fasta' or '.fsa' extension) for each pathologic containing one scaffold/contig. +You have to provide one nucleotide sequence (either '.fasta' or '.fsa' extension) for each pathologic containing one scaffold/contig. This is optionnal since mpwt 0.7.0. .. code-block:: text @@ -313,7 +314,7 @@ mpwt can be used with the command lines: .. code:: sh - mpwt -f=FOLDER [-o=FOLDER] [--patho] [--hf] [--op] [--tp] [--nc] [--flat] [--md] [--mx] [--mo] [--mc] [-p=FLOAT] [--cpu=INT] [-r] [-v] [--clean] [--log=FOLDER] [--ignore-error] [--taxon-file] + mpwt -f=FOLDER [-o=FOLDER] [--patho] [--hf] [--op] [--tp] [--nc] [--flat] [--md] [--mx] [--mo] [--mc] [-p=FLOAT] [--cpu=INT] [-r] [-v] [--clean] [--log=FOLDER] [--taxon-file] mpwt --flat [-f=FOLDER] [-o=FOLDER] [--md] [--mx] [--mo] [--mc] [--cpu=INT] [-v] mpwt -o=FOLDER [--md] [--mx] [--mo] [--mc] [--cpu=INT] [-v] mpwt --clean [--cpu=INT] [-v] @@ -334,24 +335,24 @@ mpwt can be used in a python script with an import: folder_output = "path/to/folder/output" mpwt.multiprocess_pwt(input_folder=folder_input, - output_folder=folder_output, - patho_inference=optional_boolean, - patho_hole_filler=optional_boolean, - patho_operon_predictor=optional_boolean, - patho_transporter_inference=patho_transporter_inference, - no_download_articles=optional_boolean, - flat_creation=optional_boolean, - dat_extraction=optional_boolean, - xml_extraction=optional_boolean, - owl_extraction=optional_boolean, - col_extraction=optional_boolean, - size_reduction=optional_boolean, - number_cpu=int, - patho_log=optional_folder_pathname, - ignore_error=optional_boolean, - pathway_score=pathway_score, - taxon_file=optional_boolean, - verbose=optional_boolean) + output_folder=folder_output, + patho_inference=optional_boolean, + patho_hole_filler=optional_boolean, + patho_operon_predictor=optional_boolean, + patho_transporter_inference=patho_transporter_inference, + no_download_articles=optional_boolean, + flat_creation=optional_boolean, + dat_extraction=optional_boolean, + xml_extraction=optional_boolean, + owl_extraction=optional_boolean, + col_extraction=optional_boolean, + size_reduction=optional_boolean, + number_cpu=int, + patho_log=optional_folder_pathname, + pathway_score=pathway_score, + taxon_file=optional_str, + verbose=optional_boolean, + permission=optional_str) +-------------------------+------------------------------------------------+-------------------------------------------------------------------------+ | Command line argument | Python argument | description | @@ -392,9 +393,9 @@ mpwt can be used in a python script with an import: +-------------------------+------------------------------------------------+-------------------------------------------------------------------------+ | --clean | mpwt.cleaning() | Delete all PGDBs in ptools-local folder or only PGDB from input folder | +-------------------------+------------------------------------------------+-------------------------------------------------------------------------+ -| --ignore-error | ignore_error(boolean) | Ignore errors and continue the workflow for successful build | +| --taxon-file | taxon_file(string: file pathanme) | Force mpwt to use the taxon ID in the taxon_id.tsv file | +-------------------------+------------------------------------------------+-------------------------------------------------------------------------+ -| --taxon-file | taxon_file(boolean) | Force mpwt to use the taxon ID in the taxon_id.tsv file | +| --permission | permission(string: 'all', 'group') | Choose permission access to PGDB in ptools-local and output files | +-------------------------+------------------------------------------------+-------------------------------------------------------------------------+ | -v | verbose(boolean) | Print some information about the processing of mpwt | +-------------------------+------------------------------------------------+-------------------------------------------------------------------------+ @@ -412,7 +413,8 @@ There is also another argument: This argument reads the input data inside the input folder. Then it converts Genbank and GFF files into PathoLogic Format files. And if there is already PathoLogic files it copies them. -It can be used to avoid issues with parsing Genbank and GFF files. But it is an early Work in Progress. +It can be used to avoid issues with parsing Genbank and GFF files. But it is an early Work in Progress as at this moment the PathoLogic files created do not produce the same PGDB as the corresponding GenBank/GFF files. +Especially some genes are missing in th PGDB. PathoLogic Hole Filler ++++++++++++++++++++++ @@ -653,13 +655,16 @@ Useful functions no_download_articles=optional_boolean, flat_creation=optional_boolean, dat_extraction=optional_boolean, + xml_extraction=optional_boolean, + owl_extraction=optional_boolean, + col_extraction=optional_boolean, size_reduction=optional_boolean, number_cpu=int, patho_log=optional_folder_pathname, - ignore_error=optional_boolean, pathway_score=pathway_score, - taxon_file=optional_boolean, - verbose=optional_boolean) + taxon_file=optional_str, + verbose=optional_boolean, + permission=optional_str) - Delete all the previous PGDB and the metadata files @@ -779,9 +784,6 @@ Also, if there is an error in pathologic.log, it will be shown after **=== Error There is a `Pathway Tools forum `__ where you can find informations on Pathway Tools errors. -You can also ignore PathoLogic errors by using the argument ``--ignore-error/ignore_error``. -This option will ignore error and continue the mpwt workflow on the successful PathoLogic build. - Output ~~~~~~ @@ -974,7 +976,25 @@ Bibliography Citation -------- -Belcour* A, Frioux* C, Aite M, Bretaudeau A, Hildebrand F, Siegel A. Metage2Metabo, microbiota-scale metabolic complementarity for the identification of key species. eLife 2020;9:e61968 `https://doi.org/10.7554/eLife.61968 `__. +Belcour* A, Frioux* C, Aite M, Bretaudeau A, Hildebrand F, Siegel A. Metage2Metabo, microbiota-scale metabolic complementarity for the identification of key species. eLife 2020, 9, e61968 `https://doi.org/10.7554/eLife.61968 `__. + +``mpwt`` depends on the following tools: + +- ``Pathway Tools`` for the reconstruction of draft metabolic networks (the article can be not up-to-date, look at the `Publications `__ on the BioCyc site): + +Karp P D, Midford P E, Billington R, Kothari A, Krummenacker M, Latendresse M, Ong W K, Subhraveti P, Caspi R, Fulcher C, Keseler I M, Paley SM. Pathway Tools version 23.0 update: software for pathway/genome informatics and systems biology. Briefings in Bioinformatics 2021, 22, 109–126 `https://doi.org/10.1093/bib/bbz104 `__. + +- ``Biopython`` for GenBank parsing: + +Cock, P.J.A., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., de Hoon, M.J.L. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009, 25, 1422–1423 `https://doi.org/10.1093/bioinformatics/btp163 `__. + +- ``gffutils`` for GFF parsing: + +GitHub repository: `https://github.com/daler/gffutils `__ + +- ``chardet`` for character encoding detection: + +GitHub repository: `https://github.com/chardet/chardet `__ Acknowledgements ---------------- diff --git a/mpwt/__init__.py b/mpwt/__init__.py index 37fe92d..c91a681 100755 --- a/mpwt/__init__.py +++ b/mpwt/__init__.py @@ -1,4 +1,4 @@ -# Copyright (C) 2018-2021 Arnaud Belcour - Inria Dyliss +# Copyright (C) 2018-2022 Arnaud Belcour - Inria Dyliss # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Lesser General Public License as published by # the Free Software Foundation, either version 3 of the License, or @@ -17,4 +17,4 @@ from mpwt.utils import cleaning, cleaning_input, find_ptools_path, list_pgdb, pubmed_citations, remove_pgdbs from mpwt.to_pathologic import create_pathologic_file -__version__='0.6.3' \ No newline at end of file +__version__='0.7.0' \ No newline at end of file diff --git a/mpwt/__main__.py b/mpwt/__main__.py index b14b884..f79535e 100755 --- a/mpwt/__main__.py +++ b/mpwt/__main__.py @@ -1,7 +1,7 @@ #!/usr/bin/env python # -*- coding: utf-8 -*- -# Copyright (C) 2018-2021 Arnaud Belcour - Inria Dyliss +# Copyright (C) 2018-2022 Arnaud Belcour - Inria Dyliss # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Lesser General Public License as published by # the Free Software Foundation, either version 3 of the License, or @@ -21,7 +21,7 @@ The script takes a folder name as argument. usage: - mpwt -f=FOLDER [-o=FOLDER] [--patho] [--hf] [--op] [--tp] [--nc] [--flat] [--md] [--mx] [--mo] [--mc] [-p=FLOAT] [--cpu=INT] [-r] [-v] [--clean] [--log=FOLDER] [--ignore-error] [--taxon-file] + mpwt -f=FOLDER [-o=FOLDER] [--patho] [--hf] [--op] [--tp] [--nc] [--flat] [--md] [--mx] [--mo] [--mc] [-p=FLOAT] [--cpu=INT] [-r] [-v] [--clean] [--log=FOLDER] [--taxon-file] mpwt --flat [-f=FOLDER] [-o=FOLDER] [--md] [--mx] [--mo] [--mc] [--cpu=INT] [-v] mpwt -o=FOLDER [--md] [--mx] [--mo] [--mc] [--cpu=INT] [-v] mpwt --clean [--cpu=INT] [-v] @@ -51,15 +51,15 @@ --cpu=INT Number of cpu to use for the multiprocessing (default=1). [default: 1] --log=FOLDER Create PathoLogic log files inside the given folder (use it with --patho). --list List all PGDBs inside the ptools-local folder. - --ignore-error Ignore errors (PathoLogic and flat-files creation) and continue for successful builds. --taxon-file For the use of the taxon_id.tsv file to find the taxon ID. + --permission Choose permission access to PGDB in ptools-local and output files, either 'all' or 'group' (by default it is user). -v Verbose. --version Version topf Will convert Genbank and/or GFF files into PathoLogic Format file. """ -import docopt +import argparse import logging import os import sys @@ -67,14 +67,13 @@ from mpwt import utils, to_pathologic from mpwt.mpwt_workflow import multiprocess_pwt -from multiprocessing import Pool logging.basicConfig(format='%(message)s', level=logging.CRITICAL) logger = logging.getLogger(__name__) -logging.getLogger("mpwt").setLevel(logging.CRITICAL) +logging.getLogger('mpwt').setLevel(logging.CRITICAL) -VERSION = pkg_resources.get_distribution("mpwt").version -LICENSE = """Copyright (C) 2018-2021 Arnaud Belcour - Inria Dyliss\n +VERSION = pkg_resources.get_distribution('mpwt').version +LICENSE = """Copyright (C) 2018-2022 Arnaud Belcour - Inria Dyliss\n This program is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or @@ -93,39 +92,230 @@ def run_mpwt(): """ Function used with a mpwt call in the terminal. """ - args = docopt.docopt(__doc__) - - input_folder = args['-f'] - output_folder = args['-o'] - patho_inference = args['--patho'] - patho_hole_filler = args['--hf'] - patho_operon_predictor = args['--op'] - patho_transporter_inference = args['--tp'] - no_download_articles = args['--nc'] - flat_creation = args['--flat'] - move_dat = args['--md'] - move_xml = args['--mx'] - move_owl = args['--mo'] - move_col = args['--mc'] - size_reduction = args['-r'] - number_cpu = args['--cpu'] - patho_log = args['--log'] - clean_arg = args['--clean'] - pgdb_to_deletes = args['--delete'] - pgdb_list = args['--list'] - ignore_error = args['--ignore-error'] - taxon_file = args['--taxon-file'] - pathway_score = args['-p'] - verbose = args['-v'] - topf = args['topf'] - version = args['--version'] + parser = argparse.ArgumentParser( + 'mpwt', + description='For specific help on each subcommand use: mpwt --help', + allow_abbrev=False, + ) + + parser.add_argument( + '-f', + dest='input', + required=False, + help='Working folder containing sub-folders with Genbank/GFF/PF files.', + metavar='INPUT_DIR') + + parser.add_argument( + '-o', + dest='output', + required=False, + help='Output folder path. Will create a output folder in this folder.', + metavar='OUPUT_DIR') + + parser.add_argument( + '--patho', + dest='patho', + help='Will run an inference of Pathologic on the input files.', + required=False, + action='store_true', + default=False, + ) + + parser.add_argument( + '--hf', + dest='hf', + help='Use with --patho. Run the Hole Filler using Blast.', + required=False, + action='store_true', + default=False, + ) + parser.add_argument( + '--op', + dest='op', + help='Use with --patho. Run the Operon predictor of Pathway-Tools.', + required=False, + action='store_true', + default=False, + ) + parser.add_argument( + '--tp', + dest='tp', + help='Use with --patho. Run the Transport Inference Parser of Pathway-Tools.', + required=False, + action='store_true', + default=False, + ) + parser.add_argument( + '--nc', + dest='nc', + help='Use with --patho. Turn off loading of Pubmed entries.', + required=False, + action='store_true', + default=False, + ) + parser.add_argument( + '-p', + dest='p', + help='Use with --patho. Modify PathoLogic pathway prediction score. Must be a float between 0 and 1.', + required=False, + ) + + parser.add_argument( + '--flat', + dest='flat', + help='Will create BioPAX/attribute-value flat files from PGDB.', + required=False, + action='store_true', + default=False, + ) + + parser.add_argument( + '--md', + dest='md', + help='Move the dat files into the output folder.', + required=False, + action='store_true', + default=False, + ) + parser.add_argument( + '--mx', + dest='mx', + help='Move the metabolic-reactions.xml file into the output folder.', + required=False, + action='store_true', + default=False, + ) + parser.add_argument( + '--mo', + dest='mo', + help='Move owl files into the output folder.', + required=False, + action='store_true', + default=False, + ) + parser.add_argument( + '--mc', + dest='mc', + help='Move tabular files into the output folder.', + required=False, + action='store_true', + default=False, + ) + + parser.add_argument( + '--clean', + dest='clean', + help='Clean ptools-local folder, before any other operations.', + required=False, + action='store_true', + default=False, + ) + parser.add_argument( + '--delete', + dest='delete', + help="Give a PGDB name and mpwt will delete it (if multiple separe them with a ',', example: ecolicyc,athalianacyc).", + required=False, + ) + parser.add_argument( + '-r', + dest='r', + help="Will delete files in ptools-local and compress results files to reduce results size (use it with -o).", + required=False, + action='store_true', + default=False, + ) + parser.add_argument( + '--log', + dest='log', + help="Create PathoLogic log files inside the given folder (use it with --patho).", + required=False, + ) + parser.add_argument( + '--list', + dest='list', + help="List all PGDBs inside the ptools-local folder.", + required=False, + action='store_true', + default=False, + ) + parser.add_argument( + '--taxon-file', + dest='taxon_file', + help="For the use of the taxon_id.tsv file to find the taxon ID.", + required=False, + ) + parser.add_argument( + '-v', + dest='verbose', + help="Verbose.", + required=False, + action='store_true', + default=False, + ) + parser.add_argument( + 'topf', + help="Will convert Genbank and/or GFF files into PathoLogic Format file.", + nargs='?', + ) + + parser.add_argument( + '--version', + dest='version', + action='version', + default=False, + version='%(prog)s ' + VERSION + '\n' + LICENSE) + + parser.add_argument( + '--cpu', + help='Number of cpu to use for the multiprocessing (default=1). [default: 1]', + required=False, + type=int, + default=1) + parser.add_argument( + '--permission', + dest='permission', + help="Choose permission access to PGDB in ptools-local and output files, either 'all' or 'group' (by default it is user).", + required=False, + ) + + args = parser.parse_args() + + input_folder = args.input + output_folder = args.output + patho_inference = args.patho + patho_hole_filler = args.hf + patho_operon_predictor = args.op + patho_transporter_inference = args.tp + no_download_articles = args.nc + flat_creation = args.flat + move_dat = args.md + move_xml = args.mx + move_owl = args.mo + move_col = args.mc + size_reduction = args.r + number_cpu = args.cpu + patho_log = args.log + clean_arg = args.clean + pgdb_to_deletes = args.delete + pgdb_list = args.list + taxon_file = args.taxon_file + pathway_score = args.p + verbose = args.verbose + topf = args.topf + version = args.version + permission = args.permission + + # If no argument print the help. + if len(sys.argv) == 1: + parser.print_help() + sys.exit(1) if version: - print("Mpwt v" + VERSION + "\n" + LICENSE) + print('Mpwt v' + VERSION + '\n' + LICENSE) sys.exit() if verbose: - logging.getLogger("mpwt").setLevel(logging.DEBUG) + logging.getLogger('mpwt').setLevel(logging.DEBUG) logger.setLevel(logging.DEBUG) if pgdb_list: @@ -140,7 +330,7 @@ def run_mpwt(): # Use a set to remove redudant PGDB. if pgdb_to_deletes: utils.remove_pgdbs(list(set(pgdb_to_deletes.split(','))), number_cpu) - return + sys.exit() if clean_arg: if verbose: @@ -155,10 +345,15 @@ def run_mpwt(): if not patho_inference and not flat_creation and not move_dat and not output_folder: sys.exit() - if topf: - if input_folder and output_folder: - to_pathologic.create_pathologic_file(input_folder, output_folder, number_cpu) - sys.exit() + if topf is not None: + if topf == 'topf': + if input_folder and output_folder: + to_pathologic.create_pathologic_file(input_folder, output_folder, number_cpu) + sys.exit() + else: + sys.exit('topf argument needs input_folder (-f) and output_folder options (-o).') + else: + sys.exit(f'Wrong positional argument passed: {topf}, only "topf" is expected as a postional argument.') multiprocess_pwt(input_folder=input_folder, output_folder=output_folder, @@ -175,11 +370,11 @@ def run_mpwt(): size_reduction=size_reduction, number_cpu=number_cpu, patho_log=patho_log, - ignore_error=ignore_error, pathway_score=pathway_score, taxon_file=taxon_file, - verbose=verbose) + verbose=verbose, + permission=permission) -if __name__ == "__main__": +if __name__ == '__main__': run_mpwt() diff --git a/mpwt/mpwt_workflow.py b/mpwt/mpwt_workflow.py index 7666559..f5d3257 100755 --- a/mpwt/mpwt_workflow.py +++ b/mpwt/mpwt_workflow.py @@ -1,4 +1,4 @@ -# Copyright (C) 2018-2021 Arnaud Belcour - Inria Dyliss +# Copyright (C) 2018-2022 Arnaud Belcour - Inria Dyliss # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Lesser General Public License as published by # the Free Software Foundation, either version 3 of the License, or @@ -21,16 +21,16 @@ - move the results to output folder (run_move_pgdb) """ -import csv import logging import os import shutil +import stat import sys import time from mpwt import utils from mpwt.pwt_wrapper import run_pwt, run_pwt_flat, run_move_pgdb -from mpwt.results_check import check_dat, check_pwt +from mpwt.results_check import check_dat, check_mpwt_pathologic_runs from mpwt.pathologic_input import check_input_and_existing_pgdb, pwt_input_files, create_only_flat_lisp, create_flat_creation_script, read_taxon_id from multiprocessing import Pool @@ -42,8 +42,8 @@ def multiprocess_pwt(input_folder=None, output_folder=None, patho_inference=None patho_transporter_inference=None, no_download_articles=None, flat_creation=None, dat_extraction=None, xml_extraction=None, owl_extraction=None, col_extraction=None, size_reduction=None, - number_cpu=None, patho_log=None, ignore_error=None, - pathway_score=None, taxon_file=None, verbose=None): + number_cpu=None, patho_log=None, pathway_score=None, + taxon_file=None, verbose=None, permission=None): """ Function managing all the workflow (from the creatin of the input files to the results). Use it when you import mpwt in a script. @@ -64,25 +64,14 @@ def multiprocess_pwt(input_folder=None, output_folder=None, patho_inference=None size_reduction (bool): delete ptools-local data at the end (True/False) number_cpu (int): number of CPU used (default=1) patho_log (str): pathname to mpwt log folder - ignore_error (bool): Ignore error during PathoLogic inference (True/False) pathway_score (float): score between 0 and 1 to accept or reject pathway taxon_file (str): pathname to the mpwt taxon ID file verbose (bool): verbose argument + permission (str): Choose permission access to PGDB in ptools-local and output files, either 'all' or 'group' (by default it is user). """ if verbose: logger.setLevel(logging.DEBUG) - start_time = time.time() - times = [] - steps = [] - times.append(start_time) - steps.append('start') - - # Check if Pathway Tools is in the path. - # Find PGDB folder path. - ptools_local_path = utils.find_ptools_path() - pgdbs_folder_path = os.path.join(*[ptools_local_path, 'pgdbs', 'user']) - # Check if ptools-local is accessible. error = utils.check_ptools_local_pwt() if error: @@ -96,46 +85,56 @@ def multiprocess_pwt(input_folder=None, output_folder=None, patho_inference=None if size_reduction and not output_folder: sys.exit('To use -r/size_reduction, you need to give an output folder (-o/output_folder).') - # Check if ignore_error is used with patho_inference. - if ignore_error and not patho_inference: - sys.exit('To use --ignore-error/ignore_error, you need to use the --patho/patho_inference argument.') - # Check if taxon_file is used with patho_inference. if (taxon_file and not patho_inference) and (taxon_file and not input_folder): sys.exit('To use --taxon-file/taxon_file, you need to use the --patho/patho_inference argument. Or you can use it with the -f argument to create the taxon file from data.') - #Check if patho_operon_predictor is used with patho_inference. + # Check if patho_operon_predictor is used with patho_inference. if patho_operon_predictor and not patho_inference: sys.exit('To use --op/patho_operon_predictor, you need to use the --patho/patho_inference argument.') - #Check if no_download_articles is used with patho_inference. + # Check if no_download_articles is used with patho_inference. if no_download_articles and not patho_inference: sys.exit('To use --nc/no_download_articles, you need to use the --patho/patho_inference argument.') - #Check if patho_transporter_inference is used with patho_inference. + # Check if patho_transporter_inference is used with patho_inference. if patho_transporter_inference and not patho_inference: sys.exit('To use --tp/patho_transporter_inference, you need to use the --patho/patho_inference argument.') - #Check if no_download_articles is used with patho_inference. + # Check if pathway_score is used with patho_inference. if pathway_score and not patho_inference: sys.exit('To use -p/pathway_score, you need to use the --patho/patho_inference argument.') - #Check if no_download_articles is used with output_folder. + # Check if pathway_score is a float between 0 and 1. + if pathway_score: + try: + pathway_score = float(pathway_score) + except ValueError: + sys.exit("{0} is not a float. Pathway score prediction must be a flaot between 0.0 and 1.0.".format(pathway_score)) + + if pathway_score < 0.0 or pathway_score > 1.0: + sys.exit("{0} is not a float between 0.0 and 1.0.".format(pathway_score)) + + # Check if dat_extraction is used with output_folder. if dat_extraction and not output_folder: sys.exit('To use --md/dat_extraction, you need to use the -o/output_folder argument.') - #Check if xml_extraction is used with output_folder. + # Check if xml_extraction is used with output_folder. if xml_extraction and not output_folder: sys.exit('To use --mx/xml_extraction, you need to use the -o/output_folder argument.') - #Check if owl_extraction is used with output_folder. + # Check if owl_extraction is used with output_folder. if owl_extraction and not output_folder: sys.exit('To use --mo/owl_extraction, you need to use the -o/output_folder argument.') - #Check if col_extraction is used with output_folder. + # Check if col_extraction is used with output_folder. if col_extraction and not output_folder: sys.exit('To use --mc/col_extraction, you need to use the -o/output_folder argument.') + # Check permission. + if permission and permission not in ['all', 'group']: + sys.exit('--permission/permission must be either "group" or "all".') + # Use the number of cpu given by the user or 1 CPU. if number_cpu: try: @@ -144,8 +143,154 @@ def multiprocess_pwt(input_folder=None, output_folder=None, patho_inference=None raise ValueError('The number of CPU must be an integer.') else: number_cpu_to_use = 1 - mpwt_pool = Pool(processes=number_cpu_to_use) + + independent_mpwt(input_folder, output_folder, patho_inference, + patho_hole_filler, patho_operon_predictor, + patho_transporter_inference, no_download_articles, + flat_creation, dat_extraction, xml_extraction, + owl_extraction, col_extraction, size_reduction, + number_cpu_to_use, patho_log, pathway_score, + taxon_file, permission) + + +def close_mpwt(mpwt_pool, no_download_articles, pathway_score=None, old_pathway_score=None): + """End multiprocessing Pool and restore ptools-init.dat + + mpwt_pool (multiprocessing Pool): mpwt multiprocessing Pool + no_download_articles (bool): turning off loading of PubMed citations (True/False) + pathway_score (float): score between 0 and 1 to accept or reject pathway + """ + mpwt_pool.close() + mpwt_pool.join() + + # Turn on loading of pubmed entries. + if no_download_articles: + utils.pubmed_citations(activate_citations=True) + + # Remodify the pathway score to its original value. + if pathway_score: + utils.modify_pathway_score(old_pathway_score, comment_line=True) + + +def give_permission(folder, permission): + """ Give permission to group or all for folder. + + Args: + folder (str): pathname to the input folder + permission (str): level of permission (either group or all) + """ + if permission == 'group': + for dirpath, _dirnames, filenames in os.walk(folder): + os.chmod(dirpath, stat.S_IRGRP | ~stat.S_IWGRP | ~stat.S_IXGRP) + for filename in filenames: + os.chmod(os.path.join(dirpath, filename), ~stat.S_IRGRP | ~stat.S_IWGRP | ~stat.S_IXGRP) + elif permission == 'all': + for dirpath, _dirnames, filenames in os.walk(folder): + os.chmod(dirpath, stat.S_IROTH | stat.S_IWOTH | stat.S_IXOTH) + for filename in filenames: + os.chmod(os.path.join(dirpath, filename), stat.S_IROTH | stat.S_IWOTH | stat.S_IXOTH) + else: + logger.critical(f'Invalid permission "{permission}", permission must be "group" or "all"') + + +def run_mpwt(run_folder=None, input_folder=None, run_input_files_creation=None, + run_output_folder=None, output_folder=None, + run_patho_inference=None, pathologic_options=None, + run_flat_creation=None, move_options=None, + taxon_file=None, permission=None): + """ Single run of mpwt on one folder. + Used in multiprocessing in independent_mpwt. + + Args: + run_folder (str): name of the folder containg input files + input_folder (str): pathname to input folder + run_input_files_creation (bool): if True runs creation of PathoLogic input files + run_output_folder (bool): if True moves the output file to the ouput folder + output_folder (str): pathname to output folder + run_patho_inference (bool): if True PathoLogic is run on the input folder + pathologic_options (list): list of bool for: patho_hole_filler, patho_operon_predictor, patho_transporter_inference + run_flat_creation (bool): if True flat files will be created + move_options (list): list of bool for: dat_extraction, size_reduction, xml_extraction, owl_extraction, col_extraction + taxon_file (str): pathname to the mpwt taxon ID file + permission (str): Choose permission access to PGDB in ptools-local and output files, either 'all' or 'group' (by default it is user). + """ + ptools_local_path = utils.find_ptools_path() + pgdbs_folder_path = os.path.join(*[ptools_local_path, 'pgdbs', 'user']) + species_pgdb_folder = os.path.join(pgdbs_folder_path, run_folder.lower() + 'cyc') + + patho_error_status = False + flat_error_status = False + move_error_status = False + if input_folder: + run_folder_path = os.path.join(input_folder, run_folder) + + if run_input_files_creation: + pwt_input_files(run_folder_path, taxon_file) + + if run_patho_inference: + patho_error_status = run_pwt(run_folder_path, *pathologic_options) + if patho_error_status: + return run_folder, patho_error_status, flat_error_status, move_error_status + + if run_flat_creation: + flat_error_status = run_pwt_flat(run_folder_path) + check_dat(run_folder_path, species_pgdb_folder) + if flat_error_status: + return run_folder, patho_error_status, flat_error_status, move_error_status + + if permission: + give_permission(permission, species_pgdb_folder) + + if run_output_folder: + move_error_status = run_move_pgdb(run_folder, species_pgdb_folder, output_folder, *move_options) + if move_error_status: + return run_folder, patho_error_status, flat_error_status, move_error_status + + if permission and output_folder: + give_permission(permission, output_folder) + + return run_folder, patho_error_status, flat_error_status, move_error_status + + +def independent_mpwt(input_folder, output_folder=None, patho_inference=None, + patho_hole_filler=None, patho_operon_predictor=None, + patho_transporter_inference=None, no_download_articles=None, + flat_creation=None, dat_extraction=None, xml_extraction=None, + owl_extraction=None, col_extraction=None, size_reduction=None, + number_cpu_to_use=None, patho_log=None, pathway_score=None, + taxon_file=None, permission=None): + """ + Function managing the workflow for independent run of mpwt. + Each process of Pathway Tools on an organism are run separatly so if one failed the other that passed will succeed. + + Args: + input_folder (str): pathname to input folder + output_folder (str): pathname to output folder + patho_inference (bool): PathoLogic inference (True/False) + patho_hole_filler (bool): PathoLogic hole filler (True/False) + patho_operon_predictor (bool): PathoLogic operon predictor (True/False) + patho_transporter_inference (bool): PathoLogic Transport Inference Parser (True/False) + no_download_articles (bool): turning off loading of PubMed citations (True/False) + flat_creation (bool): BioPAX/attributes-values flat files creation (True/False) + dat_extraction (bool): move BioPAX/attributes-values files to output folder (True/False) + xml_extraction (bool): move metabolic-reactions.xml to output folder (True/False) + owl_extraction (bool): move owl files to output folder (True/False) + col_extraction (bool): move tabular files to output folder (True/False) + size_reduction (bool): delete ptools-local data at the end (True/False) + number_cpu (int): number of CPU used (default=1) + patho_log (str): pathname to mpwt log folder + pathway_score (float): score between 0 and 1 to accept or reject pathway + taxon_file (str): pathname to the mpwt taxon ID file + permission (str): Choose permission access to PGDB in ptools-local and output files, either 'all' or 'group' (by default it is user). + """ + logger.info('---------- Launching mpwt ----------') + ptools_local_path = utils.find_ptools_path() + pgdbs_folder_path = os.path.join(*[ptools_local_path, 'pgdbs', 'user']) + + start_time = time.time() + + # Check if input folder exists and is a folder. if input_folder: if not os.path.exists(input_folder): logger.critical('mpwt can not run: ' + input_folder + ' does not exist.') @@ -154,18 +299,11 @@ def multiprocess_pwt(input_folder=None, output_folder=None, patho_inference=None logger.critical('mpwt can not run: ' + input_folder + ' is not a directory.') return - # Create taxon file in the input folder. - if taxon_file and input_folder and not patho_inference: - taxon_file_pathname = os.path.join(input_folder, 'taxon_id.tsv') - if os.path.exists(taxon_file_pathname): - sys.exit('taxon ID file (' + taxon_file_pathname + ') already exists.') - else: - taxon_ids = read_taxon_id(input_folder) - with open(taxon_file_pathname, 'w', encoding='utf-8') as taxon_id_file: - taxon_id_writer = csv.writer(taxon_id_file, delimiter='\t') - taxon_id_writer.writerow(['species', 'taxon_id']) - for species, taxon_id in taxon_ids.items(): - taxon_id_writer.writerow([species, taxon_id]) + # If output_folder does not exists, creates it. + if output_folder: + if not os.path.exists(output_folder): + logger.info('No output directory, it will be created.') + os.mkdir(output_folder) # Turn off loading of pubmed entries. if no_download_articles: @@ -173,65 +311,16 @@ def multiprocess_pwt(input_folder=None, output_folder=None, patho_inference=None # Modify pathway prediction score. if pathway_score: + old_pathway_score = utils.extract_pathway_score() utils.modify_pathway_score(pathway_score) - # Check input folder and create input files for PathoLogic. + mpwt_pool = Pool(processes=number_cpu_to_use) + + # Check input subfolder. if input_folder: run_ids = [folder_id for folder_id in next(os.walk(input_folder))[1]] run_patho_flat_ids, run_flat_ids = check_input_and_existing_pgdb(run_ids, input_folder, output_folder, number_cpu_to_use) - # Launch PathoLogic inference on species with no PGDBs. - if run_patho_flat_ids: - # Create the list containing all the data used by the multiprocessing call. - multiprocess_pwt_input_files = [] - multiprocess_run_pwts = [] - multiprocess_run_pwt_flats = [] - multiprocess_run_move_pgdbs = [] - for run_patho_flat_id in run_patho_flat_ids: - input_folder_path = os.path.join(input_folder, run_patho_flat_id) - species_pgdb_folder = os.path.join(pgdbs_folder_path, run_patho_flat_id.lower() + 'cyc') - input_run_move_pgdbs = [run_patho_flat_id, species_pgdb_folder] - input_run_move_pgdbs.extend([dat_extraction, output_folder, size_reduction, xml_extraction, owl_extraction, col_extraction]) - multiprocess_pwt_input_files.append([input_folder_path, taxon_file]) - multiprocess_run_pwts.append([input_folder_path, patho_hole_filler, patho_operon_predictor, patho_transporter_inference]) - multiprocess_run_pwt_flats.append([input_folder_path]) - multiprocess_run_move_pgdbs.append(input_run_move_pgdbs) - - logger.info('~~~~~~~~~~Creation of input data from Genbank/GFF/PF~~~~~~~~~~') - input_error_status = mpwt_pool.starmap(pwt_input_files, multiprocess_pwt_input_files) - if any(input_error_status): - close_mpwt(mpwt_pool, no_download_articles, pathway_score) - sys.exit('Error during PathoLogic input files creation.') - - input_time = time.time() - times.append(input_time) - steps.append('pwt input creation') - logger.info('----------End of creation of input data from Genbank/GFF/PF: {0:.2f}s----------'.format(times[-1] - times[-2])) - - # Launch PathoLogic. - if patho_inference: - logger.info('~~~~~~~~~~Inference on the data~~~~~~~~~~') - error_status = mpwt_pool.starmap(run_pwt, multiprocess_run_pwts) - - # Check PathoLogic build. - logger.info('~~~~~~~~~~Check inference~~~~~~~~~~') - passed_inferences = check_pwt(multiprocess_run_pwts, patho_log) - if any(error_status): - if ignore_error: - logger.critical('Error during inference. Process stopped. Look at the command log. Also by using --log argument, you can have additional information.') - else: - close_mpwt(mpwt_pool, no_download_articles, pathway_score) - sys.exit('Error during inference. Process stopped. Look at the command log. Also by using --log argument, you can have additional information.') - - patho_time = time.time() - times.append(patho_time) - steps.append('PathoLogic inference') - logger.info('----------End of PathoLogic inference: {0:.2f}s----------'.format(times[-1] - times[-2])) - else: - multiprocess_run_pwt_flats = [] - multiprocess_run_move_pgdbs = [] - passed_inferences = [] - # Create path for lisp if there is no folder given. # Create the input for the creation of BioPAX/attribute-values files. if (flat_creation and not input_folder) or (output_folder and not input_folder): @@ -241,135 +330,107 @@ def multiprocess_pwt(input_folder=None, output_folder=None, patho_inference=None os.mkdir(tmp_folder) # Create a lisp script file for each PGDB in the ptools-local folder. - flat_run_ids = create_only_flat_lisp(pgdbs_folder_path, tmp_folder) - multiprocess_run_pwt_flats = [] - multiprocess_run_move_pgdbs = [] - for flat_run_id in flat_run_ids: - input_tmp_folder_path = os.path.join(tmp_folder, flat_run_id) - species_pgdb_folder = os.path.join(pgdbs_folder_path, flat_run_id.lower() + 'cyc') - input_run_move_pgdbs = [flat_run_id, species_pgdb_folder] - - input_run_move_pgdbs.extend([dat_extraction, output_folder, size_reduction, xml_extraction, owl_extraction, col_extraction]) - multiprocess_run_pwt_flats.append([input_tmp_folder_path]) - multiprocess_run_move_pgdbs.append(input_run_move_pgdbs) - - # Add species that have data in PGDB but are not present in output folder. - # Or if ignore_error has been used, select only PathoLogic build that have succeed + species in input with PGDB and not in output. - if input_folder: - if ignore_error: - multiprocess_run_pwt_flats = [] - multiprocess_run_move_pgdbs = [] - if run_patho_flat_ids: - if passed_inferences: - tmp_run_flat_ids = list(set(passed_inferences).intersection(set(run_patho_flat_ids))) + run_ids = list(create_only_flat_lisp(pgdbs_folder_path, tmp_folder)) + if flat_creation: + run_flat_ids = list(run_ids) + else: + run_flat_ids = None + run_patho_flat_ids = None + + pathologic_options = [patho_hole_filler, patho_operon_predictor, patho_transporter_inference] + move_options = [dat_extraction, size_reduction, xml_extraction, owl_extraction, col_extraction] + + # Create data for multiprocessing. + # For each organism, find if a PathoLogic inference must be run, if a flat creation must be performed and if the output files must be moved. + multiprocess_run_mpwts = [] + run_input_files_creation = False + run_patho_inference = False + run_flat_creation = False + run_output_folder = False + tmp_folder = False + + for run_id in run_ids: + # For species without PGDB in ptools-local, launch input files creations, PathoLogic reconstruction, flat files creation and moving output files (according to user input) + if run_patho_flat_ids and run_id in run_patho_flat_ids: + run_input_files_creation = True + if patho_inference: + run_patho_inference = True + if flat_creation: + run_flat_creation = True + if output_folder: + run_output_folder = True + # For speccies with PGDB in ptools-local, launch only flat files creation and moving output files (according to user input) + if run_flat_ids and run_id in run_flat_ids: + if flat_creation: + run_flat_creation = True + if output_folder: + run_output_folder = True + + # If no input_folder, PGDBs from ptools-local will be used. + if not input_folder: + tmp_folder = True + # If flat_creation, flat files of these PGDBs will be created and moved to the output folder. + if flat_creation: + run_flat_creation = True + input_folder = os.path.join(ptools_local_path, 'tmp') + if output_folder: + if not os.path.exists(os.path.join(output_folder, run_id)): + run_output_folder = True else: - tmp_run_flat_ids = [] - else: - tmp_run_flat_ids = [] - if run_flat_ids: - tmp_run_flat_ids.extend(run_flat_ids) - run_flat_ids = tmp_run_flat_ids - if run_flat_ids: - for run_flat_id in run_flat_ids: - flat_creation_path = os.path.join(*[input_folder, run_flat_id, 'flat_files_creation.lisp']) - create_flat_creation_script(run_flat_id, flat_creation_path) - input_folder_path = os.path.join(input_folder, run_flat_id) - species_pgdb_folder = os.path.join(pgdbs_folder_path, run_flat_id.lower() + 'cyc') - multiprocess_run_pwt_flats.append([input_folder_path]) - input_run_move_pgdbs = [run_flat_id, species_pgdb_folder] - input_run_move_pgdbs.extend([dat_extraction, output_folder, size_reduction, xml_extraction, owl_extraction, col_extraction]) - multiprocess_run_move_pgdbs.append(input_run_move_pgdbs) - - if not multiprocess_run_pwt_flats: - close_mpwt(mpwt_pool, no_download_articles, pathway_score) - logger.critical('No PGDB to export to move to output folder.') - return - - if not multiprocess_run_move_pgdbs: - close_mpwt(mpwt_pool, no_download_articles, pathway_score) - logger.critical('No PGDB to export in flat files or to move to output folder.') - return - - # Create BioPAX/attributes-values flat files. - if (input_folder and flat_creation) or flat_creation: - logger.info('~~~~~~~~~~Creation of the flat files~~~~~~~~~~') - flat_error_status = mpwt_pool.starmap(run_pwt_flat, multiprocess_run_pwt_flats) - logger.info('~~~~~~~~~~Check .dat~~~~~~~~~~') - for multiprocess_run_move_pgdb in multiprocess_run_move_pgdbs: - check_dat(multiprocess_run_move_pgdb[0], multiprocess_run_move_pgdb[1]) - if any(flat_error_status): - if ignore_error: - logger.critical('Error during flat files creation. Process stopped. Look at the command log. Also by using --log argument, you can have additional information.') - else: - close_mpwt(mpwt_pool, no_download_articles, pathway_score) - sys.exit('Error during flat files creation. Process stopped. Look at the command log. Also by using --log argument, you can have additional information.') - - flat_time = time.time() - times.append(flat_time) - steps.append('BioPAX/attribute-value flat files creation') - logger.info('----------End of flat files creation: {0:.2f}s----------'.format(times[-1] - times[-2])) + logger.info('{0} contains already {1}, output files will not be moved'.format(output_folder, run_id)) - if (flat_creation and not input_folder) or (output_folder and not input_folder): - ptools_local_path = utils.find_ptools_path() - ptools_local_tmp_path = os.path.join(ptools_local_path, 'tmp') - shutil.rmtree(ptools_local_tmp_path) + multiprocess_run_mpwt = [run_id, input_folder, run_input_files_creation, run_output_folder, output_folder, run_patho_inference, pathologic_options, + run_flat_creation, move_options, taxon_file, permission] - logger.info('~~~~~~~~~~End of Pathway Tools~~~~~~~~~~') + multiprocess_run_mpwts.append(multiprocess_run_mpwt) - # Move PGDBs or attribute-values/dat files. - if output_folder: - logger.info('~~~~~~~~~~Moving result files~~~~~~~~~~') - if not os.path.exists(output_folder): - logger.info('No output directory, it will be created.') - os.mkdir(output_folder) - mpwt_pool.starmap(run_move_pgdb, multiprocess_run_move_pgdbs) + results = mpwt_pool.starmap(run_mpwt, multiprocess_run_mpwts) - move_time = time.time() - times.append(move_time) - steps.append('Moving results files') - logger.info('----------End of moving fimes: {0:.2f}s----------'.format(times[-1] - times[-2])) + if patho_log: + if not os.path.exists(patho_log): + os.mkdir(patho_log) + input_folders = [os.path.join(multiprocess_run_mpwt[1], multiprocess_run_mpwt[0]) for multiprocess_run_mpwt in multiprocess_run_mpwts] + check_mpwt_pathologic_runs(input_folders, patho_log) + + logger.info('-------------- Checking mpwt runs --------------') + nb_failed_runs = len([result for result in results if any(result[1:])]) + nb_total_runs = len(results) + if nb_failed_runs > 0: + if nb_failed_runs == 1: + failed_str = '{0} failed run'.format(nb_failed_runs) + else: + failed_str = '{0} failed runs'.format(nb_failed_runs) + if nb_total_runs == 1: + total_str = '{0} run'.format(nb_total_runs) + else: + total_str = '{0} runs'.format(nb_total_runs) + logger.info('/!\\ {0} on a total of {1}.'.format(failed_str, total_str)) + else: + logger.info('All runs are successful.') + + for result in results: + run_id = result[0] + if any(result[1:]): + if result[1]: + logger.info('/!\\ Error in {0} during PathoLogic inference step.'.format(run_id)) + if result[2]: + logger.info('/!\\ Error in {0} during Flat files creation step.'.format(run_id)) + if result[3]: + logger.info('/!\\ Error in {0} during Moving output files step.'.format(run_id)) + + # Remove tmp folder in ptools-local. + if tmp_folder: + ptools_local_tmp_path = os.path.join(ptools_local_path, 'tmp') + shutil.rmtree(ptools_local_tmp_path) - close_mpwt(mpwt_pool, no_download_articles, pathway_score) + # Close multiprocessing Pool + if pathway_score: + close_mpwt(mpwt_pool, no_download_articles, pathway_score, old_pathway_score) + else: + close_mpwt(mpwt_pool, no_download_articles) end_time = time.time() - times.append(end_time) - steps.append('mpwt') - - # Write each step time in log file. - if patho_log: - if patho_log: - if not os.path.exists(patho_log): - logger.info('No log directory, it will be created.') - os.mkdir(patho_log) - patho_error_pathname = os.path.join(patho_log, 'log_error.txt') - with open(patho_error_pathname, 'a') as input_file: - input_file.write('\n\n---------Time---------\n') - for index, step_time in enumerate(times): - if index != 0: - if index + 1 == len(times): - step_duration = step_time - times[0] - else: - step_duration = step_time - times[index-1] - input_file.write('Step {0} takes: {1:.2f}s.\n'.format(steps[index] , step_duration)) - - logger.info('----------mpwt has finished in {0:.2f}s! Thank you for using it.'.format(end_time - start_time)) - - -def close_mpwt(mpwt_pool, no_download_articles, pathway_score): - """End multiprocessing Pool and restore ptools-init.dat - - mpwt_pool (multiprocessing Pool): mpwt multiprocessing Pool - no_download_articles (bool): turning off loading of PubMed citations (True/False) - pathway_score (float): score between 0 and 1 to accept or reject pathway - """ - mpwt_pool.close() - mpwt_pool.join() - # Turn on loading of pubmed entries. - if no_download_articles: - utils.pubmed_citations(activate_citations=True) + logger.info('-------------- mpwt has finished in {0:.2f}s! Thank you for using it. --------------'.format(end_time - start_time)) - # Remodify the pathway score to its original value. - if pathway_score: - utils.modify_pathway_score(0.35) diff --git a/mpwt/pathologic_input.py b/mpwt/pathologic_input.py index d6b64d6..e4b0280 100755 --- a/mpwt/pathologic_input.py +++ b/mpwt/pathologic_input.py @@ -1,4 +1,4 @@ -# Copyright (C) 2018-2021 Arnaud Belcour - Inria Dyliss +# Copyright (C) 2018-2022 Arnaud Belcour - Inria Dyliss # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Lesser General Public License as published by # the Free Software Foundation, either version 3 of the License, or @@ -24,7 +24,6 @@ import csv import logging import os -import sys from Bio import SeqIO from gffutils.iterators import DataIterator @@ -75,7 +74,7 @@ def check_input_and_existing_pgdb(run_ids, input_folder, output_folder, number_c list: input IDs for BioPAX/flat files creation """ # Check if there are files/folders inside the input folder. - # And remove hidden folder/file (beginning with '.'). + # And do not use hidden folder/file (beginning with '.'). species_folders = [species_folder for species_folder in os.listdir(input_folder) if not species_folder.startswith('.')] if len(species_folders) == 0: logger.critical("No folder containing genbank/gff file. In {0} you must have sub-folders containing Genbank/GFF file.".format(input_folder)) @@ -128,6 +127,7 @@ def check_input_and_existing_pgdb(run_ids, input_folder, output_folder, number_c # Take run_ids and remove folder with error (with the intersection with check_species_folders) and if there is already present output. clean_run_ids = set(run_ids).intersection(set(check_species_folders)) + if output_folder: if os.path.exists(output_folder): if os.path.isdir(output_folder): @@ -474,8 +474,7 @@ def create_flats_and_lisp(run_folder, taxon_file): if os.path.exists(fasta_path): pf_fasta = fasta_species_name if not pf_fasta: - logger.critical('No fasta file (.fasta or .fsa) with the Pathologic file of {0}'.format(pgdb_id)) - return None + logger.critical('No fasta file (.fasta or .fsa) with the Pathologic file of {0}, this could lead to warnings in Pathway Tools.'.format(pgdb_id)) taxon_error, taxon_id, taxon_datas = extract_taxon_id(run_folder, pgdb_id, taxon_id, taxon_file) @@ -630,16 +629,16 @@ def pwt_input_files(run_folder, taxon_file): error_found = False missing_string = '' if required_files.issubset(files_in): - missing_string = 'no missing files' + missing_string = 'No missing files' else: - missing_string = 'missing {0}'.format('; '.join(required_files.difference(files_in))) + '. Inputs file created for {0}'.format(species_folder) + missing_string = 'Missing {0}'.format('; '.join(required_files.difference(files_in))) + '. Inputs file created for {0}'.format(species_folder) check_datas_lisp = create_flats_and_lisp(run_folder, taxon_file) if check_datas_lisp is None: logger.critical('Error with the creation of input files of {0}.'.format(run_folder)) error_found = True return error_found - logger.info('Checking inputs for {0}: {1}.'.format(species_folder, missing_string)) + logger.info('|Input Check|{0}| {1}'.format(species_folder, missing_string)) return error_found @@ -661,7 +660,8 @@ def create_only_flat_lisp(pgdbs_folder_path, tmp_folder): pgdb_id = species_pgdb[:-3] pgdb_pathname = os.path.join(tmp_folder, pgdb_id) tmp_pgdb_path = os.path.join(tmp_folder, pgdb_id) - os.mkdir(tmp_pgdb_path) + if not os.path.exists(tmp_pgdb_path): + os.mkdir(tmp_pgdb_path) lisp_pathname = os.path.join(pgdb_pathname, 'flat_files_creation.lisp') check_lisp_file = create_flat_creation_script(pgdb_id, lisp_pathname) if not check_lisp_file: diff --git a/mpwt/pwt_wrapper.py b/mpwt/pwt_wrapper.py index 5ee9798..bc9c93f 100755 --- a/mpwt/pwt_wrapper.py +++ b/mpwt/pwt_wrapper.py @@ -1,4 +1,4 @@ -# Copyright (C) 2018-2021 Arnaud Belcour - Inria Dyliss +# Copyright (C) 2018-2022 Arnaud Belcour - Inria Dyliss # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Lesser General Public License as published by # the Free Software Foundation, either version 3 of the License, or @@ -22,7 +22,6 @@ import shutil import signal import subprocess -import sys logger = logging.getLogger(__name__) @@ -138,10 +137,12 @@ def run_pwt(species_input_folder_path, patho_hole_filler, patho_operon_predictor if patho_transporter_inference: cmd_pwt.append('-tip') - logger.info(' '.join(cmd_pwt)) + species_name = os.path.basename(species_input_folder_path) + logger.info('|PathoLogic|{}| '.format(species_name) + ' '.join(cmd_pwt)) error_status = None - errors = ['Restart actions (select using :continue):'] + # Errors are either a fatal error or opening the lisp listener. + errors = ['fatal error', '[Current process: Initial Lisp Listener]'] patho_lines = [] # Name of the file containing the log from Pathway Tools terminal. @@ -207,11 +208,13 @@ def run_pwt_flat(species_input_folder_path): cmd_options = ['-no-patch-download', '-disable-metadata-saving', '-nologfile'] cmd_flat = ['pathway-tools', *cmd_options, '-load', lisp_path] - logger.info(' '.join(cmd_flat)) + species_name = os.path.basename(species_input_folder_path) + logger.info('|Flat files creation|{}| '.format(species_name) + ' '.join(cmd_flat)) error_status = None flat_creation_ends = ['Opening Navigator window.'] - load_errors = ['Error', 'fatal error', 'No protein-coding genes with sequence data found.', 'Cannot continue.'] + # Errors are either a fatal error or opening the lisp listener. + load_errors = ['fatal error', '[Current process: Initial Lisp Listener]'] load_lines = [] # Name of the file containing the log from Pathway Tools terminal. @@ -227,7 +230,7 @@ def run_pwt_flat(species_input_folder_path): load_line = load_line.decode(encoding, errors='replace') flat_file_writer.write(load_line) - # Lisp commnd has finished, kill Pathway Toosl trying to open navigator. + # Lisp command has finished, kill Pathway Toosl trying to open navigator. if any(flat_end in load_line for flat_end in flat_creation_ends): load_subprocess.stdout.close() load_subprocess.kill() @@ -260,7 +263,7 @@ def run_pwt_flat(species_input_folder_path): return error_status -def run_move_pgdb(pgdb_folder_dbname, pgdb_folder_path, dat_extraction, output_folder, size_reduction, xml_extraction, owl_extraction, col_extraction): +def run_move_pgdb(pgdb_folder_dbname, pgdb_folder_path, output_folder, dat_extraction, size_reduction, xml_extraction, owl_extraction, col_extraction): """ Move the result files inside the shared folder containing the input data. pgdb_folder_dbname: ID of the species. @@ -269,13 +272,15 @@ def run_move_pgdb(pgdb_folder_dbname, pgdb_folder_path, dat_extraction, output_f Args: pgdb_folder_dbname (str): species ID pgdb_folder_path (str): path to species PGDB folder - dat_extraction (bool): to extract or not the attribute-values files (.dat files) output_folder (str): path to output folder + dat_extraction (bool): to extract or not the attribute-values files (.dat files) size_reduction (bool): to compress or not the results xml_extraction (bool): to extract or not the metabolic-reactions.xml' owl_extraction (bool): to extract or not the owl files col_extraction (bool): to extract or not the tabular files (.col files) """ + logger.info('|Moving output files|{}| '.format(pgdb_folder_dbname)) + output_species = os.path.join(output_folder, pgdb_folder_dbname) keep_extensions = [] @@ -315,12 +320,13 @@ def run_move_pgdb(pgdb_folder_dbname, pgdb_folder_path, dat_extraction, output_f if len(keep_extensions) > 0: pgdb_tmp_folder_path = os.path.join(*[pgdb_folder_path, '1.0', 'data']) - if not os.path.exists(pgdb_tmp_folder_path): - logger.critical('Missing ' + pgdb_tmp_folder_path + ' folder.') - return else: pgdb_tmp_folder_path = pgdb_folder_path + if not os.path.exists(pgdb_tmp_folder_path): + logger.critical('Missing ' + pgdb_tmp_folder_path + ' folder.') + return True + # If size_reduction, mpwt will create a compressed version of the PGDB in output folder. # It will also delete the PGDB folder in ptools-local. if size_reduction: @@ -348,3 +354,5 @@ def run_move_pgdb(pgdb_folder_dbname, pgdb_folder_path, dat_extraction, output_f os.remove(pgdb_file_pathname) elif os.path.isdir(pgdb_file_pathname): shutil.rmtree(pgdb_file_pathname) + + return False \ No newline at end of file diff --git a/mpwt/results_check.py b/mpwt/results_check.py index 7c1204e..aed5fd2 100755 --- a/mpwt/results_check.py +++ b/mpwt/results_check.py @@ -1,4 +1,4 @@ -# Copyright (C) 2018-2021 Arnaud Belcour - Inria Dyliss +# Copyright (C) 2018-2022 Arnaud Belcour - Inria Dyliss # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Lesser General Public License as published by # the Free Software Foundation, either version 3 of the License, or @@ -14,153 +14,190 @@ """ Check results from Pathway Tools command: --check_pwt: results from PathoLogic (by looking at pathologic.log) +-check_mpwt_pathologic_runs: results from PathoLogic (by looking at pathologic.log) -check_dat: attribute-values dat files """ import csv import os import logging -import sys +import re logger = logging.getLogger(__name__) -def check_pwt(multiprocess_run_pwts, patho_log_folder): +def extract_pathologic(patho_log): + """ Read PathoLogic's log and extract informations such as if the build has passed, if there is warning/error. + And also if the build has passed the number of gene/protein/reaction/compound in the draft network. + + Args: + patho_log (str): pathname to the pathologic.log file. + + Returns: + non_fatal_error_count (int): number of non fatal errors. + warning_count (int): number of warning message. + fatal_error_index (None, int): either None (if no fatal error) or the index of the line in pathologic.log with fatal error + passed_inferences (None, bool): either None (if inference has no passed) or True otherwise + pgdb_build_done (None, bool): either None (if inference has no passed) or True otherwise + gene_number (None, int): either None (if build has failed) or int with the number of genes in network + protein_number (None, int): either None (if build has failed) or int with the number of proteins in network + pathway_number (None, int): either None (if build has failed) or int with the number of pathways in network + reaction_number (None, int): either None (if build has failed) or int with the number of reactions in network + compound_number (None, int): either None (if build has failed) or int with the number of compounds in network + log_str (str): string containing informaitons about the PathoLogic build behaves + log_resume_list (list): list with number of gene/protein/reaction/compound if build has passed + """ + non_fatal_error_count = 0 + warning_count = 0 + fatal_error_index = None + passed_inferences = False + pgdb_build_done = False + gene_number = None + protein_number = None + pathway_number = None + reaction_number = None + compound_number = None + + log_str = '' + log_resume_list = [] + + input_folder = os.path.dirname(patho_log) + organism_name = os.path.basename(input_folder) + + log_str += '------------ Species: ' + log_str += organism_name + '\n' + + if not os.path.exists(patho_log): + log_str += 'No pathologic log, an error occured before PathoLogic run.\n' + log_resume_list.append([organism_name, 'ERROR', '', '', '', '']) + logger.info('No pathologic log for {0}, an error occured before PathoLogic run.'.format(organism_name)) + return + + with open(patho_log, 'r') as input_file: + for index, line in enumerate(input_file): + if ';;; Error:' in line: + non_fatal_error_count += 1 + if 'Warning:' in line: + warning_count += 1 + if 'fatal error' in line: + fatal_error_index = index + log_str += line + log_resume_list.append([organism_name, 'ERROR', '', '', '', '']) + + if fatal_error_index: + if index > fatal_error_index: + log_str += line + + # Search for Build done line and its following line which look like (for Pathway Tools 25.0): + # PGDB contains XXXX genes, XXXX polypeptides, XXXX base pathways, XXXX reactions, XXXX compounds + # In Pathway Tools 25.5 it looks like this: + # PGDB contains XXXX classes and XXXX instances: XXXX genes, XXXX polypeptides, XXXX base pathways, XXXX reactions, XXXX compounds, XXXX publications + if 'Build done.' in line or 'PGDB build done.' in line: + log_str += line + if non_fatal_error_count > 0: + log_str += 'Number of non fatal errors: ' + str(non_fatal_error_count) + '. More information in ' + patho_log + '.\n' + if warning_count > 0: + log_str += 'Number of warning: ' + str(warning_count) + '. More information in ' + patho_log + '.\n' + + resume_inference_line = next(input_file) + log_str += resume_inference_line + pgdb_build_done = True + # Search the PGDB stat line and use regex to extract informations. + # # This is done by searching for association like (digit word) like(XXXX genes). + pgdb_stats = {} + pgdb_line_re = r'(?P[\d]+)\ (?P[\w]+(\ pathways)?)' + # Create a dictionary containing the stat as: {'genes': XXXX, 'reactions': XXXX, ...} + for match_found in re.finditer(pgdb_line_re, resume_inference_line): + pgdb_stats[match_found.group('variable_name')] = int(match_found.group('stat_nb')) + + gene_number = pgdb_stats['genes'] + # proteins is listed in pathologic.log for Pathway Tools inferior to 25.0 + # Since the 25.0 polypeptides replace proteins + if 'proteins' in pgdb_stats: + protein_number = pgdb_stats['proteins'] + elif 'polypeptides' in pgdb_stats: + protein_number = pgdb_stats['polypeptides'] + pathway_number = pgdb_stats['base pathways'] + reaction_number = pgdb_stats['reactions'] + compound_number = pgdb_stats['compounds'] + + if 'Done' in line: + passed_inferences = True + if pgdb_build_done: + log_resume_list.append([organism_name, gene_number, protein_number, + pathway_number, reaction_number, compound_number, + non_fatal_error_count, warning_count]) + + if passed_inferences is None and fatal_error_index is None: + log_str += 'No build in PathoLogic inference.' + log_resume_list.append([organism_name, 'ERROR', '', '', '', '']) + + log_str += '------------\n\n' + + return organism_name, non_fatal_error_count, warning_count, fatal_error_index, \ + passed_inferences, pgdb_build_done, gene_number, \ + protein_number, pathway_number, reaction_number, \ + compound_number, log_str, log_resume_list + + +def check_mpwt_pathologic_runs(species_input_folder_paths, patho_log_folder): """ Check PathoLogic's log. Create two log files (log_error.txt which contains Pathway Tools log and resume_inference.tsv which contains summary of metabolic networks). Args: - multiprocess_run_pwts (list): list of dictionary contaning multiprocess input data + species_input_folder_paths (list): list of input folder pathname patho_log_folder (str): pathname to the PathoLogic log folder. - - Returns: - list: Species with successful build. """ - if patho_log_folder: - if not os.path.exists(patho_log_folder): - logger.info('No log directory, it will be created.') - os.mkdir(patho_log_folder) - - patho_error_pathname = os.path.join(patho_log_folder, 'log_error.txt') - patho_resume_pathname = os.path.join(patho_log_folder, 'resume_inference.tsv') - - patho_error_file = open(patho_error_pathname, 'w', encoding='utf-8') - patho_resume_file = open(patho_resume_pathname, 'w', encoding='utf-8') - patho_resume_writer = csv.writer(patho_resume_file, delimiter='\t', lineterminator='\n') - patho_resume_writer.writerow(['species', 'gene_number', 'protein_number', 'pathway_number', 'reaction_number', 'compound_number', 'pwt_non_fatal_error', 'pwt_warning']) + mpwt_pathologic_informations = [] failed_inferences = [] passed_inferences = [] - - for multiprocess_run_pwt in multiprocess_run_pwts: - species_input_folder_path = multiprocess_run_pwt[0] - species = os.path.basename(species_input_folder_path) + for species_input_folder_path in species_input_folder_paths: patho_log = os.path.join(species_input_folder_path, 'pathologic.log') + species_pathologic_informations = extract_pathologic(patho_log) - if patho_log_folder: - patho_error_file.write('------------ Species: ') - patho_error_file.write(species) - patho_error_file.write('\n') - - fatal_error_index = None - non_fatal_error_count = 0 - warning_count = 0 - - if os.path.exists(patho_log): - pgdb_build_done = False - with open(patho_log, 'r') as input_file: - for index, line in enumerate(input_file): - if ';;; Error:' in line: - non_fatal_error_count += 1 - if 'Warning:' in line: - warning_count += 1 - if 'fatal error' in line: - fatal_error_index = index - if species not in failed_inferences: - failed_inferences.append(species) - if patho_log_folder: - patho_error_file.write(line) - patho_resume_writer.writerow([species, 'ERROR', '', '', '', '']) - - if fatal_error_index: - if index > fatal_error_index: - if patho_log_folder: - patho_error_file.write(line) - - # Search for Build done line and its following line which look like (for Pathway Tools 25.0): - # PGDB contains XXXX genes, XXXX polypeptides, XXXX base pathways, XXXX reactions, XXXX compounds - if 'Build done.' in line or 'PGDB build done.' in line: - pgdb_build_done = True - if patho_log_folder: - patho_error_file.write(line) - resume_inference_line = next(input_file) - patho_error_file.write(resume_inference_line) - if non_fatal_error_count > 0: - non_fatal_error_line = 'Number of non fatal errors: ' + str(non_fatal_error_count) + '. More information in ' + patho_log + '.\n' - patho_error_file.write(non_fatal_error_line) - if warning_count > 0: - warning_line = 'Number of warning: ' + str(warning_count) + '. More information in ' + patho_log + '.\n' - patho_error_file.write(warning_line) - gene_number = int(resume_inference_line.split('PGDB contains ')[1].split(' genes')[0]) - # proteins is listed in pathologic.log for Pathway Tools inferior to 25.0 - # Since the 25.0 polypeptides replace proteins - if 'proteins' in resume_inference_line: - protein_number = int(resume_inference_line.split('genes, ')[1].split(' proteins')[0]) - pathway_number = int(resume_inference_line.split('proteins, ')[1].split(' base pathways')[0]) - elif 'polypeptides' in resume_inference_line: - protein_number = int(resume_inference_line.split('genes, ')[1].split(' polypeptides')[0]) - pathway_number = int(resume_inference_line.split('polypeptides, ')[1].split(' base pathways')[0]) - reaction_number = int(resume_inference_line.split('base pathways, ')[1].split(' reactions')[0]) - compound_number = int(resume_inference_line.split('reactions, ')[1].split(' compounds')[0]) - - if 'Done' in line: - passed_inferences.append(species) - if patho_log_folder and pgdb_build_done: - patho_resume_writer.writerow([species, gene_number, protein_number, pathway_number, reaction_number, compound_number, non_fatal_error_count, warning_count]) - - if species not in passed_inferences and species not in failed_inferences: - failed_inferences.append(species) - if patho_log_folder: - patho_error_file.write('No build in PathoLogic inference.') - patho_resume_writer.writerow([species, 'ERROR', '', '', '', '']) - else: - if patho_log_folder: - patho_error_file.write('No pathologic log, an error occured before PathoLogic run.\n') - patho_resume_writer.writerow([species, 'ERROR', '', '', '', '']) - logger.info('No pathologic log for {0}, an error occured before PathoLogic run.'.format(species)) - - if patho_log_folder: - patho_error_file.write('------------\n\n') + mpwt_pathologic_informations.append(species_pathologic_informations) + if species_pathologic_informations[3] is not None: + failed_inferences.append(species_pathologic_informations[0]) + elif species_pathologic_informations[4] is not None: + passed_inferences.append(species_pathologic_informations[0]) + elif species_pathologic_informations[3] is None and species_pathologic_informations[4] is not None: + failed_inferences.append(species_pathologic_informations[0]) number_passed_inference = len(passed_inferences) number_failed_inference = len(failed_inferences) + number_total_build = number_passed_inference + number_failed_inference string_passed_build = 'build has' if number_passed_inference == 1 else 'builds have' string_failed_build = 'build has' if number_failed_inference == 1 else 'builds have' if number_passed_inference > 0: - logger.info('\n{0} {1} passed!\n'.format(str(number_passed_inference), string_passed_build)) + logger.info('|Output Check| {0} on {1} {2} passed!'.format(str(number_passed_inference), str(number_total_build), string_passed_build)) if number_failed_inference > 0: - logger.critical('WARNING: {0} {1} failed! See the log for more information.\n'.format(str(number_failed_inference), string_failed_build)) + logger.critical('|Output Check|WARNING: {0} on {1} {2} failed: {3}! See the log for more information.'.format(str(number_failed_inference), str(number_total_build), string_failed_build, ' '.join(failed_inferences))) if patho_log_folder: - patho_error_file.close() - patho_resume_file.close() - with open(patho_error_pathname,'r') as contents: - save = contents.read() - with open(patho_error_pathname, 'w', encoding='utf-8') as output_file: - output_file.write('Inference statistics:\n') - if number_passed_inference > 0: - output_file.write('Build done: ' + str(number_passed_inference) + '\n') - output_file.write('Species: ' + ', '.join(passed_inferences) + '\n\n') - if number_failed_inference > 0: - output_file.write('Build failed: ' + str(number_failed_inference) + '\n') - output_file.write('Species: ' + ', '.join(failed_inferences) + '\n\n') - output_file.write(save) - - return passed_inferences + patho_error_pathname = os.path.join(patho_log_folder, 'log_error.txt') + patho_resume_pathname = os.path.join(patho_log_folder, 'resume_inference.tsv') + + patho_log_file = open(patho_error_pathname, 'w', encoding='utf-8') + patho_resume_file = open(patho_resume_pathname, 'w', encoding='utf-8') + + patho_resume_writer = csv.writer(patho_resume_file, delimiter='\t', lineterminator='\n') + patho_resume_writer.writerow(['species', 'gene_number', 'protein_number', 'pathway_number', 'reaction_number', 'compound_number', 'pwt_non_fatal_error', 'pwt_warning']) + + if number_passed_inference > 0: + patho_log_file.write('Build done: ' + str(number_passed_inference) + '\n') + patho_log_file.write('Species: ' + ', '.join(passed_inferences) + '\n\n') + if number_failed_inference > 0: + patho_log_file.write('Build failed: ' + str(number_failed_inference) + '\n') + patho_log_file.write('Species: ' + ', '.join(failed_inferences) + '\n\n') + + for species_pathologic_informations in mpwt_pathologic_informations: + patho_log_file.write(species_pathologic_informations[11]) + patho_resume_writer.writerow(*species_pathologic_informations[12]) + def check_dat(run_dat_id, species_pgdb_folder): """ @@ -170,8 +207,6 @@ def check_dat(run_dat_id, species_pgdb_folder): run_dat_id (str): species ID species_pgdb_folder (str): path to species PGDB folder """ - pgdb_folder_dbname = run_dat_id.lower() + 'cyc' - dats_path = os.path.join(*[species_pgdb_folder, '1.0', 'data']) dat_files = ["classes.dat", "compound-links.dat", "compounds.dat", "dnabindsites.dat", "enzrxns.dat", "gene-links.dat", "genes.dat", "pathway-links.dat", @@ -186,5 +221,4 @@ def check_dat(run_dat_id, species_pgdb_folder): expected_dat_number = str(len(dat_files)) found_dat_number = str(len(dat_checks)) - logger.info('{0}: {1} out of {2} dat files created.'.format(pgdb_folder_dbname, found_dat_number, expected_dat_number)) - + logger.info('|Output Check|{0}| {1} out of {2} dat files created.'.format(run_dat_id, found_dat_number, expected_dat_number)) diff --git a/mpwt/to_pathologic.py b/mpwt/to_pathologic.py index dd10973..310cd3e 100755 --- a/mpwt/to_pathologic.py +++ b/mpwt/to_pathologic.py @@ -1,4 +1,4 @@ -# Copyright (C) 2018-2021 Arnaud Belcour - Inria Dyliss +# Copyright (C) 2018-2022 Arnaud Belcour - Inria Dyliss # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Lesser General Public License as published by # the Free Software Foundation, either version 3 of the License, or @@ -343,7 +343,7 @@ def run_create_pathologic_file(input_path, output_path, output_folder, input_nam if feature.featuretype == 'gene': if feature.chrom == region: if len(feature.id) > 40: - logger.critical('Critical warning: gene ID ' + gene_id + ' of ' + input_path + 'is too long (more than 40 characters), this will cause errors in Pathway Tools.') + logger.critical('Critical warning: gene ID ' + feature.id + ' of ' + input_path + 'is too long (more than 40 characters), this will cause errors in Pathway Tools.') element_file.write('ID\t' + feature.id + '\n') element_file.write('NAME\t' + feature.id + '\n') element_file.write('STARTBASE\t' + str(feature.start) + '\n') @@ -356,8 +356,13 @@ def run_create_pathologic_file(input_path, output_path, output_folder, input_nam for ec in child.attributes['ec_number']: element_file.write('EC\t' + ec + '\n') if 'db_xref' in child.attributes: - if ':' in db_xref: - element_file.write('DBLINK\t' + db_xref + '\n') + for dbxref in child.attributes['db_xref']: + if ':' in dbxref: + element_file.write('DBLINK\t' + dbxref + '\n') + if 'Dbxref' in child.attributes: + for dbxref in child.attributes['Dbxref']: + if ':' in dbxref: + element_file.write('DBLINK\t' + dbxref + '\n') if child.featuretype == 'CDS': element_file.write('PRODUCT-TYPE\tP' + '\n') element_file.write('PRODUCT-ID\tprot ' + feature.id + '\n') diff --git a/mpwt/utils.py b/mpwt/utils.py index 9eee504..091d4a3 100755 --- a/mpwt/utils.py +++ b/mpwt/utils.py @@ -1,4 +1,4 @@ -# Copyright (C) 2018-2021 Arnaud Belcour- Inria Dyliss +# Copyright (C) 2018-2022 Arnaud Belcour- Inria Dyliss # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Lesser General Public License as published by # the Free Software Foundation, either version 3 of the License, or @@ -227,12 +227,34 @@ def pubmed_citations(activate_citations): ptools_init_file.write(new_ptools_file) -def modify_pathway_score(pathway_score): +def extract_pathway_score(): + """ + Get the Pathway-Prediction-Score-Cutoff of ptools-init.dat + + Returns: + pathway_score (float): score between 0 and 1 to accept or reject pathways + """ + ptools_init_filepath = os.path.join(find_ptools_path() ,'ptools-init.dat') + + pathway_prediction_score_cutoff = None + with open(ptools_init_filepath, 'r') as ptools_init_file: + for line in ptools_init_file.read().split('\n'): + if 'Pathway-Prediction-Score-Cutoff' in line: + pathway_prediction_score_cutoff = line.split(' ')[1] + + if not pathway_prediction_score_cutoff: + sys.exit('There is no Pathway-Prediction-Score-Cutoff parameter in ' + ptools_init_filepath +'.') + + return pathway_prediction_score_cutoff + + +def modify_pathway_score(pathway_score, comment_line=None): """ Modify the Pathway-Prediction-Score-Cutoff of ptools-init.dat Args: pathway_score (float): score between 0 and 1 to accept or reject pathways + comment_line (bool): boolean if True comment Pathway-Prediction-Score-Cutoff line """ ptools_init_filepath = os.path.join(find_ptools_path() ,'ptools-init.dat') new_ptools_file = "" @@ -245,7 +267,7 @@ def modify_pathway_score(pathway_score): if '#' in line: line = line.replace('#', '') if pathway_score: - if pathway_score == 0.35: + if comment_line: line = '###' + line.split(' ')[0] + ' ' + str(pathway_score) else: line = line.split(' ')[0] + ' ' + str(pathway_score) diff --git a/mpwt_pipeline.svg b/mpwt_pipeline.svg index 3f6a20b..a3e4259 100644 --- a/mpwt_pipeline.svg +++ b/mpwt_pipeline.svg @@ -1,16 +1,40 @@ + id="svg8" + sodipodi:docname="mpwt_pipeline.svg" + inkscape:version="1.1 (c68e22c387, 2021-05-23)" + xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" + xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd" + xmlns:xlink="http://www.w3.org/1999/xlink" + xmlns="http://www.w3.org/2000/svg" + xmlns:svg="http://www.w3.org/2000/svg" + xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" + xmlns:cc="http://creativecommons.org/ns#" + xmlns:dc="http://purl.org/dc/elements/1.1/"> + + image/svg+xml - diff --git a/mpwt_runs.svg b/mpwt_runs.svg new file mode 100644 index 0000000..ab54558 --- /dev/null +++ b/mpwt_runs.svg @@ -0,0 +1,1922 @@ + + + + PathoLogic, GenBank, GFF + taxon_id.tsv + + + + + Species A + Species B + Species C + input folder + + + + + + + --taxon-file + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + PathoLogic, GenbBank, GFF + + + PGDBs creation + PathoLogic input files + --patho + -f + Hole Filler: --hf + Operon Predictor: --op + Transport Inference: --tp + + + + + + + + + + + ptools-local folder + attribute-values flat files creation + --flat + output folder + attribute-values flat files + + + + input folder + + + + + + + All files + Specific files: + -o + --md + PGDBs storage + + + + + + + Pathway Tools + + + + + + + metabolic-reactions.xml + --mx + biopax owl files + --mo + tabular files + --mc + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + mpwt + diff --git a/requirements.txt b/requirements.txt index 0824678..a6e39f7 100755 --- a/requirements.txt +++ b/requirements.txt @@ -1,4 +1,3 @@ biopython>=1.70 chardet>=3.0.4 -docopt>=0.6.2 gffutils>=0.9 diff --git a/setup.py b/setup.py index f74ea33..2d013e2 100755 --- a/setup.py +++ b/setup.py @@ -1,7 +1,7 @@ #!/usr/bin/env python # -*- coding: utf-8 -*- -# Copyright (C) 2018-2021 Arnaud Belcour - Inria Dyliss +# Copyright (C) 2018-2022 Arnaud Belcour - Inria Dyliss # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Lesser General Public License as published by # the Free Software Foundation, either version 3 of the License, or @@ -27,7 +27,8 @@ description='Multiprocessing for Pathway Tools', long_description=long_description, url='https://github.com/AuReMe/mpwt', - author='Arnaud Belcour', + author='AuReMe', + license='GPLv3', python_requires='>=3.6', classifiers=[ # How mature is this project? Common values are @@ -47,7 +48,6 @@ install_requires=[ 'biopython>=1.70', 'chardet>=3.0.4', - 'docopt>=0.6.2', 'gffutils>=0.9', ], entry_points={