reset docs to current state

tschuelia · Dec 29, 2024 · 4e9e755 · 4e9e755
1 parent f16bd52
commit 4e9e755
Showing 3 changed files with 69 additions and 59 deletions.
diff --git a/docs/install.md b/docs/install.md
@@ -1,3 +1,5 @@
+# Installing PyPythia
+
 ## Requirements
 In order to use this difficulty prediction, you need RAxML-NG installed somewhere on your system. You can find the install instructions [here](https://github.com/amkozlov/raxml-ng).
 

diff --git a/docs/usage.md b/docs/usage.md
@@ -1,96 +1,94 @@
+# Using PyPythia
+
 This library can be used in two ways: either directly as command line tool, or the prediction can be called from other python code.
 
-## Command Line Tool
+## Command Line Interface
+
 If you only want to predict the difficulty for a single MSA, you can query the predictor using the command line interface, for example like this:
 ```commandline
 pythia --msa examples/example.phy --raxmlng /path/to/raxml-ng
 ```
 Note that when you installed PyPythia using conda, you will have to download the `example.phy` and adjust the path accordingly.
 
-The output will be something like `The predicted difficulty for MSA examples/example.phy is: 0.16.`, telling us that example.phy is an easy dataset. In fact, this dataset exhibits a single likelihood peak. Depending on the predictor version you are using, the actual value might slightly differ. This is expected and nothing to worry about 🙂
+The output will be something like `The predicted difficulty for MSA examples/example.phy is: 0.02.`, telling us that example.phy is an easy dataset. In fact, this dataset exhibits a single likelihood peak. Depending on the predictor version you are using, the actual value might slightly differ. This is expected and nothing to worry about 🙂
 
 *Note that Pythia can also handle FASTA input files, see section Input Data below.*
 
 The following options are available:
 ```commandline
-PyPythia version 1.2.0 released by The Exelixis Lab
+PyPythia version 2.0.0 released by The Exelixis Lab
 Developed by: Julia Haag
 Latest version: https://github.com/tschuelia/PyPythia
 Questions/problems/suggestions? Please open an issue on GitHub.
 
-usage: pythia [-h] -m MSA -r RAXMLNG [-t THREADS] [-p PREDICTOR] [-o OUTPUT] [-prec PRECISION] [-sT] [--removeDuplicates] [--forceDuplicates]
-              [--shap] [-v] [-b] [-q]
+usage: pythia [-h] -m MSA -r RAXMLNG [-t THREADS] [-s SEED] [-p PREFIX]
+              [--predictor PREDICTOR] [-prec PRECISION] [-sT] [--forceDuplicates]
+              [--forceFullGaps] [--shap] [-v]
 
 Parser for Pythia command line options.
 
 options:
   -h, --help            show this help message and exit
-  -m MSA, --msa MSA     Multiple Sequence Alignment to predict the difficulty for. Must be in either phylip or fasta format.
+  -m MSA, --msa MSA     Multiple Sequence Alignment to predict the difficulty for.
+                        Must be in either phylip or fasta format.
   -r RAXMLNG, --raxmlng RAXMLNG
-                        Path to the binary of RAxML-NG. For install instructions see https://github.com/amkozlov/raxml-ng.
+                        Path to the binary of RAxML-NG. For install instructions
+                        see https://github.com/amkozlov/raxml-ng.(default: 'raxml-
+                        ng' if in $PATH, otherwise this option is mandatory).
   -t THREADS, --threads THREADS
-                        Number of threads to use for parallel parsimony tree inference. If none is set, Pythia uses the parallelization scheme
-                        of RAxML-NG that automatically detects the optimal number of threads for your machine.
-  -p PREDICTOR, --predictor PREDICTOR
-                        Filepath of the predictor to use. If not set, assume it is 'predictors/latest.pckl' in the project directory.
-  -o OUTPUT, --output OUTPUT
-                        Option to specify a filepath where the result will be written to. The file will contain a single line with only the
-                        difficulty.
+                        Number of threads to use for parallel parsimony tree
+                        inference (default: RAxML-NG autoconfig).
+  -s SEED, --seed SEED  Seed for the RAxML-NG parsimony tree inference (default:
+                        0).
+  -p PREFIX, --prefix PREFIX
+                        Prefix of the PyPythia log and result file (default: MSA
+                        file name).
+  --predictor PREDICTOR
+                        Filepath of the alternative predictor to use (default:
+                        latest Pythia).
   -prec PRECISION, --precision PRECISION
-                        Set the number of decimals the difficulty should be rounded to. Recommended and default is 2.
-  -sT, --storeTrees     If set, stores the parsimony trees as '{msa_name}.parsimony.trees' file.
-  --removeDuplicates    Pythia refuses to predict the difficulty for MSAs containing duplicate sequences. If this option is set, PyPythia
-                        removes the duplicate sequences, stores the reduced MSA as '{msa_name}.{phy/fasta}.pythia.reduced' and predicts the
-                        difficulty for the reduced alignment.
-  --forceDuplicates     Per default, Pythia refuses to predict the difficulty for MSAs containing duplicate sequences. Set this option if you
-                        are absolutely sure that you want to predict the difficulty for this MSA.
-  --shap                If set, computes the shapley values of the prediction as waterfall plot in '{msa_name}.shap.pdf'. When using this
-                        option, make sure you understand what shapley values are and how to interpret this plot.For details on shapley values
-                        refer to the wiki: https://github.com/tschuelia/PyPythia/wiki/Usage#shapley-values.
-  -v, --verbose         If set, additionally prints the MSA features.
-  -b, --benchmark       If set, time the runtime of the prediction.
-  -q, --quiet           If set, Pythia does not print progress updates and only prints the predicted difficulty.
+                        Set the number of decimals the difficulty should be rounded
+                        to (default: 2).
+  -sT, --storeTrees     If set, stores the parsimony trees as
+                        '{prefix}.pythia.trees' file (default: False).
+  --forceDuplicates     Per default, Pythia refuses to predict the difficulty for
+                        MSAs containing duplicate sequences. Only set this option
+                        if you are absolutely sure that you want to predict the
+                        difficulty for this MSA (default: False).
+  --forceFullGaps       Per default, Pythia refuses to predict the difficulty for
+                        MSAs containing sequences with only gaps. Only set this
+                        option if you are absolutely sure that you want to predict
+                        the difficulty for this MSA (default: False).
+  --shap                If set, computes the shapley values of the prediction as
+                        waterfall plot in '{prefix}.shap.pdf'. When using this
+                        option, make sure you understand what shapley values are
+                        and how to interpret this plot.For details on shapley
+                        values refer to the wiki:
+                        https://github.com/tschuelia/PyPythia/wiki/Usage#shapley-
+                        values (default: False).
+  -v, --verbose         If set, additionally prints the MSA features (default:
+                        False).
 ```
 
 
 ## From Code
+
 You can also use the library as a regular python library by installing it in your current environment.
-Then you can query the prediction like this:
+The following code snippet shows how to predict the difficulty for an MSA using PyPythia:
 
 ```python
-from pypythia.predictor import DifficultyPredictor
-from pypythia.prediction import get_all_features
-from pypythia.raxmlng import RAxMLNG
-from pypythia.msa import MSA
-
-predictor = DifficultyPredictor(open("pypythia/predictors/latest.pckl", "rb"))
-raxmlng = RAxMLNG("/path/to/raxml-ng")
-msa = MSA("examples/example.phy")
-
-msa_features = get_all_features(raxmlng, msa)
-difficulty = predictor.predict(msa_features)
-print(difficulty)
+from pypythia.prediction import predict_difficulty
+import pathlib
+
+msa = pathlib.Path("examples/example.phy")
+difficulty = predict_difficulty(msa)
+print(f"The predicted difficulty for MSA {msa} is: {round(difficulty, 2)}.")
 ```
-*Note that Pythia can also handle FASTA input files, see section Input Data below.*
 
-#### Using Python multiprocessing
-There are reported issues with multiprocessing in Python and LightGBM based predictors (see for example the [LightGBM FAQ](https://lightgbm.readthedocs.io/en/latest/FAQ.html#lightgbm-hangs-when-multithreading-openmp-and-using-forking-in-linux-at-the-same-time)).
-We added a type check in the `predictor.py` prediction code that sets the number of threads to 1 for the prediction (`num_threads=1`) if the predictor is a LightGBM predictor.
-This should not affect the previous Pythia versions using the scikit-learn predictors. Since the multithreading issues do not occur consistently, this issue is hard to debug.
-If you encounter any issues with Python multiprocessing and Pythia please open a GitHub issue.
+And the output will be the same as for the CLI: `The predicted difficulty for MSA examples/example.phy is: 0.02.`.
 
-## Usage Without Installation
-As of version 1.0.1, PyPythia includes a script `prediction_no_install.py` in the root directory. This script contains the single function `predict_difficulty`.
-Provided a path to an MSA, a path to a trained difficulty predictor (e.g. `pypythia/predictors/latest.pckl`), and a path to an executable of RAxML-NG, this fucntion
-returns the predicted difficulty without requiring an installation of PyPythia. Note that this script can only be called from PyPythia's root directory.
+If you want to get all features, or do more specific analyses of your MSA, see the API Reference for further details on all available classes and methods.
 
-To use this script, open it using your favorite text editor / python IDE and add the following at the end:
-```python
-msa_file = "path/to/your/msa"  # the file path of your MSA, can be either relative or absolute
-raxmlng_exe_path = "path/to/raxml-ng/bin/raxml-ng"  # path pointing to the RAxML-NG executable on your system
-predictor_path = "pypythia/predictors/latest.pckl"
-predict_difficulty(msa_file, predictor_path, raxmlng_exe_path)
-```
 
 # Input data
 ### Supported file types
@@ -108,6 +106,10 @@ Make sure that the MSA only contains RAxML-NG compatible taxon names.
 In particular, taxon labels with spaces, tabs, newlines, commas, colons, semicolons and parenthesis are invalid.
 
 ### MSAs with duplicate sequences
+Pythia refuses to predict the difficulty for MSAs containing duplicate sequences or MSAs containing sequences containing only gaps.
+As of version 2.0.0, Pythia removes duplicates and full-gap sequences per default and predicts the difficulty for this reduced MSA.
+If you absolutely want to predict the difficulty for the original MSA, set the command line flags `--forceDuplicates` and `--forceFullGaps`.
+
 As of version 1.0.0 Pythia refuses to predict the difficulty for MSAs containing multiple exactly identical sequences (duplicate sequences).
 The reason for this is that duplicate sequences can have a substantial impact on the resulting topologies during the maximum parsimony tree inference
 and therefore on the topological distance measures.
@@ -143,7 +145,7 @@ The following figure shows an exemplary waterfall plot output for the MSA `examp
 The x-axis depicts the difficulty and the y-axis the features alongside the respective feature value. The features are sorted by their Shapley value with the highest contribution on top. You can read the plot as follows. The base line difficulty that Pythia v1.1.0 learned is 0.35, as indicated by the `E[f(x)] = 0.35` on the x-axis. The `proportion_invariant` feature contributed to the overall prediction with a shift towards `1.0` (more difficult) of `0.01`, so *in combination with the other features*, a `proportion_invariant` of `0.341` indicates that the MSA is slightly more difficult than the average difficulty in the training set. We emphasize that the *combination with the other features* part, since the same value for `proportion_invariant` with a different MSA and different feature values for the remaining features might lead to a substantially different contribution to the overall prediction.
 The feature with the highest impact for this example is the patterns-over-taxa ratio (`num_patterns/num_taxa`). The overall contribution is 0.23 towards `0.0`, meaning it shifts the overall prediction towards `easy`.
 
-<img src="https://github.com/tschuelia/PyPythia/blob/master/examples/example.phy.shap.png" width="700">
+<img src="../img/example.phy.shap.pdf" width="700">
 
 ## More Details
 For further information please refer to [this great book on interpretable ML](https://christophm.github.io/interpretable-ml-book/shapley.html), the [documentation of the `shap` package](https://shap.readthedocs.io/en/latest/index.html), especially [their notes on the interpretability of Shapley values](https://shap.readthedocs.io/en/latest/example_notebooks/overviews/Be%20careful%20when%20interpreting%20predictive%20models%20in%20search%20of%20causal%C2%A0insights.html#Be-careful-when-interpreting-predictive-models-in-search-of-causal%C2%A0insights).
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -13,7 +13,13 @@ nav:
 - Home: index.md
 - Install: install.md
 - User Guide: usage.md
-
+- API Reference:
+  - msa: api/msa.md
+  - raxmlng: api/raxmlng.md
+  - prediction: api/prediction.md
+  - predictor: api/predictor.md
+  - custom_types: api/custom_types.md
+  - config: api/config.md
 plugins:
 - search
 - mkdocstrings