Skip to content

Commit

Permalink
reset docs to current state
Browse files Browse the repository at this point in the history
tschuelia committed Dec 29, 2024
1 parent f16bd52 commit 4e9e755
Showing 3 changed files with 69 additions and 59 deletions.
2 changes: 2 additions & 0 deletions docs/install.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# Installing PyPythia

## Requirements
In order to use this difficulty prediction, you need RAxML-NG installed somewhere on your system. You can find the install instructions [here](https://github.com/amkozlov/raxml-ng).

118 changes: 60 additions & 58 deletions docs/usage.md
Original file line number Diff line number Diff line change
@@ -1,96 +1,94 @@
# Using PyPythia

This library can be used in two ways: either directly as command line tool, or the prediction can be called from other python code.

## Command Line Tool
## Command Line Interface

If you only want to predict the difficulty for a single MSA, you can query the predictor using the command line interface, for example like this:
```commandline
pythia --msa examples/example.phy --raxmlng /path/to/raxml-ng
```
Note that when you installed PyPythia using conda, you will have to download the `example.phy` and adjust the path accordingly.

The output will be something like `The predicted difficulty for MSA examples/example.phy is: 0.16.`, telling us that example.phy is an easy dataset. In fact, this dataset exhibits a single likelihood peak. Depending on the predictor version you are using, the actual value might slightly differ. This is expected and nothing to worry about 🙂
The output will be something like `The predicted difficulty for MSA examples/example.phy is: 0.02.`, telling us that example.phy is an easy dataset. In fact, this dataset exhibits a single likelihood peak. Depending on the predictor version you are using, the actual value might slightly differ. This is expected and nothing to worry about 🙂

*Note that Pythia can also handle FASTA input files, see section Input Data below.*

The following options are available:
```commandline
PyPythia version 1.2.0 released by The Exelixis Lab
PyPythia version 2.0.0 released by The Exelixis Lab
Developed by: Julia Haag
Latest version: https://github.com/tschuelia/PyPythia
Questions/problems/suggestions? Please open an issue on GitHub.
usage: pythia [-h] -m MSA -r RAXMLNG [-t THREADS] [-p PREDICTOR] [-o OUTPUT] [-prec PRECISION] [-sT] [--removeDuplicates] [--forceDuplicates]
[--shap] [-v] [-b] [-q]
usage: pythia [-h] -m MSA -r RAXMLNG [-t THREADS] [-s SEED] [-p PREFIX]
[--predictor PREDICTOR] [-prec PRECISION] [-sT] [--forceDuplicates]
[--forceFullGaps] [--shap] [-v]
Parser for Pythia command line options.
options:
-h, --help show this help message and exit
-m MSA, --msa MSA Multiple Sequence Alignment to predict the difficulty for. Must be in either phylip or fasta format.
-m MSA, --msa MSA Multiple Sequence Alignment to predict the difficulty for.
Must be in either phylip or fasta format.
-r RAXMLNG, --raxmlng RAXMLNG
Path to the binary of RAxML-NG. For install instructions see https://github.com/amkozlov/raxml-ng.
Path to the binary of RAxML-NG. For install instructions
see https://github.com/amkozlov/raxml-ng.(default: 'raxml-
ng' if in $PATH, otherwise this option is mandatory).
-t THREADS, --threads THREADS
Number of threads to use for parallel parsimony tree inference. If none is set, Pythia uses the parallelization scheme
of RAxML-NG that automatically detects the optimal number of threads for your machine.
-p PREDICTOR, --predictor PREDICTOR
Filepath of the predictor to use. If not set, assume it is 'predictors/latest.pckl' in the project directory.
-o OUTPUT, --output OUTPUT
Option to specify a filepath where the result will be written to. The file will contain a single line with only the
difficulty.
Number of threads to use for parallel parsimony tree
inference (default: RAxML-NG autoconfig).
-s SEED, --seed SEED Seed for the RAxML-NG parsimony tree inference (default:
0).
-p PREFIX, --prefix PREFIX
Prefix of the PyPythia log and result file (default: MSA
file name).
--predictor PREDICTOR
Filepath of the alternative predictor to use (default:
latest Pythia).
-prec PRECISION, --precision PRECISION
Set the number of decimals the difficulty should be rounded to. Recommended and default is 2.
-sT, --storeTrees If set, stores the parsimony trees as '{msa_name}.parsimony.trees' file.
--removeDuplicates Pythia refuses to predict the difficulty for MSAs containing duplicate sequences. If this option is set, PyPythia
removes the duplicate sequences, stores the reduced MSA as '{msa_name}.{phy/fasta}.pythia.reduced' and predicts the
difficulty for the reduced alignment.
--forceDuplicates Per default, Pythia refuses to predict the difficulty for MSAs containing duplicate sequences. Set this option if you
are absolutely sure that you want to predict the difficulty for this MSA.
--shap If set, computes the shapley values of the prediction as waterfall plot in '{msa_name}.shap.pdf'. When using this
option, make sure you understand what shapley values are and how to interpret this plot.For details on shapley values
refer to the wiki: https://github.com/tschuelia/PyPythia/wiki/Usage#shapley-values.
-v, --verbose If set, additionally prints the MSA features.
-b, --benchmark If set, time the runtime of the prediction.
-q, --quiet If set, Pythia does not print progress updates and only prints the predicted difficulty.
Set the number of decimals the difficulty should be rounded
to (default: 2).
-sT, --storeTrees If set, stores the parsimony trees as
'{prefix}.pythia.trees' file (default: False).
--forceDuplicates Per default, Pythia refuses to predict the difficulty for
MSAs containing duplicate sequences. Only set this option
if you are absolutely sure that you want to predict the
difficulty for this MSA (default: False).
--forceFullGaps Per default, Pythia refuses to predict the difficulty for
MSAs containing sequences with only gaps. Only set this
option if you are absolutely sure that you want to predict
the difficulty for this MSA (default: False).
--shap If set, computes the shapley values of the prediction as
waterfall plot in '{prefix}.shap.pdf'. When using this
option, make sure you understand what shapley values are
and how to interpret this plot.For details on shapley
values refer to the wiki:
https://github.com/tschuelia/PyPythia/wiki/Usage#shapley-
values (default: False).
-v, --verbose If set, additionally prints the MSA features (default:
False).
```


## From Code

You can also use the library as a regular python library by installing it in your current environment.
Then you can query the prediction like this:
The following code snippet shows how to predict the difficulty for an MSA using PyPythia:

```python
from pypythia.predictor import DifficultyPredictor
from pypythia.prediction import get_all_features
from pypythia.raxmlng import RAxMLNG
from pypythia.msa import MSA

predictor = DifficultyPredictor(open("pypythia/predictors/latest.pckl", "rb"))
raxmlng = RAxMLNG("/path/to/raxml-ng")
msa = MSA("examples/example.phy")

msa_features = get_all_features(raxmlng, msa)
difficulty = predictor.predict(msa_features)
print(difficulty)
from pypythia.prediction import predict_difficulty
import pathlib

msa = pathlib.Path("examples/example.phy")
difficulty = predict_difficulty(msa)
print(f"The predicted difficulty for MSA {msa} is: {round(difficulty, 2)}.")
```
*Note that Pythia can also handle FASTA input files, see section Input Data below.*

#### Using Python multiprocessing
There are reported issues with multiprocessing in Python and LightGBM based predictors (see for example the [LightGBM FAQ](https://lightgbm.readthedocs.io/en/latest/FAQ.html#lightgbm-hangs-when-multithreading-openmp-and-using-forking-in-linux-at-the-same-time)).
We added a type check in the `predictor.py` prediction code that sets the number of threads to 1 for the prediction (`num_threads=1`) if the predictor is a LightGBM predictor.
This should not affect the previous Pythia versions using the scikit-learn predictors. Since the multithreading issues do not occur consistently, this issue is hard to debug.
If you encounter any issues with Python multiprocessing and Pythia please open a GitHub issue.
And the output will be the same as for the CLI: `The predicted difficulty for MSA examples/example.phy is: 0.02.`.

## Usage Without Installation
As of version 1.0.1, PyPythia includes a script `prediction_no_install.py` in the root directory. This script contains the single function `predict_difficulty`.
Provided a path to an MSA, a path to a trained difficulty predictor (e.g. `pypythia/predictors/latest.pckl`), and a path to an executable of RAxML-NG, this fucntion
returns the predicted difficulty without requiring an installation of PyPythia. Note that this script can only be called from PyPythia's root directory.
If you want to get all features, or do more specific analyses of your MSA, see the API Reference for further details on all available classes and methods.

To use this script, open it using your favorite text editor / python IDE and add the following at the end:
```python
msa_file = "path/to/your/msa" # the file path of your MSA, can be either relative or absolute
raxmlng_exe_path = "path/to/raxml-ng/bin/raxml-ng" # path pointing to the RAxML-NG executable on your system
predictor_path = "pypythia/predictors/latest.pckl"
predict_difficulty(msa_file, predictor_path, raxmlng_exe_path)
```

# Input data
### Supported file types
@@ -108,6 +106,10 @@ Make sure that the MSA only contains RAxML-NG compatible taxon names.
In particular, taxon labels with spaces, tabs, newlines, commas, colons, semicolons and parenthesis are invalid.

### MSAs with duplicate sequences
Pythia refuses to predict the difficulty for MSAs containing duplicate sequences or MSAs containing sequences containing only gaps.
As of version 2.0.0, Pythia removes duplicates and full-gap sequences per default and predicts the difficulty for this reduced MSA.
If you absolutely want to predict the difficulty for the original MSA, set the command line flags `--forceDuplicates` and `--forceFullGaps`.

As of version 1.0.0 Pythia refuses to predict the difficulty for MSAs containing multiple exactly identical sequences (duplicate sequences).
The reason for this is that duplicate sequences can have a substantial impact on the resulting topologies during the maximum parsimony tree inference
and therefore on the topological distance measures.
@@ -143,7 +145,7 @@ The following figure shows an exemplary waterfall plot output for the MSA `examp
The x-axis depicts the difficulty and the y-axis the features alongside the respective feature value. The features are sorted by their Shapley value with the highest contribution on top. You can read the plot as follows. The base line difficulty that Pythia v1.1.0 learned is 0.35, as indicated by the `E[f(x)] = 0.35` on the x-axis. The `proportion_invariant` feature contributed to the overall prediction with a shift towards `1.0` (more difficult) of `0.01`, so *in combination with the other features*, a `proportion_invariant` of `0.341` indicates that the MSA is slightly more difficult than the average difficulty in the training set. We emphasize that the *combination with the other features* part, since the same value for `proportion_invariant` with a different MSA and different feature values for the remaining features might lead to a substantially different contribution to the overall prediction.
The feature with the highest impact for this example is the patterns-over-taxa ratio (`num_patterns/num_taxa`). The overall contribution is 0.23 towards `0.0`, meaning it shifts the overall prediction towards `easy`.

<img src="https://github.com/tschuelia/PyPythia/blob/master/examples/example.phy.shap.png" width="700">
<img src="../img/example.phy.shap.pdf" width="700">

## More Details
For further information please refer to [this great book on interpretable ML](https://christophm.github.io/interpretable-ml-book/shapley.html), the [documentation of the `shap` package](https://shap.readthedocs.io/en/latest/index.html), especially [their notes on the interpretability of Shapley values](https://shap.readthedocs.io/en/latest/example_notebooks/overviews/Be%20careful%20when%20interpreting%20predictive%20models%20in%20search%20of%20causal%C2%A0insights.html#Be-careful-when-interpreting-predictive-models-in-search-of-causal%C2%A0insights).
8 changes: 7 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
@@ -13,7 +13,13 @@ nav:
- Home: index.md
- Install: install.md
- User Guide: usage.md

- API Reference:
- msa: api/msa.md
- raxmlng: api/raxmlng.md
- prediction: api/prediction.md
- predictor: api/predictor.md
- custom_types: api/custom_types.md
- config: api/config.md
plugins:
- search
- mkdocstrings

0 comments on commit 4e9e755

Please sign in to comment.