-
Notifications
You must be signed in to change notification settings - Fork 1
Python Library
In addition to the command line, all of Pandora's functionality can be used as Python library with an additional interface for any kind of numpy
data matrix. The idea is that many (computational) biologists use R
or Python
to perform their PCA and MDS analyses instead of smartpca
. This enables the user to perform any kind of preprocessing before doing the PCA/MDS analyses. In the following, I will provide an example on how to perform a bootstrapping stability analysis using the numpy
-based Pandora interface. Further down, I will also provide an example on how to export and import data from R
in case you want to (re-)use your custom R
data preprocessing scripts.
I will soon setup a documentation page with documentation for all classes and methods implemented in Pandora.
You can do the same analyses as you would do using the command line by using the python EIGEN interface. This interface is specifically targeted for analyses with genotype data provided in EIGENSTRAT format.
import pathlib
import tempfile
from pandora.custom_types import EmbeddingAlgorithm
from pandora.dataset import EigenDataset, bootstrap_and_embed_multiple
from pandora.embedding_comparison import BatchEmbeddingComparison
# set up the variables for the dataset you want to analyze and provide a path to a smartpca executable
eigen_example_prefix = pathlib.Path("example/example")
smartpca = "path/to/smartpca"
dataset = EigenDataset(eigen_example_prefix)
with tempfile.TemporaryDirectory() as tmpdir:
# for this toy example we won't store the actual bootstrap results and smartpca logs, so we do this computation in a TemporaryDirectory
result_dir = pathlib.Path(tmpdir)
# create 10 bootstrap replicates based on the dataset and compute the PCA embedding for each bootstrap replicate
bootstrap_replicates = bootstrap_and_embed_multiple(
dataset=dataset,
n_bootstraps=10,
result_dir=result_dir,
smartpca=smartpca,
embedding=EmbeddingAlgorithm.PCA, # tell Pandora to compute PCA for each of the bootstrap replicates
n_components=2, # here we only compute 2 PCs
seed=42, # set the seed for full reproducibility of the bootstraps
threads=2, # compute the bootstraps in parallel using 2 threads
smartpca_optional_settings=dict(numoutlieriters=0) # set the number of outlier detection iterations to 0 for smartpca
)
# finally, using all bootstrap PCA objects, we create a container for comparing all replicates and getting the overall PS score
batch_comparison = BatchEmbeddingComparison([b.pca for b in bootstrap_replicates])
pandora_stability = batch_comparison.compare()
print("Pandora Stability (PS): ", round(pandora_stability, 2))
This will print something like Pandora Stability (PS): 0.92
.
The following example demonstrates you how to do the PCA bootstrap stability analysis using the alternative numpy
interface of Pandora.
Instead of providing the file path to EIGEN files, you directly provide Pandora with a numpy
data matrix.
import pandas as pd
import numpy as np
from pandora.custom_types import EmbeddingAlgorithm
from pandora.dataset import NumpyDataset, bootstrap_and_embed_multiple_numpy
from pandora.embedding_comparison import BatchEmbeddingComparison
# this is the same geno type data as we used above, but already typed out as numpy matrix
geno_data = np.asarray([
[1, 0, 2, 0, 2, 0, 2, 0, 1, 2],
[1, 1, 1, 0, 1, 0, 2, 0, 2, 2],
[1, 2, 1, 1, 1, 1, 1, 1, 2, 2],
[0, 1, 0, 2, 0, 1, 1, 1, 0, 0],
[0, 2, 1, 2, 0, 1, 0, 2, 1, 1]]
)
# create a new pandas Series for the respective sample IDs and populations
sample_ids = pd.Series(["SAMPLE0", "SAMPLE1", "SAMPLE2", "SAMPLE3", "SAMPLE4"])
populations = pd.Series(["pop0", "pop1", "pop2", "pop3", "pop4"])
# finally initialize the dataset using the geno data and the sample metadata
dataset = NumpyDataset(geno_data, sample_ids, populations)
# create 10 bootstrap replicates based on the dataset and compute the PCA embedding for each bootstrap replicate
bootstrap_replicates = bootstrap_and_embed_multiple_numpy(
dataset=dataset,
n_bootstraps=10,
embedding=EmbeddingAlgorithm.PCA, # tell Pandora to compute PCA for each of the bootstrap replicates
n_components=2, # here we only compute 2 PCs
seed=42, # set the seed for full reproducibility of the bootstraps
threads=2 # compute the bootstraps in parallel using 2 threads
)
# finally, using all bootstrap PCA objects, we create a container for comparing all replicates and getting the overall PS score
batch_comparison = BatchEmbeddingComparison([b.pca for b in bootstrap_replicates])
pandora_stability = batch_comparison.compare()
print("Pandora Stability (PS): ", round(pandora_stability, 2))
This will print something like Pandora Stability (PS): 0.96
. Note that due to the different implementations of PCA in smartpca versus scikit-learn, the PS is slightly different for the numpy-based interface versus the EIGEN-based interface.
This example shows how to perform an MDS analysis using the NumPy-based interface. You can select the distance metric you wish to compute as input for MDS analysis. The first example uses one of the pre-implemented distance metrics, afterwards I will demonstrate how you can define your own custom distance metric.
Important: while the above examples all also work in a Jupyter notebook, the following example will only run if you paste it into a python file and run it from command line. The reason for this is the custom distance metric we will pass for MDS (which is a python Callable and bootstrap_and_embed_multiple_numpy uses multiprocessing which causes some trouble when not wrapped in the if name == "main":.
import pandas as pd
import numpy as np
from pandora.custom_types import EmbeddingAlgorithm
from pandora.dataset import NumpyDataset, bootstrap_and_embed_multiple_numpy
from pandora.distance_metrics import manhattan_population_distance
from pandora.embedding_comparison import BatchEmbeddingComparison
if __name__ == "__main__":
# this is identical to the example with PCA above: define the data and init the dataset
geno_data = np.asarray([
[1, 0, 2, 0, 2, 0, 2, 0, 1, 2],
[1, 1, 1, 0, 1, 0, 2, 0, 2, 2],
[1, 2, 1, 1, 1, 1, 1, 1, 2, 2],
[0, 1, 0, 2, 0, 1, 1, 1, 0, 0],
[0, 2, 1, 2, 0, 1, 0, 2, 1, 1]]
)
sample_ids = pd.Series(["SAMPLE0", "SAMPLE1", "SAMPLE2", "SAMPLE3", "SAMPLE4"])
populations = pd.Series(["pop0", "pop1", "pop2", "pop3", "pop4"])
dataset = NumpyDataset(geno_data, sample_ids, populations)
# instead of PCA, this time we pass MDS as embedding method
# in this case we also need to pass a Callable, we use the above euclidean function in this example
bootstrap_replicates = bootstrap_and_embed_multiple_numpy(
dataset=dataset,
n_bootstraps=10, # again compute 10 bootstrap datasets
embedding=EmbeddingAlgorithm.MDS, # and perform MDS analysis for each bootstrap
distance_metric=manhattan_population_distance, # use the Manhattan distance between populations for MDS computation
n_components=2,
seed=42,
threads=2
)
batch_comparison = BatchEmbeddingComparison([b.mds for b in bootstrap_replicates])
pandora_stability = batch_comparison.compare()
print("Pandora Stability (PS): ", round(pandora_stability, 2))
Again we will se an output like Pandora Stability (PS): 0.91
.
Custom distance metric
If you want to use a distance metric that is not implemented in Pandora, you can define one very easily as I will show you with the following example in which we will use the scikit-learn
pairwise cosine_distances
function. You can define a per-sample and a per-population metric like this:
from sklearn.metrics.pairwise import cosine_distances
from pandora.distance_metrics import *
def cosine_sample_distance(input_data: npt.NDArray, populations: pd.Series) -> Tuple[npt.NDArray, pd.Series]:
return cosine_distances(input_data, input_data), populations
def cosine_population_distance(input_data: npt.NDArray, populations: pd.Series) -> Tuple[npt.NDArray, pd.Series]:
return population_distance(input_data, populations, cosine_distances)
For the per-population metric, we make use of Pandora's population_distance
function. Provided a numpy data array and the respective populations, as well as the desired pairwise distance metric, population_distance
will take care of the population grouping.
Of course you can implement an arbitrarily complex distance metric suited for your needs.