A package to cluster and visualise MS/MS spectral data
- Project owner: Catherine Rawlinson (PhD candidate)
- Email: [email protected]
BioDendro is a metabolomics package and workflow that enables analysts to flexibly cluster and interrogate thousands of MS/MS spectra and quickly identify the core fragment patterns causing groupings. This helps identify potential functional properties of components based on core chemical backbones of a larger class, even when the individual metabolite of interest is not found in public databases.
BioDendro takes raw MS/MS data in MGF format, and a component list. The components list is the total of analytes within your sample set in the following format...
SampleID_userinfo_userinfo_m/z_RT
With retention time (RT) in units of minutes. For example:
Sample1_pos_C18_123.1234_5.60
Sample2_pos_C19_321.4321_10.60
This can be generated using XCMS, MZmine2, or proprietary instument vendor software.
Converts MGF format and component list into non-redundant list. Component-analyte list is converted into a data matrix and analytes are dynamically binned and clustered.
- Python version 3.5 or more recent.
- The python packages numpy, pandas, scipy, matplotlib, plotly, xlrd, xlsxwriter, and pillow (Installed automatically).
- We recommend running the pipeline in Jupyter notebooks, and provide example notebooks.
BioDendro is tested to run with Python 3.5-3.7, Plotly 3.8 and 3.9, and Pandas 0.23 and 0.24. Other versions may work.
A detailed guide to installing Anaconda Python, Jupyter, and BioDendro on Windows operating systems is provided in a separate pdf file. Advanced users with some knowledge of Python may also use the command line installation instructions below.
BioDendro can be installed from PyPI using pip, or from anaconda using conda.
Users that are less familiar with Python and pip are recommended to read our INSTALLING_WITH_PIP.md document which explains things in more detail, including where things will be installed and how to use virtual environments. For details on installing and using conda, see their getting-started guide.
Assuming you have Python 3 installed you can install BioDendro and its dependencies using the pip:
python3 -m pip install --user BioDendro
# Only required if you're using the provided notebooks
python3 -m pip install --user jupyter
To install BioDendro and dependencies using conda (assuming you have installed Anaconda):
conda install -c darcyabjones biodendro
# Only required if you're using the provided notebooks
conda install jupyter
Both the BioDendro
script and the python package (which can be used with the notebooks) should now be available to use.
Note that the above commands will not download the example notebooks or data. You can download those files separately, or download the whole repository as recommended in the windows install guide.
The pipeline available as a python package. To run the full pipeline in python.
import BioDendro
tree = BioDendro.pipeline(
"Fireflies_MSMS.mgf",
"Fireflies_feature_list.txt",
results_dir="my_results_dir"
)
From there you could analyse the results stored in the tree
object.
The example jupyter notebooks contain more detailed explanations of different parameters.
quick-start-example.ipynb contains basic information about running the pipelines.
longer-workflow.ipynb contains more detailed information about how the pipeline works, and how you can modify parameters.
We suggest that beginners download the quick-start notebook available here (Right-click, save-as) and modify parameters and files as necessary.
The pipeline is also available as a command line script. This is useful if you're not planning on tweaking the parameters much and just want to run the darn thing.
A list of options can be obtained with the --help
(or -h
) flag.
BioDendro --help
The minimum options to run the pipeline are the MGF file and a components list. To run the basic pipeline with the same parameters as in the python quickstart:
BioDendro --results-dir my_results_dir MSMS.mgf component_list.txt
If the --results-dir
parameter is ommitted, the results will be stored in a directory with the current date and the current time added to the end of it.
You can change the parameters to use by supplying additional flags, however, this will run the whole pipeline again, so it you just need to adjust the cutoff or decide to use braycurtis instead of jaccard distances, you might be better off using the python API.
BioDendro --scaling --cluster-method braycurtis --cutoff 0.5 MSMS.mgf component_list.txt
would be equivalent to running the following in python
tree = BioDendro.pipeline("MSMS.mgf", "component_list.txt", clustering_method="braycurtis", scaling=True, cutoff=0.5)