Merge branch 'main' of https://github.com/IBM/Hestia

IBM · May 3, 2024 · 3f392ae · 3f392ae
2 parents ea7ccc1 + 731c036
commit 3f392ae
Show file tree

Hide file tree

Showing 8 changed files with 502 additions and 80 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,3 @@
 __pycache__/
 build/
-hestia.egg-info/
+*.egg-info/
diff --git a/README.md b/README.md
@@ -1,2 +1,151 @@
-# Hestia
-Independent evaluation set construction for trustworthy ML models in biochemistry
+<div align="center">
+  <h1>Hestia</h1>
+
+  <p>Computational tool for generating generalisation-evaluating evaluation sets.</p>
+
+  <a href="https://ibm.github.io/Hestia-OOD/"><img alt="Tutorials" src="https://img.shields.io/badge/docs-tutorials-green" /></a>
+  <a href="https://github.com/IBM/Hestia-OOD/blob/main/LICENSE"><img alt="GitHub" src="https://img.shields.io/github/license/IBM/Hestia-OOD" /></a>
+  <a href="https://pypi.org/project/hestia-ood/"><img src="https://img.shields.io/pypi/v/hestia-ood" /></a>
+  <a href="https://pypi.org/project/hestia-ood/"><img src="https://img.shields.io/pypi/dm/hestia-ood" /></a>
+
+</div>
+
+- **Documentation:**  <a href="https://ibm.github.io/AutoPeptideML/" target="_blank">https://ibm.github.io/Hestia-OOD</a>
+- **Source Code:** <a href="https://github.com/IBM/AutoPeptideML" target="_blank">https://github.com/IBM/Hestia-OOD</a>
+- **Webserver:** <a href="http://peptide.ucd.ie/AutoPeptideML" target="_blank">http://peptide.ucd.ie/Hestia</a>
+- **Paper Pre-print:** <a href="https://www.biorxiv.org/content/10.1101/2024.03.14.584508v1" target="_blank">https://www.biorxiv.org/content/10.1101/2024.03.14.584508v1</a>
+
+## Contents
+
+<details open markdown="1"><summary><b>Table of Contents</b></summary>
+
+- [Intallation Guide](#installation)
+- [Documentation](#documentation)
+- [Examples](#examples)
+- [License](#license)
+ </details>
+
+
+ ## Installation <a name="installation"></a>
+
+Installing in a conda environment is recommended. For creating the environment, please run:
+
+```bash
+conda create -n autopeptideml python
+conda activate autopeptideml
+```
+
+### 1. Python Package
+
+#### 1.1.From PyPI
+
+
+```bash
+pip install hestia-ood
+```
+
+#### 1.2. Directly from source
+
+```bash
+pip install git+https://github.com/IBM/Hestia-OOD
+```
+
+### 3. Third-party dependencies
+
+For using MMSeqs as alignment algorithm is necessary install it in the environment:
+
+```bash 
+conda install -c bioconda mmseqs2
+```
+
+For using Needleman-Wunch:
+
+```bash
+conda install -c bioconda emboss
+```
+
+If installation not in conda environment, please check installation instructions for your particular device:
+
+- Linux:
+  ```bash
+  wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz
+  tar xvfz mmseqs-linux-avx2.tar.gz
+  export PATH=$(pwd)/mmseqs/bin/:$PATH
+  ```
+
+  ```bash
+  sudo apt install emboss
+  ```
+
+  ```bash
+  sudo apt install emboss
+  ```
+
+- Windows: Download binaries from [EMBOSS](https://emboss.sourceforge.net/download/) and [MMSeqs2-latest](https://mmseqs.com/latest/mmseqs-win64.zip)
+
+- Mac:
+  ```bash
+  sudo port install emboss
+  brew install mmseqs2
+  ```
+
+## Documentation <a name="documentation"></a>
+
+### 1. Similarity calculation
+
+Calculating pairwise similarity between the entities within a DataFrame `df_query` or between two DataFrames `df_query` and `df_target` can be achieved through the `calculate_similarity` function:
+
+```python
+from hestia.similarity import calculate_similarity
+import pandas as pd
+
+df_query = pd.read_csv('example.csv')
+
+# The CSV file needs to have a column describing the entities, i.e., their sequence, their SMILES, or a path to their PDB structure.
+# This column corresponds to `field_name` in the function.
+
+sim_df = calculate_similarity(df_query, species='protein', similarity_metric='mmseqs+prefilter',
+                              field_name='sequence')
+```
+
+More details about similarity calculation can be found in the [Similarity calculation documentation](https://ibm.github.io/Hestia-OOD/similarity/).
+
+### 2. Clustering
+
+Clustering the entities within a DataFrame `df` can be achieved through the `generate_clusters` function:
+
+```python
+from hestia.similarity import calculate_similarity
+from hestia.clustering import generate_clusters
+import pandas as pd
+
+df = pd.read_csv('example.csv')
+sim_df = calculate_similarity(df, species='protein', similarity_metric='mmseqs+prefilter',
+                              field_name='sequence')
+clusters_df = generate_clusters(df, field_name='sequence', sim_df=sim_df,
+                                cluster_algorithms='CDHIT')
+```
+
+There are three clustering algorithms currently supported: `CDHIT`, `greedy_cover_set`, or `connected_components`. More details about clustering can be found in the [Clustering documentation](https://ibm.github.io/Hestia-OOD/clustering/).
+
+
+### 3. Partitioning
+
+Partitioning the entities within a DataFrame `df` into a training and an evaluation subsets can be achieved through 4 different functions: `cc_part`, `graph_part`, `reduction_partition`, and `random_partition`. An example of how `cc_part` would be used is:
+
+```python
+from hestia.partition import cc_part
+import pandas as pd
+
+df = pd.read_csv('example.csv')
+train, test = cc_part(df, species='protein', similarity_metric='mmseqs+prefilter',
+                      field_name='sequence', threshold=0.3, test_size=0.2)
+
+train_df = df.iloc[train, :]
+test_df = df.iloc[test, :]
+```
+
+License <a name="license"></a>
+-------
+Hestia is an open-source software licensed under the MIT Clause License. Check the details in the [LICENSE](https://github.com/IBM/Hestia/blob/master/LICENSE) file.
+
diff --git a/hestia/clustering.py b/hestia/clustering.py
@@ -2,6 +2,7 @@
 
 import pandas as pd
 from scipy.sparse.csgraph import connected_components
+from tqdm import tqdm
 
 from hestia.similarity import sim_df2mtx
 
@@ -78,16 +79,20 @@ def _greedy_incremental_clustering(
     clustered = set()
     sim_df = sim_df[sim_df['metric'] > threshold]
 
-    for i in df.index:
-        in_cluster = set(sim_df.loc[sim_df['query'] == i, 'target'])
-        in_cluster.update(set(sim_df.loc[sim_df['target'] == i, 'query']))
+    if verbose > 2:
+        pbar = tqdm(df.index)
+    else:
+        pbar = df.index
 
+    for i in pbar:
         if i in clustered:
             continue
+        in_cluster = set(sim_df.loc[sim_df['query'] == i, 'target'])
+        in_cluster.update(set(sim_df.loc[sim_df['target'] == i, 'query']))
+        in_cluster.update(set([i]))
+        in_cluster = in_cluster.difference(clustered)
 
         for j in in_cluster:
-            if i == j:
-                continue
             clusters.append({
                 'cluster': i,
                 'member': j
@@ -99,7 +104,7 @@ def _greedy_incremental_clustering(
     if verbose > 1:
         print('Clustering has generated:',
               f'{len(cluster_df.cluster.unique()):,d} clusters for',
-              f'{len(df):,} entities')
+              f'{len(cluster_df):,} entities')
     return cluster_df
 
 
@@ -111,7 +116,7 @@ def _greedy_cover_set(
 ) -> pd.DataFrame:
     def _find_connectivity(df, sim_df):
         neighbours = []
-        for i in df.index:
+        for i in tqdm(df.index):
             in_cluster = set(sim_df.loc[sim_df['query'] == i, 'target'])
             in_cluster.update(set(sim_df.loc[sim_df['target'] == i, 'query']))
             neighbours.append(in_cluster)
@@ -124,15 +129,19 @@ def _find_connectivity(df, sim_df):
 
     clusters = []
     clustered = set()
+    if verbose > 2:
+        pbar = tqdm(df.index)
+    else:
+        pbar = df.index
 
-    for i in df.index:
-        in_cluster = neighbours.pop(0)
-
+    for i in pbar:
         if i in clustered:
             continue
+        in_cluster = neighbours.pop(0)
+        in_cluster.update([i])
+        in_cluster = in_cluster.difference(clustered)
+
         for j in in_cluster:
-            if i == j:
-                continue
             clusters.append({
                 'cluster': i,
                 'member': j
@@ -144,7 +153,7 @@ def _find_connectivity(df, sim_df):
     if verbose > 1:
         print('Clustering has generated:',
               f'{len(cluster_df.cluster.unique()):,d} clusters for',
-              f'{len(df):,} entities')
+              f'{len(cluster_df):,} entities')
     return cluster_df
 
 
@@ -158,7 +167,7 @@ def _connected_components_clustering(
     n, labels = connected_components(matrix, directed=False,
                                      return_labels=True)
     cluster_df = [{'cluster': labels[i],
-                   'member': i} for i in df.index]
+                   'member': i} for i in range(labels.shape[0])]
     if verbose > 0:
         print('Clustering has generated:',
               f'{n:,d} connected componentes for',