Skip to content

Commit

Permalink
Merge branch 'main' of https://github.com/IBM/Hestia
Browse files Browse the repository at this point in the history
  • Loading branch information
Raúl Fernández Díaz committed May 3, 2024
2 parents ea7ccc1 + 731c036 commit 3f392ae
Show file tree
Hide file tree
Showing 8 changed files with 502 additions and 80 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
__pycache__/
build/
hestia.egg-info/
*.egg-info/
153 changes: 151 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,151 @@
# Hestia
Independent evaluation set construction for trustworthy ML models in biochemistry
<div align="center">
<h1>Hestia</h1>

<p>Computational tool for generating generalisation-evaluating evaluation sets.</p>

<a href="https://ibm.github.io/Hestia-OOD/"><img alt="Tutorials" src="https://img.shields.io/badge/docs-tutorials-green" /></a>
<a href="https://github.com/IBM/Hestia-OOD/blob/main/LICENSE"><img alt="GitHub" src="https://img.shields.io/github/license/IBM/Hestia-OOD" /></a>
<a href="https://pypi.org/project/hestia-ood/"><img src="https://img.shields.io/pypi/v/hestia-ood" /></a>
<a href="https://pypi.org/project/hestia-ood/"><img src="https://img.shields.io/pypi/dm/hestia-ood" /></a>

</div>

- **Documentation:** <a href="https://ibm.github.io/AutoPeptideML/" target="_blank">https://ibm.github.io/Hestia-OOD</a>
- **Source Code:** <a href="https://github.com/IBM/AutoPeptideML" target="_blank">https://github.com/IBM/Hestia-OOD</a>
- **Webserver:** <a href="http://peptide.ucd.ie/AutoPeptideML" target="_blank">http://peptide.ucd.ie/Hestia</a>
- **Paper Pre-print:** <a href="https://www.biorxiv.org/content/10.1101/2024.03.14.584508v1" target="_blank">https://www.biorxiv.org/content/10.1101/2024.03.14.584508v1</a>

## Contents

<details open markdown="1"><summary><b>Table of Contents</b></summary>

- [Intallation Guide](#installation)
- [Documentation](#documentation)
- [Examples](#examples)
- [License](#license)
</details>


## Installation <a name="installation"></a>

Installing in a conda environment is recommended. For creating the environment, please run:

```bash
conda create -n autopeptideml python
conda activate autopeptideml
```

### 1. Python Package

#### 1.1.From PyPI


```bash
pip install hestia-ood
```

#### 1.2. Directly from source

```bash
pip install git+https://github.com/IBM/Hestia-OOD
```

### 3. Third-party dependencies

For using MMSeqs as alignment algorithm is necessary install it in the environment:

```bash
conda install -c bioconda mmseqs2
```

For using Needleman-Wunch:

```bash
conda install -c bioconda emboss
```

If installation not in conda environment, please check installation instructions for your particular device:

- Linux:
```bash
wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz
tar xvfz mmseqs-linux-avx2.tar.gz
export PATH=$(pwd)/mmseqs/bin/:$PATH
```

```bash
sudo apt install emboss
```

```bash
sudo apt install emboss
```

- Windows: Download binaries from [EMBOSS](https://emboss.sourceforge.net/download/) and [MMSeqs2-latest](https://mmseqs.com/latest/mmseqs-win64.zip)

- Mac:
```bash
sudo port install emboss
brew install mmseqs2
```

## Documentation <a name="documentation"></a>

### 1. Similarity calculation

Calculating pairwise similarity between the entities within a DataFrame `df_query` or between two DataFrames `df_query` and `df_target` can be achieved through the `calculate_similarity` function:

```python
from hestia.similarity import calculate_similarity
import pandas as pd

df_query = pd.read_csv('example.csv')

# The CSV file needs to have a column describing the entities, i.e., their sequence, their SMILES, or a path to their PDB structure.
# This column corresponds to `field_name` in the function.

sim_df = calculate_similarity(df_query, species='protein', similarity_metric='mmseqs+prefilter',
field_name='sequence')
```

More details about similarity calculation can be found in the [Similarity calculation documentation](https://ibm.github.io/Hestia-OOD/similarity/).

### 2. Clustering

Clustering the entities within a DataFrame `df` can be achieved through the `generate_clusters` function:

```python
from hestia.similarity import calculate_similarity
from hestia.clustering import generate_clusters
import pandas as pd

df = pd.read_csv('example.csv')
sim_df = calculate_similarity(df, species='protein', similarity_metric='mmseqs+prefilter',
field_name='sequence')
clusters_df = generate_clusters(df, field_name='sequence', sim_df=sim_df,
cluster_algorithms='CDHIT')
```

There are three clustering algorithms currently supported: `CDHIT`, `greedy_cover_set`, or `connected_components`. More details about clustering can be found in the [Clustering documentation](https://ibm.github.io/Hestia-OOD/clustering/).


### 3. Partitioning

Partitioning the entities within a DataFrame `df` into a training and an evaluation subsets can be achieved through 4 different functions: `cc_part`, `graph_part`, `reduction_partition`, and `random_partition`. An example of how `cc_part` would be used is:

```python
from hestia.partition import cc_part
import pandas as pd

df = pd.read_csv('example.csv')
train, test = cc_part(df, species='protein', similarity_metric='mmseqs+prefilter',
field_name='sequence', threshold=0.3, test_size=0.2)

train_df = df.iloc[train, :]
test_df = df.iloc[test, :]
```

License <a name="license"></a>
-------
Hestia is an open-source software licensed under the MIT Clause License. Check the details in the [LICENSE](https://github.com/IBM/Hestia/blob/master/LICENSE) file.

37 changes: 23 additions & 14 deletions hestia/clustering.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import pandas as pd
from scipy.sparse.csgraph import connected_components
from tqdm import tqdm

from hestia.similarity import sim_df2mtx

Expand Down Expand Up @@ -78,16 +79,20 @@ def _greedy_incremental_clustering(
clustered = set()
sim_df = sim_df[sim_df['metric'] > threshold]

for i in df.index:
in_cluster = set(sim_df.loc[sim_df['query'] == i, 'target'])
in_cluster.update(set(sim_df.loc[sim_df['target'] == i, 'query']))
if verbose > 2:
pbar = tqdm(df.index)
else:
pbar = df.index

for i in pbar:
if i in clustered:
continue
in_cluster = set(sim_df.loc[sim_df['query'] == i, 'target'])
in_cluster.update(set(sim_df.loc[sim_df['target'] == i, 'query']))
in_cluster.update(set([i]))
in_cluster = in_cluster.difference(clustered)

for j in in_cluster:
if i == j:
continue
clusters.append({
'cluster': i,
'member': j
Expand All @@ -99,7 +104,7 @@ def _greedy_incremental_clustering(
if verbose > 1:
print('Clustering has generated:',
f'{len(cluster_df.cluster.unique()):,d} clusters for',
f'{len(df):,} entities')
f'{len(cluster_df):,} entities')
return cluster_df


Expand All @@ -111,7 +116,7 @@ def _greedy_cover_set(
) -> pd.DataFrame:
def _find_connectivity(df, sim_df):
neighbours = []
for i in df.index:
for i in tqdm(df.index):
in_cluster = set(sim_df.loc[sim_df['query'] == i, 'target'])
in_cluster.update(set(sim_df.loc[sim_df['target'] == i, 'query']))
neighbours.append(in_cluster)
Expand All @@ -124,15 +129,19 @@ def _find_connectivity(df, sim_df):

clusters = []
clustered = set()
if verbose > 2:
pbar = tqdm(df.index)
else:
pbar = df.index

for i in df.index:
in_cluster = neighbours.pop(0)

for i in pbar:
if i in clustered:
continue
in_cluster = neighbours.pop(0)
in_cluster.update([i])
in_cluster = in_cluster.difference(clustered)

for j in in_cluster:
if i == j:
continue
clusters.append({
'cluster': i,
'member': j
Expand All @@ -144,7 +153,7 @@ def _find_connectivity(df, sim_df):
if verbose > 1:
print('Clustering has generated:',
f'{len(cluster_df.cluster.unique()):,d} clusters for',
f'{len(df):,} entities')
f'{len(cluster_df):,} entities')
return cluster_df


Expand All @@ -158,7 +167,7 @@ def _connected_components_clustering(
n, labels = connected_components(matrix, directed=False,
return_labels=True)
cluster_df = [{'cluster': labels[i],
'member': i} for i in df.index]
'member': i} for i in range(labels.shape[0])]
if verbose > 0:
print('Clustering has generated:',
f'{n:,d} connected componentes for',
Expand Down
Loading

0 comments on commit 3f392ae

Please sign in to comment.