Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get input data for TCGA (MSAs and aflow confirmations + ligands from our dbs) #99

Closed
Tracked by #95
jyaacoub opened this issue May 16, 2024 · 3 comments · Fixed by #119
Closed
Tracked by #95

Get input data for TCGA (MSAs and aflow confirmations + ligands from our dbs) #99

jyaacoub opened this issue May 16, 2024 · 3 comments · Fixed by #119
Labels

Comments

@jyaacoub
Copy link
Owner

jyaacoub commented May 16, 2024

As mentioned in #95 we should only focus on datasets that already have a GVPL dataset created

  • Once the dataset is created DO NOT retrain on a new dataset, otherwise this will mess up which proteins are considered as the "test set" proteins and so it would mess up our analysis.
  • Specifically we are talking about the kiba dataset since we are still getting aflow confirmations for that one.
@jyaacoub jyaacoub mentioned this issue May 16, 2024
5 tasks
jyaacoub added a commit that referenced this issue May 16, 2024
Since we dont have a kiba GVPL dataset yet...
@jyaacoub
Copy link
Owner Author

Getting right protein sequences to focus on:

Using the TCGA analysis script (at this commit for playground.py) we only get protein sequences for the dbs that we have GVPL datasets for (not kiba).

This gives us the following stats for just **davis** and **pdbbind**

Filter #1 (seq_len)     :   724 -   562 =   162
Filter #2 (ref_AA match):   160 -    42 =   118

 davis      104
PDBbind     14
Name: db, dtype: int64

image

@jyaacoub
Copy link
Owner Author

Get mutated sequences

This is simple and the code for it is just one .apply method on the dataframe that comes out:

Code to get 'mt_seq' col

def apply_mut(row):
    ref_seq = list(row['prot_seq'])
    ref_seq[row['mt_loc']-1] = row['mt_AA']
    return ''.join(ref_seq)

dfm['mt_seq'] = dfm.apply(apply_mut, axis=1)

@jyaacoub
Copy link
Owner Author

jyaacoub commented May 16, 2024

Run MSA

Using HHBlits via our MSARunner class we can get sequence alignments with our reference UniRef30_2020_06 database

submit as batch array job

Details

# %%
from src.utils.seq_alignment import MSARunner
from tqdm import tqdm
import pandas as pd
import os

DATA_DIR = '/cluster/home/t122995uhn/projects/data/tcga'
CSV = f'{DATA_DIR}/tcga_maf_davis_pdbbind.csv'
N_CPUS=6
NUM_ARRAYS = 10
array_idx = 0#${SLURM_ARRAY_TASK_ID}

df = pd.read_csv(CSV, index_col=0)
df.sort_values(by='seq_len_y', inplace=True)


# %%
for DB in df.db.unique():
    print('DB', DB)
    RAW_DIR = f'{DATA_DIR}/{DB}'
    # should already be unique if these are proteins mapped form tcga!
    unique_df = df[df['db'] == DB]
    ########################## Get job partition
    partition_size = len(unique_df) / NUM_ARRAYS
    start, end = int(array_idx*partition_size), int((array_idx+1)*partition_size)

    unique_df = unique_df[start:end]

    #################################### create fastas
    fa_dir = os.path.join(RAW_DIR, f'{DB}_fa')
    fasta_fp = lambda idx,pid: os.path.join(fa_dir, f"{idx}-{pid}.fasta")
    os.makedirs(fa_dir, exist_ok=True)
    for idx, (prot_id, pro_seq) in tqdm(
                    unique_df[['prot_id', 'prot_seq']].iterrows(), 
                    desc='Creating fastas',
                    total=len(unique_df)):
        with open(fasta_fp(idx,prot_id), "w") as f:
            f.write(f">{prot_id},{idx},{DB}\n{pro_seq}")

    ##################################### Run hhblits
    aln_dir = os.path.join(RAW_DIR, f'{DB}_aln')
    aln_fp = lambda idx,pid: os.path.join(aln_dir, f"{idx}-{pid}.a3m")
    os.makedirs(aln_dir, exist_ok=True)

    # finally running
    for idx, (prot_id, pro_seq) in tqdm(
                    unique_df[['prot_id', 'mt_seq']].iterrows(), 
                    desc='Running hhblits',
                    total=len(unique_df)):
        in_fp = fasta_fp(idx,prot_id)
        out_fp = aln_fp(idx,prot_id)
        
        if not os.path.isfile(out_fp):
            MSARunner.hhblits(in_fp, out_fp, n_cpus=N_CPUS)

print('DONE!')

jyaacoub added a commit that referenced this issue May 16, 2024
jyaacoub added a commit that referenced this issue Jun 19, 2024
@jyaacoub jyaacoub linked a pull request Jul 11, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant