Get input data for TCGA (MSAs and aflow confirmations + ligands from our dbs) #99

jyaacoub · 2024-05-16T17:56:08Z

As mentioned in #95 we should only focus on datasets that already have a GVPL dataset created

Once the dataset is created DO NOT retrain on a new dataset, otherwise this will mess up which proteins are considered as the "test set" proteins and so it would mess up our analysis.
Specifically we are talking about the kiba dataset since we are still getting aflow confirmations for that one.

Since we dont have a kiba GVPL dataset yet...

jyaacoub · 2024-05-16T18:25:46Z

Getting right protein sequences to focus on:

Using the TCGA analysis script (at this commit for playground.py) we only get protein sequences for the dbs that we have GVPL datasets for (not kiba).

This gives us the following stats for just **davis** and **pdbbind**

Filter #1 (seq_len)     :   724 -   562 =   162
Filter #2 (ref_AA match):   160 -    42 =   118

 davis      104
PDBbind     14
Name: db, dtype: int64

jyaacoub · 2024-05-16T18:34:04Z

Get mutated sequences

This is simple and the code for it is just one .apply method on the dataframe that comes out:

Code to get 'mt_seq' col

def apply_mut(row):
    ref_seq = list(row['prot_seq'])
    ref_seq[row['mt_loc']-1] = row['mt_AA']
    return ''.join(ref_seq)

dfm['mt_seq'] = dfm.apply(apply_mut, axis=1)

jyaacoub · 2024-05-16T18:37:27Z

Run MSA

Using HHBlits via our MSARunner class we can get sequence alignments with our reference UniRef30_2020_06 database

See Improve davis and kiba MSA input files #78 for examples of how this was run on davis.

submit as batch array job

Details

# %%
from src.utils.seq_alignment import MSARunner
from tqdm import tqdm
import pandas as pd
import os

DATA_DIR = '/cluster/home/t122995uhn/projects/data/tcga'
CSV = f'{DATA_DIR}/tcga_maf_davis_pdbbind.csv'
N_CPUS=6
NUM_ARRAYS = 10
array_idx = 0#${SLURM_ARRAY_TASK_ID}

df = pd.read_csv(CSV, index_col=0)
df.sort_values(by='seq_len_y', inplace=True)


# %%
for DB in df.db.unique():
    print('DB', DB)
    RAW_DIR = f'{DATA_DIR}/{DB}'
    # should already be unique if these are proteins mapped form tcga!
    unique_df = df[df['db'] == DB]
    ########################## Get job partition
    partition_size = len(unique_df) / NUM_ARRAYS
    start, end = int(array_idx*partition_size), int((array_idx+1)*partition_size)

    unique_df = unique_df[start:end]

    #################################### create fastas
    fa_dir = os.path.join(RAW_DIR, f'{DB}_fa')
    fasta_fp = lambda idx,pid: os.path.join(fa_dir, f"{idx}-{pid}.fasta")
    os.makedirs(fa_dir, exist_ok=True)
    for idx, (prot_id, pro_seq) in tqdm(
                    unique_df[['prot_id', 'prot_seq']].iterrows(), 
                    desc='Creating fastas',
                    total=len(unique_df)):
        with open(fasta_fp(idx,prot_id), "w") as f:
            f.write(f">{prot_id},{idx},{DB}\n{pro_seq}")

    ##################################### Run hhblits
    aln_dir = os.path.join(RAW_DIR, f'{DB}_aln')
    aln_fp = lambda idx,pid: os.path.join(aln_dir, f"{idx}-{pid}.a3m")
    os.makedirs(aln_dir, exist_ok=True)

    # finally running
    for idx, (prot_id, pro_seq) in tqdm(
                    unique_df[['prot_id', 'mt_seq']].iterrows(), 
                    desc='Running hhblits',
                    total=len(unique_df)):
        in_fp = fasta_fp(idx,prot_id)
        out_fp = aln_fp(idx,prot_id)
        
        if not os.path.isfile(out_fp):
            MSARunner.hhblits(in_fp, out_fp, n_cpus=N_CPUS)

print('DONE!')

See mention here: #99 (comment)

jyaacoub mentioned this issue May 16, 2024

TCGA analysis #95

Open

5 tasks

jyaacoub added a commit that referenced this issue May 16, 2024

fix(tcga): focus on pdbbind davis #95 #99

985d684

Since we dont have a kiba GVPL dataset yet...

jyaacoub added a commit that referenced this issue May 16, 2024

feat(tcga): msa for tcga muts #99 #95

5340ec8

jyaacoub added a commit that referenced this issue Jun 19, 2024

fix(split): ensuring consistent test set #99

e195864

See mention here: #99 (comment)

jyaacoub added the analysis label Jul 2, 2024

jyaacoub linked a pull request Jul 11, 2024 that will close this issue

development #119

Merged

jyaacoub closed this as completed in #119 Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get input data for TCGA (MSAs and aflow confirmations + ligands from our dbs) #99

Get input data for TCGA (MSAs and aflow confirmations + ligands from our dbs) #99

jyaacoub commented May 16, 2024 •

edited

Loading

jyaacoub commented May 16, 2024

jyaacoub commented May 16, 2024

jyaacoub commented May 16, 2024 •

edited

Loading

Get input data for TCGA (MSAs and aflow confirmations + ligands from our dbs) #99

Get input data for TCGA (MSAs and aflow confirmations + ligands from our dbs) #99

Comments

jyaacoub commented May 16, 2024 • edited Loading

jyaacoub commented May 16, 2024

Getting right protein sequences to focus on:

jyaacoub commented May 16, 2024

Get mutated sequences

jyaacoub commented May 16, 2024 • edited Loading

Run MSA

submit as batch array job

jyaacoub commented May 16, 2024 •

edited

Loading

jyaacoub commented May 16, 2024 •

edited

Loading