-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get input data for TCGA (MSAs and aflow confirmations + ligands from our dbs) #99
Closed
Tracked by
#95
Labels
Comments
jyaacoub
added a commit
that referenced
this issue
May 16, 2024
Since we dont have a kiba GVPL dataset yet...
Getting right protein sequences to focus on:Using the TCGA analysis script (at this commit for playground.py) we only get protein sequences for the dbs that we have GVPL datasets for (not kiba). |
Get mutated sequencesThis is simple and the code for it is just one .apply method on the dataframe that comes out: Code to get 'mt_seq' col
def apply_mut(row):
ref_seq = list(row['prot_seq'])
ref_seq[row['mt_loc']-1] = row['mt_AA']
return ''.join(ref_seq)
dfm['mt_seq'] = dfm.apply(apply_mut, axis=1) |
Run MSAUsing HHBlits via our
submit as batch array jobDetails
# %%
from src.utils.seq_alignment import MSARunner
from tqdm import tqdm
import pandas as pd
import os
DATA_DIR = '/cluster/home/t122995uhn/projects/data/tcga'
CSV = f'{DATA_DIR}/tcga_maf_davis_pdbbind.csv'
N_CPUS=6
NUM_ARRAYS = 10
array_idx = 0#${SLURM_ARRAY_TASK_ID}
df = pd.read_csv(CSV, index_col=0)
df.sort_values(by='seq_len_y', inplace=True)
# %%
for DB in df.db.unique():
print('DB', DB)
RAW_DIR = f'{DATA_DIR}/{DB}'
# should already be unique if these are proteins mapped form tcga!
unique_df = df[df['db'] == DB]
########################## Get job partition
partition_size = len(unique_df) / NUM_ARRAYS
start, end = int(array_idx*partition_size), int((array_idx+1)*partition_size)
unique_df = unique_df[start:end]
#################################### create fastas
fa_dir = os.path.join(RAW_DIR, f'{DB}_fa')
fasta_fp = lambda idx,pid: os.path.join(fa_dir, f"{idx}-{pid}.fasta")
os.makedirs(fa_dir, exist_ok=True)
for idx, (prot_id, pro_seq) in tqdm(
unique_df[['prot_id', 'prot_seq']].iterrows(),
desc='Creating fastas',
total=len(unique_df)):
with open(fasta_fp(idx,prot_id), "w") as f:
f.write(f">{prot_id},{idx},{DB}\n{pro_seq}")
##################################### Run hhblits
aln_dir = os.path.join(RAW_DIR, f'{DB}_aln')
aln_fp = lambda idx,pid: os.path.join(aln_dir, f"{idx}-{pid}.a3m")
os.makedirs(aln_dir, exist_ok=True)
# finally running
for idx, (prot_id, pro_seq) in tqdm(
unique_df[['prot_id', 'mt_seq']].iterrows(),
desc='Running hhblits',
total=len(unique_df)):
in_fp = fasta_fp(idx,prot_id)
out_fp = aln_fp(idx,prot_id)
if not os.path.isfile(out_fp):
MSARunner.hhblits(in_fp, out_fp, n_cpus=N_CPUS)
print('DONE!') |
jyaacoub
added a commit
that referenced
this issue
May 16, 2024
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
As mentioned in #95 we should only focus on datasets that already have a GVPL dataset created
The text was updated successfully, but these errors were encountered: