-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TCGA analysis #95
Comments
Matching mutations in TCGA to davisBest way is to use Using Biomart to get UniProtIDs
Map davis protein names to uniprot IDs with Biomart
Since davis are all human we can just get the entire db from biomart and filter using pandas
This matches 266 proteins from davis! Code
#%%
import os
import pandas as pd
from src.data_prep.processors import Processor
root_dir = '../data/DavisKibaDataset/davis'
df = pd.read_csv(f"{root_dir}/nomsa_binary_original_binary/full/XY.csv")
df_unique = df.loc[df[['code']].drop_duplicates().index]
df_unique.drop(['SMILE', 'pkd', 'prot_id'], axis=1, inplace=True)
df_unique['code'] = df_unique['code'].str.upper()
df_unique.columns = ['Gene name', 'prot_seq']
#%%
df_mart = pd.read_csv('../downloads/biomart_hsapiens.tsv', sep='\t')
df_mart = df_mart.loc[df_mart[['Gene name', 'UniProtKB/Swiss-Prot ID']].dropna().index]
df_mart['Gene name'] = df_mart['Gene name'].str.upper()
df_mart = df_mart.drop_duplicates(subset=['UniProtKB/Swiss-Prot ID'])
#%%
dfm = df_unique.merge(df_mart, on='Gene name', how='left')
dfm[['Gene name', 'UniProtKB/Swiss-Prot ID']].to_csv('../downloads/davis_biomart_matches.csv')
Using `Hugo_Symbol`
using Hugo_symbol:TCGA MAF files have "Hugo_Symbol" which corresponds to the gene name for that mutation (HUGO)... Code
#%%
import pandas as pd
df_pid = pd.read_csv("../downloads/davis_biomart_matches.csv", index_col=0).dropna()
#%%
df = pd.read_csv("../downloads/TCGA_BRCA_Mutations.csv")
df = df.loc[df['SWISSPROT'].dropna().index]
df[['Gene', 'SWISSPROT']].head()
df['uniprot'] = df['SWISSPROT'].str.split('.').str[0]
#%%
dfm = df_pid.merge(df, on='uniprot', how='inner')
# %%
dfm[~(dfm.code == dfm.Hugo_Symbol)]
# %%
df_pid = pd.read_csv('../downloads/davis_pids.csv')[['code']]
dfm_cd = df_pid.merge(df, left_on='code', right_on='Hugo_Symbol', how='left') Difference:
Code
#%%
import pandas as pd
df_pid = pd.read_csv("../downloads/davis_biomart_matches.csv", index_col=0).dropna()
#%%
df = pd.read_csv("../downloads/TCGA_BRCA_Mutations.csv")
df = df.loc[df['SWISSPROT'].dropna().index]
df[['Gene', 'SWISSPROT']].head()
df['uniprot'] = df['SWISSPROT'].str.split('.').str[0]
#%%
dfm = df_pid.merge(df, on='uniprot', how='inner')
# %%
df_pid = pd.read_csv('../downloads/davis_pids.csv').drop_duplicates(subset='code')[['code']]
dfmh= df_pid.merge(df, left_on='code', right_on='Hugo_Symbol', how='inner')
# %%
print(f"using biomart matched UniProts: {len(dfm)} total")
print(f"Using raw davis HUGO protein name: {len(dfmh)} total")
print(f"biomart finds {len(dfm[~(dfm.code == dfm.Hugo_Symbol)]):3d} extra") |
Things to consider for matching:
Columns for validating results:These come from pathogenicy predictions from ensembl
not as useful:
|
steps for matching dataset uniprot ids to TCGA mutationsTCGABiolinks uses https://gdc.cancer.gov/about-data/publications/mc3-2017 to get the MAF files.
1. Gather data for davis, kiba and PDBbind datasetsNeed the following for matching, see above comment
Code to combine all datasets into a single csv: 5696a7a
2. Download ALL TCGA projects as a single MAF
3. Prefiltering TCGA
4. Match Uniprot IDs with Mutations
5. Post filtering
Code
#%% 1.Gather data for davis,kiba and pdbbind datasets
import os
import pandas as pd
import matplotlib.pyplot as plt
from src.analysis.utils import combine_dataset_pids
from src import config as cfg
df_prots = combine_dataset_pids(dbs=[cfg.DATA_OPT.davis, cfg.DATA_OPT.PDBbind], # WARNING: just davis and pdbbind for now
subset='test')
#%% 2. Load TCGA data
df_tcga = pd.read_csv('../downloads/TCGA_ALL.maf', sep='\t')
#%% 3. Pre filtering
df_tcga = df_tcga[df_tcga['Variant_Classification'] == 'Missense_Mutation']
df_tcga['seq_len'] = pd.to_numeric(df_tcga['Protein_position'].str.split('/').str[1])
df_tcga = df_tcga[df_tcga['seq_len'] < 5000]
df_tcga['seq_len'].plot.hist(bins=100, title="sequence length histogram capped at 5K")
plt.show()
df_tcga = df_tcga[df_tcga['seq_len'] < 1200]
df_tcga['seq_len'].plot.hist(bins=100, title="sequence length after capped at 1.2K")
#%% 4. Merging df_prots with TCGA
df_tcga['uniprot'] = df_tcga['SWISSPROT'].str.split('.').str[0]
dfm = df_tcga.merge(df_prots[df_prots.db != 'davis'],
left_on='uniprot', right_on='prot_id', how='inner')
# for davis we have to merge on HUGO_SYMBOLS
dfm_davis = df_tcga.merge(df_prots[df_prots.db == 'davis'],
left_on='Hugo_Symbol', right_on='prot_id', how='inner')
dfm = pd.concat([dfm,dfm_davis], axis=0)
del dfm_davis # to save mem
# %% 5. Post filtering step
# 5.1. Filter for only those sequences with matching sequence length (to get rid of nonmatched isoforms)
# seq_len_x is from tcga, seq_len_y is from our dataset
tmp = len(dfm)
# allow for some error due to missing amino acids from pdb file in PDBbind dataset
# - assumption here is that isoforms will differ by more than 50 amino acids
dfm = dfm[(dfm.seq_len_y <= dfm.seq_len_x) & (dfm.seq_len_x<= dfm.seq_len_y+50)]
print(f"Filter #1 (seq_len) : {tmp:5d} - {tmp-len(dfm):5d} = {len(dfm):5d}")
# 5.2. Filter out those that dont have the same reference seq according to the "Protein_position" and "Amino_acids" col
# Extract mutation location and reference amino acid from 'Protein_position' and 'Amino_acids' columns
dfm['mt_loc'] = pd.to_numeric(dfm['Protein_position'].str.split('/').str[0])
dfm = dfm[dfm['mt_loc'] < dfm['seq_len_y']]
dfm[['ref_AA', 'mt_AA']] = dfm['Amino_acids'].str.split('/', expand=True)
dfm['db_AA'] = dfm.apply(lambda row: row['prot_seq'][row['mt_loc']-1], axis=1)
# Filter #2: Match proteins with the same reference amino acid at the mutation location
tmp = len(dfm)
dfm = dfm[dfm['db_AA'] == dfm['ref_AA']]
print(f"Filter #2 (ref_AA match): {tmp:5d} - {tmp-len(dfm):5d} = {len(dfm):5d}")
print('\n',dfm.db.value_counts())
# %% final seq len distribution
n_bins = 25
lengths = dfm.seq_len_x
fig, ax = plt.subplots(1, 1, figsize=(10, 5))
# Plot histogram
n, bins, patches = ax.hist(lengths, bins=n_bins, color='blue', alpha=0.7)
ax.set_title('TCGA final filtering for db matches')
# Add counts to each bin
for count, x, patch in zip(n, bins, patches):
ax.text(x + 0.5, count, str(int(count)), ha='center', va='bottom')
ax.set_xlabel('Sequence Length')
ax.set_ylabel('Frequency')
plt.tight_layout()
plt.show()
# %% Getting updated sequences
def apply_mut(row):
ref_seq = list(row['prot_seq'])
ref_seq[row['mt_loc']-1] = row['mt_AA']
return ''.join(ref_seq)
dfm['mt_seq'] = dfm.apply(apply_mut, axis=1)
# %%
dfm.to_csv("/cluster/home/t122995uhn/projects/data/tcga/tcga_maf_davis_pdbbind.csv") |
Mapped TCGA mutations counts:
1. RAW TCGA counts (from prefiltering step 3)2. Post filtering counts (step 5 from above)All proteins:
Test set:
3. Final counts after all filters: |
Since we dont have a kiba GVPL dataset yet...
primary tasks
Downloading and getting TCGA MAF files
Downloading using *TCGAbiolinks*
What project to use?
"TCGA projects are organized by cancer type or subtype."
Updated projects can be found here, but lets just focus on TCGA-BRCA for now
How to download TCGA-BRCA mafs?
Update sys packages
Install R
README
Add apt repo:
sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/"
Install R
Install sys packages required by R
Install TCGABiolinks package
make sure to run in sudo mode
Then install:
download TCGA-BRCA
to exit R:
Save TCGAbiolinks R file as CSV:
Another way is to just use the TCGA portal and download the entire cohort for each project
The text was updated successfully, but these errors were encountered: