Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCGA analysis #95

Open
3 of 5 tasks
jyaacoub opened this issue May 4, 2024 · 4 comments
Open
3 of 5 tasks

TCGA analysis #95

jyaacoub opened this issue May 4, 2024 · 4 comments
Labels

Comments

@jyaacoub
Copy link
Owner

jyaacoub commented May 4, 2024

primary tasks

Downloading and getting TCGA MAF files

Downloading using *TCGAbiolinks*

What project to use?

"TCGA projects are organized by cancer type or subtype."
Updated projects can be found here, but lets just focus on TCGA-BRCA for now

  • using the legacy version of the data portal we can gain access to the open version of TCGA-BRCA instead of the newer but closed version

How to download TCGA-BRCA mafs?

Update sys packages

sudo apt update
sudo apt upgrade -y

Install R

README
Add apt repo:

sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/"

Install R

sudo apt update
sudo apt install r-base

Install sys packages required by R

sudo apt install libcurl4-openssl-dev libssl-dev libxml2-dev -y

Install TCGABiolinks package

make sure to run in sudo mode

sudo -i R

Then install:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("TCGAbiolinks")

download TCGA-BRCA

sort(harmonized.data.type)
Aggregated Somatic Mutation
...
Masked Somatic Mutation
Masked Somatic Mutation
...
Methylation Beta Value
Splice Junction Quantification
library(TCGAbiolinks)
query <- GDCquery(project = "TCGA-BRCA", 
				  data.category = "Simple Nucleotide Variation",
				  data.type = "Masked Somatic Mutation",
				  file.type = "maf.gz", 
				  access = "open")

GDCdownload(query)
data <- GDCprepare(query)

to exit R:

q()

Save TCGAbiolinks R file as CSV:

write.csv(data, "TCGA_BRCA_Mutations.csv", row.names = FALSE)

Another way is to just use the TCGA portal and download the entire cohort for each project

jyaacoub added a commit that referenced this issue May 7, 2024
jyaacoub added a commit that referenced this issue May 8, 2024
Platinum analysis figures and TCGA init #94 and #95
@jyaacoub
Copy link
Owner Author

jyaacoub commented May 9, 2024

Matching mutations in TCGA to davis

Best way is to use Hugo_Symbol from the MAF file which is the gene name.

Using Biomart to get UniProtIDs

Map davis protein names to uniprot IDs with Biomart

  • Can't map any mutated proteins since those are unique and don't have unique uniprot IDs to identify them.

Since davis are all human we can just get the entire db from biomart and filter using pandas
Pasted image 20240509134605
Biomart URL query for our case would be like:

http://useast.ensembl.org/biomart/martview/bd870a98ae9d3290205dae6651366761?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_gene_ensembl.default.feature_page.ensembl_gene_id|hsapiens_gene_ensembl.default.feature_page.ensembl_gene_id_version|hsapiens_gene_ensembl.default.feature_page.ensembl_transcript_id|hsapiens_gene_ensembl.default.feature_page.ensembl_transcript_id_version|hsapiens_gene_ensembl.default.feature_page.external_gene_name|hsapiens_gene_ensembl.default.feature_page.uniprotswissprot|hsapiens_gene_ensembl.default.feature_page.uniprot_gn_symbol&FILTERS=&VISIBLEPANEL=resultspanel

This matches 266 proteins from davis!

Code

#%%
import os
import pandas as pd
from src.data_prep.processors import Processor
root_dir = '../data/DavisKibaDataset/davis'
df = pd.read_csv(f"{root_dir}/nomsa_binary_original_binary/full/XY.csv")

df_unique = df.loc[df[['code']].drop_duplicates().index]
df_unique.drop(['SMILE', 'pkd', 'prot_id'], axis=1, inplace=True)
df_unique['code'] = df_unique['code'].str.upper()
df_unique.columns = ['Gene name', 'prot_seq']
#%%
df_mart = pd.read_csv('../downloads/biomart_hsapiens.tsv', sep='\t')
df_mart = df_mart.loc[df_mart[['Gene name', 'UniProtKB/Swiss-Prot ID']].dropna().index]
df_mart['Gene name'] = df_mart['Gene name'].str.upper()
df_mart = df_mart.drop_duplicates(subset=['UniProtKB/Swiss-Prot ID'])

#%%
dfm = df_unique.merge(df_mart, on='Gene name', how='left')

dfm[['Gene name', 'UniProtKB/Swiss-Prot ID']].to_csv('../downloads/davis_biomart_matches.csv')

  • proteins with no matches were mutated or phosphorylated.

Using `Hugo_Symbol`

using Hugo_symbol:

TCGA MAF files have "Hugo_Symbol" which corresponds to the gene name for that mutation (HUGO)...

Code

#%%
import pandas as pd
df_pid = pd.read_csv("../downloads/davis_biomart_matches.csv", index_col=0).dropna()

#%%
df = pd.read_csv("../downloads/TCGA_BRCA_Mutations.csv")
df = df.loc[df['SWISSPROT'].dropna().index]
df[['Gene', 'SWISSPROT']].head()
df['uniprot'] = df['SWISSPROT'].str.split('.').str[0]

#%%
dfm = df_pid.merge(df, on='uniprot', how='inner')

# %%
dfm[~(dfm.code == dfm.Hugo_Symbol)]

# %%
df_pid = pd.read_csv('../downloads/davis_pids.csv')[['code']]
dfm_cd = df_pid.merge(df, left_on='code', right_on='Hugo_Symbol', how='left')

Difference:

using biomart matched UniProts:    2324 total
Using raw davis HUGO protein name: 2321 total
biomart finds   3 extra
Code

#%%
import pandas as pd
df_pid = pd.read_csv("../downloads/davis_biomart_matches.csv", index_col=0).dropna()

#%%
df = pd.read_csv("../downloads/TCGA_BRCA_Mutations.csv")
df = df.loc[df['SWISSPROT'].dropna().index]
df[['Gene', 'SWISSPROT']].head()
df['uniprot'] = df['SWISSPROT'].str.split('.').str[0]

#%%
dfm = df_pid.merge(df, on='uniprot', how='inner')

# %%
df_pid = pd.read_csv('../downloads/davis_pids.csv').drop_duplicates(subset='code')[['code']]
dfmh= df_pid.merge(df, left_on='code', right_on='Hugo_Symbol', how='inner')

# %%

print(f"using biomart matched UniProts:    {len(dfm)} total")
print(f"Using raw davis HUGO protein name: {len(dfmh)} total")

print(f"biomart finds {len(dfm[~(dfm.code == dfm.Hugo_Symbol)]):3d} extra")

@jyaacoub
Copy link
Owner Author

Things to consider for matching:

  • 10-Variant_Classification Only focus on missense mutations
  • 56-Protein_position tells us <mutation location>/seqlen
    • Match only proteins that have matching seqlen
  • 57-Amino_acids tells us <reference AA>/<mutated AA>.
    • match only proteins with the same reference AA at mutation location (see above)
      NOTE: If not enough matching then we need to get the canonical sequence from Uniprot!

Columns for validating results:

These come from pathogenicy predictions from ensembl

  • 77-SIFT a predictive model that returns the impact of mutation on protein function (SIFT).
  • 78-PolyPhen another predictive model similar to SIFT, (PolyPhen-2)

not as useful:

  • 123-IMPACT only looks at the Variant_Classification to deduce impact see docs

jyaacoub added a commit that referenced this issue May 10, 2024
@jyaacoub
Copy link
Owner Author

jyaacoub commented May 13, 2024

steps for matching dataset uniprot ids to TCGA mutations

TCGABiolinks uses https://gdc.cancer.gov/about-data/publications/mc3-2017 to get the MAF files.

1. Gather data for davis, kiba and PDBbind datasets

Need the following for matching, see above comment

  • Original index id from XY.csv
  • UniprotID
  • Protein Sequence

Code to combine all datasets into a single csv: 5696a7a

  • cols are db_idx,db,code,prot_id,seq_len,prot_seq

2. Download ALL TCGA projects as a single MAF

  1. Go to GDC portal
  2. Click dropdown for cases
  3. Enter selection view for "program"
  4. Check only TCGA
  5. Close out of the dropdown
  6. scroll do the bottom and hit Download 49.25 MB compressed MAF data

3. Prefiltering TCGA

  • Filter out by Variant_Classification (only focus on Missense_Mutation for now)
    • Maybe also filter by Variant_Type to focus only on SINGLE nucleotide variants (SNP) -> doesnt matter since there is literally only 2 other rows with non-SNP variants:![[Pasted image 20240510144758.png|300]]
  • Filter out sequences longer than 1200, practically speaking any sequence longer than this is not useful since it would take forever to run.

4. Match Uniprot IDs with Mutations

  • For davis we have to use hugo_symbols but the others should be fine

5. Post filtering

  • filter for only those sequences with matching sequence length (to get rid of nonmatched isoforms)
    • Filter #1 (seq_len) : 7495 - 5054 = 2441
  • Filter out those that don't have the same reference seq according to the "Protein_position" and "Amino_acids" col
    • Filter #2 (ref_AA match): 2441 - 4 = 2437
Code

#%% 1.Gather data for davis,kiba and pdbbind datasets
import os
import pandas as pd
import matplotlib.pyplot as plt
from src.analysis.utils import combine_dataset_pids
from src import config as cfg
df_prots = combine_dataset_pids(dbs=[cfg.DATA_OPT.davis, cfg.DATA_OPT.PDBbind], # WARNING: just davis and pdbbind for now
                                subset='test')


#%% 2. Load TCGA data
df_tcga = pd.read_csv('../downloads/TCGA_ALL.maf', sep='\t')

#%% 3. Pre filtering
df_tcga = df_tcga[df_tcga['Variant_Classification'] == 'Missense_Mutation']
df_tcga['seq_len'] = pd.to_numeric(df_tcga['Protein_position'].str.split('/').str[1])
df_tcga = df_tcga[df_tcga['seq_len'] < 5000]
df_tcga['seq_len'].plot.hist(bins=100, title="sequence length histogram capped at 5K")
plt.show()
df_tcga = df_tcga[df_tcga['seq_len'] < 1200]
df_tcga['seq_len'].plot.hist(bins=100, title="sequence length after capped at 1.2K")

#%% 4. Merging df_prots with TCGA
df_tcga['uniprot'] = df_tcga['SWISSPROT'].str.split('.').str[0]

dfm = df_tcga.merge(df_prots[df_prots.db != 'davis'], 
                    left_on='uniprot', right_on='prot_id', how='inner')

# for davis we have to merge on HUGO_SYMBOLS
dfm_davis = df_tcga.merge(df_prots[df_prots.db == 'davis'], 
                          left_on='Hugo_Symbol', right_on='prot_id', how='inner')

dfm = pd.concat([dfm,dfm_davis], axis=0)

del dfm_davis # to save mem

# %% 5. Post filtering step
# 5.1. Filter for only those sequences with matching sequence length (to get rid of nonmatched isoforms)
# seq_len_x is from tcga, seq_len_y is from our dataset 
tmp = len(dfm)
# allow for some error due to missing amino acids from pdb file in PDBbind dataset
#   - assumption here is that isoforms will differ by more than 50 amino acids
dfm = dfm[(dfm.seq_len_y <= dfm.seq_len_x) & (dfm.seq_len_x<= dfm.seq_len_y+50)]
print(f"Filter #1 (seq_len)     : {tmp:5d} - {tmp-len(dfm):5d} = {len(dfm):5d}")

# 5.2. Filter out those that dont have the same reference seq according to the "Protein_position" and "Amino_acids" col
 
# Extract mutation location and reference amino acid from 'Protein_position' and 'Amino_acids' columns
dfm['mt_loc'] = pd.to_numeric(dfm['Protein_position'].str.split('/').str[0])
dfm = dfm[dfm['mt_loc'] < dfm['seq_len_y']]
dfm[['ref_AA', 'mt_AA']] = dfm['Amino_acids'].str.split('/', expand=True)

dfm['db_AA'] = dfm.apply(lambda row: row['prot_seq'][row['mt_loc']-1], axis=1)
                         
# Filter #2: Match proteins with the same reference amino acid at the mutation location
tmp = len(dfm)
dfm = dfm[dfm['db_AA'] == dfm['ref_AA']]
print(f"Filter #2 (ref_AA match): {tmp:5d} - {tmp-len(dfm):5d} = {len(dfm):5d}")
print('\n',dfm.db.value_counts())

# %% final seq len distribution

n_bins = 25
lengths = dfm.seq_len_x
fig, ax = plt.subplots(1, 1, figsize=(10, 5))

# Plot histogram
n, bins, patches = ax.hist(lengths, bins=n_bins, color='blue', alpha=0.7)
ax.set_title('TCGA final filtering for db matches')

# Add counts to each bin
for count, x, patch in zip(n, bins, patches):
    ax.text(x + 0.5, count, str(int(count)), ha='center', va='bottom')

ax.set_xlabel('Sequence Length')
ax.set_ylabel('Frequency')

plt.tight_layout()
plt.show()

# %% Getting updated sequences
def apply_mut(row):
    ref_seq = list(row['prot_seq'])
    ref_seq[row['mt_loc']-1] = row['mt_AA']
    return ''.join(ref_seq)

dfm['mt_seq'] = dfm.apply(apply_mut, axis=1)


# %%
dfm.to_csv("/cluster/home/t122995uhn/projects/data/tcga/tcga_maf_davis_pdbbind.csv")

@jyaacoub
Copy link
Owner Author

jyaacoub commented May 13, 2024

Mapped TCGA mutations counts:

  • Note PDBbind doesn't match any proteins from TCGA when limited to just the test set!
    • This is likely because of how we got the sequences for PDBbind (from the pdb files).
  • To get the correct sequences we need to use the FASTA seq?
    • but even this has its own issues since some will not include the full sequence and it wouldn't match with the check we do for mutation location
  • mt_loc doesnt match reference and db amino acid!
    • This means even if we added some wiggle room for pdbbind by allowing at most 50 missing AA we would still fail to match if the mutation location is after those missing AA

1. RAW TCGA counts (from prefiltering step 3)

Sequence length histogram capped at 5K and 1.2K

image
image

2. Post filtering counts (step 5 from above)

All proteins:

Filter #1 (seq_len)     :  7495 -  5054 =  2441
Filter #2 (ref_AA match):  2441 -     4 =  2437

Test set:

Filter #1 (seq_len)     :  1047 -   791 =   256
Filter #2 (ref_AA match):   256 -     0 =   256

3. Final counts after all filters:

ALL PROTEINS - Sequence length histogram

image

TEST SET - Sequence length histogram

image

jyaacoub added a commit that referenced this issue May 14, 2024
@jyaacoub jyaacoub changed the title TCGA mapping + analysis TCGA analysis May 14, 2024
jyaacoub added a commit that referenced this issue May 16, 2024
Since we dont have a kiba GVPL dataset yet...
jyaacoub added a commit that referenced this issue May 16, 2024
jyaacoub added a commit that referenced this issue Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant