TCGA analysis #95

jyaacoub · 2024-05-04T02:00:47Z

primary tasks

download TCGA MAF files
Map TCGA rows to proteins from our datasets
- DONT INCLUDE KIBA UNTIL WE GET ALL AFLOW CONFS (to ensure consistent test sets and so we don't have to redo the subsequent tasks)
Get input data for TCGA (MSAs and aflow confirmations + ligands from our dbs) #99
Run the same script as Platinum analysis #94 to get results for davis, kiba, and pdbbind pretrained models.
Distribution level analysis for TCGA with mapped prots #111

Downloading and getting TCGA MAF files

Downloading using *TCGAbiolinks*

What project to use?

"TCGA projects are organized by cancer type or subtype."
Updated projects can be found here, but lets just focus on TCGA-BRCA for now

using the legacy version of the data portal we can gain access to the open version of TCGA-BRCA instead of the newer but closed version

How to download TCGA-BRCA mafs?

Update sys packages

sudo apt update
sudo apt upgrade -y

Install R

README
Add apt repo:

sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/"

Install R

sudo apt update
sudo apt install r-base

Install sys packages required by R

sudo apt install libcurl4-openssl-dev libssl-dev libxml2-dev -y

Install TCGABiolinks package

make sure to run in sudo mode

sudo -i R

Then install:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("TCGAbiolinks")

download TCGA-BRCA

sort(harmonized.data.type)
Aggregated Somatic Mutation
...
Masked Somatic Mutation
Masked Somatic Mutation
...
Methylation Beta Value
Splice Junction Quantification

library(TCGAbiolinks)
query <- GDCquery(project = "TCGA-BRCA", 
				  data.category = "Simple Nucleotide Variation",
				  data.type = "Masked Somatic Mutation",
				  file.type = "maf.gz", 
				  access = "open")

GDCdownload(query)
data <- GDCprepare(query)

to exit R:

q()

Save TCGAbiolinks R file as CSV:

write.csv(data, "TCGA_BRCA_Mutations.csv", row.names = FALSE)

Another way is to just use the TCGA portal and download the entire cohort for each project

The text was updated successfully, but these errors were encountered:

Platinum analysis figures and TCGA init #94 and #95

jyaacoub · 2024-05-09T20:00:25Z

Matching mutations in TCGA to davis

Best way is to use Hugo_Symbol from the MAF file which is the gene name.

Using Biomart to get UniProtIDs

Map davis protein names to uniprot IDs with Biomart

Can't map any mutated proteins since those are unique and don't have unique uniprot IDs to identify them.

Since davis are all human we can just get the entire db from biomart and filter using pandas

Biomart URL query for our case would be like:

http://useast.ensembl.org/biomart/martview/bd870a98ae9d3290205dae6651366761?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_gene_ensembl.default.feature_page.ensembl_gene_id|hsapiens_gene_ensembl.default.feature_page.ensembl_gene_id_version|hsapiens_gene_ensembl.default.feature_page.ensembl_transcript_id|hsapiens_gene_ensembl.default.feature_page.ensembl_transcript_id_version|hsapiens_gene_ensembl.default.feature_page.external_gene_name|hsapiens_gene_ensembl.default.feature_page.uniprotswissprot|hsapiens_gene_ensembl.default.feature_page.uniprot_gn_symbol&FILTERS=&VISIBLEPANEL=resultspanel

This matches 266 proteins from davis!

Code

#%%
import os
import pandas as pd
from src.data_prep.processors import Processor
root_dir = '../data/DavisKibaDataset/davis'
df = pd.read_csv(f"{root_dir}/nomsa_binary_original_binary/full/XY.csv")

df_unique = df.loc[df[['code']].drop_duplicates().index]
df_unique.drop(['SMILE', 'pkd', 'prot_id'], axis=1, inplace=True)
df_unique['code'] = df_unique['code'].str.upper()
df_unique.columns = ['Gene name', 'prot_seq']
#%%
df_mart = pd.read_csv('../downloads/biomart_hsapiens.tsv', sep='\t')
df_mart = df_mart.loc[df_mart[['Gene name', 'UniProtKB/Swiss-Prot ID']].dropna().index]
df_mart['Gene name'] = df_mart['Gene name'].str.upper()
df_mart = df_mart.drop_duplicates(subset=['UniProtKB/Swiss-Prot ID'])

#%%
dfm = df_unique.merge(df_mart, on='Gene name', how='left')

dfm[['Gene name', 'UniProtKB/Swiss-Prot ID']].to_csv('../downloads/davis_biomart_matches.csv')

proteins with no matches were mutated or phosphorylated.

Using `Hugo_Symbol`

using Hugo_symbol:

TCGA MAF files have "Hugo_Symbol" which corresponds to the gene name for that mutation (HUGO)...

Code

#%%
import pandas as pd
df_pid = pd.read_csv("../downloads/davis_biomart_matches.csv", index_col=0).dropna()

#%%
df = pd.read_csv("../downloads/TCGA_BRCA_Mutations.csv")
df = df.loc[df['SWISSPROT'].dropna().index]
df[['Gene', 'SWISSPROT']].head()
df['uniprot'] = df['SWISSPROT'].str.split('.').str[0]

#%%
dfm = df_pid.merge(df, on='uniprot', how='inner')

# %%
dfm[~(dfm.code == dfm.Hugo_Symbol)]

# %%
df_pid = pd.read_csv('../downloads/davis_pids.csv')[['code']]
dfm_cd = df_pid.merge(df, left_on='code', right_on='Hugo_Symbol', how='left')

Difference:

using biomart matched UniProts:    2324 total
Using raw davis HUGO protein name: 2321 total
biomart finds   3 extra

Code

#%%
import pandas as pd
df_pid = pd.read_csv("../downloads/davis_biomart_matches.csv", index_col=0).dropna()

#%%
df = pd.read_csv("../downloads/TCGA_BRCA_Mutations.csv")
df = df.loc[df['SWISSPROT'].dropna().index]
df[['Gene', 'SWISSPROT']].head()
df['uniprot'] = df['SWISSPROT'].str.split('.').str[0]

#%%
dfm = df_pid.merge(df, on='uniprot', how='inner')

# %%
df_pid = pd.read_csv('../downloads/davis_pids.csv').drop_duplicates(subset='code')[['code']]
dfmh= df_pid.merge(df, left_on='code', right_on='Hugo_Symbol', how='inner')

# %%

print(f"using biomart matched UniProts:    {len(dfm)} total")
print(f"Using raw davis HUGO protein name: {len(dfmh)} total")

print(f"biomart finds {len(dfm[~(dfm.code == dfm.Hugo_Symbol)]):3d} extra")

jyaacoub · 2024-05-10T17:34:20Z

Things to consider for matching:

10-Variant_Classification Only focus on missense mutations
56-Protein_position tells us <mutation location>/seqlen
- Match only proteins that have matching seqlen
57-Amino_acids tells us <reference AA>/<mutated AA>.
- match only proteins with the same reference AA at mutation location (see above)
  NOTE: If not enough matching then we need to get the canonical sequence from Uniprot!

Columns for validating results:

These come from pathogenicy predictions from ensembl

77-SIFT a predictive model that returns the impact of mutation on protein function (SIFT).
78-PolyPhen another predictive model similar to SIFT, (PolyPhen-2)

not as useful:

123-IMPACT only looks at the Variant_Classification to deduce impact see docs

jyaacoub · 2024-05-13T14:35:28Z

steps for matching dataset uniprot ids to TCGA mutations

TCGABiolinks uses https://gdc.cancer.gov/about-data/publications/mc3-2017 to get the MAF files.

according to their docs - https://www.bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/mutation.html

1. Gather data for davis, kiba and PDBbind datasets

Need the following for matching, see above comment

Original index id from XY.csv
UniprotID
Protein Sequence

Code to combine all datasets into a single csv: `5696a7a`

cols are db_idx,db,code,prot_id,seq_len,prot_seq

2. Download ALL TCGA projects as a single MAF

Go to GDC portal
Click dropdown for cases
Enter selection view for "program"
Check only TCGA
Close out of the dropdown
scroll do the bottom and hit Download 49.25 MB compressed MAF data

3. Prefiltering TCGA

Filter out by Variant_Classification (only focus on Missense_Mutation for now)
- Maybe also filter by Variant_Type to focus only on SINGLE nucleotide variants (SNP) -> doesnt matter since there is literally only 2 other rows with non-SNP variants:![[Pasted image 20240510144758.png|300]]
Filter out sequences longer than 1200, practically speaking any sequence longer than this is not useful since it would take forever to run.

4. Match Uniprot IDs with Mutations

For davis we have to use hugo_symbols but the others should be fine

5. Post filtering

filter for only those sequences with matching sequence length (to get rid of nonmatched isoforms)
- Filter #1 (seq_len) : 7495 - 5054 = 2441
Filter out those that don't have the same reference seq according to the "Protein_position" and "Amino_acids" col
- Filter #2 (ref_AA match): 2441 - 4 = 2437

Code

#%% 1.Gather data for davis,kiba and pdbbind datasets
import os
import pandas as pd
import matplotlib.pyplot as plt
from src.analysis.utils import combine_dataset_pids
from src import config as cfg
df_prots = combine_dataset_pids(dbs=[cfg.DATA_OPT.davis, cfg.DATA_OPT.PDBbind], # WARNING: just davis and pdbbind for now
                                subset='test')


#%% 2. Load TCGA data
df_tcga = pd.read_csv('../downloads/TCGA_ALL.maf', sep='\t')

#%% 3. Pre filtering
df_tcga = df_tcga[df_tcga['Variant_Classification'] == 'Missense_Mutation']
df_tcga['seq_len'] = pd.to_numeric(df_tcga['Protein_position'].str.split('/').str[1])
df_tcga = df_tcga[df_tcga['seq_len'] < 5000]
df_tcga['seq_len'].plot.hist(bins=100, title="sequence length histogram capped at 5K")
plt.show()
df_tcga = df_tcga[df_tcga['seq_len'] < 1200]
df_tcga['seq_len'].plot.hist(bins=100, title="sequence length after capped at 1.2K")

#%% 4. Merging df_prots with TCGA
df_tcga['uniprot'] = df_tcga['SWISSPROT'].str.split('.').str[0]

dfm = df_tcga.merge(df_prots[df_prots.db != 'davis'], 
                    left_on='uniprot', right_on='prot_id', how='inner')

# for davis we have to merge on HUGO_SYMBOLS
dfm_davis = df_tcga.merge(df_prots[df_prots.db == 'davis'], 
                          left_on='Hugo_Symbol', right_on='prot_id', how='inner')

dfm = pd.concat([dfm,dfm_davis], axis=0)

del dfm_davis # to save mem

# %% 5. Post filtering step
# 5.1. Filter for only those sequences with matching sequence length (to get rid of nonmatched isoforms)
# seq_len_x is from tcga, seq_len_y is from our dataset 
tmp = len(dfm)
# allow for some error due to missing amino acids from pdb file in PDBbind dataset
#   - assumption here is that isoforms will differ by more than 50 amino acids
dfm = dfm[(dfm.seq_len_y <= dfm.seq_len_x) & (dfm.seq_len_x<= dfm.seq_len_y+50)]
print(f"Filter #1 (seq_len)     : {tmp:5d} - {tmp-len(dfm):5d} = {len(dfm):5d}")

# 5.2. Filter out those that dont have the same reference seq according to the "Protein_position" and "Amino_acids" col
 
# Extract mutation location and reference amino acid from 'Protein_position' and 'Amino_acids' columns
dfm['mt_loc'] = pd.to_numeric(dfm['Protein_position'].str.split('/').str[0])
dfm = dfm[dfm['mt_loc'] < dfm['seq_len_y']]
dfm[['ref_AA', 'mt_AA']] = dfm['Amino_acids'].str.split('/', expand=True)

dfm['db_AA'] = dfm.apply(lambda row: row['prot_seq'][row['mt_loc']-1], axis=1)
                         
# Filter #2: Match proteins with the same reference amino acid at the mutation location
tmp = len(dfm)
dfm = dfm[dfm['db_AA'] == dfm['ref_AA']]
print(f"Filter #2 (ref_AA match): {tmp:5d} - {tmp-len(dfm):5d} = {len(dfm):5d}")
print('\n',dfm.db.value_counts())

# %% final seq len distribution

n_bins = 25
lengths = dfm.seq_len_x
fig, ax = plt.subplots(1, 1, figsize=(10, 5))

# Plot histogram
n, bins, patches = ax.hist(lengths, bins=n_bins, color='blue', alpha=0.7)
ax.set_title('TCGA final filtering for db matches')

# Add counts to each bin
for count, x, patch in zip(n, bins, patches):
    ax.text(x + 0.5, count, str(int(count)), ha='center', va='bottom')

ax.set_xlabel('Sequence Length')
ax.set_ylabel('Frequency')

plt.tight_layout()
plt.show()

# %% Getting updated sequences
def apply_mut(row):
    ref_seq = list(row['prot_seq'])
    ref_seq[row['mt_loc']-1] = row['mt_AA']
    return ''.join(ref_seq)

dfm['mt_seq'] = dfm.apply(apply_mut, axis=1)


# %%
dfm.to_csv("/cluster/home/t122995uhn/projects/data/tcga/tcga_maf_davis_pdbbind.csv")

jyaacoub · 2024-05-13T15:08:38Z

Mapped TCGA mutations counts:

Note PDBbind doesn't match any proteins from TCGA when limited to just the test set!
- This is likely because of how we got the sequences for PDBbind (from the pdb files).
To get the correct sequences we need to use the FASTA seq?
- but even this has its own issues since some will not include the full sequence and it wouldn't match with the check we do for mutation location
mt_loc doesnt match reference and db amino acid!
- This means even if we added some wiggle room for pdbbind by allowing at most 50 missing AA we would still fail to match if the mutation location is after those missing AA

1. RAW TCGA counts (from prefiltering step 3)

Sequence length histogram capped at 5K and 1.2K

2. Post filtering counts (step 5 from above)

All proteins:

Filter #1 (seq_len)     :  7495 -  5054 =  2441
Filter #2 (ref_AA match):  2441 -     4 =  2437

Test set:

Filter #1 (seq_len)     :  1047 -   791 =   256
Filter #2 (ref_AA match):   256 -     0 =   256

3. Final counts after all filters:

ALL PROTEINS - Sequence length histogram

TEST SET - Sequence length histogram

Since we dont have a kiba GVPL dataset yet...

jyaacoub added the analysis label May 4, 2024

jyaacoub added a commit that referenced this issue May 7, 2024

feat(utils): mapping fn pdb2uniprot for #95

a0883f1

jyaacoub added a commit that referenced this issue May 7, 2024

fix: map MAF mutations to PDB ids from platinum #95

996df97

jyaacoub added a commit that referenced this issue May 8, 2024

Merge pull request #96 from jyaacoub/development

2500ca8

Platinum analysis figures and TCGA init #94 and #95

jyaacoub added a commit that referenced this issue May 10, 2024

chore: TCGA filtering/merging code #95

c54b67a

jyaacoub added a commit that referenced this issue May 13, 2024

feat(analysis): combine_dataset_pids #95

5696a7a

jyaacoub added a commit that referenced this issue May 14, 2024

style(tcga): fig histogram plot #95

946b62d

jyaacoub changed the title ~~TCGA mapping + analysis~~ TCGA analysis May 14, 2024

jyaacoub mentioned this issue May 16, 2024

Get input data for TCGA (MSAs and aflow confirmations + ligands from our dbs) #99

Closed

jyaacoub added a commit that referenced this issue May 16, 2024

fix(tcga): focus on pdbbind davis #95 #99

985d684

Since we dont have a kiba GVPL dataset yet...

jyaacoub added a commit that referenced this issue May 16, 2024

feat(tcga): msa for tcga muts #99 #95

5340ec8

jyaacoub added a commit that referenced this issue Jun 21, 2024

feat(tcga): analysis script #95 #111

1c9c441

jyaacoub mentioned this issue Jul 2, 2024

Unify cross validation splits to use consistent sets #113

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TCGA analysis #95

TCGA analysis #95

jyaacoub commented May 4, 2024 •

edited

Loading

What project to use?

How to download TCGA-BRCA mafs?

Install R

Install TCGABiolinks package

download TCGA-BRCA

Save TCGAbiolinks R file as CSV:

jyaacoub commented May 9, 2024 •

edited

Loading

Map davis protein names to uniprot IDs with Biomart

using Hugo_symbol:

jyaacoub commented May 10, 2024

jyaacoub commented May 13, 2024 •

edited

Loading

jyaacoub commented May 13, 2024 •

edited

Loading

TCGA analysis #95

TCGA analysis #95

Comments

jyaacoub commented May 4, 2024 • edited Loading

primary tasks

Downloading and getting TCGA MAF files

What project to use?

How to download TCGA-BRCA mafs?

Install R

Install TCGABiolinks package

download TCGA-BRCA

Save TCGAbiolinks R file as CSV:

jyaacoub commented May 9, 2024 • edited Loading

Matching mutations in TCGA to davis

Map davis protein names to uniprot IDs with Biomart

using Hugo_symbol:

Difference:

jyaacoub commented May 10, 2024

Things to consider for matching:

Columns for validating results:

jyaacoub commented May 13, 2024 • edited Loading

steps for matching dataset uniprot ids to TCGA mutations

1. Gather data for davis, kiba and PDBbind datasets

Code to combine all datasets into a single csv: 5696a7a

2. Download ALL TCGA projects as a single MAF

3. Prefiltering TCGA

4. Match Uniprot IDs with Mutations

5. Post filtering

jyaacoub commented May 13, 2024 • edited Loading

Mapped TCGA mutations counts:

1. RAW TCGA counts (from prefiltering step 3)

2. Post filtering counts (step 5 from above)

All proteins:

Test set:

3. Final counts after all filters:

jyaacoub commented May 4, 2024 •

edited

Loading

jyaacoub commented May 9, 2024 •

edited

Loading

jyaacoub commented May 13, 2024 •

edited

Loading

Code to combine all datasets into a single csv: `5696a7a`

jyaacoub commented May 13, 2024 •

edited

Loading