-
Notifications
You must be signed in to change notification settings - Fork 0
/
pseudocode.txt
64 lines (39 loc) · 3.23 KB
/
pseudocode.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
##Pseudocode Sept 21, 2020
testing
TCGA VCF file Subset into breast, ovarian, colo-rectal:
Rows:
If chrom in range of genes in pathway
If pos in range of genes in pathway
Include the row in new file
Cols:
Use GDC website to make a map file to translate from TCGA ID to Tissue type
If tissue is x, use map file to filter out columns that correspond to sample numbers in tissue cohort x.
Include the columns that correspond to sample numbers in tissue cohort x in the final file
Final file: tissue type subset where rows are the variants in the ranges of the genes in any given pathway (in this case, hrr), and columns are only the samples in any given tissue type (in this case, ovarian)
TSV File organized by individual, where each variant has its own row including TCGA-ID, Chrom, Pos, Ref, Alt, Genotype Info:
Loop through columns in tissue type subset (above)
If genotype is not an empty field
Add a row of TCGA-ID (from column), and also Chrom, Pos, Ref, Alt, Genotype Info to the final file
Final file: TSV file organized by individual, where each variant has its own row (with columns TCGA-ID, Chrom, Pos, Ref, Alt, Genotype Info)
Test to determine success:
Sort the TCGA ids in the final file in alphabetical order
Count the number of times a unique TCGA comes up in the sorted list
Compare the number of unique TCGA IDs to the number of IDs in the tissue cohort, ideally counts should be the same
VCF File with set of unique variants:
Loop through rows in product of above tissue type subset
If genotype information is not blank
If chrom, pos, ref, and alt (combination of them all) have not been added to the final vcf file
Add row to final file
Final File: VCF file indexed by variant, where each row is a unique variant with traditional vcf columns (Chrom, Pos, ID, Ref, Alt, Qual, Filter, Info, Format) Each variant is unique and only shows up in this file once regardless of whether it shows up in multiple individuals
Test to determine success:
In the tissue type subset, go through line by line and count the number of non-blank genotype fields
Compare this count with the number of lines in the final vcf file of unique variants, ideally counts should be the same
Purpose of these files:
Subset - to cut down the VCF to a smaller size (more manageable)
Indexed by individual - so that it is easier to put the data together to ask: how many variants or lof mutations does this individual have, and do they overlap or can we tell that they are on opposite strands?
does this individual have two LOF variants/mutations that (knockout two different LOF variants, each knockout the same copy of the gene)
File with unique set of variants - to be able to run the variants through cravat or BaysDel in order to annotate them (pinpoint LOF mutations in order to analyze by individual)
First subset into three different sample groups by tissue type
Then use the unique set of variants as a way to run everything through cravat or bayesdel.
That way we will have annotated variants, to identify LOF
From there we can analyze variants per individual and group them into categories (those who have per gene 0 LOF, 1 LOF, 2 LOF on same copy of gene, 2 LOF one on each copy of the gene)