-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcommands.txt
152 lines (108 loc) · 4.11 KB
/
commands.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
## Commands
This section lists command(s) run by Xenoclassify workflow
* Running Xenoclassify workflow
Xenoclassify aligns data to host and graft genomes using imported bwaMem (or star) workflow and then classify reads depending on their alignment scores.
In the case of STAR we support multi-lane data. WG/EX/TS data are going to be aligned as single-lane data only.
### Sort bam files by read name, optionally remove supplemental (chimeric) alignments
```
if [[ "~{filterSupAlignments}" == "true" ]]; then
samtools sort -n ~{inBam} ~{'-T ' + tmpDir} | samtools view -F 2048 - -bh > ~{basename(inBam, '.bam')}_sorted.bam
else
samtools sort -n ~{inBam} ~{'-T ' + tmpDir} -o ~{basename(inBam, '.bam')}_sorted.bam
fi
```
### Classify reads with xenoclassify.py script:
```
xenoclassify.py is run on name-sorted bams from host and graft
python3 $XENOCLASSIFY_ROOT/bin/xenoclassify/xenoclassify.py -H HOST_BAM -G GRAFT_BAM -O . -b -p PREFIX
-n NEITHER_THRESHOLD -t TOLERANCE -d DIFFERENCE
...
samtools view PREFIX_output.bam | awk \'{print \$NF}\' | sort | uniq -c"
counts = os.popen(command).read().splitlines()
Following Python code looks for host/graft (+ neither, both) classification and writes a summary into .json file
for line in counts:
if line.find('CL:Z:') == 0:
continue
line = line.rstrip()
line = re.sub('CL:Z:', '', line)
tmp = line.split()
jsonDict[tmp[1]] = tmp[0]
with open(json_name, 'w') as json_file:
json.dump(jsonDict, json_file)
```
### Filter reads matching the supplied classification(s) and provision filtered .bam along with its index
```
The following python code splits the supplied tags and then
removes all matching reads from the bam file
...
inputTags = "~{sep=' ' filterTags}"
tags = inputTags.split()
command = "samtools view -h XENOCLASSIFY_BAM"
for t in tags:
command = command + " | grep -v \'CL:Z:" + t + "\'"
command = command + " | samtools sort -O bam -T TMP_DIR -o OUTPUT_PREFIX_filtered.bam -"
os.system(command)
samtools index OUTPUT_PREFIX_filtered.bam OUTPUT_PREFIX_filtered.bai
```
### Extract reads from filtered file into fastq format with picard:
```
set -euo pipefail
unset _JAVA_OPTIONS
java -Xmx32G -jar picard.jar SamToFastq I=FILTERED_BAM F=FILTERED_1.fastq F2=FILTERED_2.fastq ADDITIONAL_PARAMETERS
gzip -c FILTERED_1.fastq > OUTPUT_PREFIX_part_1.fastq.gz
gzip -c FILTERED_2.fastq > OUTPUT_PREFIX_part_2.fastq.gz
```
### In addition, we run second pass STAR alignments with reads from the filtered bam extracted into fastq
Merging reports - this is needed only for multi-lane transcriptome data processing
For multi-lane data we also add lane identificators
```
import json
import re
r = "~{sep=' ' inputReports}"
inputJsons = r.split()
inputRgs = "~{sep=' ' inputRgs}"
data = {}
def jsonRead(fileName):
with open(fileName, "r") as f:
jsonText = f.readlines()
jsonText = "".join(jsonText)
jsonText = jsonText.strip()
return json.loads(jsonText)
matches = re.findall('(?<=[ID]:)([\S]*)', inputRgs)
if len(inputJsons) > 1:
for j in range(len(inputJsons)):
if matches[j]:
data[matches[j]] = jsonRead(inputJsons[j])
else:
data = jsonRead(inputJsons[0])
metrics_file = "~{outputPrefix}_tagReport.json"
with open(metrics_file, "w") as m:
m.write(json.dumps(data, indent=2))
```
### Merging and indexing bam files
```
set -euo pipefail
samtools merge -o ~{outputPrefix}_filtered.bam ~{sep=" " inputBams}
samtools index ~{outputPrefix}_filtered.bam ~{outputPrefix}_filtered.bai
```
### Examples of json report for single-lane and multi-lane data:
single-lane:
{
"both": "286794",
"host": "348954",
"neither": "1140",
"graft": "5607744"
}
multi-lane:
{
"210601_A00469_0179_BHCKFVDRXY_1_CTGTTGAC-ACCTCAGT": {
"both": "13188",
"host": "5184",
"graft": "233018"
},
"210430_A00469_0173_AH7NV2DRXY_2_CTGTTGAC-ACCTCAGT": {
"both": "11904",
"host": "5076",
"graft": "217000"
}
}