commands.txt

## Commands
This section lists command(s) run by Xenoclassify workflow

* Running Xenoclassify workflow

Xenoclassify aligns data to host and graft genomes using imported bwaMem (or star) workflow and then classify reads depending on their alignment scores.
In the case of STAR we support multi-lane data. WG/EX/TS data are going to be aligned as single-lane data only.

### Sort bam files by read name, optionally remove supplemental (chimeric) alignments

```
    if [[ "~{filterSupAlignments}" == "true" ]]; then
        samtools sort -n ~{inBam} ~{'-T ' + tmpDir} | samtools view -F 2048 - -bh > ~{basename(inBam, '.bam')}_sorted.bam
    else
        samtools sort -n ~{inBam} ~{'-T ' + tmpDir} -o ~{basename(inBam, '.bam')}_sorted.bam
    fi
```

### Classify reads with xenoclassify.py script:

```
 
 xenoclassify.py is run on name-sorted bams from host and graft

 python3 $XENOCLASSIFY_ROOT/bin/xenoclassify/xenoclassify.py -H HOST_BAM -G GRAFT_BAM -O . -b -p PREFIX
                                                             -n NEITHER_THRESHOLD -t TOLERANCE -d DIFFERENCE
 ...

 samtools view PREFIX_output.bam | awk \'{print \$NF}\' | sort | uniq -c"
 counts = os.popen(command).read().splitlines() 

 Following Python code looks for host/graft (+ neither, both) classification and writes a summary into .json file

 for line in counts:
   if line.find('CL:Z:') == 0:
         continue
   line = line.rstrip()
   line = re.sub('CL:Z:', '', line)
   tmp = line.split()
   jsonDict[tmp[1]] = tmp[0]

 with open(json_name, 'w') as json_file:
   json.dump(jsonDict, json_file)
 
```

### Filter reads matching the supplied classification(s) and provision filtered .bam along with its index

```
 
  The following python code splits the supplied tags and then 
  removes all matching reads from the bam file  
  
  ...

  inputTags =  "~{sep=' ' filterTags}"
  tags = inputTags.split()
  
  command = "samtools view -h XENOCLASSIFY_BAM"
  for t in tags:
    command = command + " | grep -v \'CL:Z:" + t + "\'"
  
  command = command + " | samtools sort -O bam -T  TMP_DIR -o OUTPUT_PREFIX_filtered.bam -"
  os.system(command)
  
  samtools index OUTPUT_PREFIX_filtered.bam OUTPUT_PREFIX_filtered.bai

```

### Extract reads from filtered file into fastq format with picard:

```
 set -euo pipefail
 unset _JAVA_OPTIONS
 java -Xmx32G -jar picard.jar SamToFastq I=FILTERED_BAM F=FILTERED_1.fastq F2=FILTERED_2.fastq ADDITIONAL_PARAMETERS
 gzip -c FILTERED_1.fastq > OUTPUT_PREFIX_part_1.fastq.gz
 gzip -c FILTERED_2.fastq > OUTPUT_PREFIX_part_2.fastq.gz

```

### In addition, we run second pass STAR alignments with reads from the filtered bam extracted into fastq


Merging reports - this is needed only for multi-lane transcriptome data processing
For multi-lane data we also add lane identificators

```
   import json
   import re

   r = "~{sep=' ' inputReports}"
   inputJsons = r.split()
   inputRgs = "~{sep=' ' inputRgs}"

   data = {}

   def jsonRead(fileName):
       with open(fileName, "r") as f:
           jsonText = f.readlines()
           jsonText = "".join(jsonText)
           jsonText = jsonText.strip()
       return json.loads(jsonText)

   matches = re.findall('(?<=[ID]:)([\S]*)', inputRgs)

   if len(inputJsons) > 1:
       for j in range(len(inputJsons)):
           if matches[j]:
               data[matches[j]] = jsonRead(inputJsons[j])
   else:
       data = jsonRead(inputJsons[0])

   metrics_file = "~{outputPrefix}_tagReport.json"
   with open(metrics_file, "w") as m:
       m.write(json.dumps(data, indent=2))

```

### Merging and indexing bam files

```
  set -euo pipefail
  samtools merge -o ~{outputPrefix}_filtered.bam ~{sep=" " inputBams}
  samtools index ~{outputPrefix}_filtered.bam ~{outputPrefix}_filtered.bai
```

### Examples of json report for single-lane and multi-lane data:

single-lane:

{
  "both": "286794",
  "host": "348954",
  "neither": "1140",
  "graft": "5607744"
}

multi-lane:

{
  "210601_A00469_0179_BHCKFVDRXY_1_CTGTTGAC-ACCTCAGT": {
    "both": "13188",
    "host": "5184",
    "graft": "233018"
  },
  "210430_A00469_0173_AH7NV2DRXY_2_CTGTTGAC-ACCTCAGT": {
    "both": "11904",
    "host": "5076",
    "graft": "217000"
  }
}