-
Notifications
You must be signed in to change notification settings - Fork 13
Bacterial assemblies using Nextera Mate pairs
Nextera Mate-Pair Libraries are a popular assay for scaffolding genome assemblies. They can also be used as a standalone single assay to generate very nice assemblies (without the need for a separate paired-end library).
These are some assembly results for 9 common bacteria with 2 repeated libraries of each sample (so 18 libraries in total), all sequenced in a single multiplexed MiSeq run. I tried assembling them with both Velvet 1.2.10 and SPAdes 3.7.0. Adapter trimming was performed with nxtrim v0.4.0. The data are available here https://basespace.illumina.com/s/TXv32Ve6wTl9 (free registration required). I evaluated the assembles with QUAST 3.2.
Velvet commands were:
nxtrim -1 EcMG1_R1.fastq.gz -2 EcMG1_R2.fastq.gz -O EcMG1
velveth output $kmer -short -fastq.gz EcMG1.se.fastq.gz -shortPaired2 -fastq.gz EcMG1.pe.fastq.gz -shortPaired3 -fastq.gz EcMG1.mp.fastq.gz -shortPaired4 -fastq.gz EcMG1.unknown.fastq.gz
velvetg output -exp_cov auto -cov_cutoff auto -shortMatePaired4 yes -very_clean yes
I assembled multiple k-mer sizes and selected the one with the largest Contig N50. At least in my experience, I have found velvet rarely makes assembly errors on contigs (this is not the case for its scaffolding). I have also found contig N50 strongly correlates with the number of genes recovered. So this seemed like a reasonable parameter selection routine.
SPAdes commands were:
nxtrim -1 EcMG1_R1.fastq.gz -2 EcMG1_R2.fastq.gz -O EcMG1 --justmp
cat EcMG1.unknown.fastq.gz EcMG1.mp.fastq.gz > EcMG1.allmp.fastq.gz
spades.py -t 10 --hqmp1-12 EcMG1.allmp.fastq.gz -o EcMG1-spades
SPAdes cleverly assembles using multiple k-mer sizes (default 21,33,55,77) meaning parameter selection is not required.
The NGA50 metric for each library is in the below table. The assemblies are quite nice, having correct contigs in the 100s of kb (and sometimes over a megabase). Notably SPAdes produces far larger contigs than Velvet.
Velvet | SPAdes | ||||||
---|---|---|---|---|---|---|---|
Sample | Coverage | k-mer | Scaffold NGA50 | Contig NGA50 | Scaffold NGA50 | Contig NGA50 | |
Bacillus cereus ATCC 10987 | 19.97 | 53 | 602 | 80 | 1201 | 578 | |
22.69 | 53 | 1695 | 106 | 1514 | 582 | ||
Escherichia coli K-12 DH10B | 40.58 | 61 | 1034 | 210 | 700 | 459 | |
29.47 | 63 | 1430 | 166 | 698 | 425 | ||
Escherichia coli K-12 MG1655 | 28.63 | 61 | 2809 | 203 | 651 | 487 | |
30.23 | 57 | 1718 | 192 | 694 | 530 | ||
Listeria monocytogenes | 57.88 | 97 | 2195 | 1578 | 1497 | 1497 | |
45.24 | 81 | 2927 | 2424 | 1496 | 1496 | ||
Meiothermus ruber DSM 1279 | 45.79 | 79 | 2117 | 260 | 724 | 724 | |
40.81 | 63 | 1487 | 203 | 1379 | 1379 | ||
Pedobacter heparinus DSM 2366 | 30.34 | 71 | 3348 | 334 | 1264 | 959 | |
22.89 | 53 | 2809 | 291 | 957 | 902 | ||
Klebsiella pneumoniae MGH 78578 | 27.66 | 61 | 4107 | 169 | 643 | 377 | |
24.97 | 59 | 1319 | 176 | 579 | 357 | ||
Rhodobacter sphaeroides 2.4.1 | 32.52 | 53 | 2737 | 201 | 3182 | 617 | |
37.87 | 63 | 3185 | 267 | 3183 | 699 | ||
Mycobacterium tuberculosis H37Ra | 39.14 | 71 | 239 | 81 | 258 | 143 | |
32.57 | 53 | 171 | 84 | 186 | 145 |
Some of the libraries have quite low coverage. Indeed, the purpose of this experiment was to see how many bugs we could cram onto one flowcell and still obtain respectable assemblies. We might be able to further improve assemblies by combining the libraries and hence doubling the coverage. I tried doing this with SPAdes in two different ways.
A: Treating all read pairs as definite mate-pairs: this is the same as before, we treat the uncertain read pairs the same as the known mate-pairs. The risk with this approach is that some of the unknowns may in fact be paired-end.
spades.py -t 10 -o EcMG-merged.v1 --hqmp1-12 EcMG1.allmp.fastq.gz --hqmp2-12 EcMG2.allmp.fastq.gz
B: Ignoring pairing for uncertain pairs: here we still use the unknown library, but feed it to SPAdes as a single-ended library, so pairing is ignored. The idea being we still leverage the coverage of these data, without the risk of PE contaminants perturbing our assembly.
spades.py -t 10 -o EcMG-merged.v2 --hqmp1-12 EcMG1.mp.fastq.gz --hqmp2-12 EcMG2.mp.fastq.gz --s1 EcMG1.unknown.fastq.gz --s2 EcMG2.unknown.fastq.gz
SPAdes configuration A performs better than configuration B on this data with 7/9 samples having higher contig NGA50. These libraries have very low rates of paired end contamination, your mileage may vary. The higher coverage improves some of the assemblies, but not all.
SPAdes A | SPAdes B | ||||
---|---|---|---|---|---|
Sample | Coverage | Scaffold NGA50 | Contig NGA50 | Scaffold NGA50 | Contig NGA50 |
Bacillus cereus ATCC 10987 | 42.66 | 1660 | 578 | 1654 | 354 |
Escherichia coli K-12 DH10B | 70.05 | 700 | 459 | 592 | 374 |
Escherichia coli K-12 MG1655 | 58.86 | 694 | 530 | 541 | 197 |
Listeria monocytogenes | 103.12 | 1497 | 1497 | 1497 | 1497 |
Meiothermus ruber DSM 1279 | 86.6 | 1339 | 1339 | 1716 | 1360 |
Pedobacter heparinus DSM 2366 | 53.23 | 1264 | 959 | 1262 | 902 |
Klebsiella pneumoniae MGH 78578 | 52.63 | 698 | 393 | 393 | 235 |
Rhodobacter sphaeroides 2.4.1 | 70.39 | 3183 | 3183 | 3180 | 461 |
Mycobacterium tuberculosis H37Ra | 71.71 | 186 | 161 | 198 | 185 |
SPAdes:
SPAdes mate-pair assembly publication:
Velvet papers:
NxTrim: