Skip to content

Bacterial assemblies using Nextera Mate pairs

Jared O'Connell edited this page Mar 28, 2017 · 2 revisions

Nextera Mate-Pair Libraries are a popular assay for scaffolding genome assemblies. They can also be used as a standalone single assay to generate very nice assemblies (without the need for a separate paired-end library).

These are some assembly results for 9 common bacteria with 2 repeated libraries of each sample (so 18 libraries in total), all sequenced in a single multiplexed MiSeq run. I tried assembling them with both Velvet 1.2.10 and SPAdes 3.7.0. Adapter trimming was performed with nxtrim v0.4.0. The data are available here (free registration required). I evaluated the assembles with QUAST 3.2.

Lower-coverage, high throughput example

Velvet commands were:

nxtrim  -1 EcMG1_R1.fastq.gz -2  EcMG1_R2.fastq.gz -O EcMG1
velveth output $kmer -short -fastq.gz -shortPaired2 -fastq.gz -shortPaired3 -fastq.gz -shortPaired4 -fastq.gz EcMG1.unknown.fastq.gz
velvetg output -exp_cov auto -cov_cutoff auto -shortMatePaired4 yes -very_clean yes

I assembled multiple k-mer sizes and selected the one with the largest Contig N50. At least in my experience, I have found velvet rarely makes assembly errors on contigs (this is not the case for its scaffolding). I have also found contig N50 strongly correlates with the number of genes recovered. So this seemed like a reasonable parameter selection routine.

SPAdes commands were:

nxtrim  -1 EcMG1_R1.fastq.gz -2  EcMG1_R2.fastq.gz -O EcMG1 --justmp
cat EcMG1.unknown.fastq.gz > EcMG1.allmp.fastq.gz -t 10 --hqmp1-12 EcMG1.allmp.fastq.gz -o EcMG1-spades

SPAdes cleverly assembles using multiple k-mer sizes (default 21,33,55,77) meaning parameter selection is not required.

The NGA50 metric for each library is in the below table. The assemblies are quite nice, having correct contigs in the 100s of kb (and sometimes over a megabase). Notably SPAdes produces far larger contigs than Velvet.

Velvet SPAdes
Sample Coverage k-mer Scaffold NGA50 Contig NGA50 Scaffold NGA50 Contig NGA50
Bacillus cereus ATCC 10987 19.97 53 602 80 1201 578
22.69 53 1695 106 1514 582
Escherichia coli K-12 DH10B 40.58 61 1034 210 700 459
29.47 63 1430 166 698 425
Escherichia coli K-12 MG1655 28.63 61 2809 203 651 487
30.23 57 1718 192 694 530
Listeria monocytogenes 57.88 97 2195 1578 1497 1497
45.24 81 2927 2424 1496 1496
Meiothermus ruber DSM 1279 45.79 79 2117 260 724 724
40.81 63 1487 203 1379 1379
Pedobacter heparinus DSM 2366 30.34 71 3348 334 1264 959
22.89 53 2809 291 957 902
Klebsiella pneumoniae MGH 78578 27.66 61 4107 169 643 377
24.97 59 1319 176 579 357
Rhodobacter sphaeroides 2.4.1 32.52 53 2737 201 3182 617
37.87 63 3185 267 3183 699
Mycobacterium tuberculosis H37Ra 39.14 71 239 81 258 143
32.57 53 171 84 186 145

Increasing coverage for better assemblies

Some of the libraries have quite low coverage. Indeed, the purpose of this experiment was to see how many bugs we could cram onto one flowcell and still obtain respectable assemblies. We might be able to further improve assemblies by combining the libraries and hence doubling the coverage. I tried doing this with SPAdes in two different ways.

A: Treating all read pairs as definite mate-pairs: this is the same as before, we treat the uncertain read pairs the same as the known mate-pairs. The risk with this approach is that some of the unknowns may in fact be paired-end. -t 10 -o EcMG-merged.v1 --hqmp1-12 EcMG1.allmp.fastq.gz --hqmp2-12 EcMG2.allmp.fastq.gz

B: Ignoring pairing for uncertain pairs: here we still use the unknown library, but feed it to SPAdes as a single-ended library, so pairing is ignored. The idea being we still leverage the coverage of these data, without the risk of PE contaminants perturbing our assembly. -t 10 -o EcMG-merged.v2 --hqmp1-12 --hqmp2-12 --s1 EcMG1.unknown.fastq.gz --s2 EcMG2.unknown.fastq.gz

SPAdes configuration A performs better than configuration B on this data with 7/9 samples having higher contig NGA50. These libraries have very low rates of paired end contamination, your mileage may vary. The higher coverage improves some of the assemblies, but not all.

SPAdes A SPAdes B
Sample Coverage Scaffold NGA50 Contig NGA50 Scaffold NGA50 Contig NGA50
Bacillus cereus ATCC 10987 42.66 1660 578 1654 354
Escherichia coli K-12 DH10B 70.05 700 459 592 374
Escherichia coli K-12 MG1655 58.86 694 530 541 197
Listeria monocytogenes 103.12 1497 1497 1497 1497
Meiothermus ruber DSM 1279 86.6 1339 1339 1716 1360
Pedobacter heparinus DSM 2366 53.23 1264 959 1262 902
Klebsiella pneumoniae MGH 78578 52.63 698 393 393 235
Rhodobacter sphaeroides 2.4.1 70.39 3183 3183 3180 461
Mycobacterium tuberculosis H37Ra 71.71 186 161 198 185



Bankevich, Anton, et al. "SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing." Journal of Computational Biology 19.5 (2012): 455-477.

SPAdes mate-pair assembly publication:

Prjibelski, Andrey D., et al. "ExSPAnder: a universal repeat resolver for DNA fragment assembly." Bioinformatics 30.12 (2014): i293-i301.

Velvet papers:

Zerbino, Daniel R., and Ewan Birney. "Velvet: algorithms for de novo short read assembly using de Bruijn graphs." Genome research 18.5 (2008): 821-829.

Zerbino, Daniel R., et al. "Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short-read de novo assembler." PloS one 4.12 (2009): e8407.


O’Connell, Jared, et al. "NxTrim: optimized trimming of Illumina mate pair reads." Bioinformatics 31.12 (2015): 2035-2037.