Bacterial assemblies using Nextera Mate pairs

Nextera Mate-Pair Libraries are a popular assay for scaffolding genome assemblies. They can also be used as a standalone single assay to generate very nice assemblies (without the need for a separate paired-end library).

These are some assembly results for 9 common bacteria with 2 repeated libraries of each sample (so 18 libraries in total), all sequenced in a single multiplexed MiSeq run. I tried assembling them with both Velvet 1.2.10 and SPAdes 3.7.0. Adapter trimming was performed with nxtrim v0.4.0. The data are available here https://basespace.illumina.com/s/TXv32Ve6wTl9 (free registration required). I evaluated the assembles with QUAST 3.2.

Lower-coverage, high throughput example

Velvet commands were:

nxtrim  -1 EcMG1_R1.fastq.gz -2  EcMG1_R2.fastq.gz -O EcMG1
velveth output $kmer -short -fastq.gz EcMG1.se.fastq.gz -shortPaired2 -fastq.gz EcMG1.pe.fastq.gz -shortPaired3 -fastq.gz EcMG1.mp.fastq.gz -shortPaired4 -fastq.gz EcMG1.unknown.fastq.gz
velvetg output -exp_cov auto -cov_cutoff auto -shortMatePaired4 yes -very_clean yes

I assembled multiple k-mer sizes and selected the one with the largest Contig N50. At least in my experience, I have found velvet rarely makes assembly errors on contigs (this is not the case for its scaffolding). I have also found contig N50 strongly correlates with the number of genes recovered. So this seemed like a reasonable parameter selection routine.

SPAdes commands were:

nxtrim  -1 EcMG1_R1.fastq.gz -2  EcMG1_R2.fastq.gz -O EcMG1 --justmp
cat EcMG1.unknown.fastq.gz EcMG1.mp.fastq.gz > EcMG1.allmp.fastq.gz
spades.py -t 10 --hqmp1-12 EcMG1.allmp.fastq.gz -o EcMG1-spades

SPAdes cleverly assembles using multiple k-mer sizes (default 21,33,55,77) meaning parameter selection is not required.

The NGA50 metric for each library is in the below table. The assemblies are quite nice, having correct contigs in the 100s of kb (and sometimes over a megabase). Notably SPAdes produces far larger contigs than Velvet.

		Velvet			SPAdes
Sample	Coverage	k-mer	Scaffold NGA50	Contig NGA50	Scaffold NGA50	Contig NGA50
Bacillus cereus ATCC 10987	19.97	53	602	80	1201	578
Bacillus cereus ATCC 10987	22.69	53	1695	106	1514	582
Escherichia coli K-12 DH10B	40.58	61	1034	210	700	459
Escherichia coli K-12 DH10B	29.47	63	1430	166	698	425
Escherichia coli K-12 MG1655	28.63	61	2809	203	651	487
Escherichia coli K-12 MG1655	30.23	57	1718	192	694	530
Listeria monocytogenes	57.88	97	2195	1578	1497	1497
Listeria monocytogenes	45.24	81	2927	2424	1496	1496
Meiothermus ruber DSM 1279	45.79	79	2117	260	724	724
Meiothermus ruber DSM 1279	40.81	63	1487	203	1379	1379
Pedobacter heparinus DSM 2366	30.34	71	3348	334	1264	959
Pedobacter heparinus DSM 2366	22.89	53	2809	291	957	902
Klebsiella pneumoniae MGH 78578	27.66	61	4107	169	643	377
Klebsiella pneumoniae MGH 78578	24.97	59	1319	176	579	357
Rhodobacter sphaeroides 2.4.1	32.52	53	2737	201	3182	617
Rhodobacter sphaeroides 2.4.1	37.87	63	3185	267	3183	699
Mycobacterium tuberculosis H37Ra	39.14	71	239	81	258	143
Mycobacterium tuberculosis H37Ra	32.57	53	171	84	186	145

Increasing coverage for better assemblies

Some of the libraries have quite low coverage. Indeed, the purpose of this experiment was to see how many bugs we could cram onto one flowcell and still obtain respectable assemblies. We might be able to further improve assemblies by combining the libraries and hence doubling the coverage. I tried doing this with SPAdes in two different ways.

A: Treating all read pairs as definite mate-pairs: this is the same as before, we treat the uncertain read pairs the same as the known mate-pairs. The risk with this approach is that some of the unknowns may in fact be paired-end.

spades.py -t 10 -o EcMG-merged.v1 --hqmp1-12 EcMG1.allmp.fastq.gz --hqmp2-12 EcMG2.allmp.fastq.gz

B: Ignoring pairing for uncertain pairs: here we still use the unknown library, but feed it to SPAdes as a single-ended library, so pairing is ignored. The idea being we still leverage the coverage of these data, without the risk of PE contaminants perturbing our assembly.

spades.py -t 10 -o EcMG-merged.v2 --hqmp1-12 EcMG1.mp.fastq.gz --hqmp2-12 EcMG2.mp.fastq.gz --s1 EcMG1.unknown.fastq.gz --s2 EcMG2.unknown.fastq.gz

SPAdes configuration A performs better than configuration B on this data with 7/9 samples having higher contig NGA50. These libraries have very low rates of paired end contamination, your mileage may vary. The higher coverage improves some of the assemblies, but not all.

		SPAdes A		SPAdes B
Sample	Coverage	Scaffold NGA50	Contig NGA50	Scaffold NGA50	Contig NGA50
Bacillus cereus ATCC 10987	42.66	1660	578	1654	354
Escherichia coli K-12 DH10B	70.05	700	459	592	374
Escherichia coli K-12 MG1655	58.86	694	530	541	197
Listeria monocytogenes	103.12	1497	1497	1497	1497
Meiothermus ruber DSM 1279	86.6	1339	1339	1716	1360
Pedobacter heparinus DSM 2366	53.23	1264	959	1262	902
Klebsiella pneumoniae MGH 78578	52.63	698	393	393	235
Rhodobacter sphaeroides 2.4.1	70.39	3183	3183	3180	461
Mycobacterium tuberculosis H37Ra	71.71	186	161	198	185

References

SPAdes:

Bankevich, Anton, et al. "SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing." Journal of Computational Biology 19.5 (2012): 455-477.

SPAdes mate-pair assembly publication:

Prjibelski, Andrey D., et al. "ExSPAnder: a universal repeat resolver for DNA fragment assembly." Bioinformatics 30.12 (2014): i293-i301.

Velvet papers:

Zerbino, Daniel R., and Ewan Birney. "Velvet: algorithms for de novo short read assembly using de Bruijn graphs." Genome research 18.5 (2008): 821-829.

Zerbino, Daniel R., et al. "Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short-read de novo assembler." PloS one 4.12 (2009): e8407.

NxTrim:

O’Connell, Jared, et al. "NxTrim: optimized trimming of Illumina mate pair reads." Bioinformatics 31.12 (2015): 2035-2037.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bacterial assemblies using Nextera Mate pairs

Lower-coverage, high throughput example

Increasing coverage for better assemblies

References

Clone this wiki locally