Skip to content

Multiplicity mistake

Ryan Wick edited this page Sep 5, 2017 · 6 revisions

This hybrid assembly resulted in a nice looking chromosome and three little plasmids, but then there's a linear contig and odd loop. This one is a doozy - you've been warned! 😬 images/miscalculated_multiplicity_01.png

Illumina graph

There's two parts of the final assembly I want to understand:

  • The linear contig: Why is it linear? Is it actually a linear plasmid or is an incompletely-assembled sequence?
  • The eight-contig-loop: Is it a separate circular sequence or does it belong somewhere else? With the linear contig perhaps?

I wanted to see what these sequences were doing in the Illumina graph (001_best_spades_graph.gfa), so I BLASTed them in Bandage: images/miscalculated_multiplicity_02.png

The Illumina graph looks pretty good (no dead ends) but it's quite tangled around that region. To get my head around it, I had Bandage visualise just the contigs which had good BLAST hits. I did this by drawing the graph and then progressively hiding contigs (with delete) until I could only see what I wanted. Increasing the 'Node length per Megabase' setting makes this part of the graph a bit more palatable: images/miscalculated_multiplicity_03.png

The blue contigs align to the final assembly's linear contig and the green contigs BLAST to the eight-contig-loop.

Visualise multiplicity

Unicycler tries to figure out the multiplicity of each contig in the Illumina graph (see more here). This is because it uses the single-copy contigs as the anchors for scaffolding. So the important distinction is single-copy vs repeat.

You can view Unicycler's multiplicity calls in Bandage by switching to the custom colours: images/miscalculated_multiplicity_04.png

Green = single-copy, yellow = 2x, orange = 3x and red = 4x or more. Now we can examine this to see if it makes sense. Did Unicycler make a mistake?

🦅 The eagle-eyed among you may notice that the larger contigs in this part of the graph seem to fall into three groups of depth: 1.4x, 1.9x and 3.2x (roughly speaking, there's a lot of variation around those numbers). And it's no coincidence that 1.4 + 1.9 ≈ 3.2. It seem like we have two very similar sequences, one at about 1.4x depth and one at about 1.9x depth. The fact that these depths are a bit more than 1x (the chromosomal depth) strongly hints that we are looking at plasmids. Where the two plasmids are the same, the sequences collapse together in the assembly to make 3.2x depth contigs (which are 2x repeats): images/miscalculated_multiplicity_05.png

Some of these common sequences are very long: images/miscalculated_multiplicity_06.png

23 kbp is a very long repeat for a bacterial genome! That may be part of the reason Unicycler had trouble. However, the real culprit is here: images/miscalculated_multiplicity_07.png

For some of these 3.2x contigs, Unicycler wrongly decided that they were single-copy. They are actually 2x repeats! This is a problem because Unicycler only uses them once in scaffolding and so cannot correctly scaffold the genome together.

Manual multiplicity

To remedy the situation, I went back to the 001_best_spades_graph.gfa file and manually added ML tags to specify multiplicity. For example:

S	52	TTTTT...TCGAT	LN:i:21445	dp:f:3.253	LB:z:3.253	CL:z:forestgreen	ML:i:2
S	66	CGCAA...CGTCA	LN:i:10064	dp:f:3.219	LB:z:3.219	CL:z:forestgreen	ML:i:2

I then re-ran the Unicycler assembly. This time, Unicycler takes the manually-assigned values into account (read more here) and doesn't make the same mistake. I also used bold mode on this re-assembly to more aggressively merge contigs. Now we get this for our final graph: images/miscalculated_multiplicity_08.png

Better, but still not complete 😩

Final steps

Focusing on the two incomplete components shows us this: images/miscalculated_multiplicity_09.png

The small bit in the bottom-left is relatively easy - it is two tangled small plasmids which we can resolve as described here.

The big part is our previously-described two plasmids of 1.4x and 1.9x depth. Their shared region totals over 60 kbp, so resolving that with long reads may not be possible. But it's not too much of a stretch to separate them using depth: images/miscalculated_multiplicity_10.png

The final issue is that the shared sequence between the two big plasmids has an inverted repeat (a repeat in the repeat!) of about 1 kbp (an insertion sequence). There are two possible orientations for a sequence between two inverted repeats:

option 1: start -> repeat -> middle -> taeper -> end
option 2: start -> repeat -> elddim -> taeper -> end

Unicycler didn't scaffold it out because it's in the middle of a huge repeat, so we'll have to do it ourselves by aligning long reads in Bandage. After all of this manual work, we can finally get to a nice completed assembly: images/miscalculated_multiplicity_11.png

The lesson to be learned from this nightmare is that Unicycler doesn't cope well with very long repeats. It can confuse them for single-copy contigs and that causes problems when scaffolding.