-
Notifications
You must be signed in to change notification settings - Fork 135
Multiplicity mistake
This hybrid assembly resulted in a nice looking chromosome and three little plasmids, but then there's a linear contig and odd loop. This one is a doozy - you've been warned! 😬 images/miscalculated_multiplicity_01.png
There's two parts of the final assembly I want to understand:
- The linear contig: Why is it linear? Is it actually a linear plasmid or is an incompletely-assembled sequence?
- The eight-contig-loop: Is it a separate circular sequence or does it belong somewhere else? With the linear contig perhaps?
I wanted to see what these sequences were doing in the Illumina graph (001_best_spades_graph.gfa
), so I BLASTed them in Bandage:
images/miscalculated_multiplicity_02.png
The Illumina graph looks pretty good (no dead ends) but it's quite tangled around that region. To get my head around it, I had Bandage visualise just the contigs which had good BLAST hits. I did this by drawing the graph and then progressively hiding contigs (with delete) until I could only see what I wanted. Increasing the 'Node length per Megabase' setting makes this part of the graph a bit more palatable: images/miscalculated_multiplicity_03.png
The blue contigs align to the final assembly's linear contig and the green contigs BLAST to the eight-contig-loop.
Unicycler tries to figure out the multiplicity of each contig in the Illumina graph (see more here). This is because it uses the single-copy contigs as the anchors for scaffolding. So the important distinction is single-copy vs repeat.
You can view Unicycler's multiplicity calls in Bandage by switching to the custom colours: images/miscalculated_multiplicity_04.png
Green = single-copy, yellow = 2x, orange = 3x and red = 4x or more. Now we can examine this to see if it makes sense. Did Unicycler make a mistake?
🦅 The eagle-eyed among you may notice that the larger contigs in this part of the graph seem to fall into three groups of depth: 1.4x, 1.9x and 3.2x (roughly speaking, there's a lot of variation around those numbers). And it's no coincidence that 1.4 + 1.9 ≈ 3.2. It seem like we have two very similar sequences, one at about 1.4x depth and one at about 1.9x depth. The fact that these depths are a bit more than 1x (the chromosomal depth) strongly hints that we are looking at plasmids. Where the two plasmids are the same, the sequences collapse together in the assembly to make 3.2x depth contigs (which are 2x repeats): images/miscalculated_multiplicity_05.png
Some of these common sequences are very long: images/miscalculated_multiplicity_06.png
23 kbp is a very long repeat for a bacterial genome! That may be part of the reason Unicycler had trouble. However, the real culprit is here: images/miscalculated_multiplicity_07.png
For some of these 3.2x contigs, Unicycler wrongly decided that they were single-copy. They are actually 2x repeats! This is a problem because Unicycler only uses them once in scaffolding and so cannot correctly scaffold the genome together.
To remedy the situation, I went back to the 001_best_spades_graph.gfa
file and manually added ML
tags to specify multiplicity. For example:
S 52 TTTTT...TCGAT LN:i:21445 dp:f:3.253 LB:z:3.253 CL:z:forestgreen ML:i:2
S 66 CGCAA...CGTCA LN:i:10064 dp:f:3.219 LB:z:3.219 CL:z:forestgreen ML:i:2
I then re-ran the Unicycler assembly. This time, Unicycler takes the manually-assigned values into account (read more here) and doesn't make the same mistake. I also used bold mode on this re-assembly to more aggressively merge contigs. Now we get this for our final graph: images/miscalculated_multiplicity_08.png
Better, but still not complete 😩
Focusing on the two incomplete components shows us this: images/miscalculated_multiplicity_09.png
The small bit in the bottom-left is relatively easy - it is two tangled small plasmids which we can resolve as described here.
The big part is our previously-described two plasmids of 1.4x and 1.9x depth. Their shared region totals over 60 kbp, so resolving that with long reads may not be possible. But it's not too much of a stretch to separate them using depth: images/miscalculated_multiplicity_10.png
The final issue is that the shared sequence between the two big plasmids has an inverted repeat (a repeat in the repeat!) of about 1 kbp (an insertion sequence). There are two possible orientations for a sequence between two inverted repeats:
option 1: start -> repeat -> middle -> taeper -> end
option 2: start -> repeat -> elddim -> taeper -> end
Unicycler didn't scaffold it out because it's in the middle of a huge repeat, so we'll have to do it ourselves by aligning long reads in Bandage. After all of this manual work, we can finally get to a nice completed assembly: images/miscalculated_multiplicity_11.png
The lesson to be learned from this nightmare is that Unicycler doesn't cope well with very long repeats. It can confuse them for single-copy contigs and that causes problems when scaffolding.