-
Notifications
You must be signed in to change notification settings - Fork 136
Multiplicity mistake
This hybrid assembly resulted in a nice looking chromosome and three little plasmids, but there's also a linear contig and odd loop. This one is a doozy - you've been warned! 😬
There are two parts of the final assembly I want to understand:
- The linear contig: Why is it linear? Is it actually a linear plasmid or is an incompletely-assembled sequence?
- The eight-contig-loop: Is it a one or more separate circular sequences - plasmids maybe? Or does it belong somewhere else - perhaps with the linear contig?
I wanted to see what these sequences were doing in the Illumina graph (001_best_spades_graph.gfa
), so I BLASTed them in Bandage:
The Illumina graph looks pretty good (no dead ends) but it's quite tangled around that region. To get my head around it, I had Bandage visualise just the contigs which had good BLAST hits. I did this by drawing the graph and then progressively hiding contigs (with delete) until I could only see what I wanted. Increasing the 'Node length per Megabase' setting makes this part of the graph a bit more palatable:
The blue contigs align to the final assembly's linear contig and the green contigs BLAST to the eight-contig-loop.
Unicycler tries to figure out the multiplicity of each contig in the Illumina graph (see more here). It uses the single-copy (multiplicity = 1) contigs as the anchors for scaffolding, so the most important distinction is single-copy vs repeat.
You can view Unicycler's multiplicity calls in Bandage by switching to the custom colours. Green = single-copy, yellow = 2x, orange = 3x and red = 4x or more. Now we can examine this to see if it makes sense. Did Unicycler make a mistake?
🦅 The eagle-eyed among you may notice that the larger contigs in this part of the graph seem to fall into three groups of depth: 1.4x, 1.9x and 3.2x (roughly speaking, there's variation around those numbers). And it's no coincidence that 1.4 + 1.9 ≈ 3.2. It seem like we have two very similar sequences, one at about 1.4x depth and one at about 1.9x depth. The fact that these depths are a bit more than 1x (the chromosomal depth) strongly hints that we are looking at plasmids. Where the two plasmids are the same, the sequences collapse together in the assembly to make 3.2x depth contigs (which are 2x repeats):
Some of these common sequences are very long:
23 kbp is a very long repeat for a bacterial genome! That may be part of the reason Unicycler had trouble. However, the real culprit is here:
For some of these 3.2x contigs, Unicycler wrongly decided that they were single-copy. They are actually 2x repeats! This is a problem because Unicycler only uses them once in scaffolding and so cannot correctly scaffold the genome together.
To remedy the situation, I went back to the 001_best_spades_graph.gfa
file and manually added ML
tags to specify multiplicity. For example:
S 52 TTTTT...TCGAT LN:i:21445 dp:f:3.253 LB:z:3.253 CL:z:forestgreen ML:i:2
S 66 CGCAA...CGTCA LN:i:10064 dp:f:3.219 LB:z:3.219 CL:z:forestgreen ML:i:2
I then re-ran the Unicycler assembly. This time, Unicycler takes the manually-assigned values into account (read more here) and doesn't make the same mistake. I also used bold mode on this re-assembly to more aggressively merge contigs. Now we get this for our final graph:
Better, but still not complete 😩
Focusing on the two incomplete components shows us this:
The small bit in the bottom-left is relatively easy - it is two tangled small plasmids which we can resolve as described here.
The big part is our previously-described two plasmids of 1.4x and 1.9x depth. Their shared region totals over 60 kbp, so resolving that with long reads may not be possible. But it's not too much of a stretch to separate them using depth:
The final issue is that the shared sequence between the two big plasmids has an inverted repeat (a repeat in the repeat!) of about 1 kbp (an insertion sequence):
There are two possible orientations for a sequence between two inverted repeats:
option 1: start -> repeat -> middle -> taeper -> end
option 2: start -> repeat -> elddim -> taeper -> end
Unicycler didn't scaffold it out because it's in the middle of a huge repeat, so we'll have to do it ourselves by aligning long reads in Bandage. After all of this manual work, we can finally get to a nice completed assembly:
The lesson to be learned from this nightmare is that Unicycler doesn't cope well with very long repeats. It can confuse them for single-copy contigs and that causes problems when scaffolding.