-
Notifications
You must be signed in to change notification settings - Fork 21
2. Overall k mer evaluation
Overall k-mer evaluation is performed using the k-mer multiplicity of illumina WGS reads.
Here, we re-evaluate the a. thaliana trio binned assembly (Koren et al., 2018), downloadable here: COL and CVI.
The FALCON-Unzip (Chin et al., 2016) assembly (F1 Col-0–Cvi-0) is also useful to see how the k-mer based evaluation reports pseudo-haplotype assemblies. For convenience, I renamed the fasta files to col.fasta / cvi.fasta for the trio binned assembly, and pri.fasta / alt.fasta for the FALCON-Unzip assembly.
Let’s make a dir for each assembly of interest.
mkdir triocanu_clr
mv col.fasta triocanu_clr/
mv cvi.fasta triocanu_clr/
mkdir falcon_unzip
mv pri.fasta falcon_unzip/
mv alt.fasta falcon_unzip/
Download the pre-built F1.k18.meryl.tar and untar. This 18-mer db was generated from the F1 illumina reads from the FALCON-Unzip paper.
Link F1.k18.meryl
and merqury.sh
. Run it.
cd triocanu_clr
ln -s ../F1.k18.meryl
nohup ./merqury.sh F1.k18.meryl col.fasta cvi.fasta triocanu_clr > triocanu.log &
It takes a bit time (at most 8 hours on a 3Gb genome, 24 cores) when running on a single cluster.
Use _submit_merqury.sh
for submitting to a cluster. Everything finishes within 2~4 hours.
Assuming the k-mers found only in the assembly being bp errors, we can use the k-mer survival rate to estimate base level QV.
head */*.qv
==> triocanu_clr/triocanu_.qv <==
col 682359 123511626 35.1183 0.000307729
cvi 506306 122349407 36.3761 0.00023035
Both 1188665 245861033 35.6991 0.00026921
Each column is showing:
- assembly of interest. Both is combined of the above two.
- k-mers uniquely found only in the assembly
- k-mers found in both assembly and the read set
- QV
- Error rate
These results suggest the cvi assembly has slightly higher QV (36.4) than the col assembly (35.1), but still are very similar.
This provides the merqury version of spectra-cn plots, first introduced by KAT. In addition to the stacked histogram, Merqury generates unstacked line and filled histograms to enhance visibility of the shape. All runs at once, no need to re-name and concatenate fasta files for diploid assemblies.
Here are the copy number analysis outputs.
For both col and cvi assemblies:
triocanu_clr.spectra-cn.st.png # st: stacked
triocanu_clr.spectra-cn.fl.png # fl: filled
triocanu_clr.spectra-cn.ln.png # ln: line
COL haplotype assembly:
triocanu_clr.col.spectra-cn.st.png
triocanu_clr.col.spectra-cn.fl.png
triocanu_clr.col.spectra-cn.ln.png
CVI haplotype assembly:
triocanu_clr.cvi.spectra-cn.st.png
triocanu_clr.cvi.spectra-cn.fl.png
triocanu_clr.cvi.spectra-cn.ln.png
The spectra-asm
shows the distinct k-mers shared between col.fasta and cvi.fasta.
triocanu_clr.spectra-asm.st.png
triocanu_clr.spectra-asm.fl.png
triocanu_clr.spectra-asm.ln.png
The k-mer completeness can be measured by the fraction of recovered solid k-mer. Solid k-mers are distinct k-mers filtered for erroneous low copy k-mers.
k-mer completeness = found solid k-mers in an assembly / solid k-mers in a read set
This is how the report looks like:
col all 104975080 125303808 83.7764
cvi all 104809523 125303808 83.6443
both all 123134729 125303808 98.2689
Columns are
- Assembly
- k-mer set used for measuring completeness - all = read set (This gets expended with hap-mers later)
- solid k-mers in the assembly
- Total solid k-mers in the read set
- Completeness (%)
Here, both
is a higher because each col and cvi assemblies contain haplotype specific part of the genome, as we can see from the spectra-asm plots.
asm_only.bed
and asm_only.tdf
(tiled data format to visualize on IGV) files are provided to trace the bp errors. This is useful for finding mis-assembly sites, which usually involves low bp quality region.