GitHub - xfengnefx/hifiasm-meta: hifiasm_meta - de novo metagenome assembler, based on hifiasm, a haplotype-resolved de novo assembler for PacBio Hifi reads.

A hifiasm fork for metagenome assembly using Hifi reads.

Getting Started

# Install hifiasm-meta (g++ and zlib required)
git clone https://github.com/xfengnefx/hifiasm-meta.git
cd hifiasm-meta && make

# Run
hifiasm_meta -t32 -o asm reads.fq.gz 2>asm.log
hifiasm_meta -t32 --force-rs -o asm reads.fq.gz 2>asm.log  # if the dataset has high redundancy

A test dataset and the assembled results are available at zenodo. This is downsampled from SRR13128014 (zymoBIOMICS D6331 mock community), containing (only the) 5 E.coli strains. Hifiasm-meta r57 takes roughly 5 minutes and a peak memory of 18GB.

About this fork

Hifiasm_meta comes with a read selection module, which enables the assembly of dataset of high redundancy without compromising overall assembly quality, and meta-centric graph cleaning modules. In post-assembly stage, hifiasm_meta traverses the primay assembly graph and try to rescue some genome bins that would be overlooked by traditional binners. Currently hifiasm_meta does not take bining info.

Output files

Contig graph: asm.p_ctg*.gfa and asm.a_ctg*.gfa

Raw unitig graph: asm.r_utg*.gfa

Cleaned unitig graph: asm.p_utg*.gfa

Contig name format: ^s[0-9]+\.[uc]tg[0-9]{6}[lc], where the s[0-9]+ is a disconnected subgraph label of the contig. It might be useful to be able to quickly checking whether two contigs are in the same disconnected subgraph (i.e. haplotype that wasn't assembled in to a single contig, tangled haplotypes).

Special Notes

Based on the limited available test data, real datasets are unlikely to require read selection; mock datasets, however, might need it.

Bin file is one-way compatible with the stable hifiasm for now: stable hifiasm can use hifiasm_meta's bin file, but not vice versa. Meta needs to store extra info from overlap & error correction step.

Switches

See also README_ha.md, the stable hifiasm doc.

# General options
-o              Prefix of output files [hifiasm_meta.asm]. 
                For detailed description of all assembly graphs, 
                 see above or manpage.
-B	        	Use bin files under a different prefix than the 
                 one specified by -o.
-t              Number of CPU threads used by hifiasm\_meta (default: 1).
-h              Show help information and exit. Returns 0.
--version       Show version number and exit. Returns 0.

# Read selection options
-S              Enable read selection.
                If enabled, hifiasm_meta will estimate the total number of 
                 read overlaps. If the estimation seems within acceptable, 
                 no read will be dropped; otherwise, reads will be dropped 
                 from the most redundant ones until the criteria are satisfied.
--force-rs      Force read selection. Read will be dropped according to the 
                 runtime kmer frequency threshold described below.
--lowq-10       Runtime 10% quantile kmer frequency threshold.
                Lower value means less reads kept, if read selection is triggered. [50]
--lowq-5        Runtime 5% quantile kmer frequency threshold.
                Lower value means less reads kept, if read selection is triggered. [50]
--lowq-3        Runtime 3% quantile kmer frequency threshold.
                Lower value means less reads kept, if read selection is triggered. [disabled]

# Error correction options
-k              K-mer length [51]. This option must be less than 64.
-w              Minimizer window size [51].
-f              Number of bits for bloom filter; 0 to disable [37]. 
                This bloom filter is used to filter out singleton k-mers 
                 when counting all k-mers. 
-r              Rounds of haplotype-aware error corrections [3]. 
                This option affects all outputs of hifiasm\_meta.
--min-hist-cnt  When analyzing the k-mer spectrum, ignore counts below INT [5].

# Assembly options
-z              Length of adapters that should be removed [0]. 
                This option remove INT bases from both ends of each read.
-i              Ignore error corrected reads and overlaps saved in bin files.

# Debugging options
--dbg-gfa       Use extra bin files to speed up the debugging of graph cleaning.
                If set and the extra bin files do not already exist, 
                 assembly runs normally (i.e. from scratch or resume from bin files) 
                 and writes the extra bin files.
                If set and bin files as well as extra bin files are present, 
                 assembly will resume from raw unitig graph stage.
--dump-all-ovlp Dump all overlaps ever calculated during the final overlaping. 
--write-paf     Dump overlaps, produces 2 files, one contains the intra-haplotype or unphased overlaps, the other contains inter-haplotype overlaps. If coverage is very high, this might not be the full set of overlaps.
--write-ec      Dump error corrected reads.
-e              Ban assembly, i.e. terminate before generating string graph.

Preliminary results (r49)

We evaluated hifiasm-meta on the following public datasets:

	accession	#bases (Gb)	N50 read length (kb)	Median read QV	Sample description
ATCC	SRR11606871	59.2	12.0	36	Mock, ATCC MSA-1003
zymoBIOMICS	SRR13128014	18.0	10.6	40	Mock, ZymoBIOMICS D6331
sheepA	SRR10963010	51.9	14.3	25	Sheep gut microbiome
sheepB	SRR14289618	206.4	11.8	N/A*	Sheep gut microbiome
humanO1	SRR15275213	18.5	11.4	40	Human gut, pool of 4 omnivore samples
humanO2	SRR15275212	15.5	10.3	41	Human gut, pool of 4 omnivore samples
humanV1	SRR15275211	18.8	11.0	39	Human gut, pool of 4 vegan samples
humanV2	SRR15275210	15.2	9.6	40	Human gut, pool of 4 vegan samples
chicken	SRR15214153	33.6	17.6	30	Chicken gut microbiome

*Base quality was not available for this dataset.

In the empirical datasets, we evaluated assemblies with checkM. Following the convention, we define near-complete as having at more than 90% checkM completeness score and less than 5% contamination score. High-quality is defined as >70% complete and <10% contaminated. Medium-quality is defined as >50% complete and QS>50, where QS (quality score) is given by completeness-(5*contamination). Binning was performed with metabat2. Additionally, we split out any >1Mb circles from genome bins and let them form bins on themselves.

	>1Mb circular contigs	>1Mb circular contigs, near-complete	Near-complete MAGs	High-quality MAGs	Medium-quality MAGs
sheepA	139	125	186	42	33
sheepB	245	219	377	55	47
chicken	69	57	87	20	15
humanO1	33	27	53	20	19
humanO2	26	23	48	17	16
humanV1	38	33	73	23	15
humanV2	34	27	53	22	17
humanPooled	75	62	109	39	41

A Bandage plot of sheepA's primary contig graph (screenshot omitted some small unconnected contigs at the bottom):

ATCC contained 20 species and zymoBIOMICS contained 21 strains of 17 species. Hifiasm-meta recovered 14 out of 15 abundant (0.18%-18%) species in ATCC as single complete contigs. The other 5 rare species had insufficient coverage to be fully assembled. The challenge of the zymoBIOMICS dataset is its mixture of 5 E.coli strains (8% abundance each). Hifiasm-meta assembled strain B766 into a complete circular contig, strain B3008 into 2 contigs and the rest as fragmented contigs.

The two mock datasets were assembled with --force-rs -A, the rest used default. Performance on 48 threads (-t48):

	Wall clock (h)	PeakRSS (Gb)
ATCC	22	323
zymoBIOMICS	5.3	131
sheepA	17.8	208
sheepB	214	724
chicken	15.8	201
humanO1	3	70
humanO2	2.3	69
humanV1	3.4	76
humanV2	2.2	62
humanPooled	18	224

Name		Name	Last commit message	Last commit date
Latest commit History 503 Commits
.gitignore		.gitignore
Assembly.cpp		Assembly.cpp
Assembly.h		Assembly.h
CommandLines.cpp		CommandLines.cpp
CommandLines.h		CommandLines.h
Correct.cpp		Correct.cpp
Correct.h		Correct.h
Hash_Table.cpp		Hash_Table.cpp
Hash_Table.h		Hash_Table.h
LICENSE		LICENSE
Levenshtein_distance.cpp		Levenshtein_distance.cpp
Levenshtein_distance.h		Levenshtein_distance.h
Makefile		Makefile
Output.cpp		Output.cpp
Output.h		Output.h
Overlaps.cpp		Overlaps.cpp
Overlaps.h		Overlaps.h
Overlaps_hamt.cpp		Overlaps_hamt.cpp
Overlaps_hamt.h		Overlaps_hamt.h
POA.cpp		POA.cpp
POA.h		POA.h
Process_Read.cpp		Process_Read.cpp
Process_Read.h		Process_Read.h
Purge_Dups.cpp		Purge_Dups.cpp
Purge_Dups.h		Purge_Dups.h
README.md		README.md
README_ha.md		README_ha.md
Trio.cpp		Trio.cpp
anchor.cpp		anchor.cpp
data.c		data.c
extract.cpp		extract.cpp
gitcommit.h		gitcommit.h
hifiasm.1		hifiasm.1
hifiasm_meta.1		hifiasm_meta.1
hist.cpp		hist.cpp
htab.cpp		htab.cpp
htab.h		htab.h
kdq.h		kdq.h
ketopt.h		ketopt.h
khashl.h		khashl.h
kseq.h		kseq.h
ksort.h		ksort.h
ksw2.h		ksw2.h
ksw2_extz2_sse.c		ksw2_extz2_sse.c
kthread.cpp		kthread.cpp
kthread.h		kthread.h
kvec.h		kvec.h
main.cpp		main.cpp
meta_util.cpp		meta_util.cpp
meta_util.h		meta_util.h
meta_util_debug.cpp		meta_util_debug.cpp
meta_util_debug.h		meta_util_debug.h
sketch.cpp		sketch.cpp
sptree.cpp		sptree.cpp
sptree.h		sptree.h
stamp.sh		stamp.sh
sys.cpp		sys.cpp
t-sne.cpp		t-sne.cpp
t-sne.h		t-sne.h
tsne.cpp		tsne.cpp
tsne.h		tsne.h
vptree.h		vptree.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Getting Started

About this fork

Output files

Special Notes

Switches

Preliminary results (r49)

About

Releases 8

Packages

Languages

License

xfengnefx/hifiasm-meta

Folders and files

Latest commit

History

Repository files navigation

Getting Started

About this fork

Output files

Special Notes

Switches

Preliminary results (r49)

About

Resources

License

Stars

Watchers

Forks

Releases 8

Packages 0

Languages

Packages