This repository contains the code associated to the paper:
The Australian Microbiome dataset for accurate detection and ecological modelling of fungi
Authors:
Luke Florence1, Sean Tomlinson2,3, Marc Freestone4, John W. Morgan1, Jen L. Wood5, Camille Truong6
Affiliations:
- Department of Environment and Genetics, La Trobe University, Bundoora, VIC, 3083, Australia.
- Biodiversity and Conservation Science, Department of Biodiversity, Conservation and Attractions, Kensington, WA, 6151, Australia.
- School of Biological Sciences, University of Adelaide, Adelaide, SA, 5000, Australia.
- The Biodiversity Consultancy, Cambridge, CB2 1SJ, United Kingdom.
- Department of Department of Microbiology, Anatomy, Physiology and Pharmacology, La Trobe University, Bundoora, VIC, 3083, Australia.
- Royal Botanic Gardens Victoria, Melbourne, VIC, 3004, Australia.
Corresponding author: Luke Florence ([email protected])
DNA metabarcoding has played a pivotal role in advancing our understanding of the diversity and function of soil-inhabiting fungi. The Australian Microbiome Initiative has produced an extensive soil fungal metabarcoding dataset of more than 2000 plots across a breadth of ecosystems in Australia and Antarctica. Sequence data requires rigorous approaches for the integration of species occurrences into biodiversity platforms, addressing biases due to false positives and overinflated diversity estimates, among others. To address such biases, we conducted a rigorous analysis of the fungal dataset following best practices in fungal metabarcoding and paired this dataset with over 100 predictor variables to fast-track data exploration. We carefully validated our methodology based on studies conducted on historical versions of the dataset. Our approach generated robust information on Australian soil fungi that can be leveraged by end-users interested in biodiversity, biogeography, and conservation.
The primary data products from this analysis are available on the associated figshare repository.
Notes:
- The reproducible code for the bioinformatics pipeline are available in the
bioinformatics
directory and are run from source directory. - The associated raw data files can be obtained from the Bioplatforms Australia data portal by using the search terms “sample_type:Soil & amplicon:ITS & depth_lower:0.1”.
- The dependencies required to reproduce this research can be installed by via mamba and the
env/dynamic_cluster.yml
file. - This pipeline has been designed to be reproducible. However, this is a first draft of the taxonomically informed dynamic clustering and some troubleshooting will be required for replicability (i.e. to run the pipeline on new datasets).
- After organising the raw data, the scripts with numeric preifixes should be run in numeric order to reproduce the results. Scripts without numeric prefixes are auxiliary code.
01.Extract_ITS.sh performs five functions:
- Quality truncate reads with Trimmomatic
- Extract the ITS region with ITSxpress
- Quality filter reads with VSEARCH
- Check the quality of processed reads with FastQC and MultiQC
- Track reads across the pipeline (library-wise approach)
02.Denoise.R performs the following tasks:
- Denoise reads with DADA2
- Merge sequence tables from multiple sequencing runs
- Convert the merged sequence table to a FASTA sequence file formatted for chimera detection in VSEARCH
03.Chimera_detection.sh performs two main tasks:
- De novo chimera detection with VSEARCH
- Reference-based chimera detection with VSEARCH
Note: There are two distinct output from this and subsequent steps: (1) an
ASVs
andOTUs
output. TheASVs
are have ASVs that are clustered after taxonomic assignment using a taxonomically informed dynamic clustering approach and the `OTUs are clustered in this step at 97% similarity using an abundance-based centroid approach in VSEARCH. The main data output from this project uses the dynamic clusters. (ie.e the ASV files from this step) and the OTU files are intended for comparative analyses between the conventional 97% OTU approach and the dynamic clustering approach we chose to use.
04.Abundance_filter_OTUs
- Sample-wise abundance filter of OTUs in each sample based on relative abundance <0.1% of the total sequence count per sample
- Library-wise abundance filter of each OTU with a relative sequence abundance <0.5% of the total OTU within a given library
- Positive control filter of remove positive control OTUs with a relative sequence abundance <3% of the total positive control OTU within a given library
- Dereplicate samples and remove samples with sequencing deopth <5,000 reads
05.Predict_cutoffs.sh downloads the UNITE+INSD reference dataset used in this study and extracts the ITS1 region for use in chimera detection and taxonomic assignment.
06.BLAST.sh assigns top five BLAST hits to ASVs and OTUs using the UNITE+INSD reference dataset.
07.Classify_OTUs.sh Filter BLAST top five hits based on taxon-specific thresholds, coverage thresholds for genus (90%) and species (95%), and then affiliate taxonomy to ASVs and OTUs to the best hit at each rank using 66.67% consensus threshold across the all remaining hits.
08.Dynamic_clustering.sh Cluster OTUs using a taxonomically informed dynamic clustering approach
Note:
- Code associated with the technical validation are available in the
technical_validation
directory and are executable from within the R project.
01.Assess_length_bias.R: Assess sequences length distribution of the ITS region in the Australian Microbiome dataset against the UNITE+INSD reference dataset.
02.Impact_of_ITS_extraction.R: Assess the impact of ITS extraction on biases against taxon with long ITS regions. We processed 300 bp sequences targeting the fungal ITS1 region, removing conserved SSU and 5.8S rRNA sequences to enhance OTU clustering and taxonomic accuracy. Using ITSxpress for ITS1 extraction, we noted that the inability to merge paired-end reads in the Australian Microbiome fungal dataset biases against taxa with long ITS1 regions (>230 bp), leading to potential false negatives. This script identifies taxa with long ITS1 regions that are underrepresented in the Australian Microbiome dataset.
03.Quantify_ECM_in_antarctica.R: Evaluate the occurrence of ectomycorrhizal OTUs in Antarctica.
04.Map_cortinarius.R: Map the distribution of Cortinarius species in Australia.
05.Map_amanita.R: Map the distribution of Amanita species in Australia.
06.Example_niche_analysis.R: Test differences in the soil total nitrogen and phosphorus niches of two hypogeous (belowground) ectomycorrhizal genera from the order Pezizales: Ruhlandiella and Sphaerosoma.
07.Assess_dataset_diversity.R:
Technical validation dependencies
The technical validation was conducted using R version 4.3.3 -- "Angel Food Cake" and the following packages:
- Biostrings version 2.70.3
- BLAST version 2.16.0
- data.table version 1.15.4
- emmeans version 1.10.1
- ggbeeswarm version 0.7.2
- ggpubr version 0.6.0
- parameters version 0.21.7
- patchwork version 1.2.0
- performance version 0.11.0
- rnaturalearth version 1.0.1
- rnaturalearthdata version 1.0.0
- SRS version 0.2.3
- sf version 1.0-16
- terra version 1.7-71
- tidyterra version 0.6.0
- tidyverse version 2.0.0
- vegan version 2.6-8