Population genomics identifies genetic signatures of carrot domestication and improvement and uncovers the origin of high carotenoid orange carrots
This repository contains scripts used for the following publication:
Placeholder for reference to publication
-
FALCON (v.0.3.0)
source: https://github.com/PacificBiosciences/FALCON
parameters:length_cutoff=5000 length_cutoff_pr=1000 --max_diff=50 --max_cov=50 --min_cov=4
reference: Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nature Methods 13, 1050-1054, doi:10.1038/nmeth.4035 (2016). -
CANU (V.1.5)
source: https://github.com/marbl/canu/releases
reference: Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 27, 722-736, doi:10.1101/gr.215087.116 (2017).
-
CANU (V.1.5)
source: https://github.com/marbl/canu/releases
reference: Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 27, 722-736, doi:10.1101/gr.215087.116 (2017). -
Arrow as implemented in GenomicConsensus
source: https://github.com/PacificBiosciences/GenomicConsensus -
Pilon (V1.23)
parameters: using default parameters
reference: Walker, B. J. et al. Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. PLOS ONE 9, e112963, doi:10.1371/journal.pone.0112963 (2014).
-
BWA mem (0.7.17)
reference: Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754-1760, doi:10.1093/bioinformatics/btp324 (2009). -
SALSA2 (V2.3)
200 iterations and GATC as the restriction site
reference: Ghurye, J. et al. Integrating Hi-C links with assembly graphs for chromosome-scale assembly. PLOS Computational Biology 15, e1007273, doi:10.1371/journal.pcbi.1007273 (2019).
- PBJelly (V15.8.24)
parameters:--minMatch 8 --minPctIdentity 80 --bestn 1 --nCandidates 20 --maxScore -500 --nproc 39 --noSplitSubreads
reference: English, A. C. et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS One 7, e47768, doi:10.1371/journal.pone.0047768 (2012).
-
BWA mem (0.7.17)
filtered for MAPQ>=10 and marker sequences with mapping frequency of >1 were discarded
reference: Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754-1760, [ doi:10.1093/bioinformatics/btp324 (2009). -
STAR (V 2.7.10a)
parameters:--outSAMstrandField intronMotif --outSAMattrIHstart 0 --outFilterMismatchNmax 2 --outSAMtype BAM SortedByCoordinate
reference: Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15-21, doi:10.1093/bioinformatics/bts635 (2012). -
BWA mem (0.7.17)
parameters:-5 -t 44
-
sort
parameters:-k3,3d -k7,7d
-
juicer_tools.1.8.9_jcuda.0.8.jar
parameters:pre -q 10
reference: Durand, N. C. et al. Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments. Cell Syst 3, 95-98, doi:10.1016/j.cels.2016.07.002 (2016). -
A custom program was used to visualize and identify connections of PE sequences (PE BACs, 10, 20 and 40 kb MPE) to other scaffolds or contigs
-
RaGOO (V1.1)
parameters:-C -t 43 -i 0.3
reference: Alonge, M. et al. RaGOO: fast and accurate reference-guided scaffolding of draft genomes. Genome Biology 20, 224, doi:10.1186/s13059-019-1829-6 (2019). -
LR-Gapcloser (Version 3, 29 June 2007)
parameters:-s p -t 40 -r 5
reference: Xu, G.-C. et al. LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly. GigaScience 8, doi:10.1093/gigascience/giy157 (2018).
-
MUMmer (v4.0.0beta2)
script: gene_prediction/15.dcarv3analysis.sh
reference: Marçais, G. et al. MUMmer4: A fast and versatile genome alignment system. PLOS Computational Biology 14, e1005944, doi:10.1371/journal.pcbi.1005944 (2018). -
Blat (v35)
source: https://hgwdev.gi.ucsc.edu/~kent
reference: Kent, W. J. et al. The human genome browser at UCSC. Genome Res 12, 996-1006, doi:10.1101/gr.229102 (2002). -
UCSC Liftover (v 2006-04-26)
script: gene_prediction/08.lift.sh
source: https://hgwdev.gi.ucsc.edu/~kent
reference: Kent, W. J. et al. The human genome browser at UCSC. Genome Res 12, 996-1006, doi:10.1101/gr.229102 (2002).
-
Fastq-Screen (V0.4.14)
reference: Wingett, S. & Andrews, S. FastQ Screen: A tool for multi-genome mapping and quality control [version 2; peer review: 4 approved]. F1000Research 7, doi:10.12688/f1000research.15931.2 (2018). -
GC content distribution estimates also done by Fastq-Screen
- BWA mem (0.7.17)
reference: Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754-1760, doi:10.1093/bioinformatics/btp324 (2009).
-
BWA mem (EST mapping) (0.7.17)
source for EST sequences: https://www.carrotomics.org/bio_data/291864 -
StringTie (transcriptome mapping) (V1.3.5)
parameters:-m 50 -c 1 -f 0.01
source: https://ccb.jhu.edu/software/stringtie/
reference: Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nature Biotechnology 33, 290-295, doi:10.1038/nbt.3122 (2015). -
GMAP (IsoSeq mapping) (V2021-08-25)
reference: Wu, T. D. & Watanabe, C. K. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859-1875, doi:10.1093/bioinformatics/bti310 (2005).
- RepeatModeler (v.2.0.1)
source: http://www.repeatmasker.org/RepeatModeler/
- RepeatMasker (v.4.1.0)
source: http://www.repeatmasker.org
-
LTRharvest (GenomeTools v.1.6.1)
parameters:-seed 80 -maxlenltr 4000 -mindistltr 3000 -mintsd 2 -maxtsd 20 -motif tgca
reference: Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9, 18, doi:10.1186/1471-2105-9-18 (2008). -
SiLiX (v1.2.9)
parameters:--ident 0.6 --overlap 0.7
reference: Miele, V., Penel, S. & Duret, L. Ultra-fast sequence clustering from similarity networks with SiLiX. BMC Bioinformatics 12, 116, https://doi.org/10.1186/1471-2105-12-116 (2011). -
DANTE (v.1.1.0)
reference: Novák, P., Neumann, P. & Macas, J. (2010) Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data. BMC Bioinformatics, 11, 378.
reference: Novák, P., Neumann, P. & Macas, J. (2020b) Global analysis of repetitive DNA from unassembled sequence reads using RepeatExplorer2. Nature Protocols, 15(11), 3745–3776. -
Mafft (v.7.471)
parameters: default settings
reference: Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30, 772-780, doi:10.1093/molbev/mst010 (2013). -
R ‘ape’ package (v.5.4-1)
parameters: “K80” model
reference: Paradis, E. & Schliep, K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35, 526-528, doi:10.1093/bioinformatics/bty633 (2018). -
HelitronScanner
parameters: default settings
reference: Xiong, W., He, L., Lai, J., Dooner, H. K. & Du, C. HelitronScanner uncovers a large overlooked cache of Helitron transposons in many plant genomes. Proceedings of the National Academy of Sciences 111, 10263-10268, doi:10.1073/pnas.1410068111 (2014).
- LTR Assembly Index (LAI)
reference: Ou, S., Chen, J. & Jiang, N. Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Research 46, e126-e126, doi:10.1093/nar/gky730 (2018).
- Centromere motif used
TCAACTTCAAACGAGTCTGGAATGAAGCTACAAGTTGTTT
CCAACTTCAATCTAGYCAATAATGAAGCWACAAGTTGTTT
CCRGGCTCAAACGAGTCAAGAATGAAACTACAAGTTGATT
CYAATTTYAAAC AACCGAAAATGAAGCTACAAGTTGTTT
- custom script /centromere/31.centromere.location.sh
image files: /centromere/
-
STAR (V 2.7.10a)
parameters:--outSAMstrandField intronMotif --outSAMattrIHstart 0 --outFilterMismatchNmax 2 --outSAMtype BAM SortedByCoordinate
reference: Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15-21, doi:10.1093/bioinformatics/bts635 (2012). -
StringTie (V1.3.5)
parameters:-m 50 -c 1 -f 0.01
source: https://ccb.jhu.edu/software/stringtie/
reference: Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nature Biotechnology 33, 290-295, doi:10.1038/nbt.3122 (2015). -
Cytoscape (v3.9)
reference: Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13, 2498-2504, doi:10.1101/gr.1239303 (2003).
reference: Szklarczyk, D. et al. STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res 43, D447-452, doi:10.1093/nar/gku1003 (2015). -
BioConductor (v4.2.1)
source: https://www.bioconductor.org/ -
MCODE (v?)
reference: Bader, G. D. & Hogue, C. W. V. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4, 2, doi:10.1186/1471-2105-4-2 (2003).
-
IsoSeq3 (v3.3.0)
source: https://github.com/PacificBiosciences/IsoSeq -
GMAP (V2021-08-25)
parameters: -f=2; and minimum identity and coverage=99%
reference: Wu, T. D. & Watanabe, C. K. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859-1875, doi:10.1093/bioinformatics/bti310 (2005).
- MAKER (v3.01.03)
reference: Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12, 491, doi:10.1186/1471-2105-12-491 (2011).
parameters:
est_pass=0
altest_pass=0
protein_pass=0
rm_pass=0
model_pass=0
pred_pass=0
other_pass=0
prok_rm=0
softmask=1
est2genome=0
protein2genome=0
trna=0
unmask=1
cpus=1
max_dna_len=500000
min_contig=1
pred_flank=200
pred_stats=0
AED_threshold=1
min_protein=0
alt_splice=0
always_complete=0
map_forward=0
keep_preds=0
split_hit=2500
single_exon=0
single_length=100
correct_est_fusion=0
tries=2
clean_try=0
clean_up=0
-
AUGUSTUS (v2.5.5)
-
SNAP (Commit of June 3, 2019)
source: https://github.com/KorfLab/SNAP
- GeMoMa (v1.6)
reference: Keilwagen, J., Hartung, F. & Grau, J. GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position Conservation and RNA-seq Data. Methods Mol Biol 1962, 161-177, doi:10.1007/978-1-4939-9173-0_9 (2019).
-
GMAP (V2021-08-25)
reference: Wu, T. D. & Watanabe, C. K. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859-1875, doi:10.1093/bioinformatics/bti310 (2005). -
GenomeThreader (V2021-08-25)
reference: Gremme, G., Brendel, V., Sparks, M. E. & Kurtz, S. Engineering a software tool for gene structure prediction in higher organisms. Information and Software Technology 47, 965-978, doi:https://doi.org/10.1016/j.infsof.2005.09.005 (2005).
-
Blast2Go (OmicsBox 3.0.29)
reference: Conesa, A. et al. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21, 3674-3676, doi:10.1093/bioinformatics/bti610 (2005). -
PlantTFcat (Downloaded on Dec 2020)
reference: Dai, X., Sinharoy, S., Udvardi, M. & Zhao, P. X. PlantTFcat: an online plant transcription factor and transcriptional regulator categorization and analysis tool. BMC Bioinformatics 14, 321, doi:10.1186/1471-2105-14-321 (2013). -
PRGdb (v3.0)
reference: Osuna-Cruz, C. M. et al. PRGdb 3.0: a comprehensive platform for prediction and analysis of plant disease resistance genes. Nucleic Acids Res 46, D1197-d1201, doi:10.1093/nar/gkx1119 (2018). -
Verification when submitting to NCBI
A list of genes manually curated based on the validation during the submission to NCBI: gene_prediction/24.report.20220211.tsv
script: gene_prediction/24.makereport.pl
- INFERNAL (v1.1.2)
parameters: Rfam database (v13.0) and the p-value of <0.05
reference: Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933-2935, doi:10.1093/bioinformatics/btt509 (2013).
- Sqanti3 (v4.1)
reference: Tardaguila, M. et al. SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification. Genome Res 28, 396-411, doi:10.1101/gr.222976.117 (2018).
- Scripts and datafiles related to phenotyping can be found in /phenotype/
-
BWA (v 0.7.17-r1188)
parameters:-a -M -t 2 -R
readgroup
reference: Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754-1760, doi:10.1093/bioinformatics/btp324 (2009). -
The Genome Analysis Toolkit (GATK) v4.0.7.0
example script for one genotype: variant_detection/job1800.sh
example joint genotyping script for one region: variant_detection/template.jointgenodcarv3d.sh
reference: McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20, 1297-1303, doi:10.1101/gr.107524.110 (2010). -
BCFtools (v1.9)
parameters:view -e 'F_MISSING >= 0.1' -Oz
reference: Narasimhan, V. et al. BCFtools/RoH: a hidden Markov model approach for detecting autozygosity from next-generation sequencing data. Bioinformatics 32, 1749-1751, doi:10.1093/bioinformatics/btw044 (2016).
- A heavily modified version of TRACKPOSON
scripts: /y2_genotyping/
-
Plink (v1.90b3.44)
parameters:--indep 50 5 2
reference: Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81, 559-575, doi:10.1086/519795 (2007). -
PHYLIP (v3.696)
reference: Felsenstein, J. PHYLIP (phylogeny inference package), version 3.5 c. (Joseph Felsenstein., 1993). -
ggtree
reference: Yu, G., Smith, D. K., Zhu, H., Guan, Y. & Lam, T. T.-Y. ggtree: an r package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 8, 28-36, doi:https://doi.org/10.1111/2041-210X.12628 (2017).
- snpgdsPCA implemented in the R package SNPRelate (v1.20.1)
reference: Zheng, X. et al. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics 28, 3326-3328, doi:10.1093/bioinformatics/bts606 (2012).
-
TreeMix (v1.12)
reference: Pickrell, J. K. & Pritchard, J. K. Inference of Population Splits and Mixtures from Genome-Wide Allele Frequency Data. PLOS Genetics 8, e1002967, doi:10.1371/journal.pgen.1002967 (2012). -
OptM (v0.1.6)
reference: Fitak, R. R. OptM: estimating the optimal number of migration edges on population trees using Treemix. Biology Methods and Protocols 6, bpab017, doi:10.1093/biomethods/bpab017 (2021).
- VCFtools (v0.1.14)
reference: Weir, B. S. & Cockerham, C. C. Estimating F-Statistics for the Analysis of Population Structure. Evolution 38, 1358-1370, doi:10.2307/2408641 (1984).
-
bcftools (v1.10.2)
call
parameters:-m -Oz -f GQ
reference: Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008, doi:10.1093/gigascience/giab008 (2021). -
The Genome Analysis Toolkit (GATK) v4.4.0.0
SelectVariants
parameters:-select-type NO_VARIATION
reference: McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20, 1297-1303, doi:10.1101/gr.107524.110 (2010). -
VCFtools (v0.1.14)
parameters:--remove-indels --max-missing 0.8 --minDP 800 --maxDP 8000
reference: Danecek, P. et al. The variant call format and VCFtools. Bioinformatics (Oxford, England) 27, 2156-2158, doi:10.1093/bioinformatics/btr330 (2011). -
Pixy (v. 1.2.7.beta1) https://zenodo.org/record/6551490
reference: Korunes, K.L. and Samuk, K. (2021), pixy: Unbiased estimation of nucleotide diversity and divergence in the presence of missing data. Molecular Ecology Resources. Accepted Author Manuscript. https://doi.org/10.1111/1755-0998.13326 -
Scripts
Merge with: nucleotide_diversity/23.filter.kevinG.sh
Run pixy with: nucleotide_diversity/23.pixy.G.sh
Summarize with: nucleotide_diversity/23.summarize.G.pl
Summary file: nucleotide_diversity/23.pixy.G.summary.tsv
- PopLDdecay (v3.31)
parameters:-OutStat
reference: Zhang, C., Dong, S. S., Xu, J. Y., He, W. M. & Yang, T. L. PopLDdecay: a fast and effective tool for linkage disequilibrium decay analysis based on variant call format files. Bioinformatics 35, 1786-1788, doi:10.1093/bioinformatics/bty875 (2019).
-
bcftools (v1.10.2)
parameters:view -e 'F_MISSING >= 0.1' -Oz
reference: Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008, doi:10.1093/gigascience/giab008 (2021). -
SMC++ software (v.1.15.2)
parameters and code: Code for the estimation of effective population size and divergence times using SMC++ is available at https://github.com/mishaploid/carrot-demography.
source: https://github.com/popgenmethods/smcpp
reference: Terhorst, J., Kamm, J. A. & Song, Y. S. Robust and scalable inference of population history from hundreds of unphased whole genomes. Nat Genet 49, 303-309, doi:10.1038/ng.3748 (2017).
-
VCFtools (v0.1.14)
reference: Danecek, P. et al. The variant call format and VCFtools. Bioinformatics (Oxford, England) 27, 2156-2158, doi:10.1093/bioinformatics/btr330 (2011). -
XP-CLR (v1.0)
reference: Chen, H., Patterson, N. & Reich, D. Population differentiation as a test for selective sweeps. Genome Research 20, 393-402 (2010).
Scripts and parameters for this section can be found in /genotype/
-
vcftools (v0.1.16)
reference: Weir, B. S. & Cockerham, C. C. Estimating F-Statistics for the Analysis of Population Structure. Evolution 38, 1358-1370, doi:10.2307/2408641 (1984). -
Beagle (v5.0)
parameters: default settings
reference: Browning, S. R. & Browning, B. L. Rapid and Accurate Haplotype Phasing and Missing-Data Inference for Whole-Genome Association Studies By Use of Localized Haplotype Clustering. The American Journal of Human Genetics 81, 1084-1097, doi:10.1086/521987 (2007). -
Tassel (v5)
reference: Bradbury, P. J. et al. TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics 23, 2633-2635, doi:10.1093/bioinformatics/btm308 (2007). -
GAPIT (custom modified version of 2020.10.24 Gapit 3.0)
script: GWA/GWA.R.txt
reference: Lipka, A. E. et al. GAPIT: genome association and prediction integrated tool. Bioinformatics 28, 2397-2399, doi:10.1093/bioinformatics/bts444 (2012). -
simpleM
reference: Gao, X., Starmer, J. & Martin, E. R. A multiple testing correction method for genetic association studies using correlated single nucleotide polymorphisms. Genet Epidemiol 32, 361-369, doi:10.1002/gepi.20310 (2008). -
R package qqman (v0.1.8)
reference: Turner, S. qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. (2014).
- simpleM (version July 2, 2009)
script: GWA/simpleMSigThreshold.R.txt
reference: Gao, X., Starmer, J. & Martin, E. R. A multiple testing correction method for genetic association studies using correlated single nucleotide polymorphisms. Genet Epidemiol 32, 361-369, doi:10.1002/gepi.20310 (2008).
- qqman (v0.1.8)
script: GWA/qqman4Manhattanplots.R
reference: Turner, S. qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. (2014).
-
TRIMMOMATIC (v0.36)
reference: Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114-2120, doi:10.1093/bioinformatics/btu170 (2014). -
Rsubread (v2.14.1) in R 3.5.0
reference: Liao, Y., Smyth, G. K. & Shi, W. The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Res 47, e47, doi:10.1093/nar/gkz114 (2019). -
FeatureCounts (v2.14.1) in R 3.5.0
reference: Liao, Y., Smyth, G. K. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923-930, doi:10.1093/bioinformatics/btt656 (2014). -
Limma (v3.56.1)
reference: Weir, B. S. & Cockerham, C. C. Estimating F-Statistics for the Analysis of Population Structure. Evolution 38, 1358-1370, doi:10.2307/2408641 (1984).
- lm in R 3.5.0
reference: Everitt, B. Book reviews: Chambers JM, Hastie TJ eds 1992: Statistical models in S. California: Wadsworth and Brooks/Cole. ISBN 0 534 16765-9. Statistical Methods in Medical Research 1, 220-221, doi:10.1177/096228029200100208 (1992).