- The first part describes the dataset. This is given as a two-column key/value format. The keys are case-insensitive, but the values are case-sensitive. The order of rows is unimportant.
- Organism. Usually genus and species, but there is no hard rule at this time.
- Outbreak. This is usually an outbreak code but can be some other descriptor of the dataset.
- pmid. Any publications associated with this dataset should be listed as pubmed IDs.
- tree. This is a URL to the newick-formatted tree. This tree serves as a guide to future analyses.
- source. Where did this dataset come from?
- intendedUsage. How do you think others will use this dataset?
- Blank row - separates the two parts of the dataset
- Header row with these names (case-insensitive, in any order): biosample_acc, ...
- Sample info. Each row represents a genome and must have the following fields. Use a dash (-) for any missing data.
- biosample_acc - The BioSample accession
- strain - Its genome name
- genbankAssembly - GenBank accession number
- SRArun_acc - SRR accession number
- outbreak - The name of the outbreak clade. Usually named after an outbreak code. If not part of an important clade, the field can be filled in using 'outgroup'
- dataSetName - this should be redundant with the outbreak field in the first part of the spreadsheet
- suggestedReference - The suggested reference genome for analysis, e.g., SNP analysis.
- sha256sumAssembly - A checksum for the GenBank file
- sha256sumRead1 - A checksum for the first read from the SRR accession
- sha256sumRead2 - A checksum for the second read from the SRR accession
- nucleotide - A single nucleotide accession. This is sometimes an alternative to an assembly especially for one-contig genomes.
- sha256sumnucleotide - a checksum for the single nucleotide accession.
- amplicon_strategy - which amplicon strategy was used? E.g.,
ARTIC V3
This specification uses sha256 to calculate hashsums. To create a hashsum on a file, e.g., file.fastq.gz, run the following
sha256sum file.fastq.gz
We include a script adjustHashsums.pl
to help create hashsums automatically in the spreadsheet.
Here are the suggested steps:
- create the spreadsheet as described above in the detailed fields. Do not include hashsum values in the relevant fields.
- Run
GenFSGopher.pl
using your new spreadsheet. It will err due to incorrect hashsums. - A file
in.tsv
should be in the output directory identical to the input file. - Run
adjustHashsums.pl
onin.tsv
to create a fileout.tsv
. out.tsv
will have correct hashsums.