Skip to content

Eval.nf input data formatting

Thomas Krannich edited this page May 7, 2024 · 3 revisions

The eval.nf input options

The eval.nf variant callset evaluation workflow has two distinct and mutually exclusive parameters to provide input data. Both are explained in greater detail below.

Input sample sheet

(🟢Likely the easier to understand option.) With the --sample_sheet parameter the user can specify a sample sheet (CSV files) as the eval.nf workflow's input data. The sample sheet must contain a table in comma-separated values. The first line of the table must be a header of the format "index,truthset,callset". The following lines contain the corresponding truth- and callsets. The column order is not strict but must be consistent across all lines. In general, the sample sheet looks like:

index,truthset,callset
1,/path/to/truthsetOne.vcf,/path/to/callsetOne.vcf
2,/path/to/truthsetTwo.vcf,/path/to/callsetTwo.vcf
3,/path/to/truthsetThree.vcf,/path/to/callsetThree.vcf
4,/path/to/truthsetFour.vcf,/path/to/callsetFour.vcf
...

Here is another more precise example of the sample sheet using CIEVaD's test data within the aux/ci_data/ subdirectory, using truthset data from a default hap.nf run and assuming CIEVaD was downloaded to your home (~) directory:

index,truthset,callset
1,~/cievad/results/simulated_hap1.vcf,~/cievad/aux/ci_data/callset_1.vcf.gz
2,~/cievad/results/simulated_hap2.vcf,~/cievad/aux/ci_data/callset_2.vcf.gz
3,~/cievad/results/simulated_hap3.vcf,~/cievad/aux/ci_data/callset_3.vcf.gz

At this point in time (May 07th, 2024; version 0.3.0) we only tested and confirm functionality using absolute paths for the files. As opposed to the other input option below (section "Input directory") the truth- and callsets do not need to comply with a naming convention. It is up to the user to verify that the truthset and callset per line correspond to each other for evaluation.

Finally, with a correctly formatted sample sheet (e.g. my_samples.csv present in the CIEVaD root directory) a run command of the evaluation workflow simply looks like:

nextflow run eval.nf -profile local,conda --sample_sheet my_samples.csv

Input directory

With the --callsets_dir parameter the user can specify a directory for the workflow to automatically detect variant callsets (VCF files). Each VCF file has to comply with the naming format callset_<X>.vcf[.gz], where <X> is the index of the corresponding truthset. Callsets can optionally be gzip compressed. For example, CIEVaD comes with some test data in the aux/ci_data/ subdirectory:

$ tree aux/ci_data/
aux/ci_data/
├── callset_1.vcf.gz
├── callset_2.vcf.gz
├── callset_3.vcf.gz
└── README.md

These callsets were generated from the NGS data of three simulated haplotypes from the hap.nf workflow. Hence, the index [1-3] in the filenames callset_[1-3].vcf.gz corresponds to the index of the truthsets results/simulated_hap[1-3].vcf. To use the test data for eval.nf the input parameter would simply look like:

nextflow run eval.nf -profile local,conda --callsets_dir aux/ci_data/

Here, the corresponding truthsets are assumed to be in the default location (results directory) and are found automatically. Tip: callsets can also be UNIX symlinks which comes in handy when dealing with larger numbers of truth- and callsets.



Back to ToC