Skip to content

Roadmap NLeSC/Naturalis collaboration

No due date 0% complete

This roadmap tentatively proposes enhancements to the pipeline to be implemented in an envisioned collaboration between e-Science experts at NLeSC and Naturalis biodiversity researchers. The collaboration is based on the following background:

Research Idea

DNA barcoding is an application of high-throughput DNA sequencing aimed at identifying species and t…

This roadmap tentatively proposes enhancements to the pipeline to be implemented in an envisioned collaboration between e-Science experts at NLeSC and Naturalis biodiversity researchers. The collaboration is based on the following background:

Research Idea

DNA barcoding is an application of high-throughput DNA sequencing aimed at identifying species and the boundaries
between them at the molecular level, thereby massively scaling up our potential for rapid biodiversity assessment in
the environment, a key source of data on how the biodiversity crisis unfolds around us. DNA barcoding data can, in
principle, be used for reconstructing the topology of the tree of life. Such an estimate of tree shape is useful at
multiple points of the value chain in biodiversity assessment. Firstly in its utility in curating barcode sequence records,
and secondly in allowing for the calculation of phylogenetic diversity metrics, which are considered by key NGOs such
as IPBES to be more informative than flat lists of identified species - but such metrics are hard to attain in the absence
of available reference trees. Here, we propose to remedy this by the development of a robust pipeline for inferring the
tree of life for the cytochrome oxidase subunit 1 gene, a key barcoding marker for which Naturalis is the custodian of
the European instance of the central barcoding database BOLD. We furthermore propose to implement this through a
divide-and-conquer algorithm applying topological constraints from the Open Tree of Life. This is necessary as
phylogenetic inference is NP-complete and consequently out of reach at BOLD's scale (which has 10^6 records). Initial
prototypes suggest the approach is feasible. What we request is the support to turn these prototypes into a robust,
scalable application.

eScience and technological challenge

The current prototype is a SnakeMake pipeline. The approach it takes is to partition the BOLD
(Ratnasingham & Hebert, 2007) sequences of the target group of species in their constituent taxonomic families. For
each family, the tree is then inferred using a constraint set from the OpenTree of Life (Hinchliff et al., 2015).
Subsequently, out of each family, two exemplar species on opposite sides of the tree root are selected, which
participate in the construction of the high-level backbone topology. The family trees are then grafted onto the
backbone. This approach works for smaller taxonomic groups, and due to trivial parallelization should also work for
larger groups. However, the parallelization strategy needs to know the expected number of families ahead of time (this
is simply due to poor SnakeMake practices) and generates large numbers of intermediate data and log files in a
relative unstructured fashion. We suspect that the expertise at NLeSC in SnakeMake pipeline development can address
this. As further eScience challenges we anticipate that the input data can be partitioned more evenly (families differ
widely in size) and that the pipeline could be integrated with a facility for publishing periodic releases of the output FAIRly.

Loading