Skip to content

genomehubs/goat-nlp

Repository files navigation

Google Summer of Code at the Tree of Life at the Wellcome Sanger Institute

Accepting proposals for Google Summer of Code 2024

GoaT-NLP

Natural language search across the tree of life

The Tree of Life at the Wellcome Sanger Institute is generating high-quality genome assemblies as part of the Earth BioGenome Project (EBP), a global initiative to generate reference-quality genome sequences for all species on earth. Given the scale of this initiative, we need ready access to metadata relevant to sample collection, sequencing and assembly and a platform to coordinate our efforts with those of other projects under the EBP umbrella. To meet this need, we have developed Genomes on a Tree (GoaT), an Elasticsearch-based datastore, search engine, and reporting platform, with directly-measured or estimated values for a suite of attributes across all known species.

This project is about bridging the gap between the potential of GoaT to perform queries relevant to all stages of the biodiversity genomics projects within the EBP and users' ability to formulate these queries using the syntax that the existing API, CLI and front end UI require. To be able to directly answer questions like:

  • Which plant families do not yet have a reference-quality genome assembly for any species? [UI result table]
  • How many butterfly species without an assembly have an expected genome size greater than 1 billion base pairs? [API /count endpoint]
  • Which species on a project target list are already being sequenced by another EBP partner project? [API /search endpoint]
  • What proportion of reference-quality genome assemblies have been produced by EBP vs non-EBP projects in each of the last 5 years? [UI report view for 1 year]

GenomeHubs project

GoaT is part of a broader collection of tools developed under the GenomeHubs project. A closely related tool, BoaT, indexes data within assemblies, and it is anticipated that development of GoaT-NLP will also benefit BoaT and further GenomeHubs projects still in development. All GenomeHubs source code is open source under the MIT license avaliable from the GenomeHubs GitHub organisation, primarily in the genomehubs/genomehubs repository. Configuration files to define the source data and customise the UI for GoaT are in the genomehubs/goat-data and the genomehubs/goat-ui repositories, respectively.

Data structure

In order to support queries like the examples above, GoaT stores directly measured and estimated values for a range of attributes alongside taxonomic information including rank and lineage as a document per taxon in the datastore. The data structure for the taxon index is summarised below, other datatypes including assembly, sample and features are stored in separate indexes.

A processed taxon document can be obtained from the /record endpoint of the API, e.g. /api/v2/record?recordId=9612&result=taxon or viewed in the UI by visiting the correwsponding taxon record page, e.g. /record?recordId=9612&result=taxon.

Each document has a core set of keyword fields:

  • taxon_id - unique taxon ID in the current taxonomy (defaut NCBI taxonomy)
  • parent - taxon ID of the parent taxon
  • scientific_name - scientific name of the taxon
  • taxon_rank - rank of the taxon, e.g. species, genus, family, etc.

Additional fields are divided into three groups:

taxon_names:

A set of nested fields for each name of the taxon:

  • name - the taxon name
  • class - the taxon name class, e.g. scientific name, common name, etc.
  • source - the source of the taxon name, e.g. NCBI, GBIF, etc.

lineage:

An ordered set of nested fields for each ancestor of the taxon:

  • taxon_id - the unique ID of the ancestral taxon
  • taxon_rank - the rank of the ancestral taxon
  • scientific_name - the scientific name of the ancestral taxon
  • node_depth - the depth of the ancestral taxon in the taxonomic tree

attributes:

A set of nested fields for each attribute:

  • key - the unique attribute name
  • *_value - the summary value of the attribute where * is the attribute type which largely corresponds to the list of Elasticsearch field data types
  • source - the source of the attribute, e.g. NCBI, GBIF, etc.
  • min, max, mean, median - summary statistics for the attribute
  • aggregation_method - the aggregation method used to generate a summary value
  • aggregation_source - the source of the attribute used to generate a summary value (direct, ancestor, descendant)

Each attribute value in the taxon index can be derived from one or more raw values, which are stored as a nested set of values in the values field.

The full mapping used is defined in taxon.json. Similar mappings are used for the other document types.

Query syntax

The query syntax currently used by GoaT it tied to this structured data model. It supports simple and highly-specific queries, but takes time to learn and presents a barrier to wider data access.

GoaT query syntax allows any combination of of tax_ filters and <attribute> <operator> <value> clauses to be joined with AND operators.

tax_ filters are used to restrict the taxonomic scope of a query as follows:

  • tax_name(<value>) - return results where any taxon name or ID at the top-level or in taxon_names matches value
  • tax_tree(<value>) - return results where the name or taxon ID of any taxon in the lineage matches value
  • tax_rank(<value>) - return results where the taxon_rank matches value
  • tax_depth(<value>) - return results where the node_depth of any taxon in the lineage < value
  • tax_lineage(<value>) - return results for each ancestral taxon in the lineage of a record where any taxon name or ID at the top-level or in taxon_names matches value

The operators supported are: =, !=, >, >=, <, and <=. A full list of available atttribute names, types and value constraints is available at goat.genomehubs.org/types.

Support for logical OR operators is currently limited to the ability to provide a comma separated list of values for an attribute or tax_filter, in which case results will be returned if at least one value matches the query. For example, tax_tree(fungi,metazoa) AND long_list(DTOL,GAGA) will return results for taxa in either the fungi or metazoa lineage and where the long_list attribute contains either DTOL or GAGA.

Values of summary statistics can also be queried using the min, max, mean, median modifiers, e.g. using min(assembly_date)>=2023-01-01 to find taxa by the earliest assembly date.

Natural language search

GoaT-NLP aims to extend the capabilities of GoaT to support natural language queries. The project aims to:

  • Take natural language queries and convert them to structured queries using the GoaT query syntax. Static Badge

  • Automatically select the most appropriate type of search to perform and return results as a natural language statement. Static Badge

  • Augment Goat search results with extracts from unstructured text. Static Badge

  • Extract information from text using machine learning models for indexing. Static Badge

We are open to suggestions for further directions to develop this project and validation will be as important as information retrieval to ensure the results presented accurately reflect the intended query.

Contributing

We are proposing the GoaT-NLP project as a Google Summer of Code project for 2024. If you are interested in contributing to GoaT-NLP, please read the information provided in the ToL+PaM GSoC 2024 Google Doc and use the information in that document to get in touch with any questions you may have.

Proposals

We will assess applications from potential GSoC contributors on the basis of the proposal. Again, see the ToL+PaM GSoC 2024 Google Doc for more, but broadly, we want to know:

  • how would you approach this project?
  • which technologies would you use and why?
  • what would be the key milestones and when would you reach them?
  • how would you ensure the sustainability of your code beyond the end of the GSoC term?

You should follow the GSoC contibutor guidelines to help structure your proposal. Note that we'd like to see a diagram of your suggested implementation and while we have no fixed length limit, we value the ability to identify and focus on the core elements of your proposal and to write concisely.

Resources

GoaT videos

GoaT was part of Biodiversity Genomics Academy 2023 (BGA23), watch the video tutorial or view the slides

About

Natural language search across the tree of life

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published