GoaT-NLP

Google Summer of Code at the Tree of Life at the Wellcome Sanger Institute

Accepting proposals for Google Summer of Code 2024

GoaT-NLP

Natural language search across the tree of life

The Tree of Life at the Wellcome Sanger Institute is generating high-quality genome assemblies as part of the Earth BioGenome Project (EBP), a global initiative to generate reference-quality genome sequences for all species on earth. Given the scale of this initiative, we need ready access to metadata relevant to sample collection, sequencing and assembly and a platform to coordinate our efforts with those of other projects under the EBP umbrella. To meet this need, we have developed Genomes on a Tree (GoaT), an Elasticsearch-based datastore, search engine, and reporting platform, with directly-measured or estimated values for a suite of attributes across all known species.

This project is about bridging the gap between the potential of GoaT to perform queries relevant to all stages of the biodiversity genomics projects within the EBP and users' ability to formulate these queries using the syntax that the existing API, CLI and front end UI require. To be able to directly answer questions like:

Which plant families do not yet have a reference-quality genome assembly for any species? [UI result table]
How many butterfly species without an assembly have an expected genome size greater than 1 billion base pairs? [API /count endpoint]
Which species on a project target list are already being sequenced by another EBP partner project? [API /search endpoint]
What proportion of reference-quality genome assemblies have been produced by EBP vs non-EBP projects in each of the last 5 years? [UI report view for 1 year]

GenomeHubs project

GoaT is part of a broader collection of tools developed under the GenomeHubs project. A closely related tool, BoaT, indexes data within assemblies, and it is anticipated that development of GoaT-NLP will also benefit BoaT and further GenomeHubs projects still in development. All GenomeHubs source code is open source under the MIT license avaliable from the GenomeHubs GitHub organisation, primarily in the genomehubs/genomehubs repository. Configuration files to define the source data and customise the UI for GoaT are in the genomehubs/goat-data and the genomehubs/goat-ui repositories, respectively.

Data structure

In order to support queries like the examples above, GoaT stores directly measured and estimated values for a range of attributes alongside taxonomic information including rank and lineage as a document per taxon in the datastore. The data structure for the taxon index is summarised below, other datatypes including assembly, sample and features are stored in separate indexes.

A processed taxon document can be obtained from the /record endpoint of the API, e.g. /api/v2/record?recordId=9612&result=taxon or viewed in the UI by visiting the correwsponding taxon record page, e.g. /record?recordId=9612&result=taxon.

Each document has a core set of keyword fields:

taxon_id - unique taxon ID in the current taxonomy (defaut NCBI taxonomy)
parent - taxon ID of the parent taxon
scientific_name - scientific name of the taxon
taxon_rank - rank of the taxon, e.g. species, genus, family, etc.

Additional fields are divided into three groups:

`taxon_names:`

A set of nested fields for each name of the taxon:

name - the taxon name
class - the taxon name class, e.g. scientific name, common name, etc.
source - the source of the taxon name, e.g. NCBI, GBIF, etc.

`lineage:`

An ordered set of nested fields for each ancestor of the taxon:

taxon_id - the unique ID of the ancestral taxon
taxon_rank - the rank of the ancestral taxon
scientific_name - the scientific name of the ancestral taxon
node_depth - the depth of the ancestral taxon in the taxonomic tree

`attributes:`

A set of nested fields for each attribute:

key - the unique attribute name
*_value - the summary value of the attribute where * is the attribute type which largely corresponds to the list of Elasticsearch field data types
source - the source of the attribute, e.g. NCBI, GBIF, etc.
min, max, mean, median - summary statistics for the attribute
aggregation_method - the aggregation method used to generate a summary value
aggregation_source - the source of the attribute used to generate a summary value (direct, ancestor, descendant)

Each attribute value in the taxon index can be derived from one or more raw values, which are stored as a nested set of values in the values field.

The full mapping used is defined in taxon.json. Similar mappings are used for the other document types.

Query syntax

The query syntax currently used by GoaT it tied to this structured data model. It supports simple and highly-specific queries, but takes time to learn and presents a barrier to wider data access.

GoaT query syntax allows any combination of of tax_ filters and <attribute> <operator> <value> clauses to be joined with AND operators.

tax_ filters are used to restrict the taxonomic scope of a query as follows:

tax_name(<value>) - return results where any taxon name or ID at the top-level or in taxon_names matches value
tax_tree(<value>) - return results where the name or taxon ID of any taxon in the lineage matches value
tax_rank(<value>) - return results where the taxon_rank matches value
tax_depth(<value>) - return results where the node_depth of any taxon in the lineage < value
tax_lineage(<value>) - return results for each ancestral taxon in the lineage of a record where any taxon name or ID at the top-level or in taxon_names matches value

The operators supported are: =, !=, >, >=, <, and <=. A full list of available atttribute names, types and value constraints is available at goat.genomehubs.org/types.

Support for logical OR operators is currently limited to the ability to provide a comma separated list of values for an attribute or tax_filter, in which case results will be returned if at least one value matches the query. For example, tax_tree(fungi,metazoa) AND long_list(DTOL,GAGA) will return results for taxa in either the fungi or metazoa lineage and where the long_list attribute contains either DTOL or GAGA.

Values of summary statistics can also be queried using the min, max, mean, median modifiers, e.g. using min(assembly_date)>=2023-01-01 to find taxa by the earliest assembly date.

Natural language search

GoaT-NLP aims to extend the capabilities of GoaT to support natural language queries. The project aims to:

Take natural language queries and convert them to structured queries using the GoaT query syntax.
Automatically select the most appropriate type of search to perform and return results as a natural language statement.
Augment Goat search results with extracts from unstructured text.
Extract information from text using machine learning models for indexing.

We are open to suggestions for further directions to develop this project and validation will be as important as information retrieval to ensure the results presented accurately reflect the intended query.

Contributing

We are proposing the GoaT-NLP project as a Google Summer of Code project for 2024. If you are interested in contributing to GoaT-NLP, please read the information provided in the ToL+PaM GSoC 2024 Google Doc and use the information in that document to get in touch with any questions you may have.

Proposals

We will assess applications from potential GSoC contributors on the basis of the proposal. Again, see the ToL+PaM GSoC 2024 Google Doc for more, but broadly, we want to know:

how would you approach this project?
which technologies would you use and why?
what would be the key milestones and when would you reach them?
how would you ensure the sustainability of your code beyond the end of the GSoC term?

You should follow the GSoC contibutor guidelines to help structure your proposal. Note that we'd like to see a diagram of your suggested implementation and while we have no fixed length limit, we value the ability to identify and focus on the core elements of your proposal and to write concisely.

Resources

GoaT videos

GoaT was part of Biodiversity Genomics Academy 2023 (BGA23), watch the video tutorial or view the slides

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
.vscode		.vscode
src		src
.env.dist		.env.dist
.gitignore		.gitignore
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GoaT-NLP

GenomeHubs project

Data structure

`taxon_names:`

`lineage:`

`attributes:`

Query syntax

Natural language search

Contributing

Proposals

Resources

GoaT videos

About

Releases

Packages

Contributors 2

Languages

License

genomehubs/goat-nlp

Folders and files

Latest commit

History

Repository files navigation

GoaT-NLP

GenomeHubs project

Data structure

taxon_names:

lineage:

attributes:

Query syntax

Natural language search

Contributing

Proposals

Resources

GoaT videos

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

`taxon_names:`

`lineage:`

`attributes:`

Packages