Skip to content

Latest commit

 

History

History
209 lines (147 loc) · 6.54 KB

octofludb.asc

File metadata and controls

209 lines (147 loc) · 6.54 KB

The OctoFLU Database

The OctoFLU Database is a platform for storing, curating, and integrating all our data. It is built on the GraphDB database (http://graphdb.ontotext.com/).

The database can be accessed either through the graphical user interface or the octofludb command line utility. The latter approach is strongly recommended and will be the only approach I discuss in the future sections.

Installation

The free version of GraphDB can be downloaded and installed from the website.

octofludb can be installed through pip:

pip install octofludb

You may need to replace pip with pip3, depending on your setup.

Database management with the octofludb utility

Documentation for all subcommands in octofludb can be accessed by calling octofludb -h. Subcommand documentation can be accessed with octofludb <subcommand> -h.

Once GraphDB is installed on your system, you can initialize an empty repo with octofludb init.

Getting data

A local database can be accessed by submitting SPARQL queries. A SPARQL script specifies a pattern over a graph and when given to the database will print a row of data for every occurrence of the pattern.

Below is a very simple script that returns the strain names of every gamma:

PREFIX f: <https://flu-crew.org/term/>

SELECT ?name
WHERE {
  ?sid f:strain_name ?name .
  ?sid f:host "swine" .
  ?sid f:has_segment ?gid .
  ?gid f:clade "gamma" .
}

The PREFIX line defines an abbreviatin, f, for the long URI https://flu-crew.org/term/. While the URI looks like a web link, it does not need to point to a real web page (although it may, and that is a good place to put documentation). Every edge type defined specifically for OctofluDB has this prefix.

The SELECT term is followed by a list of all the columns we want printed out for us, in this case just one.

?sid f:strain_name ?name . is a graph pattern restricting the returned data to things related to eachother by the f:strain_name predicate. Words that begin with ? are variables. There is nothing special about ?sid and ?name, these variables could be named anything. By convention, I normally use ?sid for a *s*train *id*entifier and ?gid for a *g*enbank *id*enditifer. Every statement must end with a ., otherwise GraphDB will spill out a very ugly error statement.

?sid f:has_segment ?gid . links our strain node to its child segment node. This particular link will appear in many queries.

?gid f:clade "gamma" links the segment to the literal string value "gamma" by the f:clade relation.

The query can be sent to the database with the command:

octofludb query fetch-gammas.rq

The extension .rq is convention for SPARQL scripts.

Deleting data

I will describe how to delete data with a few examples.

To understand the need for deleting data, you should know a little about how graph databases handle conflicting data. Data uploaded to GraphDB consists of many "triples", where a triple consists of a subject, a predicate (relationship), and an object. That is, a triple specifies a single edge in the graph.

For example, suppose we have an edge in the database linking the HA segment EPI_123456 to the clade "gamma". Then suppose we want to instead link it to the more specific clade "gamma.3". If we upload the new triple, however, the original link "gamma" is not removed, instead we end up with the segment being linked to both "gamma" and "gamma.3". The database doesn’t magically know that the relation "clade" should always be one-to-one. So if we want only the clade "gamma.3", we will need to delete the original clade.

The following SPARQL command will do what we want:

PREFIX f: <https://flu-crew.org/term/>

DELETE {
    ?gid f:epi_id "EPI_123456" .
    ?gid f:clade "gamma" .
}

This deletion script can be written to a file, delete-clade.rq, and executed on the local database with octofludb update delete-clade.rq.

This example expands on the prior example and introduces a more powerful technique for deleting data.

In the August OFFLU report process, we updated the octoFLU reference files to better represent the global H3 clades. If we directly upload the new H3 clades to octofludb, then we will end up with duplicate records. So we must first delete all prior non-USA H3 clades. The SPARQL deletion statement for this is shown below.

PREFIX f: <https://flu-crew.org/term/>

DELETE {
    ?gid f:clade ?clade .
    ?gid f:gl_clade ?gl_clade .
}
WHERE {
  ?sid f:strain_name ?strain_name .
  ?sid f:country/f:code ?country .
  ?sid f:host "swine" .

  FILTER (?country != "USA") .

  ?sid f:subtype ?subtype .
  FILTER REGEX(?subtype, "H3") .

  ?sid f:has_segment ?gid .
  ?gid f:segment_name "HA" .

  OPTIONAL { ?gid f:clade ?clade . }
  OPTIONAL { ?gid f:gl_clade ?gl_clade . }
}

The WHERE clause defines a set of triples and the DELETE clause specifies which of these triples should be removed.

This command should be written to a file (say, delete-global-H3-clades.rq) and then can be applied to the local database with the command:

octofludb update delete-global-H3-clades.rq

It is generally good practice to check before we delete. For example, if we accidentally added the triple ?sid f:has_segment ?gid to the DELETE clause, then we would delete the links from strains to segments. An easy way to check what will be deleted is to create the corresponding query statement:

PREFIX f: <https://flu-crew.org/term/>

SELECT *
WHERE {
   ....
}

With the WHERE clause being the same as before. The calling octofludb query to get a table showing all cases where clade info will be deleted.

Uploading data

There are three basic steps to uploading data to the database.

The first is to prepare the raw data in the "proper" way.

The second step is to convert the raw data to the turtle-format GraphDB understands. This is done through the mk_* commands in octofludb. These commands do more than just translate the data, they use specialized parsers to extract flu specific information from the data. For example, if

The third step is to upload the turtle files to the database. If you are working with a local database, this step is easy: octofludb upload <turtle-fiels>.

recipes

Recipe functions are data synthesis functions that are built into octofludb. They will be added on an as-needed database.

  • masterlist - generate a table of data describing all A0 strains. This is the base table used by octoflushow and to generate quarterly reports.