The OctoFLU Database is a platform for storing, curating, and integrating all our data. It is built on the GraphDB database (http://graphdb.ontotext.com/).
The database can be accessed either through the graphical user interface or the
octofludb
command line utility. The latter approach is strongly recommended
and will be the only approach I discuss in the future sections.
The free version of GraphDB can be downloaded and installed from the website.
octofludb
can be installed through pip:
pip install octofludb
You may need to replace pip
with pip3
, depending on your setup.
Documentation for all subcommands in octofludb
can be accessed by calling
octofludb -h
. Subcommand documentation can be accessed with octofludb
<subcommand> -h
.
Once GraphDB is installed on your system, you can initialize an empty repo with
octofludb init
.
A local database can be accessed by submitting SPARQL queries. A SPARQL script specifies a pattern over a graph and when given to the database will print a row of data for every occurrence of the pattern.
Below is a very simple script that returns the strain names of every gamma:
PREFIX f: <https://flu-crew.org/term/>
SELECT ?name
WHERE {
?sid f:strain_name ?name .
?sid f:host "swine" .
?sid f:has_segment ?gid .
?gid f:clade "gamma" .
}
The PREFIX line defines an abbreviatin, f
, for the long URI
https://flu-crew.org/term/
. While the URI looks like a web link, it does not
need to point to a real web page (although it may, and that is a good place to
put documentation). Every edge type defined specifically for OctofluDB has this
prefix.
The SELECT term is followed by a list of all the columns we want printed out for us, in this case just one.
?sid f:strain_name ?name .
is a graph pattern restricting the returned data
to things related to eachother by the f:strain_name
predicate. Words that
begin with ?
are variables. There is nothing special about ?sid
and
?name
, these variables could be named anything. By convention, I normally use
?sid
for a *s*train *id*entifier and ?gid
for a *g*enbank *id*enditifer.
Every statement must end with a .
, otherwise GraphDB will spill out a very
ugly error statement.
?sid f:has_segment ?gid .
links our strain node to its child segment node.
This particular link will appear in many queries.
?gid f:clade "gamma"
links the segment to the literal string value "gamma" by the f:clade
relation.
The query can be sent to the database with the command:
octofludb query fetch-gammas.rq
The extension .rq
is convention for SPARQL scripts.
I will describe how to delete data with a few examples.
To understand the need for deleting data, you should know a little about how graph databases handle conflicting data. Data uploaded to GraphDB consists of many "triples", where a triple consists of a subject, a predicate (relationship), and an object. That is, a triple specifies a single edge in the graph.
For example, suppose we have an edge in the database linking the HA segment EPI_123456 to the clade "gamma". Then suppose we want to instead link it to the more specific clade "gamma.3". If we upload the new triple, however, the original link "gamma" is not removed, instead we end up with the segment being linked to both "gamma" and "gamma.3". The database doesn’t magically know that the relation "clade" should always be one-to-one. So if we want only the clade "gamma.3", we will need to delete the original clade.
The following SPARQL command will do what we want:
PREFIX f: <https://flu-crew.org/term/>
DELETE {
?gid f:epi_id "EPI_123456" .
?gid f:clade "gamma" .
}
This deletion script can be written to a file, delete-clade.rq
, and executed
on the local database with octofludb update delete-clade.rq
.
This example expands on the prior example and introduces a more powerful technique for deleting data.
In the August OFFLU report process, we updated the octoFLU reference files to better represent the global H3 clades. If we directly upload the new H3 clades to octofludb, then we will end up with duplicate records. So we must first delete all prior non-USA H3 clades. The SPARQL deletion statement for this is shown below.
PREFIX f: <https://flu-crew.org/term/>
DELETE {
?gid f:clade ?clade .
?gid f:gl_clade ?gl_clade .
}
WHERE {
?sid f:strain_name ?strain_name .
?sid f:country/f:code ?country .
?sid f:host "swine" .
FILTER (?country != "USA") .
?sid f:subtype ?subtype .
FILTER REGEX(?subtype, "H3") .
?sid f:has_segment ?gid .
?gid f:segment_name "HA" .
OPTIONAL { ?gid f:clade ?clade . }
OPTIONAL { ?gid f:gl_clade ?gl_clade . }
}
The WHERE clause defines a set of triples and the DELETE clause specifies which of these triples should be removed.
This command should be written to a file (say, delete-global-H3-clades.rq
)
and then can be applied to the local database with the command:
octofludb update delete-global-H3-clades.rq
It is generally good practice to check before we delete. For example, if we
accidentally added the triple ?sid f:has_segment ?gid
to the DELETE clause,
then we would delete the links from strains to segments. An easy way to check
what will be deleted is to create the corresponding query statement:
PREFIX f: <https://flu-crew.org/term/>
SELECT *
WHERE {
....
}
With the WHERE clause being the same as before. The calling octofludb query
to get a table showing all cases where clade info will be deleted.
There are three basic steps to uploading data to the database.
The first is to prepare the raw data in the "proper" way.
The second step is to convert the raw data to the turtle-format GraphDB
understands. This is done through the mk_*
commands in octofludb
. These
commands do more than just translate the data, they use specialized parsers to
extract flu specific information from the data. For example, if
The third step is to upload the turtle files to the database. If you are
working with a local database, this step is easy: octofludb upload
<turtle-fiels>
.