Skip to content

Users Guide

Carl Kesselman edited this page Oct 2, 2017 · 19 revisions

DERIVA-py provides a streamlined interface to ERMRest, which is the DERIVA asset catalog. Underneath the covers, ERMRest is a RESTful web service with a rich, resource focused API. DERIVA-py provides access to that interface via python.

The deriva-py interface has been designed to integrate DERIVA into the Python ecosystem, for example to integrate with PANDAs, CVS tools and to enable navigation through interlinked entity/relationship (ER) models, with the ability to retrieve and update both entities and relationships. The library also contains features to explore the underlying data model, and has been designed to integrate with command completion in the Python user interface to facilitate interactive exploration of a DERIVA data model.

The ERMRest web services API is rich and has support for many different features. Deriva-pi was created to streamline access to the most commonly used of these features. In the case that a feature is not directly exported into deriva-py, the needed function can always be accessed through deriva-py by providing the underlying ERMRest resource specification.

Assets and the ERMrest Entity/Relation Model

DErIVA is build around the idea that an investigation will consist of many different types of data assets that may have complex relationships between one another. For example, in a gene expression experiment, we might have multiple tissue samples from an organism with a specific genetic sequence, each with histology (i.e. image) data. We might take multiple pieces of tissue from each sample, run multiple RNA-Sequence experiments on each (known as replicates), and then do analysis on each of those to determine expression levels for a set of genes. So in addition to having the RNA-Seq, Imagining, Sequence, and expression data, we need to know how all of these data files, which we call assets are related to one another. In Dervia, we can represent these connections using an Entity/Relationship (ER) model in which an entity such as an Experiment is related to other entities, such as RNA-Seq-Run, or assets, such as an image file.

In DERIVA, we refer to the ER model that describes the connections as a data model. Data models are stored in a web service called ERMRest. Within ERMRest, each entity and relation is a web resource and we can create, query, update, or delete elements of the data model, or specific entities or relations within that model via web operations.

A frequent need is to find a set of entities that are related to a known entity. For example, we may want to find all of the RNA-Seq files that are associated with a specific experiment, which would require we find all of the RNA-Seq runs, and for each of these, find all of the replicates, and for each replicate, all of the RNA-Seq files. Alternatively, for a given RNA-Seq file, we might want to find the images that we have for the sample in which the RNA came from. One of the powerful features of ERMRest is that it can introspect the data model and in many cases, locate the desired entities by simply specifying the path through the model, e.g.

RNA-Seq-File -> RNA-Seq-Replicate -> RNA-Seq-Run->Experiment -> Images

One powerful feature of ERMRest is that in can navigate though a relationship in either direction. So

 Images ->Experiment->RNA-Seq-Run->RNA-Seq-Replicate->RNA-Seq-FIle

works as well.

ERMRest Resources

All data is ERMRest is organized into a four level hierarchy. At the top is a catalog which is a container that holds the ER model, all entity instances, and other related information. An ERMRest server may house multiple catalogs. Each catalog is identified by a catalog number, which is an integer.

Each catalog can contain one or more schema. The schema contains information that describe an ER model including: the types of entities, the attributes associated with each entity, which attributes describe relationships and other information. Schemas allow us to organize entities within a catalog and provide modularity.

Within a schema, we can define one or more entities by identifying the entities name, and the set of attributes that as associated with the entity. An entity definition is like a table in a SQL database, with columns being the attributes. We can refer to a specific entity within a schema using the notation schema.entity.

Finally, an ERMRest model defines a set of relationships. Relationships in ERMRest are typically explicitly declared using foreign_key declarations. ERMRest uses these declarations to introspect the model and streamline navigation. The endpoints for a relationship does not have to be contained in a single schema, and may refer to an entity in a different schema using the schema.table notation.

Entity Sets

Unlike SQL database queries, an ERMRest query always returns an Entity Set. There are never duplicates. In any ERMRest request, there are a set of entity attributes that are used to uniquely identify the the instance of the entity. Often, these will be just the attributes used to define the entity in the schema, in which case the entity is already unique. However, in cases where you may be altering the attributes to be used, this can has some unexpected consequences if one is not careful:

  • If you return a subset attributes (i.e. project), ERMRest will use the remaining identifying attributes to create a set and will not include multiple copies of the same attribute values. So if I have A,B,1 and A,B,2 and only use the first two columns, ERMRest will return a single entity: A,B.
  • If you add an attribute to an entity as part of a query, and you have more then one value (e.g. A,B,1 and A,B,2), ERMRest will turn this into a set based on the identifying attributes, and a selection of the first value of the remaining attributes. So for this example, ERMRest may return A,B,1 OR A,B,2, unless the result is explicitly sorted.

We note that the behavior of ERMRest is the same as in SQL if every SELECT statement always included a DISTINCT ON expression.

ERMRest Paths

The primary method for specifying an entity set ERMRest is to specify a path through the ER model, starting with an initial entity, and then specifying a subsequent sequence of entities that can be reached from the staring point by traversing a relationship in the model. In many cases, ERMRest can figure out the relationship by model introspection and no additional information is required beyond the sequence to go from entity to entity. ERMRest also lets you steer the navigation through the ER model, by telling it which relationship, or by specifying a subset of instances of a specific relationship to follow.

Aliasing

Describe aliases

Identifying Entities With Path Expressions

The simplest form of ERMRest query is one in which the final entity set in a path expression is returned. This is called an entity query.

Another common requirement is to return an entity set that contains a subset of attributes in the final entity in a path. This is called an attribute query and is indicated by a 'select' member that lists a subset of the attributes of the final entity. In this case, an entity set is formed from the specified attributes, removing any duplicates.

The final select can also include attributes from entities that appear along the path. It is important to remember that the set will be formed based on the attributes of the last entity in the path and just one of the values of the other attributes will be used. The number of elements in the entity set will be determined only by the number of unique combination of values for attributes in the last entity in the path. This can lead to unexpected results unless you construct your path carefully.

Describe the idea of navigating an ERMRest path here. .next()/.link()

Identifying Entities with Group Expressions

The final type of ERMRest query is called an attributegroup query. Unlike a 'attribute' query, which always uses the attributes of the last specified entity to determine set membership, a attributegroup query allows one to specify attributes from different entities to be use to determine set membership. So if we had A->B->select(A.foo,B.bar) we would return a set with one row for each unique value of A.foo, while A->B->group(A.foo,B.bar)->select(A.foo,B.bar) will return a set in with one row for each unique value of A.foo,B.bar

A attributegroup expression behaves much like a FROM clause in a SQL statement combined with a DISTINCT ON in the SELECT clause.

API Functions

UPDATE THIS WHEN WE FINALIZE FUNCTIONS

We will now describe the Deriva-py interface to ERMrest in detail. However, the following high-level summary will be useful:

  1. Always expect a set. In cases where the project part of the interface converts multiple valued attributes into single values, be careful as you will get arbitrary values.
  2. We have two ways to specify set of what: path expressions and group expressions. In a path expression, it is always a set of the last thing in the path (or the subset of the attributes of the last thing in the path). In a group expression, the set is determined by the attributes listed.
  3. You construct path expressions using .link() operators, you construct groups using .join(), you are not allowed to mix the two.
  4. In both cases, you can steer the relationships by using an optional ON= parameter. In both cases, you can name the entity using the optional AS= parameter.

Generating a query

# Create an alias for the entity
.as(alias-name)

# Append an entity to the current path. 
# Optional AS parameter creates a alias for the resulting entity set.
# Optional ON parameter specifies a filter to specify/restrict the relationship to be used.
.next(AS=alias-name, ON=filter, EntityAlias) 

# Specify columns (project) for the path.
# If no group parameter is specified, the last entity in the path is used to create the set.
# If a group parameter with a table alias is specified, that entityset is used to create the group.
# If the group parameter is a column list, those attributes are used to define the set.
# Any attributes specified by the group parameter are included in the entity set along with any additional values in the column-list
.select(group=column-list|entity-alias, column-list) 

Filter Expressions

When evaluating a path expression, ERMRest introspects the schema and uses build in heuristics to navigate through the model. Lets consider an experiment that has two measurement measurements which we will call before and after.

If we wanted to collect up all of the measurement data for all experiments we could write

Experiment.next(measurement)

However, we might just want all the before measurements. We can get this result by using a filter expression to tell ERMRest to restrict the entity set:

Experiment.next(measurement, ON=(Experiment.before_measurement))

In addition, we might want only the before measurements that are images. For that we could write:

Experiment.next(measurement, ON=(Experiment.before_measurement and Measurement.type == 'image'))

Connecting to an ERMRest Server

Filter Expressions

Clone this wiki locally