Data Challenges

Author	Matt Henderson
Revised	April, 2015

Contributors: Shane Canon, Dan Gunter, Gavin Price, KBase team

List of challenges

Sharing/granularity

The granularity of sharing is a workspace, not an object. This causes confusion at some levels of the system. The Narrative UI has hidden the concept of a workspace from the user entirely, which solves part of the problem.

Versioning

Narratives together with bio objs

References

Shock vs. workspace vs. central store: Different types of references, each with their own behavior due to different backends. It would be easier to have a unified set of semantics.
Permissions are "relaxed" on references. If an object is referenced by a downstream object (e.g., an annotation is used to construct a model), then that referenced object is implicitly shared if the downstream object is sharereed. So even if they upstream object were kept private, it can be accessed via the downstream object. This is intentional because it was meant to support provenance. But it may be unclear to the user that this is the case.

Example where "relaxed" permissions could be a problem.

Imagine User A has a reads data set that is really important to them. They assemble, annotate and build a model from that. They share the model with a colleague (User B) to validate that the model is good. User B does some iteration on that and publishes, the world could potential access the raw data set now. But User A may have not intended for that to happen.
References are one-way, in the sense that an object does not store information about all the objects that reference it. This is due to the way references are implemented. Sometimes navigating in the "reverse" of the original reference direction would be useful, and the next system should implement references in such a way that this is possible without looking at the entire tree.
Copies of data from the workspace shared with other people may not be "complete", from the perspective of the recipient. Workspace objects may reference things outside the workspace, including the central store or Shock. Some examples:
- If the reference is to a shock node, that node is not owned by me, and the owner can in fact delete that node and I will never have access to that data.
- If the reference is a shock reference and not a handle reference, that shock node is actually not shared with me and I can’t even access it.
- If the reference is an actual workspace reference, and the data is dependent, say for example Genome and ContigSet, you will not get a ContigSet object and a Genome object when you copy a Genome. You could make the argument that you may not need the ContigSet object, but even if you wanted it you actually can’t get it. You could copy the ContigSet independently of the Genome, but those two objects (Genome copy and ContigSet copy) can never reference one another, even though they are byte equivalent to the original objects that in fact do reference one another.
  
  In the end, this means that the semantics of what a copy means are different here than what people are accustomed to. A workspace copy here means that you get a new top level object with the same contents, including all references. This could be considered a shallow copy. You could also consider a deep copy, where all dependent data objects are also replicated on copy. It is not possible now to get a deep copy for workspace objects unless you do it manually (and, with shock etc. references, this may be non-trivial). Although under the hood it makes perfect sense to not duplicate bytes, especially for large data, the next system should implement copies so that they are functionally equivalent to the original.

Types

ownership/namespacing
extensibility and evolution
polymorphism

CDMI API doesn’t work well for accessing large data objects

Search attempted to use it for extracting all the genome and feature data out of the central store for indexing purposes. Dumping the tables turned out to be a more reliable and practical option. Future data apis and implementations need to work more effectively for accessing large genomes in particular

Reference data updates

Search

Validation of raw data
Uniformity across data sources

Multiple copies of the data

Due to various difficulties in practical usage of components of the system, the reference data in particular is copied several times. This is obviously not optimal, so future systems should try to minimize this kind of behavior. Copies are in:

central store
WS copy
WS indexable copy
flat files (Solr)

Lack of subset capability

One always need to get entire object. This is particularly painful when a large object like the Genome is being used as a container for smaller pieces of data that are used exclusively by other functions. The future system should make it natural to select only the part of the object that’s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data.asciidoc

data.asciidoc

Data Challenges

List of challenges

Files

data.asciidoc

Latest commit

History

data.asciidoc

File metadata and controls

Data Challenges

List of challenges