Skip to content

Latest commit

 

History

History
244 lines (155 loc) · 15 KB

sep_010.md

File metadata and controls

244 lines (155 loc) · 15 KB

SEP 010 -- simplify description of sequence features and sub-parts

SEP 10
Title simplify description of sequence features and sub-parts
Authors Raik Gruenberg <raik.gruenberg at gmail com>
Editor James McLaughlin
Type Data Model
SBOL Version 3.0
Replaces
Status Accepted
Created 20-Sep-2016
Last modified 31-Aug-2019
Issue #25

Abstract

There are two very different types of 'part annotation'. (1) part composition relationships -- These always point to an existing (and presumably re-usable) sub-component. Sequence location or indeed sequence information may or may not be available. (2) Classic sequence feature annotations -- As known from the genbank format, these only apply to clearly specified sequence regions but often do not point to meaningful sub-parts.

Currently, Component alone is sufficient to describe sub-part relations without any sequence information. This however is the exception in synthetic biology practice. Both SequenceAnnotation and Component are needed for SBOL encoding of actual genetic designs with parts and sub-parts (because Component lacks a location field). Conversely, simple sequence features can be described using SequenceAnnotation alone (as of SBOL 2.0) but this possibility is not widely known and additional Component and ComponentDefinition are often created instead.

We propose to modify Component and SequenceAnnotation such that Component is solely responsible for the description of part - subpart relationships (with or without sequence) and SequenceAnnotation is solely responsible for the description of genbank-style sequence features. SequenceAnnotation should be renamed to SequenceFeature.

Table of Contents

1. Rationale

1.1 current situation

The current SequenceAnnotation class has a dual purpose:

(1) Its primary role is to specify the location of "sub-parts" within the sequence of a parent ComponentDefinition. To this end, SequenceAnnotation links one or more Location records with a Component. The Component, in turn, refers to a ComponentDefinition (via its definition field). This ComponentDefinition is the description of the actual sub-part. Actual physical composition is therefore defined like this:

ComponentDefinition -[sequenceAnnotation]-> SequenceAnnotation -[component]-> Component -[definition]-> ComponentDefinition

The directionality (which one is parent and which one is a sub-part) is frequently confused. Moreover, the parent ComponentDefinition also directly links to the sub-part Component via a component field. This is necessary so that composition can be described before any sequences (and thus sequence locations) are known. An additional chain of references is therefore needed, in parallel to the one shown above:

ComponentDefintion -[component]-> Component -[definition]-> ComponentDefinition

Current SBOL 2.1 part - subpart relations are summarized in the following figure: SBOL 2.1 part - subpart relations

Adding to this redundancy, both Component and SequenceAnnotation may have role properties that diverge from the role (functional classification) of the target ComponentDefinition. Whether a diverging role is attached to Component or SequenceAnnotation is an arbitrary choice. This invites conflicting implementations and interpretations of this field.

Evidently, SBOL makes the description of "undefined", "loose bag", part composition without any sequence information relatively easy. By contrast, the description of actual genetic designs with actual sequences is surprisingly complex and redundant. This is unfortunate because the latter is, by and far, the overwhelming use case of SBOL. It also hinders adoption by sequence-level designers and tool developers.

(2) The secondary role of SequenceAnnotation is to simply annotate regions of interest within a given sequence. Arguably, this should be its primary role (hence the name) as it is a very common use case in practice. A SequenceAnnotation without component can be created and linked to a region of, e.g., DNA. SequenceAnnotation inherits name and description fields from Identified and is therefore sufficient for the description of "flat" sequence features. In practice however, most tools mix SequenceAnnotation and Component even for simple sequence features:

SBOL 2.1 sequence feature description

1.2 Goals of the proposal

proposed simplified data model

(1) Restrict the use of SequenceAnnotation to annotations of features which do not fall into the part - subpart category. As a welcome side effect, this should make it much easier to move back and forth between SBOL and large bodies of existing genbank-formatted information and related software.

(2) Simplify the part-subpart relationship via Component so that it does not any longer require SequenceAnnotation.

(3) Create a syntactic parallel between sequence/physical and functional part-subpart relations in SBOL -- The Component class will be equivalent in syntax and meaning to the existing Participation class. For programmers, the pattern ComponentDefinition -> Component(role) -> ComponentDefinition will look and feel like the already established pattern Interaction -> Participation(role) -> ComponentDefinition.

(4) Remove ambiguity as to how things can / should be expressed at the sequence layer to aid meaningful data exchange.

2. Specification

2.1 Add location field to Component

Add the following optional field to Component:

  • [0..n] location pointing to a Location on the parent ComponentDefinition sequence; if location is missing, this indicates a part / sub-part relationship for which sequence details have not (yet) been determined.

The Location record(s) specified by a Component are subject to the same restrictions currently in place for SequenceAnnotation Location. Concretely, two Location records attached to the same Component MUST NOT overlap in their range as it would not be clear what that means. The Location of two separate Components may overlap.

2.2 Rename SequenceAnnotation to SequenceFeature

  • rename class SequenceAnnotation to SequenceFeature
  • rename sequenceAnnotation field of ComponentDefinition to sequenceFeature

2.3 Restrict SequenceFeature to sequence feature annotation

Remove the following fields from SequenceFeature (formerly SequenceAnnotation):

  • component -- SequenceAnnotation is not any longer used for part - subpart relations
  • roleIntegration -- there is no sub-part/definition that role fields may be in conflict with

Update the specification to clarify usage of existing fields:

  • [0..n] role pointing to a SequenceOntology term (optional), corresponds to genbank type field
  • [0..1] name corresponding to genbank name field (optional but now RECOMMENDED)
  • [0..1] description corresponds to genbank description field (optional)

Moreover, a validation rule is needed: SequenceFeature can only be used if an actual sequence record is specified for the parent ComponentDefinition.

2.3 Let SequenceConstraint point to SequenceFeature

SequenceConstraint.object and SequenceConstraint.subject can point to either ComponentInstance derivatives (as before) or to SequenceFeature.

This change allows to anchor constraints on sequence regions that are not actually sub-parts. Examples may be start / stop codons, transcription start sites or specific mutations.

3. Example or Use Case

Example use cases for the modified SequenceAnnotation are feature annotations such as START or STOP codons, mutations, highlighting regions referred to in a paper, sequence conflicts, etc, all mainly intended for human consumption. Over the evolution of a design, sequence features may later be formalized into re-usable subparts (i.e. 'Component's) It is therefore conceivable that a sequence editor reads in a genbank file with many sequence features and offers the user the easy conversion of some of those features into sub-parts. This, in fact, is a workflow already used and supported by the Benchling Sequence editor (http://benchling.com).

4. Backwards Compatibility

4.1 suggested transition path

Implement all changes at once in SBOL v 3.0.

4.2 Conversion of SBOL 2.x records to 3.0:

  1. remove intermediate SequenceAnnotation and move SequenceAnnotation.location to Component
  2. optionally, try to flatten SequenceAnnotation - Component - ComponentDefinition chains of trivial annotations into SequenceFeature records

4.3 Backwards conversion of SBOL 3.x to 2.x:

  • conversion of localized Component:

    (1) create SequenceAnnotation record pointing to subpart Component

    (2) move location from Component to SequenceAnnotation

  • conversion of non-localized Component:

    no change required

  • conversion of SequenceFeature

    (1) rename SequenceFeature to SequenceAnnotation

    (2) rename ComponentDefinition.sequenceFeature field to sequenceAnnotation

5. Discussion

5.1 author comments

As an added benefit, the proposed change creates a symmetry between the sequence and the functional layer of SBOL: Component is now the equivalent of Participation. The former describes a physical part- subpart relation whereas the latter describes a functional part - subpart relation. Both specify one or more role properties, both point to a (sub)ComponentDefinition. Currently, this parallel is obfuscated by the multiple direct and indirect references between parent and sub-part ComponentDefinition.

5.2 discussion at COMBINE

  • It was pointed out that SequenceAnnotation already can have its own name and description fields as it is derrived from Identified. The SEP was changed accordingly.
  • Renaming SequenceAnnotation to SequenceFeature was universally considered a good idea (for symmetry with genbank, bioinformatics practics and in order to avoid confusion with "Annotation" in SBOL and SBML).

5.3 SequenceConstraints

  • At COMBINE, it was suggested that SequenceConstraint should also be allowed to point to SequenceFeature. This would avoid construction of Component - ComponentDefinition chains for, e.g. mutations or other simple features that are not sub-parts but nevertheless restrict/orient the positioning of other Components. This change has been incorporated into the SEP.

  • Originally, this link SequenceConstraint -> SequenceFeature link was restricted to the SequenceConstraint.object field. This was meant to enforce that Components (sub-parts) can be anchored to sequence features but not the other way round. However, the types of constraints allowed assume that the directionality of a SequenceConstraint can be freely chosen. We can say that part A preceeds part B but we cannot say that part A "follows" part B. In SBOL, the latter is expressed as "part B preceeds part A" (i.e. subject and object of the SequenceConstrain are reversed). For this reason, both object and subject of the constraint need to be allowed pointing to SequenceFeature.

  • At COMBINE, it was suggested to rename SequenceConstraint into ComponentConstraint -- this should be put into a separate SEP.

  • It was suggested to put additional restrictions on the use of Component.location so that a fully specified sequence can more easily be pieced together from Component sequences. This raises an important issue with the current data model, which does not allow an easy distinction between partially defined and fully specified sequences. However, the editors consider this as an orthogonal problem which should be adressed separately.

5.4 Transition path

  • The original SEP (see github history) suggested a step-wise introduction starting with SBOL 2.2. This would have led to a hybrid data model where both usage patterns could co-exist and was eventually considered too complex. Instead, the SEP is considered as a clean backward-incompatible change for SBOL v 3.0.

5.5 Related SEPs

The following SEPs make complementary suggestions for further simplification of the SBOL data model:

  • SEP 15 (Issue) -- rename Component -> SubPart and ComponentDefinition -> Component
  • SEP 25 (Issue) -- merge Module(Definition) with Component(Definition) and remove FunctionalComponent
  1. Competing SEPs

None.

References

Copyright

CC0
To the extent possible under law, SBOL developers has waived all copyright and related or neighboring rights to SEP 010. This work is published from: United States.