Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kabob reactome august #2

Open
wants to merge 554 commits into
base: master
Choose a base branch
from

Conversation

ekwhite
Copy link

@ekwhite ekwhite commented Aug 28, 2018

Add rules for parsing basic Reactome entities from BioPAX to KaBOB ICE.
Entities include
Continuants: proteins, small molecules, physical entities, dnas, rnas, and complexes
Occurrents: biochemical reactions, template reactions, degradations, pathways; controls, template reaction regulations, and pathway steps.
When these rules are working, there's a lot more ICE to generate.
Watch for the rules' returning triples with missing forward slashes (http:/ rather than http://) in the URIs of existing BioPAX entities. Here's an example of one in the first position:
http:/www.reactome.org/biopax/65/48887#TemplateReactionRegulation8 <http://pur
l.obolibrary.org/obo/IAO_0000142> <http://ccp.ucdenver.edu/kabob/ice/R_hQXlqE4km
e-3HAiVOIU_ZXgx9rU> .

bill-baumgartner and others added 30 commits June 21, 2017 19:40
… to step_b_ontology_to_bio

Because the package transfers the hierarchies from the ontologies into
bio world.
The scripts have been segmented into allegrograph-specific,
virtuoso-specific, and common-scripts folders. The
allegrograph-specific scripts are ready to try currently. More work is
required for the virtuoso scripts.  At some point the scripts should be
refactored as there is a fair amount of repetition.
Although they are downloaded as OWL files, they are converted to the
ntriples format prior to loading
Must be in the base directory so it can find project.clj
bill-baumgartner and others added 30 commits May 3, 2018 13:28
Mainly in the GGP abstraction hierarchy
To be used to model interactions in general. The INO incorporates
aspects of MI, but has a continuous hierarchy to the upper level
interaction concept. Use of MI_0000 has been replaced by INO_0000002.
Replaced usage of ccp ns with kbio for rules that create bio-entities
…ization

This is a more natural place to run these rules, also for practical
purposes missing RNAs and proteins need to be generated before the
biogrid rules run
The Protein Ontology standardized namespaces for all external
identifiers to the obo namespace. This negated the need for the step_aa
rules which have been moved to the deprecated folder. In their place
there are some new rules that make exact match statements for pr
identifiers that don’t match what is produced by the file parser
machinery, e.g. NCBIGene_1  in pr vs NCBI_GENE_1 produced by the file
parsers.
Added pseudogene, protein-coding gene, and biological region types
based on NCBI gene gene_info data. Removed old rules that were typing
genes as RNAs. These will be replaced with rules in step_hcb:
generating missing ggp entities.
Updated code to handle :body blocks that are SPARQL strings
Removed handling for the :sparql-string block
Revised generation to static node URIs to prevent redundant assertions
from different ontology files expressing similar knowledge
Removed deprecated rules
Replaces use of :sparql-string in rules with :body
Added ordering to taxon rules to prevent duplicate taxon restrictions
from being generated
Replaced use of to_be_integrated/ and under_review/ rule directories
with under_construction/
Updated GO MF KR to use realizes relation
Replaced usage of ccp ns with kice and kbio where appropriate
1) Implementation of a suite of static rule tests to identify common errors during rule composition. Rules are checked for variable alignment between the head, reify, and body blocks among other things. Note, the use of the :sparql-string keyword has been discontinued. The :body keyword can now be either a SPARQL string or a list of triples using Livingston’s DSL.
To run the static rule test:
lein test :only kabob.build.static-rule-tests
Alternatively, you can run the tests individually:
lein test :only kabob.build.static-rule-tests/test-rule-structure
lein test :only kabob.build.static-rule-tests/test-whitespace-padded-names
lein test :only kabob.build.static-rule-tests/test-rules-known-syms
lein test :only kabob.build.static-rule-tests/test-rules-have-meta
lein test :only kabob.build.static-rule-tests/test-duplicate-names
lein test :only kabob.build.static-rule-tests/test-rules-forward-safe
lein test :only kabob.build.static-rule-tests/test-rule-heads-for-expected-property-namespace
lein test :only kabob.build.static-rule-tests/test-rules-for-missing-slashes-in-variables

2) Implementation of a suite of validation rules to check for representation faults within a KaBOB instance. These rules are written such that they add no new triples to the KB except for the 4 triples associated with the rule metadata when a rule is run. The validation rules are written such that zero hits is the expected result. If there are >0 results, then that is an indication of a representational issue within KaBOB that needs to be addressed.

3) Excluded redundant restrictions using a new strategy involving hashing for representing blank nodes. Redundant restrictions are created by importing ontologies where duplicate information is represented using blank nodes, e.g. restrictions with identical hasProperty and someValuesFrom fillers, but b/c they use blank nodes, they are imported as unique entities when they should instead be collapsed.

4) The rule directories have been renamed using _0_ and _1_ prefixes to more accurately encode the run order.

5) The GO MF representation was changed from MF-->has_participant-->Protein to MF<--realizes--[anonymous-process]--has_participant-->Protein

6) NCBI Taxonomy taxonomic rank concepts are now excluded from BioWorld

7) Handling was added to the Stardog build pipeline to allow for the use of named graphs, so each triple is placed in a graph named after its source file (which for rule output is named after the rule that was run to generate the triples)

8) The kice namespace is now used in the identifier set generation code (the ccp namespace had remained in use accidentally)
* Fixed redundant OWL constructs when importing ontology blank nodes
* Added links from UniProt isoforms to their ‘canonical’ protein using variant_of
* Added labels to every (I think) exhaustive subclass that is created by the kabob rules

Note 1: There are still some nodes missing labels. Some are discontinued (NCBI) or withdrawn (HGNC) records that are linked by other sources. But many are genes/proteins from species other than human that are being brought in as part of PPIs, e.g. a human protein and a mouse protein are known to interact. The next build will restrict PPIs to just human-human interactions (I thought this was already the case but evidently it was not).

Note 2: There are still some redundant restriction classes. When there is a restriction that is defined in an ontology that is also defined by one of the KaBOB rules, e.g. only_in_taxon restrictions, there will be two copies b/c of the way the URIs are currently generated. I’ll work towards collapsing these in future releases.
Changes from May and July 2018 releases
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants