Kabob reactome august #2

ekwhite · 2018-08-28T19:45:10Z

Add rules for parsing basic Reactome entities from BioPAX to KaBOB ICE.
Entities include
Continuants: proteins, small molecules, physical entities, dnas, rnas, and complexes
Occurrents: biochemical reactions, template reactions, degradations, pathways; controls, template reaction regulations, and pathway steps.
When these rules are working, there's a lot more ICE to generate.
Watch for the rules' returning triples with missing forward slashes (http:/ rather than http://) in the URIs of existing BioPAX entities. Here's an example of one in the first position:
http:/www.reactome.org/biopax/65/48887#TemplateReactionRegulation8 <http://pur
l.obolibrary.org/obo/IAO_0000142> <http://ccp.ucdenver.edu/kabob/ice/R_hQXlqE4km
e-3HAiVOIU_ZXgx9rU> .

… to step_b_ontology_to_bio Because the package transfers the hierarchies from the ontologies into bio world.

The scripts have been segmented into allegrograph-specific, virtuoso-specific, and common-scripts folders. The allegrograph-specific scripts are ready to try currently. More work is required for the virtuoso scripts. At some point the scripts should be refactored as there is a fair amount of repetition.

Although they are downloaded as OWL files, they are converted to the ntriples format prior to loading

Must be in the base directory so it can find project.clj

…the load-request-directory

…nsion ontology

…dling

Mainly in the GGP abstraction hierarchy

To be used to model interactions in general. The INO incorporates aspects of MI, but has a continuous hierarchy to the upper level interaction concept. Use of MI_0000 has been replaced by INO_0000002.

Replaced usage of ccp ns with kbio for rules that create bio-entities

…ization This is a more natural place to run these rules, also for practical purposes missing RNAs and proteins need to be generated before the biogrid rules run

The Protein Ontology standardized namespaces for all external identifiers to the obo namespace. This negated the need for the step_aa rules which have been moved to the deprecated folder. In their place there are some new rules that make exact match statements for pr identifiers that don’t match what is produced by the file parser machinery, e.g. NCBIGene_1 in pr vs NCBI_GENE_1 produced by the file parsers.

Added pseudogene, protein-coding gene, and biological region types based on NCBI gene gene_info data. Removed old rules that were typing genes as RNAs. These will be replaced with rules in step_hcb: generating missing ggp entities.

Moved them to a new directory

…protein root class

…riate

Updated code to handle :body blocks that are SPARQL strings Removed handling for the :sparql-string block

Revised generation to static node URIs to prevent redundant assertions from different ontology files expressing similar knowledge

Removed deprecated rules Replaces use of :sparql-string in rules with :body Added ordering to taxon rules to prevent duplicate taxon restrictions from being generated Replaced use of to_be_integrated/ and under_review/ rule directories with under_construction/ Updated GO MF KR to use realizes relation Replaced usage of ccp ns with kice and kbio where appropriate

1) Implementation of a suite of static rule tests to identify common errors during rule composition. Rules are checked for variable alignment between the head, reify, and body blocks among other things. Note, the use of the :sparql-string keyword has been discontinued. The :body keyword can now be either a SPARQL string or a list of triples using Livingston’s DSL. To run the static rule test: lein test :only kabob.build.static-rule-tests Alternatively, you can run the tests individually: lein test :only kabob.build.static-rule-tests/test-rule-structure lein test :only kabob.build.static-rule-tests/test-whitespace-padded-names lein test :only kabob.build.static-rule-tests/test-rules-known-syms lein test :only kabob.build.static-rule-tests/test-rules-have-meta lein test :only kabob.build.static-rule-tests/test-duplicate-names lein test :only kabob.build.static-rule-tests/test-rules-forward-safe lein test :only kabob.build.static-rule-tests/test-rule-heads-for-expected-property-namespace lein test :only kabob.build.static-rule-tests/test-rules-for-missing-slashes-in-variables 2) Implementation of a suite of validation rules to check for representation faults within a KaBOB instance. These rules are written such that they add no new triples to the KB except for the 4 triples associated with the rule metadata when a rule is run. The validation rules are written such that zero hits is the expected result. If there are >0 results, then that is an indication of a representational issue within KaBOB that needs to be addressed. 3) Excluded redundant restrictions using a new strategy involving hashing for representing blank nodes. Redundant restrictions are created by importing ontologies where duplicate information is represented using blank nodes, e.g. restrictions with identical hasProperty and someValuesFrom fillers, but b/c they use blank nodes, they are imported as unique entities when they should instead be collapsed. 4) The rule directories have been renamed using _0_ and _1_ prefixes to more accurately encode the run order. 5) The GO MF representation was changed from MF-->has_participant-->Protein to MF<--realizes--[anonymous-process]--has_participant-->Protein 6) NCBI Taxonomy taxonomic rank concepts are now excluded from BioWorld 7) Handling was added to the Stardog build pipeline to allow for the use of named graphs, so each triple is placed in a graph named after its source file (which for rule output is named after the rule that was run to generate the triples) 8) The kice namespace is now used in the identifier set generation code (the ccp namespace had remained in use accidentally)

* Fixed redundant OWL constructs when importing ontology blank nodes * Added links from UniProt isoforms to their ‘canonical’ protein using variant_of * Added labels to every (I think) exhaustive subclass that is created by the kabob rules Note 1: There are still some nodes missing labels. Some are discontinued (NCBI) or withdrawn (HGNC) records that are linked by other sources. But many are genes/proteins from species other than human that are being brought in as part of PPIs, e.g. a human protein and a mouse protein are known to interact. The next build will restrict PPIs to just human-human interactions (I thought this was already the case but evidently it was not). Note 2: There are still some redundant restriction classes. When there is a restriction that is defined in an ontology that is also defined by one of the KaBOB rules, e.g. only_in_taxon restrictions, there will be two copies b/c of the way the URIs are currently generated. I’ll work towards collapsing these in future releases.

Changes from May and July 2018 releases

Add schema diagram

bill-baumgartner and others added 30 commits June 21, 2017 19:40

Changed the post_identifier_merge/step_b_ontology_to_ice rule package…

487a14c

… to step_b_ontology_to_bio Because the package transfers the hierarchies from the ontologies into bio world.

Added curly braces around variables in process-ontologies.sh

29af09d

Fixed ccp-extension-ontology URL after making the repo public

df9a435

Disabled all but rdf-file-list-generation in build-from-scratch-ag

0d821f0

Corrected path to INIT.sh

46b29b4

Activated ontology file loading

0698ef4

Add curly brace to variable

6d5dcbe

Fixed paths to other scripts being called

f5c66ae

Corrected format for ontology files

70869c9

Although they are downloaded as OWL files, they are converted to the ntriples format prior to loading

Initial commit of scripts to run the virtuoso build

915c7e4

Activated ontology loading

8dd7ca0

Activated a single rule run for testing purposes

5569eb2

Added LEIN_ROOT variable to avoid leiningen warning during runs

b5c26b4

Added cd /kabob.git for a leiningen run calls

4ca561f

Must be in the base directory so it can find project.clj

Added getopts for parameter handling

1cf6b79

The triple store container name is now accessed via a file placed in …

703a331

…the load-request-directory

Reverted getopts addition

4ececfa

Added missing prefixes to queries

d97f8f1

Integration of virtuoso-specific code into the rule machinery

7863b46

Revised virtuoso kb initialization

287b821

Changed virtuoso jdbc dependency to version 4.0.0

4697127

Testing the virtuoso connection url

23a1174

Changed virtuoso dependencies to sesame 2.6.0 and jdbc 3.0.0

fe5433e

Changed jdbc dependency back to 4.0.0

e41ecb9

Goa bp rule revised to use record and field classes from the ccp exte…

48131c3

…nsion ontology

Further revisions to the go-bp rule to account for new identifier han…

7a7272e

…dling

Removed a comment

60e452c

Merge remote-tracking branch 'UCDenver-ccp/overhaul'

fb57f42

Merge branch 'overhaul' into UCDenver-ccp/overhaul

0d5c79b

bill-baumgartner and others added 30 commits May 3, 2018 13:28

Bug fix in go cc instance rule reify clause

dad2f87

Replaced incorrect usages of rdf:type with rdfs:subClassOf

d4ac7d3

Mainly in the GGP abstraction hierarchy

Incorporated the Interaction Network Ontology

997a2f5

To be used to model interactions in general. The INO incorporates aspects of MI, but has a continuous hierarchy to the upper level interaction concept. Use of MI_0000 has been replaced by INO_0000002.

Added kbio namespace

6b73387

Replaced usage of ccp ns with kbio for rules that create bio-entities

Add biogrid files as "other downloads"

f76445f

Moved gene abstraction gen rules (step i) into step h; step h reorgan…

7ad0429

…ization This is a more natural place to run these rules, also for practical purposes missing RNAs and proteins need to be generated before the biogrid rules run

Updated rules to use the new ccp-kbio namespace

6f867b0

Revised typing of genes by 'gene type'

48ff0f4

Added pseudogene, protein-coding gene, and biological region types based on NCBI gene gene_info data. Removed old rules that were typing genes as RNAs. These will be replaced with rules in step_hcb: generating missing ggp entities.

Migrated rules that type entities based on identifier type

4efb992

Moved them to a new directory

Add rules to generate missing entities for proteins and RNAs

436b09a

Updated to comply with new bio namespace

44ea34a

Update to comply with changes to ggp hierarchy

4989e1e

Test validation rule

13f6775

Add rule to assert direct subclass relations for all proteins to the …

9117320

…protein root class

Add rules to break down biogrid ice-to-bio transformation

5508097

Added kice namespace

8da69b1

Replaced usage of ccp ns in reify clauses with kice/kbio where approp…

e71fa87

…riate

Added kice ns to rule bodies

593ce8c

Created instance based located_in rule

c5574f4

Moved sha-1 reification code to the kr project

1726be4

Updated code to handle :body blocks that are SPARQL strings Removed handling for the :sparql-string block

Revamped unit tests of the build procedure; added static rule tests

f150c9c

Update to the bnode-to-uri code

afb539a

Revised generation to static node URIs to prevent redundant assertions from different ontology files expressing similar knowledge

Merge pull request #34 from bill-baumgartner/master

3e8eb48

Changes from May and July 2018 releases

Add schema diagram

57ce01e

Merge pull request #35 from bill-baumgartner/master

6379536

Add schema diagram

add rules to extract basic entities in Reactome from BioPAX to ICE

c989991

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kabob reactome august #2

Kabob reactome august #2

ekwhite commented Aug 28, 2018

Kabob reactome august #2

Are you sure you want to change the base?

Kabob reactome august #2

Conversation

ekwhite commented Aug 28, 2018