Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create python module to validate processed data #18

Closed
wants to merge 28 commits into from
Closed
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
8c1dd85
add modularized validation
horstf Apr 13, 2022
d181201
add shape to test
horstf Apr 13, 2022
fc37a5e
add old dodo.py code from ont_dev branch (have to update)
horstf Apr 13, 2022
68aed79
update requirements and environment
horstf Apr 13, 2022
f49e040
add __init__.py for modularization of code
horstf May 3, 2022
644a4c9
rename emodul_validation.py to validation.py to imporve python modula…
horstf May 3, 2022
f752d17
add validation to dodo.py
horstf May 3, 2022
c4dd67b
add shape as dep to dodo.py
horstf May 3, 2022
230ac90
move and rename shape
horstf May 3, 2022
458da5e
merge main
horstf Jul 26, 2022
1e3697e
move validation into own lebedigital submodule
horstf Jul 26, 2022
f35255b
validation didnt need additional submodule structure
horstf Jul 26, 2022
379c107
add comments
horstf Jul 26, 2022
2a3e61c
add comments
horstf Aug 4, 2022
b9b3175
remove superflous comments
horstf Aug 4, 2022
c150d15
remove unused file from concrete
horstf Sep 5, 2022
8b25331
add files to test validation
horstf Sep 7, 2022
da2cebf
add validation tests
horstf Sep 7, 2022
de56554
add validation tests
horstf Sep 7, 2022
1a49eab
remove empty lines
horstf Sep 7, 2022
26e0925
add validation tests
horstf Sep 7, 2022
9e92d70
undo changes in requirements.txt
horstf Sep 8, 2022
3133735
undo changes in requirements.txt
horstf Sep 8, 2022
f0d889a
Merge remote-tracking branch 'origin/main' into validation_modulariza…
horstf Oct 7, 2022
0d501bc
add some dodo code
horstf Oct 7, 2022
3cd8a72
add output code
horstf Oct 7, 2022
a3917f9
add output code
horstf Oct 7, 2022
2678023
update dodo
horstf Jan 16, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions lebedigital/validation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
from pyshacl import validate
from rdflib import Graph, URIRef, Namespace
from rdflib.util import guess_format
from rdflib.namespace import SH, RDF

SCHEMA = Namespace('http://schema.org/')


def test_graph(rdf_graph: Graph, shapes_graph: Graph) -> Graph:
"""
Tests an RDF graph against a SHACL shapes graph.

Parameters
----------
rdf_graph
An rdflib Graph object containing the triples to test against.
shapes_graph
An rdflib Graph object containing the shapes to test.

Returns
-------
result_graph
An rdflib Graph object containing the SHACL validation report (which is empty if no SHACl shapes were violated).
"""
conforms, result_graph, _ = validate(
joergfunger marked this conversation as resolved.
Show resolved Hide resolved
rdf_graph,
shapes_graph,
ont_graph=None, # can use a Web URL for a graph containing extra ontological information
inference='none',
abort_on_first=False,
allow_infos=False,
allow_warnings=False,
meta_shacl=False,
advanced=False,
js=False,
debug=False)

# only add other graphs if any violations occurred
if not conforms:
# also add nodes from data and shacl shapes to graph to be able to search backwards for the violated shapes
result_graph += shapes_graph
result_graph += rdf_graph

return result_graph

def violates_shape(validation_report: Graph, shape: URIRef) -> bool:
"""
Returns true if the given shape is violated in the report.

Parameters
----------
validation_report
An rdflib Graph object containing a validation report from the test_graph function.
shape
A URIRef object containing the URI of a shape.

Returns
-------
True, if the specified shape appears as violated in the validation report, False otherwise.
"""
# get the class that is targeted by the specified shape
target_class = validation_report.value(shape, SH.targetClass, None, any=False)
if target_class is None:
raise ValueError(f'The shapes graph does not contain a {shape} shape.')


# get all classes that have been violated
# check if any of the violated classes is the class that is targeted by the specified shape
# return any((True for o in validation_report.objects(None, SH.focusNode) if target_class in validation_report.objects(o, RDF.type)))
for o in validation_report.objects(None, SH.focusNode):
if target_class in validation_report.objects(o, RDF.type):
return True

# no violated class is targeted by the specified shape, thus the shape is not violated
return False


def read_graph_from_file(filepath: str) -> Graph:
"""
Reads a file containing an RDF graph into an rdflib Graph object.

Parameters
----------
filepath
The path to the file containing the graph.

Returns
-------
graph
The rdflib Graph object containing the triples from the file.
"""
with open(filepath, 'r') as f:
graph = Graph()
graph.parse(file=f, format=guess_format(filepath))
return graph
20 changes: 19 additions & 1 deletion usecases/Concrete/dodo.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
import graphlib
import os
from pathlib import Path
from knowledgeGraph.emodul import validation

baseDir = Path(__file__).resolve().parents[0]
emodulFolder = os.path.join(os.path.join(os.path.join(baseDir,'knowledgeGraph'),'emodul'),'E-modul-processed-data')
emodulFolder = os.path.join(os.path.join(baseDir,'knowledgeGraph'),'emodul')
emodulRawdataFolder = os.path.join(emodulFolder,'rawdata')
emodulProcesseddataFolder = os.path.join(emodulFolder,'processeddata')
emodulYAMLmetadataFolder = os.path.join(emodulFolder,'metadata_yaml_files')
Expand All @@ -11,8 +13,19 @@
compressionRawdataFolder = os.path.join(compressionFolder,'rawdata')
compressionProcesseddataFolder = os.path.join(compressionFolder,'processeddata')

graph_path = os.path.join(emodulProcesseddataFolder, 'EM_Graph.ttl')
shapes_path = os.path.join(emodulFolder, 'shape_ym.ttl')

DOIT_CONFIG = {'verbosity': 2}

def validate_graph(graph_path, shapes_path):
g = validation.read_graph_from_file(graph_path)
s = validation.read_graph_from_file(shapes_path)
r = validation.test_graph(g, s)
assert validation.violates_shape(r, validation.SCHEMA.InformationBearingEntityShape)
assert not validation.violates_shape(r, validation.SCHEMA.SpecimenDiameterShape)
assert not validation.violates_shape(r, validation.SCHEMA.SpecimenShape)

def task_installation():
yield {
'basename': 'install python packages',
Expand Down Expand Up @@ -71,6 +84,11 @@ def task_emodul():
# 'basename': 'validate rdf files against shacl shape',
# 'actions': ['python knowledgeGraph/emodul/emodul_validation.py']
# }
yield {
'basename': 'validate rdf files against shacl shape',
'actions': [(validate_graph, [graph_path, shapes_path])],
'file_dep': [graph_path, shapes_path]
}
yield {
'basename': 'run emodul query script',
'actions': ['python knowledgeGraph/emodul/emodul_query.py'],
Expand Down
Empty file.
78 changes: 78 additions & 0 deletions usecases/Concrete/knowledgeGraph/emodul/validation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
from pyshacl import validate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure that that file does, but all files should eiter be in lebedigital (the actual functions), in tests (the tests for these functions) or in minimumWorkingExample (the pydoit workflow that processes the data for the minimum working example). The folder usecases/Concrete is deprecated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That file is not used anymore, I think that was an accidental commit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is still in Concrete, please remove it?

from rdflib import Graph, URIRef, Namespace
from rdflib.util import guess_format
from rdflib.namespace import SH, RDF

"""
baseDir0 = Path(__file__).resolve().parents[0]
baseDir1 = Path(__file__).resolve().parents[1]
baseDir2 = Path(__file__).resolve().parents[2]
ontologyPath = os.path.join(baseDir2,'ConcreteOntology')
metadataPath = os.path.join(baseDir0,'E-modul-processed-data/emodul_metadata.csv')
graphPath = os.path.join(baseDir0,'E-modul-processed-data/EM_Graph.ttl')
processedDataPath = os.path.join(baseDir0,'E-modul-processed-data')
"""

SCHEMA = Namespace('http://schema.org/')

"""
Given a path to a shacl shape and a path to an rdf file, this function tests the rdf data against the specified shacl shapes.
The result is an rdflib graph containing the validation report, if it is empty the validation was successful.
"""
def test_graph(rdf_graph: Graph, shapes_graph: Graph) -> Graph:

conforms, result_graph, _ = validate(
rdf_graph,
shapes_graph,
ont_graph=None, # can use a Web URL for a graph containing extra ontological information
inference='none',
abort_on_first=False,
allow_infos=False,
allow_warnings=False,
meta_shacl=False,
advanced=False,
js=False,
debug=False)

# only add other graphs if any violations occurred
if not conforms:
# also add nodes from data and shacl shapes to graph to be able to search backwards for the violated shapes
result_graph += shapes_graph
result_graph += rdf_graph

return result_graph

"""
Returns true if the given shape is violated in the report.
"""
def violates_shape(validation_report: Graph, shape: URIRef) -> bool:

# get the class that is targeted by the specified shape
target_class = validation_report.value(shape, SH.targetClass, None, any=False)
if target_class is None:
raise ValueError(f'The shapes graph does not contain a {shape} shape.')


# get all classes that have been violated
# check if any of the violated classes is the class that is targeted by the specified shape
for o in validation_report.objects(None, SH.focusNode):
if target_class in validation_report.objects(o, RDF.type):
return True

# no violated class is targeted by the specified shape, thus the shape is not violated
return False

"""
Reads a graph from a file into a Graph object.
"""
def read_graph_from_file(filepath: str) -> Graph:
with open(filepath, 'r') as f:
graph = Graph()
graph.parse(file=f, format=guess_format(filepath))
return graph


# assert that certain violations occurred / did not occur:
# assert violates_shape(g, SCHEMA.InformationBearingEntityShape)
# assert not violates_shape(g, SCHEMA.InformationBearingEntityShape)

3 changes: 2 additions & 1 deletion usecases/Concrete/knowledgeGraph/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,6 @@ SPARQLWrapper==1.8.5
requests==2.22.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is also outdated, the conda environment at the top level is the one and only to be used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can ignore these commits then, all the libraries I need are already part of the environment file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also deleted the validation.py from Concrete

Copy link
Collaborator Author

@horstf horstf Sep 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joergfunger For the tests, is there a graph that I can use to test my code? The usecases/MinimumWorkingExample/emodul/processed_data folder is empty but I would need an EM_Graph.ttl file or so to test the validation. I can also just create my own otherwise from the data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you talk to @PoNeYvIf on that? For the tests, it would also be possible (and maybe even better) to create your own very simple test cases (it does not have to be related to a very specific ontology, but rather a very general one such as foaf or prov). In particular make sure that there are errors existing such that those can be returned and processed. We could also have a short phone call on that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. if we know that we test for machine ids via shacl, we could get the machine IDs from the global KG that are referred to in the local KG. This would of course mean that every time we add a rule we have to think about the SPARQL query and the information that could be contained globally and not locally, i don't know if this is feasible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's try to create the queries, let me know when it's okay for you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I would always download the complete global graph. It would rather be necessary if particular instances that are referred to in the local graph (e.g. the testing machine that has already be created, or the mix_design we are referring to when adding a new Youngs modulus test) is existing. So for the specific tests, we would first generate the mix design, and upload it to the global KG. Then we would create a Youngs modulus test that references an instance of the mix design (without creating the same instance again). By uploading that then to the KG, these two data sets (mix design and Youngs modulus test) are automatically connected. And before uploading the Youngs modulus KG, we would have to check, if this instance of the mix design is already existing in the global KG (and potentially further properties apart from the id, but that is not necessary right now).

Copy link
Member

@joergfunger joergfunger Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the sparql query, we somehow know in the local KG generation what classes we expect. So the sparql query should return the ids of all instances of a given class - that should be quite general and would not require rewriting the sparql query each time we have a new test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should have a talk together with @firmao to discuss the specifics?

GitPython==3.1.24
probeye==1.0.6
pyshacl==0.9.5
pyaml==21.10.1
doit==0.33.1
doit==0.33.1