Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix explicit dataset (FROM and FROM NAMED clauses) #2794

Merged
merged 14 commits into from
Jul 31, 2024

Conversation

apicouSP
Copy link
Contributor

Summary of changes

WHAT

FROM or FROM NAMED clause: redefine entirely the query's RDF dataset

When a query (SELECT, CONSTRUCT, ASK, DESCRIBE) is using a FROM or FROM NAMED clause, we redefine entirely the query's RDF dataset
Include only the graphs in FROM clauses in the query's default graph
Include only the graphs in the FROM NAMED clauses in the query's named graphs

As a consequence, when a user defines a FROM clause in his query but does not define FROM NAMED, then the named graphs are considered empty set.
And vice versa: if a user defines a FROM NAMED clause but does not define a FROM clause, then the default graph is considered empty.
I added tests specifically for that
That's my interpretation of this

Also since the RDF Dataset is entirely redefined when using FROM or FROM NAMED clauses then for those queries the parameter SPARQL_DEFAULT_GRAPH_UNION will be ignored

Load external graphs only if they don't already exist

Try to load external graphs only if they don't already exist in the given ConjunctiveGraph

As a consequence, you don't have to set SPARQL_DEFAULT_GRAPH_UNION to False if all the graphs mentioned in FROM and FROM NAMED clauses already exist in the given ConjunctiveGraph

WHAT it is not

  • I didn't change the behavior of triples duplicated in different named graphs when those named graphs are merged into the query's default graph. Would need further discussion
  • I didn't change the behavior or the updates (DELETE and INSERT). Since code is independent, I will probably make another PR for that

WHY:

So queries behave closer to W3C spec

Issues:

This PR fixes this issue (confirmed)
And also that issue as far as I understand
This also solves partially this discussion

Checklist

  • Checked that there aren't other open pull requests for
    the same change.
  • Checked that all tests passes .
  • [] type checking: black and flake8 do not agree on how to format some lines, I chose to take black's suggestions
  • If the change has a potential impact on users of this project:
    • Added or updated tests that fail without the change.
    • Updated relevant documentation to avoid inaccuracies. Just following W3C spec, which link is already in the docs.
    • Considered adding additional documentation.
  • Considered granting push permissions to the PR branch,
    so maintainers can fix minor issues and keep your PR up to date.

apicouSP and others added 7 commits May 28, 2024 16:40
When using a FROM or FROM NAMED clause: redefine entirely the query's RDF dataset.
Include only the graphs in FROM clause in the query's default graph
Include only the graphs in the FROM NAMED clause in the query's named graphs
Try to load external graphs only if they don't already exist in the given ConjunctiveGraph
… test_dataset_inclusive

Since ConjunctiveGraph has been deprecated.
Also define if dataset is inclusive or exclusive at Dataset init instead of using global param SPARQL_DEFAULT_GRAPH_UNION
@apicouSP
Copy link
Contributor Author

I now use Dataset instead of ConjunctiveGraph as this one is now deprecated.
I also fixed style and type errors in new tests. Now black, ruff and mypy should pass.
Let me know if you need something to merge this!
Sorry I made too much commits.

@nicholascar nicholascar self-requested a review June 21, 2024 09:04
@coveralls
Copy link

Coverage Status

coverage: 91.058% (+0.03%) from 91.03%
when pulling 8ed095c on apicouSP:fix-explicit-dataset
into 0ecc400 on RDFLib:main.

@coveralls
Copy link

coveralls commented Jul 24, 2024

Coverage Status

coverage: 90.646% (+0.02%) from 90.627%
when pulling 3c034c9 on apicouSP:fix-explicit-dataset
into d7b2d25 on RDFLib:main.

@ashleysommer
Copy link
Contributor

Thank you @apicouSP for this Pull Request.
This looks like a good change to bring RDFLib SPARQL executor more in line with the SPARQL 1.1 specs.
Note, I've fixed all the issues that were preventing our CI pipelines from working properly, so I can see now your tests are passing.

I'm not super familiar with the RDFLib SPARQL executor (I don't know if any current maintainers are) so this will take a bit longer than normal to review this, but we'll try to get it merged before our upcoming RDFLib v7.1 release.

@nicholascar
Copy link
Member

@recalcitrantsupplant can you please review this?

@ashleysommer
Copy link
Contributor

I'm happy to merge this as-is now, and remove references to ConjunctiveGraph across the whole codebase in a separate PR.

rdflib/plugins/sparql/sparql.py Outdated Show resolved Hide resolved
self.graph = Graph()
for d in datasetClause:
if d.default:
self.graph += graph.get_context(d.default)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like its pretty costly to copy every triple from each FROM graph into a new Graph. Could it be done by adding the existing graph to a new Dataset (or ConjunctiveGraph), and setting default_union on that store?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand well your suggestion, this was previously done in the load function:
self.graph += self.dataset.get_context(source) # type: ignore[operator]
Exactly the same way.
So that's just something that I didn't change.
I will see if I can optimize that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I see now this of course isn't something you added, but it was just to this codepath.
There is no need to include an optimization for that in this PR, because it could introduce an unintended change in behaviour to SPARQL processing.

I am planning on doing a large sprint of work on optimising the SPARQL processor after RDFLib v7.1.0 is released.
I suspect efficiency problems (exactly like this one) are behind the odd extremely slow execution users are seeing in issues like:
#2147
and
#1775
and
#787

@ashleysommer
Copy link
Contributor

ashleysommer commented Jul 31, 2024

@apicouSP
Currently tests failing after the latest update:

py39-docs: commands[3]> poetry run python -m mypy --show-error-context --show-error-codes --junit-xml=test_reports/3.9-ubuntu-latest-mypy-junit.xml
rdflib/plugins/sparql/sparql.py: note: In member "__init__" of class "QueryContext":
rdflib/plugins/sparql/sparql.py:286: error: Incompatible types in assignment (expression has type "ConjunctiveGraph", variable has type "Optional[Dataset]") [assignment]
Found 1 error in 1 file (checked 398 source files)

rdflib/plugins/sparql/sparql.py Outdated Show resolved Hide resolved
@ashleysommer ashleysommer merged commit 5876266 into RDFLib:main Jul 31, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug when using FROM statements
4 participants