-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hard skip empty datasets nodes in dataservices #3285
base: master
Are you sure you want to change the base?
Hard skip empty datasets nodes in dataservices #3285
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for taking a look at this! 🙏
udata/core/dataset/rdf.py
Outdated
@@ -751,6 +752,14 @@ def dataset_from_rdf(graph: Graph, dataset=None, node=None, remote_url_prefix: s | |||
|
|||
dataset.title = rdf_value(d, DCT.title) | |||
if not dataset.title: | |||
external_rdf = rdf_value(d, FOAF.isPrimaryTopicOf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure this is a valid way of identifying datasets that are just referenced.
I think updating the missing title on dataset
with these considerations of why title is missing may be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not be the only way but it is a way. The only problem is if there is a isPrimaryTopicOf
but it's not why there is no title? Otherwise we may show "missing title" for referenced dataset in some case but it's not a big deal? (we can add more conditions later if we encounter them?)
udata/harvest/backends/dcat.py
Outdated
for node in page.subjects(RDF.type, DCAT.DataService): | ||
dataservice = page.resource(node) | ||
for dataset in dataservice.objects(DCAT.servesDataset): | ||
if dataset_node == dataset.identifier: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure it works if a dataset is properly described in the catalog but is also served by a dataservice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it doesn't work in some case (I was waiting for the tests and they fail…), need to find a better way to exclude them…
udata/harvest/backends/dcat.py
Outdated
resource = page.resource(node) | ||
|
||
title = rdf_value(resource, DCT.title) | ||
primary_topic = rdf_value(resource, FOAF.isPrimaryTopicOf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't trust the isPrimaryTopicOf
heuristic approach. I think all datasets described by GeoNetwork and transformed with the XSLT end up with this property. It's not only the case in the servesDataset
context.
See for example this random dataset DCAT output.
Ideally, we would like to exclude datasets that are only predicate of servesDataset
and hasPart
explicitly?
Indeed, they are not directly bound to the harvested catalogue at this moment, we wouldn't want to harvest them if they only appear as reference only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find another way to filter these nodes. But I don't exclude all the nodes with isPrimaryTopicOf
, only nodes without title (so they will be skipped anyway), I don't think there is a lot of datasets without titles and isPrimaryTopicOf
that are real datasets?
Skips the numerous datasets marked as ignored due to missing title when they are just referenced and not described.
Example harvest source with ~1977 ignored items, reaching the 2000 elements max on demo.
See the ISO output and DCAT result after XSLT of a problematic page.