Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hard skip empty datasets nodes in dataservices #3285

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

ThibaudDauce
Copy link
Contributor

@ThibaudDauce ThibaudDauce commented Mar 13, 2025

Skips the numerous datasets marked as ignored due to missing title when they are just referenced and not described.

Example harvest source with ~1977 ignored items, reaching the 2000 elements max on demo.
See the ISO output and DCAT result after XSLT of a problematic page.

@ThibaudDauce ThibaudDauce requested a review from maudetes March 13, 2025 10:50
Copy link
Contributor

@maudetes maudetes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking a look at this! 🙏

@@ -751,6 +752,14 @@ def dataset_from_rdf(graph: Graph, dataset=None, node=None, remote_url_prefix: s

dataset.title = rdf_value(d, DCT.title)
if not dataset.title:
external_rdf = rdf_value(d, FOAF.isPrimaryTopicOf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure this is a valid way of identifying datasets that are just referenced.
I think updating the missing title on dataset with these considerations of why title is missing may be better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be the only way but it is a way. The only problem is if there is a isPrimaryTopicOf but it's not why there is no title? Otherwise we may show "missing title" for referenced dataset in some case but it's not a big deal? (we can add more conditions later if we encounter them?)

for node in page.subjects(RDF.type, DCAT.DataService):
dataservice = page.resource(node)
for dataset in dataservice.objects(DCAT.servesDataset):
if dataset_node == dataset.identifier:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure it works if a dataset is properly described in the catalog but is also served by a dataservice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it doesn't work in some case (I was waiting for the tests and they fail…), need to find a better way to exclude them…

@maudetes maudetes self-requested a review March 18, 2025 13:14
resource = page.resource(node)

title = rdf_value(resource, DCT.title)
primary_topic = rdf_value(resource, FOAF.isPrimaryTopicOf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't trust the isPrimaryTopicOf heuristic approach. I think all datasets described by GeoNetwork and transformed with the XSLT end up with this property. It's not only the case in the servesDataset context.
See for example this random dataset DCAT output.

Ideally, we would like to exclude datasets that are only predicate of servesDataset and hasPart explicitly?
Indeed, they are not directly bound to the harvested catalogue at this moment, we wouldn't want to harvest them if they only appear as reference only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find another way to filter these nodes. But I don't exclude all the nodes with isPrimaryTopicOf, only nodes without title (so they will be skipped anyway), I don't think there is a lot of datasets without titles and isPrimaryTopicOf that are real datasets?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants