Hard skip empty datasets nodes in dataservices #3285

ThibaudDauce · 2025-03-13T10:50:43Z

Skips the numerous datasets marked as ignored due to missing title when they are just referenced and not described.

Example harvest source with ~1977 ignored items, reaching the 2000 elements max on demo.
See the ISO output and DCAT result after XSLT of a problematic page.

maudetes

Thank you for taking a look at this! 🙏

maudetes · 2025-03-13T11:03:39Z

udata/core/dataset/rdf.py

@@ -751,6 +752,14 @@ def dataset_from_rdf(graph: Graph, dataset=None, node=None, remote_url_prefix: s

    dataset.title = rdf_value(d, DCT.title)
    if not dataset.title:
+        external_rdf = rdf_value(d, FOAF.isPrimaryTopicOf)


I am unsure this is a valid way of identifying datasets that are just referenced.
I think updating the missing title on dataset with these considerations of why title is missing may be better?

It may not be the only way but it is a way. The only problem is if there is a isPrimaryTopicOf but it's not why there is no title? Otherwise we may show "missing title" for referenced dataset in some case but it's not a big deal? (we can add more conditions later if we encounter them?)

maudetes · 2025-03-13T11:09:13Z

udata/harvest/backends/dcat.py

+        for node in page.subjects(RDF.type, DCAT.DataService):
+            dataservice = page.resource(node)
+            for dataset in dataservice.objects(DCAT.servesDataset):
+                if dataset_node == dataset.identifier:


Are we sure it works if a dataset is properly described in the catalog but is also served by a dataservice?

No it doesn't work in some case (I was waiting for the tests and they fail…), need to find a better way to exclude them…

…ices

maudetes · 2025-03-18T17:05:04Z

udata/harvest/backends/dcat.py

+        resource = page.resource(node)
+
+        title = rdf_value(resource, DCT.title)
+        primary_topic = rdf_value(resource, FOAF.isPrimaryTopicOf)


I don't trust the isPrimaryTopicOf heuristic approach. I think all datasets described by GeoNetwork and transformed with the XSLT end up with this property. It's not only the case in the servesDataset context.
See for example this random dataset DCAT output.

Ideally, we would like to exclude datasets that are only predicate of servesDataset and hasPart explicitly?
Indeed, they are not directly bound to the harvested catalogue at this moment, we wouldn't want to harvest them if they only appear as reference only?

I didn't find another way to filter these nodes. But I don't exclude all the nodes with isPrimaryTopicOf, only nodes without title (so they will be skipped anyway), I don't think there is a lot of datasets without titles and isPrimaryTopicOf that are real datasets?

Hard skip empty datasets nodes in dataservices

399ac14

ThibaudDauce requested a review from maudetes March 13, 2025 10:50

maudetes reviewed Mar 13, 2025

View reviewed changes

ThibaudDauce and others added 7 commits March 13, 2025 14:12

CHange the way to detect empty dataset node

865d9d8

Merge branch 'master' into hard_skip_empty_datasets_nodes_in_dataserv…

b0e4173

…ices

Update changelog

49298fe

Warn about reaching max items

a4294d7

Do not show error on preview, there is already a warn message

de661d5

Merge branch 'master' into hard_skip_empty_datasets_nodes_in_dataserv…

ccb2318

…ices

Merge branch 'master' into hard_skip_empty_datasets_nodes_in_dataserv…

0974534

…ices

maudetes self-requested a review March 18, 2025 13:14

maudetes added 2 commits March 18, 2025 16:59

Merge branch 'master' into hard_skip_empty_datasets_nodes_in_dataserv…

2dc5c1b

…ices

move changelog at the right section

a55e203

maudetes reviewed Mar 18, 2025

View reviewed changes

Try alternative approach to filter out empty datasets

ca1bc59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hard skip empty datasets nodes in dataservices #3285

Hard skip empty datasets nodes in dataservices #3285

ThibaudDauce commented Mar 13, 2025 •

edited by maudetes

Loading

maudetes left a comment

maudetes Mar 13, 2025

ThibaudDauce Mar 13, 2025

maudetes Mar 13, 2025

ThibaudDauce Mar 13, 2025

maudetes Mar 18, 2025

ThibaudDauce Mar 19, 2025

Hard skip empty datasets nodes in dataservices #3285

Are you sure you want to change the base?

Hard skip empty datasets nodes in dataservices #3285

Conversation

ThibaudDauce commented Mar 13, 2025 • edited by maudetes Loading

maudetes left a comment

Choose a reason for hiding this comment

maudetes Mar 13, 2025

Choose a reason for hiding this comment

ThibaudDauce Mar 13, 2025

Choose a reason for hiding this comment

maudetes Mar 13, 2025

Choose a reason for hiding this comment

ThibaudDauce Mar 13, 2025

Choose a reason for hiding this comment

maudetes Mar 18, 2025

Choose a reason for hiding this comment

ThibaudDauce Mar 19, 2025

Choose a reason for hiding this comment

ThibaudDauce commented Mar 13, 2025 •

edited by maudetes

Loading