Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Series serialization #334

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

Dataset Series serialization #334

wants to merge 2 commits into from

Conversation

amercader
Copy link
Member

Refs #298 , fixes #332

This adds preliminary support for exposing Dataset Series and their members (managed by ckanext-dataset-series).

Datasets of type dataset_series (TODO: support custom series types) are serialized as dcat:DatasetSeries, and member Datasets include the dcat:inSeries property. If the series is ordered, navigation is included for both entities (dcat:first / dcat:last and dcat:previous / dcat:next respectively):

Example Dataset Series (http://localhost:5000/dataset_series/test-dataset-series.ttl)

@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://localhost:5017/dataset/20f41df2-0b50-4b6b-9a75-44eb39411dca> a dcat:DatasetSeries ;
    dct:description "Testing" ;
    dct:identifier "20f41df2-0b50-4b6b-9a75-44eb39411dca" ;
    dct:issued "2025-01-22T13:43:38.208410"^^xsd:dateTime ;
    dct:modified "2025-01-28T13:53:03.900418"^^xsd:dateTime ;
    dct:publisher <http://localhost:5017/organization/a27490ed-4abf-46bd-a80a-d6e19d7fff18> ;
    dct:title "Test Dataset series" ;
    dcat:distribution <http://localhost:5017/dataset/20f41df2-0b50-4b6b-9a75-44eb39411dca/resource/0a526400-7a45-4c2c-a1db-7058acb270b0> ;
    dcat:first <http://localhost:5017/dataset/826bd499-40e5-4d92-bfa1-f777775f0d76> ;
    dcat:last <http://localhost:5017/dataset/ce8fb09a-f285-4ba8-952e-46dbde08c509> .

<http://localhost:5017/dataset/20f41df2-0b50-4b6b-9a75-44eb39411dca/resource/0a526400-7a45-4c2c-a1db-7058acb270b0> a dcat:Distribution ;
    dct:issued "2025-01-22T13:43:49.560508"^^xsd:dateTime ;
    dct:modified "2025-01-22T13:43:49.555378"^^xsd:dateTime ;
    dct:title "need to drop this" .

<http://localhost:5017/organization/a27490ed-4abf-46bd-a80a-d6e19d7fff18> a foaf:Agent ;
    foaf:name "Test org 1" .

Example member Dataset (http://localhost:5000/dataset/test-series-member-2.ttl)

@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://localhost:5017/dataset/de9cb401-5fc7-47cd-83ac-f7fd154b2cee> a dcat:Dataset ;
    dct:description "sdas" ;
    dct:identifier "de9cb401-5fc7-47cd-83ac-f7fd154b2cee" ;
    dct:issued "2025-01-22T13:57:13.319491"^^xsd:dateTime ;
    dct:modified "2025-01-24T10:42:00.788016"^^xsd:dateTime ;
    dct:publisher <http://localhost:5017/organization/a27490ed-4abf-46bd-a80a-d6e19d7fff18> ;
    dct:title "Test series member 2" ;
    dcat:distribution <http://localhost:5017/dataset/de9cb401-5fc7-47cd-83ac-f7fd154b2cee/resource/aab3cabd-69b9-40e9-b922-1b0548de6cfc> ;
    dcat:inSeries <http://localhost:5017/dataset/20f41df2-0b50-4b6b-9a75-44eb39411dca> ;
    dcat:next <http://localhost:5017/dataset/ce8fb09a-f285-4ba8-952e-46dbde08c509> ;
    dcat:previous <http://localhost:5017/dataset/826bd499-40e5-4d92-bfa1-f777775f0d76> .

<http://localhost:5017/dataset/de9cb401-5fc7-47cd-83ac-f7fd154b2cee/resource/aab3cabd-69b9-40e9-b922-1b0548de6cfc> a dcat:Distribution ;
    dct:issued "2025-01-22T13:57:18.992071"^^xsd:dateTime ;
    dct:modified "2025-01-22T13:57:18.990029"^^xsd:dateTime ;
    dcat:accessURL <https://data.gov.ie> .

<http://localhost:5017/organization/a27490ed-4abf-46bd-a80a-d6e19d7fff18> a foaf:Agent ;
    foaf:name "Test org 1" .

When requesting the catalog endpoint (e.g. http://localhost:5000/catalog.ttl) Dataset Series are typed as dcat:DatasetSeries and member datasets contain the dcat:inSeries property but the navigation properties are not provided for performance reasons. I think this is a good compromise for now as the full properties can be accessed on each dataset serialization.

A note on URIs

At first I though about constructing the Dataset URIs using /dataset_series/ for consistency:

<http://localhost:5017/dataset/20f41df2-0b50-4b6b-9a75-44eb39411dca> a dcat:DatasetSeries ;

But that brings extra considerations. If we want to support custom series dataset type (i.e. stuff like /projects/ or /collections/) those should also have the same URI pattern, probably using /dataset_series/ and not the custom type. This would involve making dataset_uri() aware of the preferred dataset type, probably via a param.
We definitely don't want to change the URIs for any arbitrary dataset type (as this might break existing URIs in existing sites with custom dataset types), but for those types that describe Dataset Series perhaps it's worth the added complexity (and other entities could also have different URI patterns in the future if they are implemented with dataset types, like Data Services).

Any thoughts @seitenbau-govdata @hcvdwerf ?

TODO:

  • Remove distributions from Dataset Series
  • Support arbitrary series dataset types
  • Decide on Dataset Series URIs
  • Tests
  • Documentation

@amercader amercader marked this pull request as draft January 29, 2025 14:18
@hcvdwerf
Copy link
Contributor

@amercader
Great work on implementing dataset series support! This is a valuable addition, and I really appreciate the effort in aligning it with DCAT-AP 3.

  • URI dataset_series sounds fine....

  • I would really like indeed to have the flexibility to add more entities in this way, such as study, project, collections, etc.

  • Why is Dataseries extension not part of the DCAT extension itself? Since this is a requirement of DCAT-AP 3, wouldn’t it make sense to include dataset_series directly within the DCAT extension?

  • Should first, last, next, and previous also be included in the model definitions (YAMLs)?

  • How does next work when the dataset is the last one in the series? Is it simply left empty?

  • What do you mean by distributions? Can’t you just leave out the resource fields in your model(YAMLs)?

  • How do next and previous work when a dataset is part of multiple dataset series? I’ve heard this is a possibility. According to DCAT-AP 3, inSeries can have 0..n values. Do you have any idea how this would work in combination with the previous and next properties?

@amercader
Copy link
Member Author

URI dataset_series sounds fine....

great

I would really like indeed to have the flexibility to add more entities in this way, such as study, project, collections, etc.

Yes, I'm now convinced that this is a requirement (ckan/ckanext-dataset-series#6)

Why is Dataseries extension not part of the DCAT extension itself? Since this is a requirement of DCAT-AP 3, wouldn’t it make sense to include dataset_series directly within the DCAT extension?

The functionality provided by ckanext-dataset-series is useful for many use cases besides DCAT, it's just a way to organize datasets, optionally ordered. As I mentioned on the original issue I think this will allow series to evolve and reduce the complexity in ckanext-dcat itself, which is starting to get really big. Everything specifically related to DCAT support will live in ckanext-dcat though (i.e. profiles support, serialization, etc)

Should first, last, next, and previous also be included in the model definitions (YAMLs)?

I don't think so, as these are computed fields which are not meant to be manually updated (or via the API). They are computed at view time based on the actual ordering of the items in the series, so we don't have to store their values

How does next work when the dataset is the last one in the series? Is it simply left empty?

Yes, it equals None when there are no further datasets

What do you mean by distributions? Can’t you just leave out the resource fields in your model(YAMLs)?

Not sure what you mean here.

How do next and previous work when a dataset is part of multiple dataset series? I’ve heard this is a possibility. According to DCAT-AP 3, inSeries can have 0..n values. Do you have any idea how this would work in combination with the previous and next properties?

The way this is handled at the CKAN API level is to show an item for each series the dataset belongs to (series_navigation is a list):

{ 
   "name": "test-member-in-two-series",
   "type": "dataset",
   "series_navigation": [
      {
          "id": "series-1",
          "name": "test-dataset-series-1",
          "title": "Test Dataset series 1",
          "next": {
              "id": "ce8fb09a-f285-4ba8-952e-46dbde08c509",
              "name": "test-series1-member-3",
              "title": "Test series 1 member 3"
          },
          "previous": {
              "id": "826bd499-40e5-4d92-bfa1-f777775f0d76",
              "name": "test-series1-member-1",
              "title": "Test series 1 member 1"
          }
      },
      {
          "id": "series-2",
          "name": "test-dataset-series-2",
          "title": "Test Dataset series 2",
          "next": {
              "id": "5e5f18d5-c762-44be-8fb1-588dacf000b1",
              "name": "test-series2-member-6",
              "title": "Test series 2 member 6"
          },
          "previous": {
              "id": "22ed0d79-ec09-45a4-b468-65e29d5dda4d",
              "name": "test-series2-member-4",
              "title": "Test series 2 member 4"
          }
      }
  ]
}

At the DCAT serialization level I don't think there's an elegant way to handle this, a dataset will get a set of dcat:inSeries, dcat:next and dcat:previous for each series it belongs to. The DCAT-AP 3 guidelines don't provide any tips beside "should be handled with care", so if you have a better approach I'm happy to consider it. In any case this could be documented and we could recommend not using multiple series per dataset.

@hcvdwerf
Copy link
Contributor

  • What is your definition of a "Project"? At Health-RI, we also have a definition of a project, but I don't think it's the same one you're referring to. (See: Health-RI Metadata Model)

  • In this schema, I see a resource field. I’m not sure about its purpose—could you clarify?

  • For now, datasets that are part of multiple dataset series are not relevant to us, but it is possible according to the specifications. So, we should include a strong disclaimer about using this. Since DCAT does not fully support it, we have to be very careful, as we cannot reliably determine which previous and next properties belong to which dataset series.

@amercader
Copy link
Member Author

amercader commented Feb 14, 2025

  • What is your definition of a "Project"? At Health-RI, we also have a definition of a project, but I don't think it's the same one you're referring to. (See: Health-RI Metadata Model)

I don't have any particular implementation in mind for projects, studies, collections etc. Basically any kind of datasets aggregation by an entity with its own properties. In your model IIUC it would be Studies that would implement Dataset Series, and Projects would aggregate Studies.

  • In this schema, I see a resource field. I’m not sure about its purpose—could you clarify?

That's what I meant with scheming not liking schemas without resources. The current version will throw an exception if you drop all entries in resource_fields or the resource_fields property entirely. We will need to patch ckanext-scheming to support this or come up with a workaround, but for now I left a single field so it didn't fail.

  • For now, datasets that are part of multiple dataset series are not relevant to us, but it is possible according to the specifications. So, we should include a strong disclaimer about using this. Since DCAT does not fully support it, we have to be very careful, as we cannot reliably determine which previous and next properties belong to which dataset series.

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Dataset Series serialization
2 participants