Is there a distinction between data sets used in the project and data sets produced #14

rwwh · 2019-11-20T11:07:10Z

The concept of a data set is slightly overloaded. I recognize data sets

that are collected in the project, where the "quality assurance" is especially interested,
that are intermediary steps in the analysis
that are products of the project, and will be distributed.

Is the model supposed to be able to document each of these? Are they mostly differentiated based on the fact that the "product" data sets contain "distribution"s? It may be good to add something about this to the documentation.

TomMiksa · 2019-11-20T13:30:22Z

Currently, there is no field that explictly states whether a dataset is reused or produced. The standard differentiates between existing and non-existing datasets by setting dates appropriately (see FAQ). For the time being, we are able to express what data is being "used" in the project without differentiating what existed before.

In case you need to make it explicit what data is reused then I would see the following options:

Set Dataset type to "Reused data" to encapsulate all data reused.
OR
Use description field of Dataset or Distribution to describe that data is reused.
OR
(inexplicit way) If the reused dataset comes from a data repository, then one could read out from the datasets metadata when it was published and compare it to the creation date of a DMP or starting date of a project (depends on setting) to find out whether a dataset existed before project has started or a DMP was written.

I think we should add a point on this to the FAQ.

It is correct what you describe about having multiple distributions for one dataset to indicate different things. For example, we can have a dataset with "survey data" that will have two distributions: "raw data" and "anonymised data". The first one is being used in processing during the project and will be deleted afterwards. The second one will be published at the end of a project.

paulwalk · 2019-11-20T13:38:35Z

I think any attempt to create an ontology around the possible uses/purposes of data must be firmly out of scope for this project. I recommend that we do not try to address this at all - except in an FAQ suggesting that "intended use" might be described in the description for a dataset or distribution.

TomMiksa assigned peterneish, paulwalk and TomMiksa Aug 28, 2020

TomMiksa added the decision Decision to be taken that alligns the approach label Aug 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a distinction between data sets used in the project and data sets produced #14

Is there a distinction between data sets used in the project and data sets produced #14

rwwh commented Nov 20, 2019

TomMiksa commented Nov 20, 2019

paulwalk commented Nov 20, 2019

Is there a distinction between data sets used in the project and data sets produced #14

Is there a distinction between data sets used in the project and data sets produced #14

Comments

rwwh commented Nov 20, 2019

TomMiksa commented Nov 20, 2019

paulwalk commented Nov 20, 2019