Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a distinction between data sets used in the project and data sets produced #14

Open
rwwh opened this issue Nov 20, 2019 · 2 comments
Assignees
Labels
decision Decision to be taken that alligns the approach

Comments

@rwwh
Copy link

rwwh commented Nov 20, 2019

The concept of a data set is slightly overloaded. I recognize data sets

  • that are collected in the project, where the "quality assurance" is especially interested,
  • that are intermediary steps in the analysis
  • that are products of the project, and will be distributed.

Is the model supposed to be able to document each of these? Are they mostly differentiated based on the fact that the "product" data sets contain "distribution"s? It may be good to add something about this to the documentation.

@TomMiksa
Copy link
Contributor

Currently, there is no field that explictly states whether a dataset is reused or produced. The standard differentiates between existing and non-existing datasets by setting dates appropriately (see FAQ). For the time being, we are able to express what data is being "used" in the project without differentiating what existed before.

In case you need to make it explicit what data is reused then I would see the following options:

  1. Set Dataset type to "Reused data" to encapsulate all data reused.
    OR
  2. Use description field of Dataset or Distribution to describe that data is reused.
    OR
  3. (inexplicit way) If the reused dataset comes from a data repository, then one could read out from the datasets metadata when it was published and compare it to the creation date of a DMP or starting date of a project (depends on setting) to find out whether a dataset existed before project has started or a DMP was written.

I think we should add a point on this to the FAQ.

It is correct what you describe about having multiple distributions for one dataset to indicate different things. For example, we can have a dataset with "survey data" that will have two distributions: "raw data" and "anonymised data". The first one is being used in processing during the project and will be deleted afterwards. The second one will be published at the end of a project.

@paulwalk
Copy link
Contributor

I think any attempt to create an ontology around the possible uses/purposes of data must be firmly out of scope for this project. I recommend that we do not try to address this at all - except in an FAQ suggesting that "intended use" might be described in the description for a dataset or distribution.

@TomMiksa TomMiksa added the decision Decision to be taken that alligns the approach label Aug 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
decision Decision to be taken that alligns the approach
Projects
None yet
Development

No branches or pull requests

4 participants