-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
software versions and identifiers #157
Comments
At GigaScience we've found the RelatedIdentifier field in DataCite metadata very useful and intuitive for describing software, input and output data, updates, and even documentation. For example our published SOAPdenovo2 genome assembly software (DOI:10.5524/100044) took an old input dataset (DOI:10.5524/100015) and created a new improved output (DOI:10.5524/100038). We describe this all in the DataCite metadata, as we use IsPreviousVersionOf and IsNewVersionOf to describe which is the old and improved data, and Compiles and IsCompiledBy to describe which is the input data and which is the software compiling it. We issued a correction to SOAPdenovo2 (DOI:10.5524/100148 ), so used IsNewVersionOf to describe that. And lots of other related tools and data that we link to this with IsSupplementedBy/IsSupplementTo tags. And the metadata has lots of other flexibility as you can link documentation with IsDocumentedBy, and describe relationships between modules or forks with IsPartOf, isVariantFormOf, etc. We lobbied Lobbied DataCite to get “workflow” as resourceTypeGeneral in their v3 metadata schema, so this schema can also describe the components of computational pipelines. If you are looking for examples to use for this we have plenty we can contribute. |
how would the parent and the version 1 metadata differ (besides the relationtype)? |
they wouldn't otherwise differ, I don't think |
I think the use of related identifiers field to denote different software versions sounds very promising, but I'm not quite sure that the idea of issuing two identifiers (one for the package as a whole and one for the version) when, say, a Zenodo archival record is first created is ideal. What would such a record actually look like? Which of the two would be the identifier for the datacite entry created? and what would the other point to? What DataCite relationship category would be used? I think it makes more sense to assign a single identifier when the record is created, which is unique to the version actually archived at the time. When the record is updated (e.g. by new GitHub release in the automatic Zenodo model), then a new identifier is issued, creating a new DataCite entry which now includes the field "relatedIdentifer" pointing to the previous version and using the relationship Sure it would be nice to have a way to cite / refer to the package as a whole vs the individual versions, but I think that is not particularly practical. Given good dataCite records with I recall we've also discussed this issue in setting up CodeMeta fields, so want to make sure we coordinate those recommendations with these; @mbjones might better remember our discussions and say if I'm off base here. |
@danielskatz described really well what we discussed earlier today. I think a different way to phrase this is to say that we want a persistent identifier for the specific release/version, but also a persistent identifier for the repo. In the Github world these two are clearly distinguishable, and, as Dan and Scott said, can be described in DataCite metadata. There are a number of use cases for a persistent identifier for a software repo (as opposed to a specific release), one important reason is to aggregate all citations to specific versions for credit and attribution (principle #2). How else would you aggregate all citations for a piece of software, to the latest version, to all versions? The current implementation of issuing an identifier for principle #6 (specificity) can't also handle principle #2, unless there is only a single version of the software. |
You find the same idea also in the JISC recommendations for software citation: http://rrr.cs.st-andrews.ac.uk/wp-content/uploads/2015/10/guidelines-software-identification.pdf. They talk about a model of software entities and the "product" level vs. the "version" level, and the describe the usefulness of identifiers for the "product" level:
|
@mfenner Thanks Martin! This clarifies a lot, and I'm all for giving a permanent identifier to the repo. For instance, I think this means that version-specific identifier would still be the one that corresponds to the zenodo record that gets created, and that record could simply refer to the source repository using the repo id instead of just the repo URL. It's not clear what the relationship property would be from the existing DataCite relation terms ( I assume the version-specific DOI would then resolve to the Zenodo record, and the package DOI would resolve to what? the GitHub repo? Or does it need to resolve to something more permanently archived? (If the latter, how would you archive something without archiving a particular version / snapshot?) I'm also not sure that the notions of 'source code repository' and 'software product' are really 100% synonymous. I also didn't follow why we need two such identifiers to aggregate citations. Does this just assume people cite both identifiers, so that the citation count of the package ID can be used as the aggregate? (Do you think sum of citations over versions will always equal that of citations to the package ID?) It seems to me that the only way to do aggregate citations is to define what collection of identifiers are being aggregated (i.e. the most recent identifier and all other identifiers produced in walking the chain of Don't mean to be contrary here; I'm all for having identifiers for things and defining the relationships to them. It just seems to me that it's not essential to have an identifier for the 'product' level to accomplish the goals here, and am unclear about the practical side of how it would be implemented and used (as with the above question about where would that DOI even resolve to?). |
tl;dr: I think there's a difference between the concepts of linking specific versions of software, giving an identifier for a software repo, and giving an identifier which provides a way of referencing the abstract entity that is a piece of software. We looked at this a bit with JORS and SSI related work. It seemed to come down to there being two different reasons for wanting to cite software:
Which map nicely to the use cases listed in 10.7717/peerj.2394/table-2 which either do or don't require software versions. We questioned whether it was easier to do the second of these aggregate use cases by:
In the end we went with 1) because it looked like the other options would have taken too much effort at the time to implement, and required the community as a whole to adopt one way or the other. |
I think the case/reason that you (@npch) are leaving out is citation and credit, where the authors both want to get credit for a specific version, but also want to be able to roll up that credit into a credit for all versions of the software. But I agree that at the time, using your 1 was a good choice. Now that we have a chance to influence the larger community when software citations standards move into an implementation phase, I think it might be time to consider other options. |
This is a very interesting discussion, and I agree with @danielskatz that the timing is right, as we are moving from principles to implementation. The idea of an identifier pointing to the latest version of something is very popular, and is obviously how we navigate the web (only that for the average webpage it is very hard to go to a previous version). What I am advocating for is that this is not the best way to use identifiers for scholarly resources, as it doesn't properly address specificity and attribution. The main problem is that the thing the identifier is pointing to is changing with every version. The IMHO much better implementation is to have an identifier that points to an abstract, versionless concept, rather than to the latest version. This versionless concept then links to specific versions. This helps with a number of use cases described in the software citation paper. This is also how software package repositories often work, see for example (I mainly use Javascript and Ruby) https://www.npmjs.com/package/bower or https://rubygems.org/gems/factory_girl. The implementation using Github, Zenodo and DOIs is not quite following this pattern, and I guess that @danielskatz and I are suggesting that we should do so. The needed changes are probably the following:
I understand that code repository doesn't equal software, but it is a good proxy for a lot of open source software. And for the other cases I think we still need these two identifiers, just pointing to something else. We should also not forget that collecting software citations is really hard, and we need all the help we can get. Having an parent identifier for all versions that links to all the citations found is extremely powerful, as we don't want everyone to aggregate the citations to different versions himself, in the worst case with different results. |
The approach we are planning on pursuing for Zenodo is the one described by @mfenner. One "container"-DOI, plus a "version"-DOI per release. Some users prefer having the container-DOI cited, whereas others prefer having the version-DOI cited.
Wouldn't it be possible to simply use One complexity that we have to deal with is that hasPreviousVersion/isNewVersionOf does not model semantic versioning very well, especially in the cases were releases happen out of order (e.g. 1.1, 1.2, 1.1.1, ...) |
@lnielsen I am happy to hear that Zenodo plans to issue "container"-DOIs in addition to "version"-DOIs. |
I know the first version of the software citation principles have just been finalized, but I want to suggest that we consider adding a bit more to the discussion section about software versions.
I was just talking with @mfenner and learned that DOIs have fields that could be used to record relationships. This suggests that there is at least one way that different versions of software could have identifiers but metrics could be collected on the whole family of versions of a software package.
I think @mfenner suggested that the github/zenodo link be modified so that the first time a package is released through zenodo, two DOIs would be be created: one for the package and one for the version. The metadata for the version would indicate the DOI for the parent as well. Then for future versions, the same parent could be identified in the version's DOI's metadata.
@mfenner, please correct anything I got wrong.
I don't know how this would be done in other services such as figshare, but I'm sure it could be worked out.
The text was updated successfully, but these errors were encountered: