-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for UUID #30
Comments
Thanks @empavia , |
I think the key question that would be helpful to answer is about how we identify a dataset. Suppose we have a dataset "A", which is the first version released. It has a cryptographic hash of If they are different datasets, then the cryptographic hash is perfect: it's already a unique identifier. If they should be identified as the same dataset, just different editions, then it would be helpful to have some sort of uniquely identifying abstraction to refer to the overarching data product. |
Yeah, good points, James. We have an
Right, we have the |
I could definitely see cases where we'd want to keep track of different versions of a dataset, but I'm not sure about the need to have all of them cataloged on CKAN. @empavia do you happen to have any such cases in mind? Just thinking out loud here, all of the cases I can think of right now are ones where new versions of a dataset are released as though they are a completely independent version. A few examples are SRTMv3 (in contrast with v1 and v2), SoilGrids2017. It does seem like some degree of versioning seems important for reproducibility. Imagine we made an update to an AWC layer. How will we want to identify that the layer has changed? Would we want to just include that as a new layer on the catalog? Thinking ahead, I'm not sure how a DOI might change if we updated the underlying data with editions, but the behavior is clear for if we had a new dataset. Anyways, I guess I'm feeling like the simplicity of a SHA might be really nice, but @empavia I would be very interested to hear your thoughts on this. |
I agree that I don't foresee new versions of data needing to be new 'editions' and was thinking of your same use cases. New versions of data, or new year releases, will likely just be different datasets as it will be important to provide options (such as having SRTM v3 and v4, for example). The issue on reproducibility is important to discuss and think about. I imagined that an updated AWC layer (if we've rerun it with a change in a formula, for example) would replace the original version and may just need updated metadata and a new 'update date' on CKAN and in the metadata. I am not sure how that would work with the SHA or with a DOI, as it seems we may get a new one with a rerun set. However, is it possible to just keep the original SHA and DOI and only update the data itself and maybe some other pieces of the metadata that are relevant? For example, for AWC again, we just completely override the old version with the new version (assuming it is better quality or fixed something, etc) and maintain the original identifiers? Curious to hear your thoughts on this, @phargogh. Seems like we are generally on the same page, though. |
I think I'm leaning towards having the SHA256 (or another checksum, maybe As I understand the intent of a DOI, I think the DOI is supposed to uniquely identify that specific layer, and the SHA256 (or whichever checksum we use) should also relatively uniquely identify the dataset. So if we update the AWC equation and produce a new layer, I would think that we should have a new DOI and a new SHA256. One interesting thing that appears to be true (but not fully implemented in the CKAN UI) is that datasets appear to be able to be versioned ... but I am not finding anything in the Having said that, we could also see about trying to fix these revision issues in CKAN and in
Although we could leave the SHA256 unchanged in the metadata, doing so will undoubtedly create confusion down the road: folks will undoubtedly contact us asking why the computed SHA256 of the dataset they downloaded doesn't match the SHA256 stated in the metadata. So for the sake of future us, I think we may want to just stick with having the DOI and the checksum match the dataset. Conceptually, I kind of like the simplicity, too ... change the dataset, treat it like a new dataset. |
@phargogh Thanks, James! I am in agreement that we should treat changing the dataset as a new dataset. You are right that we would want a matching SHA256. Overall, it makes sense, and seems to be standard practice (if I understand your previous note correctly), to generate a new SHA256 and DOI for each new layer. I am interested in us exploring the versioning for DOIs, but it seems that may be a separate task from this. All in all, with the context you've given, I am happy to move forward on using the SHA256 as the ID! |
Related to the current approach for generating the sha256, we are using I'm thinking of updating the unique identifier to be a hash of @phargogh Are there any other "fast-hash" approaches we should consider for a unique identifier? |
Ooo yeah, that's an interesting design problem when we have remote datasets! The main challenge I see is that by changing how the checksum is defined, we're also making it harder to verify the file's integrity. Of course, this is totally fine for a unique ID! But it's problematic for file verification. At least for the use case of integrity checking, I don't see a way to avoid computing some kind of checksum. It would be a pretty straightforward operation to, for cloud-based files, compute the checksum in the cloud environment in order to avoid downloading the whole file to the local computer. For example (from SO), one could execute this on a GCS VM:
To directly answer your question, filesize and |
Cool, thanks for your thoughts! I think file integrity questions are beyond the scope of this project for now. And I think the unique identifier should include references to the data, like size, mtime, so that it remains the same unless the data has changed. |
The catalog requires a unique code to add datasets from metadata. While using the MCF, we implemented the use of a UUID.
We would like to request the implementation of a UUID code into the frictionless schema for all data types.
The text was updated successfully, but these errors were encountered: