Metadata Specification Proposal #17

ZacharyWills · 2023-07-18T19:39:56Z

Adding spec, notes and some examples for semi-reproducible cataloging of a RUN. This is a relatively flexible item, since no automation has been produced yet. Eventually existing tooling will make this spec more difficult to change.

Open to comments and discussion. As per CIROH Meeting on July 18th 2023.

Adding spec, notes and some examples for semi-reproducible cataloging of a RUN. This is a relatively flexible item, since no automation has been produced yet. Eventually existing tooling will make this spec more difficult to change. Open to comments and discussion. As per CIROH Meeting on July 18th 2023.

ZacharyWills · 2023-07-18T19:44:57Z

@jameshalgren Can we add as many people as we can to consider how to tackle this? It's also related to Andy Wood & the protocols as model run diffs would be good for the proposal stage before they were run or to reproduce.

ZacharyWills · 2023-07-24T17:37:52Z

A good bit of feedback I got that I want to record here: There should be a way to mark a specific hash and file as a "benchmark" or part of Operational workflows.

arpita0911patel

Zach, possible to update README.md so that the existing content is still there along with the new?

ZacharyWills · 2024-01-02T23:10:40Z

Quick update to this:

In pursuit of the goal of having each run output a metadata file, Austin has made us a nice repo for generating the Merkel Tree: https://github.com/aaraney/ht

Adding that after validation, then we can add comparison once the data stream is up and running.

I'll add the spec for the configs, the configs, and their locations later on once the stream is up and running.

Thanks,
Zach

aaraney

Thanks for writing this up, @ZacharyWills! I left a fair number of comments that hopefully with further the conversation in a positive direction! Look forward to hearing back from you!

README.md

aaraney · 2024-01-10T14:42:54Z

README.md

-and copy the path
+###Reproducing
+
+Get the files from the appropriate bucket & RUN_CONFIG


I likely just dont know the context, so this might be a moot point, but what is a RUN_CONFIG? Can we just remove it? If not, can we provide some supporting text to say what a RUN_CONFIG is?

aaraney · 2024-01-10T14:48:51Z

README.md

+
+Get the files from the appropriate bucket & RUN_CONFIG
+```
+wget --no-parent https://awi-ciroh-ngen-data.s3.us-east-2.amazonaws.com/AWI_001/AWI_03W_113060_001.tar.gz .


I would suggest using curl instead of wget. curl is generally available by default.

aaraney · 2024-01-10T15:02:34Z

notes.md

@@ -0,0 +1,23 @@
+# Notes on a "NextGen Universal Package"
+
+There's been some discussion in the community about the combinatorial effects of opening up the possibility to build ever-more-complicated systems while making it more difficult to resproduce results in the community, or to share results for integration into our collective understanding of the world. 


small typo! resproduce -> reproduce

aaraney · 2024-01-10T15:06:11Z

notes.md

Not sure if this file was intended to make it into the proposal? It seems like some of this content could be reshaped a bit and make its way into the README instead? Maybe we do want a living notes document though. Although, a github discussion thread might be more appropriate and accomplish the same goal. Thoughts?

aaraney · 2024-01-10T15:53:11Z

Specification.md

General comments:

probably want to add some kind of spec version identifier. or, more desirably, when we are happy with the spec, lock it down and guarantee that we will never remove a field.

I really like that you havent said anything about it being in JSON or specifying the format. I think we should harp on that being a design goal (if it is one). So kind of like schema.org, where you can choose to represent this information in a variety of formats. While that might cause pain points early on when, for example, format translation tooling isnt great, I think in the long run less format coupling is desirable.

Consider adding another top level metadata key (or record, whatever you want to call it) and add hash as a key with its record containing algo or type. So, in json, something like:

{ "meta": { "hash": { "algo": "sha256" } } }

This will provide a way to add new metadata fields in the future (if desired) and make it clear what hash algorithm is being used. I can see in the future needing to move a different hashing algorithm.

As for general formatting, what are your thoughts on switching to a table format similar to schema.org? A schema.org Thing is a good enough example. In our context, that could look like:

Model

Property Expected Type Description

realization Realization NextGen framework configuration file that specifies the unit(s) of work for a model engine run.

configuration Configuration BMI model formulation configuration file

I figure Specification.md will undergo a fair amount of change as we iron out the details for records themselves. So, I am not going to go overboard commenting about it.

aaraney · 2024-01-10T15:54:47Z

Specification.md

@@ -0,0 +1,26 @@
+--- CONFIG START ---


I think we can remove this. It kind of implies that the listed fields need to be ordered (although, I think that is a stretch).

aaraney · 2024-01-10T17:24:15Z

README.md

+	"RUN_CONFIG": {
+		"description": "Run Config for AWI_03W_113060_001",
+		"Model": {
+			"realization": {


Thanks for getting this going, @ZacharyWills! Having now stared at this for 30 minutes and gone back and forth about what comments I want to make, it seems clear to me that we could solve this problem in a lot of different ways. But, as will all solutions, they all have pros and cons. My biggest take away so far is that this initial proposal should set up future proposals for success. I think the way we do that is by going with the simplest data model that does not limit future additions in places where we want expansion.

To get to that place, I think we first need some guard rails on what the spec is not. I figure we wont be able to determine all the things it is to start, without using it, so it is likely easier to say what it isn't. So kind of like goals, but anti-goals if you will haha. The purpose of saying what the spec isn't is to create consensus and avoid scope creep and have something to point to when suggestions are brought to the table. Please disagree and challenge me on this. It may just be an idea in my head and not a good one!

My general thoughts so far are (numbered, so it is easier to respond. Not ordered):

I think we probably want to create a key within Model that is either files, assets, configuration (or something) that is a list of realization and configuration record types. This will allow specifying the realization file and all other BMI init config / other supporting files and their details.

I like the distinction of configuration, forcing, and hydrofabric. However, im not sold on realization needing its own category. To me, its just another configuration file. I think we could instead just have a configuration record type that has Name: str, Store: "fs" | "os" (this could be confusing, "filesystem" or "bucket" is probably fine), Path: str, Hash: str, Variant: "unknown" | "realization" | "cfe" | .... This way we can treat all configuration files the same.

Im not sold on Hydrofabric living within Forcing. As you mention in notes.md:

Firstly there must be a heirarchical underestanding of how we represent data and configurations...

I don't see it the same. I think you can make the case that you can have forcing over a given hydrofabric domain, so Forcing should live within Hydrofabric. I am just not sold that the relationship between the two needs hierarchy. Why not just have Hydrofabric and Forcing as top level keys? My same comment about having files or assets as a subkey of Forcings and Hydrofabric applies here too.
4. Where do files like stream flow nudging or restart fall into this?
5. Pedantic: I am not a huge fan of RUN_CONFIG. We don't use upper case snake case anywhere else, so I think RunConfig or run_config are preferred. Even just Config or another single word key would be "nicer." Thoughts?
6. Pedantic: use the same case format across keys (i.e. model and realization, not Model and realization). My personal preference is lowercase snake case (i.e. run_config), but I definitely have a bias from my python background lol. Point being, we should pick something and stick with it.
7. I mention this in a later comment in notes.md, but we probaly want to introduce a top level Metadata or Meta as a top level key and add a hash key with its record containing Algo or Type. So, in json, something like:

{ "Meta": { "Hash": { "Algo": "sha256" } } }

So the hierarchy is based off of the assumption that things within the workflow are somewhat dependent (generating forcings for only your selected hydrofabric is more efficient than always generating them everywhere) but there's a lot of discussion about that going on now here: lynker-spatial/hfsubsetCLI#28

I totally agree on the syntax changes and I'll put some work in to fix the proposal to reflect those insights. For the most part I just wanted something to get the ball rolling on how AWI/CIROH-members/OWP//Ops will work with these run artifacts and the relationship between their configs and the desired outputs. That's ultimately the goal of the metadata is to facilitate that discussion at first, then later on when we're more practiced we can start to talk about how to transition things.

Thanks again for taking a look at this and any input is so valuable at this point to get us going from 0 -> something.

aaraney · 2024-01-10T17:41:55Z

notes.md

+
+The first step to building that collective understanding is based on having the same interpretation and categorization of the data we have and will produce. 
+
+Firstly there must be a heirarchical underestanding of how we represent data and configurations. Note that these aren't 1:1 with the data and configurations, these are for the heirarchical framing of how those are organized. 


small typo! underestanding -> understanding

Co-authored-by: Austin Raney <[email protected]>

arpita0911patel · 2024-10-11T19:33:59Z

Uploading the discussion from working group 2 slack channel regarding model_type_name in realization file.

ZacharyWills requested a review from jameshalgren July 18, 2023 19:40

ZacharyWills mentioned this pull request Jul 18, 2023

Add a 'RUN' wrapper to capture all inputs a level above 'realization' #16

Open

arpita0911patel reviewed Aug 7, 2023

View reviewed changes

JordanLaserGit mentioned this pull request Oct 30, 2023

Forcing path override #39

Closed

aaraney suggested changes Jan 10, 2024

View reviewed changes

Update README.md

5277014

Co-authored-by: Austin Raney <[email protected]>

arpita0911patel requested a review from JoshCu February 20, 2024 16:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata Specification Proposal #17

Metadata Specification Proposal #17

ZacharyWills commented Jul 18, 2023

ZacharyWills commented Jul 18, 2023

ZacharyWills commented Jul 24, 2023

arpita0911patel left a comment •

edited

Loading

ZacharyWills commented Jan 2, 2024

aaraney left a comment

aaraney Jan 10, 2024

aaraney Jan 10, 2024

aaraney Jan 10, 2024

aaraney Jan 10, 2024

aaraney Jan 10, 2024

aaraney Jan 10, 2024

aaraney Jan 10, 2024

aaraney Jan 10, 2024

aaraney Jan 10, 2024

ZacharyWills Jan 11, 2024

aaraney Jan 10, 2024

arpita0911patel commented Oct 11, 2024 •

edited

Loading

		@@ -0,0 +1,23 @@
		# Notes on a "NextGen Universal Package"

		There's been some discussion in the community about the combinatorial effects of opening up the possibility to build ever-more-complicated systems while making it more difficult to resproduce results in the community, or to share results for integration into our collective understanding of the world.

Property	Expected Type	Description
`realization`	`Realization`	NextGen framework configuration file that specifies the unit(s) of work for a model engine run.
`configuration`	`Configuration`	BMI model formulation configuration file


		The first step to building that collective understanding is based on having the same interpretation and categorization of the data we have and will produce.

		Firstly there must be a heirarchical underestanding of how we represent data and configurations. Note that these aren't 1:1 with the data and configurations, these are for the heirarchical framing of how those are organized.

Metadata Specification Proposal #17

Are you sure you want to change the base?

Metadata Specification Proposal #17

Conversation

ZacharyWills commented Jul 18, 2023

ZacharyWills commented Jul 18, 2023

ZacharyWills commented Jul 24, 2023

arpita0911patel left a comment • edited Loading

Choose a reason for hiding this comment

ZacharyWills commented Jan 2, 2024

aaraney left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Model

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arpita0911patel commented Oct 11, 2024 • edited Loading

arpita0911patel left a comment •

edited

Loading

arpita0911patel commented Oct 11, 2024 •

edited

Loading