-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reviewing StreamResource and StreamDatum from the storage, access perspective #296
Comments
I'm agnostic to the key name changes. Adding the I'm hesitant to give up all information about which parts of the uri are mounting / hosting details and which are "real". Although this came from a time when we had the same filesystems mounted under very different paths on different machines with much higher regularity, having the ability to "make the path relative" (for lack of a better term) seems generically useful (maybe we should have gone with a single integer rather than two parts of the path). However, given the growing usage of containers and the trivial ability to be re-mount any level of path on the host to any level of path in the container it may make sense to lift all of this up to be a client configuration problem (which opens the door to much more complex re-writing rules). "it has not been well used" is not a compelling argument to remove it (it is a compelling argument for better docs!), but "it does not actually solve the problem" is. |
I'm also agnostic to the key name changes, maybe leaning slightly in the direction of pro, especially for I'll hold my hands up and say I hadn't noticed that I agree with @tacaswell that the advent of containers is likely to make different path mounting more well-used. In fact I strongly encourage it because baked in assumptions of multiple machines using the same mount point has bitten me many times! Also happy with the {
"protocol": "file",
"location": "/data/foo/bar"
} But if |
Exactly. The existing names aren't bad, but if we can use widely-known names and standards, developers don't even have to notice them. Let's talk about the location of I like structure things, but if we care about covering {'scheme': 'file',
'netloc': 'localhost',
'path': '/a/b/c',
'params': '',
'query': '',
'fragment': ''} and I think the string form is probably the easiest thing to pass around in documents. Not a strongly-held opinion, though. |
I'm not too attached to structure either but I don't think the full URL spec is a terrible thing to pass around. I think it might be best to see if anyone does hold strong opinions either way. |
That is fair…and making me wonder whether Tiled should store the URI in a structured form at rest. That could enable indexing and more efficient filtering by scheme or netloc, which could come up… |
From discussion:
|
If we move the descriptor to |
Even so, think it is the lesser evil. We can discuss. There is a tension here between what is easy for the producer and what is simpler (and performant) for the consumer. |
I must confess to getting a bit lost among text descriptions of relational hierarchies, so I drew it out as a sanity check, does this look right? Top is current state, bottom is proposed change: Also tagging @coretl |
@tacaswell we already reduced the scope of I don't think I have many opinions either way here, both options are the same complexity to produce in ophyd-async, and not much different in the run bundler, so happy with either. |
Also tagging @DiamondJoseph for comment |
Filling in the motivation for my terse notes above: The Tiled adapter that loads Bluesky documents has to chase a lot of references, illustrated in Callum’s diagram, to answer the simple question, “What data do you have?” The overhead is especially unfortunate when it is asked to return pages of search results describing ~300 datasets at a time. Placing |
Also also tagging @garryod |
Discussed with TC just before lunch, we have assumptions about Descriptors which motivate keeping Descriptor reference on StreamDatum.
If a new Descriptor being emitted also emitted a new Resource: requires that the Device know when the Descriptor changes (for the DatumFactory), or that the Datums are modified to point to the correct Resource (although I suppose currently they have their Descriptor injected?) |
Do you mean keeping the Descriptor reference on the StreamDatum, or moving it to the StreamResource? I think you mean the former, just checking. |
Fair, so in a given run we are going to have multiple To be clear:
I think the following set of documents is valid (assuming only
That is we have two streams, one with A and B, one with C. In the primary stream we have two descriptors (because we changed config or something) and with in the first descriptor we had to generate a second StreamResource. I added levels of nesting for "stream" and "key" which are emergent / synthetic concepts....
TC is not unambiguous 😜 |
Maybe each TC should pick a UUID4 we can use. |
Yep, just completely said the opposite of what I meant. Edited the comment for posterity.
Would you believe we had 3 Joes in the group before I joined Diamond? I've been explicitly Joseph for years, but it's become almost a necessity. Too used to people signing comments with just initials -JW. |
[After reading through the schemas and the above discussion, most of my confusion has lifted. Here is what remains...] @tacaswell The document snippet in your example is really helpful for visualizing a possible outcome. In |
It is enforced by the run bundler that each key must have the same number of events and they have matching sequence numbers. It is also currently enforced by the run bundler that each StreamDatum for each key must be the same length as those for the other keys in the same descriptor. This is an implementation detail, but could be made a guarantee if helpful... |
Thanks, @coretl ! I think having it as an expectation (rather than a guarantee) is sufficient, and handled by the RunBundler makes sense. So, I would then say that the details under |
A proposal developed with @callumforrester today: Instead of specifying a file {
"root": "/a/b/"
"resource_path": "x/y/z"
} how about {
"uri": {
"scheme": "file",
"path": ["a", "b", "x", "y", "z"],
"netloc": "localhost"
"query_params": [],
}
"facets": [
{"segments": {"start": 0, "end": 2}, "feature": "root"}
]
} That is, the URI is given as an object, with a side band ("facets") explaining the semantics of certain segments of the path. This enables:
I learned about a "facets" comment from a UI dev working for the other Bluesky, explaining why Bluesky posts don't just use Markdown. The requirements resonate with ours. A developer can freely ignore the facets if they don't (and don't want to) understand them, but they are can be used to give extra meaning to the path. |
For completeness I should mention that @tacaswell originally considered something like:
which, similarly uses a side-band. The "facets" proposal above is more flexible in that it can describe any segment and more than one. |
Coming late to the party here, but I've got a few thoughts from a consumer's perspective... Using URIs to specify resource locations is ideal but they should be encoded as described in IETF RFC 3986 (i.e. a string) - encoding them as any kind of struct will require users to write custom deserialization code when Absolute file paths are impossible where shared file systems are concerned as the file system can - and will be - be mounted at an arbitrary path. As such it only makes sense to use relative paths when describing the location of files and have the base directory agreed out of band - or perhaps with an additional field dedicated to this purpose though this has us defining a higher level protocol for resolving file systems. |
I like that argument for URI as string. |
I'm really glad you chimed in @garryod. Musing on:
...that is true at NSLS-II as well (and has been a source of many headaches). From RFC 8089, the "file" URI scheme:
with examples including I wonder if we can go with URIs like:
For testing, dev, and simple deployments with local storage, we can just use
The out of band mechanism could, for example, be local configuration mapping the domain in the URI to the local mount point, e.g. mounts:
- domain: data1.nsls2.bnl.gov
root: /nsls2/data1 I like that this would use existing standards, a single string URI, and no additional fields in the document that would require explanation/documentation. |
Hadn't come across this before, but in the absence of any better suggestion it seems like the way to go even if it's only currently a Proposed Standard.
Feels like there should be an of the shelve way of doing this - like the |
I can't find anything either. Since this aspect is read-time configuration, not part of the document stream or storage at rest, it would be comparatively easy to change it later. The file URI aspect is the part we really need to get right from the start. |
This is exactly the reason why we have
I hope we can do that re-writing at the "I'm consuming documents and pushing them into tiled" stage and not at the "I generated these in ophyd" stage as I do not think we want to inject the details of mounting into the ophyd devices (as reflecting on how root/root_map has worked at NSLS-II the main problem is we lost control of the root being set correctly (because it was done at the ophyd layer on each device in each profile)). |
The Using the hostname of the storage array is unambiguous. When possible, I think it would be best to set the URI from the start in ophyd, so that streaming consumers that happen to be on different hosts (and may have a different view of the storage) can work with it. If the engineer configuring ophyd just hands us an absolute path, we can emit |
Offline, @tacaswell argue that if ophyd puts the storage host in the URI, and then storage changes, we have to update all the "clients" (ophyd config). I think he's convinced me that ophyd should just emit But there is room for variability here: this is just an argument at the level of what to recommend. |
I may be missing something here, but how will the downstream consumer (tiled?) know what to rewrite them to without knowledge of how filesystems are mounted on each and every machine a ophyd device runs on? Or is the thought that this is less ownerous than having each ophyd device configured with this? |
I think that's the idea, yeah. I'm not 100% sold either way on this. |
I'd like to try to converge on a proposal:
Example: {
"name": "stream_resource",
"doc": {
"uid": "158167d9-796b-4e12-afff-5b3dd903815c",
"data_key": "waxs-sum",
"mimetype": "application/x-hdf5",
"uri": "file://gpfs.diamond.ac.uk/dls/i22/data/2023/cm33873-5/i22-723667-waxs-hdf.h5",
"parameters": {
"path": "/entry/instrument/NDAttributes/StatsTotal",
"swmr": true,
},
"run_start": "617184e5-5e24-40b1-80f7-0056d51e44f8"
}
} In the end, I believe no one is advocating for changed to stream datum. That would remain as: {
"name": "stream_datum",
"doc": {
"stream_resource": "158167d9-796b-4e12-afff-5b3dd903815c",
"uid": "158167d9-796b-4e12-afff-5b3dd903815c/2",
"seq_nums": {
"start": 6,
"stop": 7
},
"indices": {
"start": 5,
"stop": 6
},
"descriptor": "8ff32942-9b21-4542-aee1-16c04291950d"
}
} Out of scope for this stream is whether to rename this document (as @coretl proposed). That would affect the |
I'm enthusiastic about removing path_semantics, renaming spec and resource_kwargs. I'm on board with renaming root (our devices should always be mounting either I'm opposed to renaming stream_resource and stream_datum: I think it's clear (to me) what they mean and how they relate to existing documents, without being easily confused with them. |
Help me to avoid losing focus here. I'm thinking of area detectors with their own filesystem that are not mounted on the host where the event-model process is running. |
I think the actual problem with the current scheme is that we left it up to the ohpyd classes where to do the root / resource path split (and to do the windows <-> posix remapping) and in general it was in the wrong place so if we move to using a uri but don't change where it is configured I do not think we will do anything other than change the spelling (if you use If we continue let ophyd/RE hold the mapping (but spell it with in-band encoding) then the following story will happen:
Not exactly the same problems we have now, but I think they will be terrible in about the same ways. Hence, I think the mapping from We need to make sure that where ever we land on this we don't make the low-infrastructure use case (dev, testing, small deployments) does not get needlessly complicated. |
@prjemian Good point of clarification! I think we can stipulate that if ophyd hands us a It's yet to be settled in the discussion where/how we should recommend that this gets mapped into a portable representation ( |
@tacaswell This is exactly why we can never buy data2 in our current scheme (GDA et al.) |
How about ophyd produces URIs like |
My understanding of |
All our IOCs are in DNS, but the vxWorks ones don't know anything about it... So we can get IP address out of cainfo, reverse lookup the name in DNS, then put that in the URI. Probably not the end of the world if the entry is not in DNS and IP addresses get into the URI though? |
Yes, an IP address in the URI is fine. |
During the discussions that developed StreamResource and StreamDatum documents, the "producer" side was being built, but much of the "consumer" side remained to be developed. I think we agreed to stay open to adjustments when the time came to build out the consumer side. I hope that window is still open.
Here's a StreamResource from DLS I22, courtesy of @DiamondJoseph:
I can see nothing to change about these:
uid
, unique key for the documentdata_key
, tells us which field (column) in the Event Stream this maps torun_start
, tells us which BlueskyRun this belongs toI might suggest re-evaluating that names of:
spec
, a key in a registry that tells us which reader ("handler") to useresource_kwargs
, additional configuration for the readerIn Tiled, we use MIME types, including standard ones such as
image/tiff
and custom ones named likeapplication/x-hdf5-smwr-slice
, in the role ofspec
. Maybe it's worth considering whether we want to usemimtype
here, as it is a recognized standard way to spell, "This is a format, which tells you which I/O code to read it."Taking a broader-than Python look at the world,
resource_kwargs
(emphasis on "kwargs") is a bit of a Python-ism. These are JSON-serializable parameters that are needed by the code that reads the data. This code does not necessary have to be in Python. In Tiled, the analogous entity is simply calledparameters
.The existing names aren't doing any harm, so we have to weigh the pain of change against the marginal benefit of maybe-better names. If we leave them as is, I think that would be fine. Just worth considering.
The way we spell the location of the data seems more problematic:
root
, an absolute filepathresource_path
, a relative filepath (sometimes filename)path_semantics
, a two-member enum that iswin
orposix
I think we should consider replacing all of this with a URI, like
file://localhost/dls/i22/data/2023/cm33873-5/i22-723667-waxs-hdf.h5"
.file:
, such ass3:
.root
andresource_path
is a guess about what parts of the path will be stable over time and what parts may change as the data moves. Different people make different guesses, and at NSLS-II this has been a mess.path_semantics
is completely vestigial, based on a misunderstanding that Windows required backslashes in paths. In fact, anything newer than Windows 95 is fine with forward slashes.In Tiled, the same logical resource may be available in multiple places. This is expressed as a one-to-many relation between logical datasets and URIs (effectively). If the StreamResource just had a URI, this would be simpler and more future-proof.
Here is a StreamDatum:
The
descriptor
is on StreamDatum instead of on StreamResource withdata_key
. I think it's worth revisiting whether we really really need that, because life would be simplify if thedescriptor
anddata_key
were together on the same document.The text was updated successfully, but these errors were encountered: