-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Have maximum length on workspace_id #553
Have maximum length on workspace_id #553
Conversation
This helps applications to not make assumptions that are false. 64 characters is a common limit that is not too restrictive for users/use-cases. In most cases special characters are not allowed at the start of identifiers. By having these constraints it is easier to map workspace_id's to other resources that might impose similar constraints.
This is just the default for IDs in openEO (except for Collections afaik). As this is an API, the allowed characters are inspired by the allowed characters that don't get encoded in URL path parts (see https://datatracker.ietf.org/doc/html/rfc3986#section-2.3). Did you do a survey of common cloud providers and what their limits and restrictions are? I'd prefer to have some evidence rather than making decisions here on personal preferences. Note: A change would also need to be applied in https://github.com/Open-EO/openeo-processes/blob/draft/proposals/export_workspace.json PS: Although the API allows it, back-ends can restrict it further if needed. If for example someone wants to create a workspace on e.g. Amazon and let's say Amazons max length is 32 chars, you can surely reject longer values when a user submits them although not restricted by the API. |
Also avoid additional dots as it would give problems if virtual hosting is ever needed.
f8351f1
to
ff7d3b4
Compare
Thank you for the quick feedback. So it is not just personal preference as it is driven by experience. For the 3 biggest Cloud Providers:
I appreciate that backends can be more restrictive than what is in the API spec but it takes away a lot of the benefits of using an open API. In particular:
A careful look at Object storage containers for the aforementioned hyperscalers will show that they actually following DNS naming or a subset of it. As it does give them the advantage to avoid hot points in their APIs. Suppose you are an implementer of an OpenEO platform and you provide workspaces of which some are extremely heavy usage. Your workspace API interactions can scale indepently if the workspace can be part of the hostname. Which is not possible as long as a tilde and underscore are allowed in the name because you will always need to go via a L7-aware routing component. I pushed a commit to illustrate what it then look like as it would require dropping underscore and uppercase characters. It would also be best to drop dots '.' as it also messes with virtual hosting. Dropping upper case characters wouldn´t be strictly needed if the backend makes sure a workspace_id can only occur once while ignoring cases as DNS names do resolve regardless of casing. While it does seem restrictive it is what is used for bucket/container names for object stores by these hyperscalers where these names must be unique (for Azure and GCP I don't know by heart where the uniqueness ends but AWS is by far the biggest object storage provider and for them these bucket names need to be globally unique) PS: I'd be happy to create a PR to change https://github.com/Open-EO/openeo-processes/blob/draft/proposals/export_workspace.json#L24 if you find the above data convincing and do not have additional concerns |
Thanks for the overview. I didn't realize in my first post that the workspace_id can't be chosen by users, but is generated by the backend. As such, there's not really an API issue here because the backends can just return IDs that comply to their requirements. If you return workspace IDs that always comply to Also, I see that you directly map workspace id to the bucket name. That's not a given though. The workspace identifier may actually not be a direct mapping to the bucket name. If a backend implements multiple providers, you'd probably want to generate identifiers to avoid conflicts. For future-proofness, you should consider avoiding the direct mapping. This would also avoid many of the issues that you mentioned. It feels the PR ties too closely to some bigger providers here (and your internal requirements). We had something similar in STAC and then got taught that there are other providers evolving and had to redo the whole storage extension because we tied it too closely to AWS/GCS/Azure. Given that backends can decide how to produce the IDs, I tend towards closing this PR. |
It is actually now that you point that out that I realize that as well workspace title <> workspace id (In my head they were the same but seems that is not up to spec). On workspace title there is no restriction but my above remarks would actually mostly apply to user input. I used the buckets mostly to showcase API design in a similar problem space where these constraints are imposed. We will actually be supporting multiple providers and in our POC code we do track bucket name separate from workspace id but as we cannot make assumptions on bucket names it would be nice to be able to make assumptions on the workspace id. Now that it differs from title that is indeed the case so then there is not really a concern for workspace id as we will anyway need to keep extra state to map workspace id and workspace title.
I disagree that this type of change would bind to a provider. The inputs for this API are to serve this API and that is where it ends how interaction is done with storage/cloud providers that is up to the implementer of the backend. In this particular API the workspace provider parameters should be used for inputs that are special due to backend implementation. Things like workspace name and description that belong to the API should imho be better of with clearer constraints on size and allowed characters. I agree that this PR should be closed as imposing constraints should be done holistically. PS: Every API implementation must implement a maximum size on user provided input otherwise it will have a bug because no API can work with unbound inputs and if no special care is taken it gives birth to security issues as well. So If an API can come up with constraints that impose little to no limitations to usage of the service that is a big plus. |
This helps applications to not make assumptions that are false. 63 characters is a common limit for naming (e.g. DNS parts) that is not very restrictive for users/use-cases. In most cases special characters are not allowed at the start of identifiers.
By having these constraints implementations can make assumptions and do bounds checking and it is easier to map workspace_id's to other resources that might impose similar constraints.
@m-mohr I am not sure if there is a specific reason for including a tilde-symbol if not I'd have a slight preference to not include that one either given that it is not common for naming and could also not be available for other resources that one might want to create as part of their workspace implementation.