Have maximum length on workspace_id #553

pvbouwel · 2024-12-18T08:37:45Z

This helps applications to not make assumptions that are false. 63 characters is a common limit for naming (e.g. DNS parts) that is not very restrictive for users/use-cases. In most cases special characters are not allowed at the start of identifiers.

By having these constraints implementations can make assumptions and do bounds checking and it is easier to map workspace_id's to other resources that might impose similar constraints.

@m-mohr I am not sure if there is a specific reason for including a tilde-symbol if not I'd have a slight preference to not include that one either given that it is not common for naming and could also not be available for other resources that one might want to create as part of their workspace implementation.

This helps applications to not make assumptions that are false. 64 characters is a common limit that is not too restrictive for users/use-cases. In most cases special characters are not allowed at the start of identifiers. By having these constraints it is easier to map workspace_id's to other resources that might impose similar constraints.

m-mohr · 2024-12-18T11:06:37Z

This is just the default for IDs in openEO (except for Collections afaik). As this is an API, the allowed characters are inspired by the allowed characters that don't get encoded in URL path parts (see https://datatracker.ietf.org/doc/html/rfc3986#section-2.3).

Did you do a survey of common cloud providers and what their limits and restrictions are? I'd prefer to have some evidence rather than making decisions here on personal preferences.

Note: A change would also need to be applied in https://github.com/Open-EO/openeo-processes/blob/draft/proposals/export_workspace.json

PS: Although the API allows it, back-ends can restrict it further if needed. If for example someone wants to create a workspace on e.g. Amazon and let's say Amazons max length is 32 chars, you can surely reject longer values when a user submits them although not restricted by the API.

Also avoid additional dots as it would give problems if virtual hosting is ever needed.

pvbouwel · 2024-12-18T13:05:10Z

Thank you for the quick feedback. So it is not just personal preference as it is driven by experience. For the 3 biggest Cloud Providers:

AWS
- Buckets have a limit of 63 characters and do not allow tilde in their name.
- There is no nice overview page for AWS but from experience I know most resources do not allow tilde (I actually am not aware of a resource that does allow it). The most common way of identifying resources is an ARN and for IAM paths used in ARNs you cannot use tilde. The length of resource names varies depending on the resource (roles can be 64 characters for example)
Azure:
- object storage container name is limited to 63 characters and disallows tilde
- General naming does not allow tilde
GCP:
- bucket names are 63 characters max if no . when they have dots it can be up to 222 characters but each part between dots must not exceed 63 characters
- Resource names should only use characters available in DNS names

I appreciate that backends can be more restrictive than what is in the API spec but it takes away a lot of the benefits of using an open API. In particular:

API is no longer uniform unless you check backend provider documentation on their implementation you might need to issue a few requests and action the return errors before being able to successfully use the API
Users could get frustrating issues when moving between platforms (code against the OpenAPI is not lift & shift compatible)
If tooling is built by implementers of a backend it is easy to get assumptions of the backend in the tooling and having it break against other backends. While arguing that it is a bug in an implementation a clear API spec would help protect against that and potentially helping users with a more exotic backend provider.

A careful look at Object storage containers for the aforementioned hyperscalers will show that they actually following DNS naming or a subset of it. As it does give them the advantage to avoid hot points in their APIs. Suppose you are an implementer of an OpenEO platform and you provide workspaces of which some are extremely heavy usage. Your workspace API interactions can scale indepently if the workspace can be part of the hostname. Which is not possible as long as a tilde and underscore are allowed in the name because you will always need to go via a L7-aware routing component.

I pushed a commit to illustrate what it then look like as it would require dropping underscore and uppercase characters. It would also be best to drop dots '.' as it also messes with virtual hosting.

Dropping upper case characters wouldn´t be strictly needed if the backend makes sure a workspace_id can only occur once while ignoring cases as DNS names do resolve regardless of casing.

While it does seem restrictive it is what is used for bucket/container names for object stores by these hyperscalers where these names must be unique (for Azure and GCP I don't know by heart where the uniqueness ends but AWS is by far the biggest object storage provider and for them these bucket names need to be globally unique)

PS: I'd be happy to create a PR to change https://github.com/Open-EO/openeo-processes/blob/draft/proposals/export_workspace.json#L24 if you find the above data convincing and do not have additional concerns

m-mohr · 2024-12-18T14:17:27Z

Thanks for the overview. I didn't realize in my first post that the workspace_id can't be chosen by users, but is generated by the backend. As such, there's not really an API issue here because the backends can just return IDs that comply to their requirements. If you return workspace IDs that always comply to ^[a-z0-9][a-z0-9\-]{0,62}$, you'll never actually face other IDs anyway (unless users provide invalid IDs, but those return a 404 anyway).

Also, I see that you directly map workspace id to the bucket name. That's not a given though. The workspace identifier may actually not be a direct mapping to the bucket name. If a backend implements multiple providers, you'd probably want to generate identifiers to avoid conflicts. For future-proofness, you should consider avoiding the direct mapping. This would also avoid many of the issues that you mentioned.

It feels the PR ties too closely to some bigger providers here (and your internal requirements). We had something similar in STAC and then got taught that there are other providers evolving and had to redo the whole storage extension because we tied it too closely to AWS/GCS/Azure. Given that backends can decide how to produce the IDs, I tend towards closing this PR.

pvbouwel · 2024-12-18T16:04:39Z

It is actually now that you point that out that I realize that as well workspace title <> workspace id (In my head they were the same but seems that is not up to spec). On workspace title there is no restriction but my above remarks would actually mostly apply to user input.

I used the buckets mostly to showcase API design in a similar problem space where these constraints are imposed. We will actually be supporting multiple providers and in our POC code we do track bucket name separate from workspace id but as we cannot make assumptions on bucket names it would be nice to be able to make assumptions on the workspace id. Now that it differs from title that is indeed the case so then there is not really a concern for workspace id as we will anyway need to keep extra state to map workspace id and workspace title.

It feels the PR ties too closely to some bigger providers here (and your internal requirements). We had something similar in
STAC and then got taught that there are other providers evolving and had to redo the whole storage extension because we tied it too closely to AWS/GCS/Azure. Given that backends can decide how to produce the IDs, I tend towards closing this PR.

I disagree that this type of change would bind to a provider. The inputs for this API are to serve this API and that is where it ends how interaction is done with storage/cloud providers that is up to the implementer of the backend. In this particular API the workspace provider parameters should be used for inputs that are special due to backend implementation. Things like workspace name and description that belong to the API should imho be better of with clearer constraints on size and allowed characters.

I agree that this PR should be closed as imposing constraints should be done holistically.

PS: Every API implementation must implement a maximum size on user provided input otherwise it will have a bug because no API can work with unbound inputs and if no special care is taken it gives birth to security issues as well. So If an API can come up with constraints that impose little to no limitations to usage of the service that is a big plus.

make sure workspace id is DNS compliant

ff7d3b4

Also avoid additional dots as it would give problems if virtual hosting is ever needed.

pvbouwel force-pushed the suggestion1/enforce_maximum_length_on_workspace_id branch from f8351f1 to ff7d3b4 Compare December 18, 2024 12:55

m-mohr requested review from GeraldIr and m-mohr December 18, 2024 14:19

pvbouwel closed this Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have maximum length on workspace_id #553

Have maximum length on workspace_id #553

pvbouwel commented Dec 18, 2024

m-mohr commented Dec 18, 2024 •

edited

Loading

pvbouwel commented Dec 18, 2024

m-mohr commented Dec 18, 2024 •

edited

Loading

pvbouwel commented Dec 18, 2024

Have maximum length on workspace_id #553

Have maximum length on workspace_id #553

Conversation

pvbouwel commented Dec 18, 2024

m-mohr commented Dec 18, 2024 • edited Loading

pvbouwel commented Dec 18, 2024

m-mohr commented Dec 18, 2024 • edited Loading

pvbouwel commented Dec 18, 2024

m-mohr commented Dec 18, 2024 •

edited

Loading

m-mohr commented Dec 18, 2024 •

edited

Loading