Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have maximum length on workspace_id #553

Conversation

pvbouwel
Copy link

This helps applications to not make assumptions that are false. 63 characters is a common limit for naming (e.g. DNS parts) that is not very restrictive for users/use-cases. In most cases special characters are not allowed at the start of identifiers.

By having these constraints implementations can make assumptions and do bounds checking and it is easier to map workspace_id's to other resources that might impose similar constraints.

@m-mohr I am not sure if there is a specific reason for including a tilde-symbol if not I'd have a slight preference to not include that one either given that it is not common for naming and could also not be available for other resources that one might want to create as part of their workspace implementation.

This helps applications to not make assumptions that are false. 64 characters is a common limit that is not too restrictive for users/use-cases. In most cases special characters are not allowed at the start of identifiers.

By having these constraints it is easier to map workspace_id's to other resources that might impose similar constraints.
@m-mohr
Copy link
Member

m-mohr commented Dec 18, 2024

This is just the default for IDs in openEO (except for Collections afaik). As this is an API, the allowed characters are inspired by the allowed characters that don't get encoded in URL path parts (see https://datatracker.ietf.org/doc/html/rfc3986#section-2.3).

Did you do a survey of common cloud providers and what their limits and restrictions are? I'd prefer to have some evidence rather than making decisions here on personal preferences.

Note: A change would also need to be applied in https://github.com/Open-EO/openeo-processes/blob/draft/proposals/export_workspace.json

PS: Although the API allows it, back-ends can restrict it further if needed. If for example someone wants to create a workspace on e.g. Amazon and let's say Amazons max length is 32 chars, you can surely reject longer values when a user submits them although not restricted by the API.

Also avoid additional dots as it would give problems if virtual hosting
is ever needed.
@pvbouwel pvbouwel force-pushed the suggestion1/enforce_maximum_length_on_workspace_id branch from f8351f1 to ff7d3b4 Compare December 18, 2024 12:55
@pvbouwel
Copy link
Author

Thank you for the quick feedback. So it is not just personal preference as it is driven by experience. For the 3 biggest Cloud Providers:

I appreciate that backends can be more restrictive than what is in the API spec but it takes away a lot of the benefits of using an open API. In particular:

  • API is no longer uniform unless you check backend provider documentation on their implementation you might need to issue a few requests and action the return errors before being able to successfully use the API
  • Users could get frustrating issues when moving between platforms (code against the OpenAPI is not lift & shift compatible)
  • If tooling is built by implementers of a backend it is easy to get assumptions of the backend in the tooling and having it break against other backends. While arguing that it is a bug in an implementation a clear API spec would help protect against that and potentially helping users with a more exotic backend provider.

A careful look at Object storage containers for the aforementioned hyperscalers will show that they actually following DNS naming or a subset of it. As it does give them the advantage to avoid hot points in their APIs. Suppose you are an implementer of an OpenEO platform and you provide workspaces of which some are extremely heavy usage. Your workspace API interactions can scale indepently if the workspace can be part of the hostname. Which is not possible as long as a tilde and underscore are allowed in the name because you will always need to go via a L7-aware routing component.

I pushed a commit to illustrate what it then look like as it would require dropping underscore and uppercase characters. It would also be best to drop dots '.' as it also messes with virtual hosting.

Dropping upper case characters wouldn´t be strictly needed if the backend makes sure a workspace_id can only occur once while ignoring cases as DNS names do resolve regardless of casing.

While it does seem restrictive it is what is used for bucket/container names for object stores by these hyperscalers where these names must be unique (for Azure and GCP I don't know by heart where the uniqueness ends but AWS is by far the biggest object storage provider and for them these bucket names need to be globally unique)

PS: I'd be happy to create a PR to change https://github.com/Open-EO/openeo-processes/blob/draft/proposals/export_workspace.json#L24 if you find the above data convincing and do not have additional concerns

@m-mohr
Copy link
Member

m-mohr commented Dec 18, 2024

Thanks for the overview. I didn't realize in my first post that the workspace_id can't be chosen by users, but is generated by the backend. As such, there's not really an API issue here because the backends can just return IDs that comply to their requirements. If you return workspace IDs that always comply to ^[a-z0-9][a-z0-9\-]{0,62}$, you'll never actually face other IDs anyway (unless users provide invalid IDs, but those return a 404 anyway).

Also, I see that you directly map workspace id to the bucket name. That's not a given though. The workspace identifier may actually not be a direct mapping to the bucket name. If a backend implements multiple providers, you'd probably want to generate identifiers to avoid conflicts. For future-proofness, you should consider avoiding the direct mapping. This would also avoid many of the issues that you mentioned.

It feels the PR ties too closely to some bigger providers here (and your internal requirements). We had something similar in STAC and then got taught that there are other providers evolving and had to redo the whole storage extension because we tied it too closely to AWS/GCS/Azure. Given that backends can decide how to produce the IDs, I tend towards closing this PR.

@m-mohr m-mohr requested review from GeraldIr and m-mohr December 18, 2024 14:19
@pvbouwel
Copy link
Author

It is actually now that you point that out that I realize that as well workspace title <> workspace id (In my head they were the same but seems that is not up to spec). On workspace title there is no restriction but my above remarks would actually mostly apply to user input.

I used the buckets mostly to showcase API design in a similar problem space where these constraints are imposed. We will actually be supporting multiple providers and in our POC code we do track bucket name separate from workspace id but as we cannot make assumptions on bucket names it would be nice to be able to make assumptions on the workspace id. Now that it differs from title that is indeed the case so then there is not really a concern for workspace id as we will anyway need to keep extra state to map workspace id and workspace title.

It feels the PR ties too closely to some bigger providers here (and your internal requirements). We had something similar in
STAC and then got taught that there are other providers evolving and had to redo the whole storage extension because we tied it too closely to AWS/GCS/Azure. Given that backends can decide how to produce the IDs, I tend towards closing this PR.

I disagree that this type of change would bind to a provider. The inputs for this API are to serve this API and that is where it ends how interaction is done with storage/cloud providers that is up to the implementer of the backend. In this particular API the workspace provider parameters should be used for inputs that are special due to backend implementation. Things like workspace name and description that belong to the API should imho be better of with clearer constraints on size and allowed characters.

I agree that this PR should be closed as imposing constraints should be done holistically.

PS: Every API implementation must implement a maximum size on user provided input otherwise it will have a bug because no API can work with unbound inputs and if no special care is taken it gives birth to security issues as well. So If an API can come up with constraints that impose little to no limitations to usage of the service that is a big plus.

@pvbouwel pvbouwel closed this Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants