Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Registry design (blueprint) #1

Open
larsbutler opened this issue Jun 30, 2014 · 26 comments
Open

Registry design (blueprint) #1

larsbutler opened this issue Jun 30, 2014 · 26 comments

Comments

@larsbutler
Copy link
Member

I've created this issue to serve as a blueprint for the implementation for the "zapp registry".

For ease of editing and to show history, I've moved the spec to this gist: https://gist.github.com/larsbutler/10a6355169f2404d8959




I've updated and cleaned up the spec per the discussion on this page.

Name

I propose ZPA (ZeroVM Package Archive).

Platform

The assumption so far is that we will build this on Swift+ZeroCloud, and many of the functions of the ZPA will be written as zapps. Dogfooding is one of the reasons for this. Another reason is the Swift provides a horizontally-scalable storage system that can store millions of files. Since the ZPA is intended to be the central repository for developers to publish their zapps, ZPA must be capable of operating at this kind of scale.

If one were to build the ZPA from scratch, probably ~80% of the work would be focused purely on storage. With a platform like Swift, a lot of that is solved for us.

To make this work, we will need to write some custom zapps to handle various client requests and backend processing tasks. Some changes to ZeroCloud middleware may also be required.

General requirements

  • All published zapps must be publicly and anonymously accessible
  • ZPA should be able to store multiple versions of a given zapp/package
    • Also similar to PyPI
  • To publish a zapp, users must have an account and must authenticate
  • Each user shall have their own package namespace to publish zapps
    • This should be implemented as a Swift container, owned by the user
    • With a separate container/namespace, this sidesteps any potential issues with multiple users publishing packages with the same name
  • All published packages should be viewable/downloadable with a web browser or other basic HTTP client (cURL, etc.)

Publishing a zapp

  • Each user must have a well-known location to make publish requests to
  • A user must authenticate (providing an auth token or similar credentials) when performing zapp upload/publish action
  • To upload a zapp, the user must send a POST request to their personal ZPA URL (http://example.com/larsbutler/zpa)
    • The contents of the POST should be the zapp file (a tar.gz archive)
    • The filename doesn't matter
    • The file must contain a zapp.yaml file.
  • Setting some special headers in the request may be necessary (TBD)
  • When the POST is submitted, ZeroCloud should execute a special zapp to process the request.
    • The publish zapp shall do the following:
      • extract and read the zapp.yaml file
      • save the uploaded file to a location within the users personal ZPA container
        • example: http://example.com/larsbutler/zpa/<zapp-name>/<zapp-name>-<version>-<timestamp>.zapp, where <zapp-name> and <version> are extracted from the zapp.yaml meta section, and <timestamp> is automatically computed by the publishing zapp, in the format YYYYmmddHHMMSS (20140630185610, for example)
        • instead of timestamp, we could simply use an auto-incrementing revision number; the important thing is that this number always increases and never decreases
      • if this is a new upload for the same <version>, the old one will be marked for deletion (which is a change which will take time to propagate through the system, because of the eventual consistency of Swift)
      • an existing zapp file should never be overwritten; we should create new files and delete old ones

Example:

  1. User uploads http://example.com/larsbutler/zpa/geomet/geomet-0.1.9-20140630123456.zapp
  2. User uploads http://example.com/larsbutler/zpa/geomet/geomet-0.1.9-20140707001122.zapp
  3. User uploads http://example.com/larsbutler/zpa/geomet/geomet-0.2.0-20140801120000.zapp

The result will be:

http://example.com/larsbutler/zpa/geomet/geomet-0.1.9-20140707001122.zapp (latest 0.1.9 package)
http://example.com/larsbutler/zpa/geomet/geomet-0.1.9-20140630123456.zapp (marked for deletion)
http://example.com/larsbutler/zpa/geomet/geomet-0.2.0-20140801120000.zapp (latest 0.2.0 package)

This opinionated file naming convention was inspired by headaches I've experienced in publishing packages to Launchpad PPAs. For example:

  • If the uploaded files are not perfectly named (including the package name/version), publishing fails
  • Package text files (changelog, etc.) need to be updated every single time a new package version is uploaded. This can be really annoying if you're making small tweaks and testing changes.
  • Pushing this complexity to the ZPA means that clients can simpler and users/developers don't have to jump through quite so many hoops to publish a zapp.
    • The tradeoff is that we enforce a very opinionated way of doing things

Searching and listing packages

  • ZPA shall be searchable
  • There needs to be a special zapp which can aggregate all of the various packages published by all users into a single listing
    • This can be written as a MapReduce zapp
    • Searching/listing operations should be able to return all of the available versions for a given zapp
      • Because deletions of old revisions are not immediate, the zapp file name convention allows for easy sorting, which enables use to only return the latest package file for a given <version>.
    • Some changes to ZeroCloud may be required to support this
    • We should only return one result per user per package name per version

Backend logic

  • Ideally, we would restrict user accounts to 1 container per deployment: the zpa container. The reason is so that users don't just put random files into Swift. As far as I know, this cannot be done in Swift itself, so this will probably require some custom middleware to implement.
  • POSTs to the zpa must trigger the "publishing zapp" (mentioned above). If the file uploaded is not a valid zapp (an archive containing a zapp.yaml), an error should be returned (probably a 4xx).

Some of this could be implemented by extending ZeroCloud, but it may be more appropriate to write this into a separate middleware application.

User interface

  • Most interaction, initially, should be done via HTTP
  • We should build/extend a client (such as zpm) which is capable of interacting with the ZPA
  • Eventually it would be good to have an HTML/JavaScript UI, but the priority for the moment is to build the registry and make it work with command-line tools/client first.

Client use cases

A client (a command-line client to start with) needs to be capable of the following actions:

  • publish a zapp
  • search for zapps by name
  • get all available versions for a given zapp
  • get metadata for a given zapp
@mgeisler
Copy link
Contributor

mgeisler commented Jul 1, 2014

One thought is that I think the registry should be app centric, not developer centric. That is, I prefer the way Ruby, Debian, Haskell, and Python does it:

Their namespaces are centered around the artifacts produced in the ecosystem. Each entry has a number of owners or admins who can update it.

Doing it like this avoids the situation where a project like ZeroVM creates a zerovm user (team) account only to have this user publish a zerovm artifact. This is what you see on GitHub, Bitbucket, and other source hosting sites.

@mgeisler
Copy link
Contributor

mgeisler commented Jul 1, 2014

About uploading more than one zapp with a given version number: I agree with @pkit when he said on IRC that we should be less permissive and not allow uploading more than one zapp with a given version number. PyPI will not allow you to do this and I believe developers in general will consider it a mistake if we allow it.

There are very strong traditions in Debian and other distributions for requiring a new version number with every change to the published source. One area where this is important is security updates: if package content can be updated without changing the version number, well then it becomes a nightmare to figure out if you have vulnerable software on your system.

@larsbutler
Copy link
Member Author

About uploading more than one zapp with a given version number: I agree with @pkit when he said on IRC that we should be less permissive and not allow uploading more than one zapp with a given version number. PyPI will not allow you to do this and I believe developers in general will consider it a mistake if we allow it.

What @pkit suggested was incrementing the revision number (1.0.0-4, where 4 is the revision). In principle, I'm fine with, and it would be a reasonable alternative to the timestamp. However, the timestamp might be better for managing revisions: with a timestamp, there is no need to figure out what was the previous revision and ensure that newer revisions have a higher number. The timestamp automatically sorts this out.

There are very strong traditions in Debian and other distributions for requiring a new version number with every change to the published source.

That's why I suggested for the ZPA to do the revision/timestamp incrementing automatically. If the developer wants to make minor changes without rolling a whole new version number, the new revision needs an updated number. We could make them do it manually, but why not automate it? Personally, I find this to be one of the more annoying aspects of Debian packaging, one which I automate as much as possible.

One area where this is important is security updates: if package content can be updated without changing the version number, well then it becomes a nightmare to figure out if you have vulnerable software on your system.

Agreed. In fact, I never suggested that we allow package changes without a new version/revision number.

@larsbutler
Copy link
Member Author

@mgeisler

One thought is that I think the registry should be app centric, not developer centric.

That's a fair point, although I think that making that work on Swift/ZeroCloud would be more difficult; having separate namespaces for each user solves a lot of quota and permissions issue for us, almost completely out of the box. I'm open to suggestions on how the app-centric approach could be implemented, though.

I'd like to hear from some others. If there is an overwhelming consensus to do it this way, we can figure out how to make it work.

@pkit
Copy link
Member

pkit commented Jul 1, 2014

@larsbutler it's very important that revision number (be it timestamp or running integer) is assigned by client and not by server.
Server cannot know what revision is the current one, because it never checkpoints itself to any consistent state, it's the job of a client (or user) to produce a "cause-effect" relationship out of the server data.
It is also important that the dependencies are calculated including the revision number.

I'm open to suggestions on how the app-centric approach could be implemented, though.

It can be implemented by giving each account a write permission to specific app container.
I.e. it can work like that:

  1. We create special account for "registry"
  2. We publish a zapp on this account with execution rights: for authenticated users only.
  3. The zapp can have some simple interface, like: https://.../auth.nexe?project=geomet&action=add or https://.../auth.nexe?project=geomet&action=remove
  4. This will add or remove user to the write acl on the container (named geomet) in the "registry" account.

@pkit
Copy link
Member

pkit commented Jul 1, 2014

Hmm...on the second thought it was kind of stupid to just allow users to write any stuff to a container.
What we want is to enforce some rules:

  1. Never overwrite.
  2. Never write something that's not a zapp
  3. Probably check zapp.yamp format
    Therefore we need some sort of service.
    We can do it with "open" but probably implementing "open-with" will be even better.
    So, we still do steps 1. and 2. from above.
    But then:
  4. User uploads a zap to his/her account, and then does "open-with" our app, located in the "registry" account.
  5. App does all the checks and if they pass: writes the zapp to the specific container.
    This way we don't need to worry about permissions also, because if user could upload stuff to own account, and it's is authenticated user - we are good to go.

@mgeisler
Copy link
Contributor

mgeisler commented Jul 1, 2014

Constantine Peresypkin [email protected] writes:

Hmm...on the second thought it was kind of stupid to just allow users
to write any stuff to a container.
What we want is to enforce some rules:

  1. Never overwrite.
  2. Never write something that's not a zapp
  3. Probably check zapp.yamp format
    Therefore we need some sort of service.

Agreed. I never envisioned giving users Swift credentials and letting
them upload/download stuff themselves. Uploads would preferably go to a
zapp which puts it's stdin into a Swift object after checking that it
has the right format.

We can do it with "open" but probably implementing "open-with" will be
even better. So, we still do steps 1. and 2. from above.

But then:
3. User uploads a zap to his/her account, and then does "open-with"
our app, located in the "registry" account.
4. App does all the checks and if they pass: writes the zapp to the
specific container. This way we don't need to worry about permissions
also, because if user could upload stuff to own account, and it's is
authenticated user - we are good to go.

I'm unsure what you're proposing here? The way I imagine the system
there would only be one account on Swift -- the internal one used by the
registry. Users will invoke zapps and these zapps will manipulate Swift
in the background, possibly using credentials for the registry user.
These credentials can be stored in the zapp.

What is lacking for this to work is mostly the ability to invoke a zapp
anonymously, right? Swift can already be configured to allow anonymous
container listings -- I've used that before. @rpedde talked about
extending the Swift ACLs with a proper permission bit (the "x" bit) and
I think that sounds like just the thing we would need.

@mgeisler
Copy link
Contributor

mgeisler commented Jul 1, 2014

@larsbutler You write

[...] having separate namespaces for each user solves a lot of quota and permissions issue for us, almost completely out of the box

It is my impression that the problems you see solved by Swift aren't the difficult or important problems. Implementing quotas can be done in many ways and I don't think that's the core problem we're solving here.

@larsbutler
Copy link
Member Author

@pkit

...it's very important that revision number (be it timestamp or running integer) is assigned by client and not by server.
Server cannot know what revision is the current one, because it never checkpoints itself to any consistent state, it's the job of a client (or user) to produce a "cause-effect" relationship out of the server data.

Fair enough. So in that case, how do we enforce the rule that new versions bear a newer version/revision number? Can we enforce that (in the way that Launchpad does, for example)?

It is also important that the dependencies are calculated including the revision number.

Dependencies between what exactly? Dependencies between zapps?

@pkit
Copy link
Member

pkit commented Jul 1, 2014

Uploads would preferably go to a zapp which puts it's stdin into a Swift object after checking that it has the right format.

Right now you cannot "upload through zapp" anything. But it can be arranged either on job-description level or on the "helper middleware" level.

The way I imagine the system there would only be one account on Swift -- the internal one used by the registry.

You still need users to authenticate themselves. And the trivial approach: users must have a Zebra account.

What is lacking for this to work is mostly the ability to invoke a zapp anonymously, right?

We have the ability to invoke zapp anonymously. We just need to sort out the correct permission level for that.

I think that sounds like just the thing we would need.

I think current Swift permissions are too limited and have too much legacy baggage. Like "referrer" or "rlisting". On the other hand we may want to have backwards compatibility here. On yet other hand the whole auth is external to Swift, and "Swift ACL" is just a recommendation, as even keystone and tempauth slightly differ already.

@mgeisler
Copy link
Contributor

mgeisler commented Jul 1, 2014

Lars Butler [email protected] writes:

There are very strong traditions in Debian and other distributions
for requiring a new version number with every change to the published
source.

That's why I suggested for the ZPA to do the revision/timestamp
incrementing automatically. If the developer wants to make minor
changes without rolling a whole new version number, the new revision
needs an updated number. We could make them do it manually, but why
not automate it?

The problem is simply that you cannot change the version number of a
piece of software by external means only. You need to edit some files
inside the package too: in our case at least the zapp.yaml file which
holds the version number. You will typically have to update other files
too, such as a changelog.

If there is a problem here, then I feel that solving it is outside the
scope of what a package registry should do.

One area where this is important is security updates: if package
content can be updated without changing the version number, well then
it becomes a nightmare to figure out if you have vulnerable software
on your system.

Agreed. In fact, I never suggested that we allow package changes
without a new version/revision number.

Then I misunderstood the example where you had geomet version 0.1.9
uploaded twice. You also mentioned "latest 0.1.9 package", which I took
to indicate that you operate with the idea that there can be more than
one package at a given version.

@larsbutler
Copy link
Member Author

@mgeisler

It is my impression that the problems you see solved by Swift aren't the difficult or important problems.

I never said whether they were difficult or not, but I think they do need to be solved, and solving them in this way (with out-of-the-box functionality) reduces the amount of work we have to do. It is my impression that you underestimate the amount work it will take build something like that from scratch.

Implementing quotas can be done in many ways and I don't think that's the core problem we're solving here.

First of all, I never said it was the "core problem"; it's "a problem", one of the many which needs to be solved in order to build this thing.

Would you perceive to be the core problem, then? (Just saying "that's not the core problem" is not helpful or constructive, without offering your ideas about the core problem.)

@pkit
Copy link
Member

pkit commented Jul 1, 2014

@larsbutler

So in that case, how do we enforce the rule that new versions bear a newer version/revision number? Can we enforce that (in the way that Launchpad does, for example)?

If we store stuff in registry by invoking a zapp it can enforce "no overwrite" rule by using specific headers (If-None-Match for example).

Dependencies between what exactly? Dependencies between zapps?

Yep

@larsbutler
Copy link
Member Author

@mgeisler

Then I misunderstood the example where you had geomet version 0.1.9
uploaded twice. You also mentioned "latest 0.1.9 package", which I took
to indicate that you operate with the idea that there can be more than
one package at a given version.

Right, there's only 1 official package per version (version being 0.1.9, for example). If I have a 0.1.9-1 and I upload a 0.1.9-2, 0.1.9-2 should be the new canonical package for the 0.1.9 version (it might include some security patches). When 0.1.9-2 is uploaded, it should effectively replace 0.1.9-1, BUT due to the lack of consistent state of Swift, 0.1.9-1 will be deleted at some point in time later with no guarantee about when that happens. So technically, multiple revisions of 0.1.9 can exist in the storage system at a given time, but for all intents and purposes, there is only one: the latest one. See the point about deleting old revisions in the section "Searching and listing packages".

@pkit
Copy link
Member

pkit commented Jul 1, 2014

@larsbutler

0.1.9-1 will be deleted at some point in time later with no guarantee about when that happens

There is no need to delete the old one in a generic case. We can just make sure that action "download package 0.19" will choose the latest one.

@larsbutler
Copy link
Member Author

@pkit

There is no need to delete the old one in a generic case. We can just make sure that action "download package 0.19" will choose the latest one.

Yeah, that's technically true. I was thinking of doing that more as a housecleaning; if the download action will never grab an old version, why keep it around?

@larsbutler
Copy link
Member Author

@mgeisler

The problem is simply that you cannot change the version number of a
piece of software by external means only. You need to edit some files
inside the package too: in our case at least the zapp.yaml file which
holds the version number. You will typically have to update other files
too, such as a changelog.

You don't change the entire version number only through external means; what I'm proposing is that the developer still chooses when increment the version number (x.x.x), just not the revision. In this case, the revision would be more of an internal artifact to keep track of what is newer and what is older.

If there is a problem here, then I feel that solving it is outside the
scope of what a package registry should do.

A fair point. We can make developers do it themselves. As long we have a clear rule about version increments and a way to enforce it, I'm fine with this.

@larsbutler
Copy link
Member Author

@pkit

If we store stuff in registry by invoking a zapp it can enforce "no owerwrite" rule by using specific headers (If-None-Match for example).

Can you please elaborate on that? If you can provide some more details, I'll edit the spec and put it in.

@mgeisler
Copy link
Contributor

mgeisler commented Jul 1, 2014

@larsbutler

Implementing quotas can be done in many ways and I don't think that's the core problem we're solving here.

First of all, I never said it was the "core problem"; it's "a problem", one of the many which needs to be solved in order to build this thing.

Would you perceive to be the core problem, then? (Just saying "that's not the core problem" is not helpful or constructive, without offering your ideas about the core problem.)

The biggest unknown I see is how to let the server-side code accept files, check them, and put them into Swift. The output objects from a given job are fixed before the job starts today, but the way I think about it, the ZeroVM job that inspects the tarball would need to decide on output objects the tarball is to be stored in. AFAIK, that isn't supported today, so I'm unsure how we would do this.

One option might be to use the tempurl feature: that way the registry zapp can give clients a token that allows them to make a specific PUT request. That could enforce that uploads go where we want them. We would have no idea what the user uploads, though. So my something like my original scheme might be needed:

  • the client requests a tempurl token for a scratch area
  • the client uploads the zapp
  • the client invokes a registry zapp with the Swift path to the just uploaded zapp
  • if the registry zapp accepts the zapp, it writes a job description with the correct output object name
  • the client invokes this job description, which then installs the zapp in the right place

There are obvious pitfalls here: someone needs to clean up at various stages if the client goes away before finishing all steps. I'm also not sure if we can allow anonymous people to invoke zapps and still restrict them to only invoke zapps using pre-defined job descriptions.

So there are still some unknowns here: hence me thinking that this is where you'll end up with most of the effort.

@pkit
Copy link
Member

pkit commented Jul 1, 2014

@larsbutler

if the download action will never grab an old version, why keep it around?

If we will have dependencies it will matter. If we won't - why do we need a registry? :)

Can you please elaborate on that?

If you do a PUT with If-None-Match: * request header the PUT will succeed only if the object with that name does not exist already. We cannot enforce the header, because it's a request header, but we can allow storing files in registry only by invoking a zapp with specific parameters, and the zapp then can use proper request headers.

@larsbutler
Copy link
Member Author

@pkit

If we will have dependencies it will matter. If we won't - why do we need a registry? :)

It depends on how granular you want your dependency specification to be. If I want 0.1.9 as a dependency, would I be allowed specify 0.1.9-2? I was thinking that we wouldn't do this; instead, one would specify just 0.1.9 and will get the latest revision of 0.1.9, whatever happens to be available.

@pkit
Copy link
Member

pkit commented Jul 1, 2014

@larsbutler

would I be allowed specify 0.1.9-2

Yes, probably it's a good idea. And also any other variant.
Something like Debian: package >=0.1.9, package < 0.1.10 or package >= 0.1 or package = 0.1.9-5

@larsbutler
Copy link
Member Author

@pkit

Yes, probably it's a good idea. And also any other variant.
Something like Debian: package >=0.1.9, package < 0.1.10 or package >= 0.1 or package = 0.1.9-5

Okay, agreed.

@mgeisler
Copy link
Contributor

mgeisler commented Jul 1, 2014

@larsbutler

If I want 0.1.9 as a dependency, would I be allowed specify 0.1.9-2? I was thinking that we wouldn't do this; instead, one would specify just 0.1.9 and will get the latest revision of 0.1.9, whatever happens to be available.

What you're saying here is (apparently) that the full version number (0.1.9-2) isn't the version number of the software. Instead it's something else — an internal version number of the registry.

This means that you allow people to upload different packages and still give them the same version number (0.1.9). That should not be allowed and I think you also think so based on what you said earlier.

I think you should avoid over-thinking this part. Let users decide on version numbers and the semantics. Let the registry maintain a version->zapp mapping, with the constraint that the version numbers are unique per zapp. That is the semantics developers are used to from other package indexes.

As for dependencies between zapps: we've talked about this before and zapps was designed to be self-contained. I would also like to see something like libraries in the future, but that's still far away. Even when we have some notion of libraries, I expect it to be the clients that download the dependencies. So let the clients decide how they want to resolve >= 0.1.9.

@larsbutler
Copy link
Member Author

@mgeisler

Let the registry maintain a version->zapp mapping, with the constraint that the version numbers are unique per zapp. That is the semantics developers are used to from other package indexes.

Okay, fair enough.

@larsbutler
Copy link
Member Author

Okay, I think I've received enough feedback to fix/rewrite some parts of the spec. Let me take another stab at this and see where we land.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants