Registry design (blueprint) #1

larsbutler · 2014-06-30T16:58:25Z

I've created this issue to serve as a blueprint for the implementation for the "zapp registry".

For ease of editing and to show history, I've moved the spec to this gist: https://gist.github.com/larsbutler/10a6355169f2404d8959

I've updated and cleaned up the spec per the discussion on this page.

Name

I propose ZPA (ZeroVM Package Archive).

Platform

The assumption so far is that we will build this on Swift+ZeroCloud, and many of the functions of the ZPA will be written as zapps. Dogfooding is one of the reasons for this. Another reason is the Swift provides a horizontally-scalable storage system that can store millions of files. Since the ZPA is intended to be the central repository for developers to publish their zapps, ZPA must be capable of operating at this kind of scale.

If one were to build the ZPA from scratch, probably ~80% of the work would be focused purely on storage. With a platform like Swift, a lot of that is solved for us.

To make this work, we will need to write some custom zapps to handle various client requests and backend processing tasks. Some changes to ZeroCloud middleware may also be required.

General requirements

All published zapps must be publicly and anonymously accessible

Similar to PyPI

ZPA should be able to store multiple versions of a given zapp/package

Also similar to PyPI

To publish a zapp, users must have an account and must authenticate

Each user shall have their own package namespace to publish zapps

This should be implemented as a Swift container, owned by the user

With a separate container/namespace, this sidesteps any potential issues with multiple users publishing packages with the same name

All published packages should be viewable/downloadable with a web browser or other basic HTTP client (cURL, etc.)

Publishing a zapp

Each user must have a well-known location to make publish requests to

Example: http://example.com/larsbutler/zpa

where larsbutler is the users name or ID and zpa is a special Swift container

A user must authenticate (providing an auth token or similar credentials) when performing zapp upload/publish action

To upload a zapp, the user must send a POST request to their personal ZPA URL (http://example.com/larsbutler/zpa)

The contents of the POST should be the zapp file (a tar.gz archive)

The filename doesn't matter

The file must contain a zapp.yaml file.

Setting some special headers in the request may be necessary (TBD)

When the POST is submitted, ZeroCloud should execute a special zapp to process the request.

The publish zapp shall do the following:

extract and read the zapp.yaml file

save the uploaded file to a location within the users personal ZPA container

example: http://example.com/larsbutler/zpa/<zapp-name>/<zapp-name>-<version>-<timestamp>.zapp, where <zapp-name> and <version> are extracted from the zapp.yaml meta section, and <timestamp> is automatically computed by the publishing zapp, in the format YYYYmmddHHMMSS (20140630185610, for example)

instead of timestamp, we could simply use an auto-incrementing revision number; the important thing is that this number always increases and never decreases

if this is a new upload for the same <version>, the old one will be marked for deletion (which is a change which will take time to propagate through the system, because of the eventual consistency of Swift)

an existing zapp file should never be overwritten; we should create new files and delete old ones

Example:

User uploads http://example.com/larsbutler/zpa/geomet/geomet-0.1.9-20140630123456.zapp

User uploads http://example.com/larsbutler/zpa/geomet/geomet-0.1.9-20140707001122.zapp

User uploads http://example.com/larsbutler/zpa/geomet/geomet-0.2.0-20140801120000.zapp

The result will be:

http://example.com/larsbutler/zpa/geomet/geomet-0.1.9-20140707001122.zapp (latest 0.1.9 package)
http://example.com/larsbutler/zpa/geomet/geomet-0.1.9-20140630123456.zapp (marked for deletion)
http://example.com/larsbutler/zpa/geomet/geomet-0.2.0-20140801120000.zapp (latest 0.2.0 package)

This opinionated file naming convention was inspired by headaches I've experienced in publishing packages to Launchpad PPAs. For example:

If the uploaded files are not perfectly named (including the package name/version), publishing fails

Package text files (changelog, etc.) need to be updated every single time a new package version is uploaded. This can be really annoying if you're making small tweaks and testing changes.

Pushing this complexity to the ZPA means that clients can simpler and users/developers don't have to jump through quite so many hoops to publish a zapp.

The tradeoff is that we enforce a very opinionated way of doing things

Searching and listing packages

ZPA shall be searchable

There needs to be a special zapp which can aggregate all of the various packages published by all users into a single listing

This can be written as a MapReduce zapp

Searching/listing operations should be able to return all of the available versions for a given zapp

Because deletions of old revisions are not immediate, the zapp file name convention allows for easy sorting, which enables use to only return the latest package file for a given <version>.

Some changes to ZeroCloud may be required to support this

We should only return one result per user per package name per version

Backend logic

Ideally, we would restrict user accounts to 1 container per deployment: the zpa container. The reason is so that users don't just put random files into Swift. As far as I know, this cannot be done in Swift itself, so this will probably require some custom middleware to implement.

POSTs to the zpa must trigger the "publishing zapp" (mentioned above). If the file uploaded is not a valid zapp (an archive containing a zapp.yaml), an error should be returned (probably a 4xx).

Some of this could be implemented by extending ZeroCloud, but it may be more appropriate to write this into a separate middleware application.

User interface

Most interaction, initially, should be done via HTTP

We should build/extend a client (such as zpm) which is capable of interacting with the ZPA

Eventually it would be good to have an HTML/JavaScript UI, but the priority for the moment is to build the registry and make it work with command-line tools/client first.

Client use cases

A client (a command-line client to start with) needs to be capable of the following actions:

publish a zapp

search for zapps by name

get all available versions for a given zapp

~~get metadata for a given zapp~~

The text was updated successfully, but these errors were encountered:

mgeisler · 2014-07-01T08:44:10Z

One thought is that I think the registry should be app centric, not developer centric. That is, I prefer the way Ruby, Debian, Haskell, and Python does it:

Their namespaces are centered around the artifacts produced in the ecosystem. Each entry has a number of owners or admins who can update it.

Doing it like this avoids the situation where a project like ZeroVM creates a zerovm user (team) account only to have this user publish a zerovm artifact. This is what you see on GitHub, Bitbucket, and other source hosting sites.

mgeisler · 2014-07-01T09:35:21Z

About uploading more than one zapp with a given version number: I agree with @pkit when he said on IRC that we should be less permissive and not allow uploading more than one zapp with a given version number. PyPI will not allow you to do this and I believe developers in general will consider it a mistake if we allow it.

There are very strong traditions in Debian and other distributions for requiring a new version number with every change to the published source. One area where this is important is security updates: if package content can be updated without changing the version number, well then it becomes a nightmare to figure out if you have vulnerable software on your system.

larsbutler · 2014-07-01T09:51:35Z

About uploading more than one zapp with a given version number: I agree with @pkit when he said on IRC that we should be less permissive and not allow uploading more than one zapp with a given version number. PyPI will not allow you to do this and I believe developers in general will consider it a mistake if we allow it.

What @pkit suggested was incrementing the revision number (1.0.0-4, where 4 is the revision). In principle, I'm fine with, and it would be a reasonable alternative to the timestamp. However, the timestamp might be better for managing revisions: with a timestamp, there is no need to figure out what was the previous revision and ensure that newer revisions have a higher number. The timestamp automatically sorts this out.

There are very strong traditions in Debian and other distributions for requiring a new version number with every change to the published source.

That's why I suggested for the ZPA to do the revision/timestamp incrementing automatically. If the developer wants to make minor changes without rolling a whole new version number, the new revision needs an updated number. We could make them do it manually, but why not automate it? Personally, I find this to be one of the more annoying aspects of Debian packaging, one which I automate as much as possible.

One area where this is important is security updates: if package content can be updated without changing the version number, well then it becomes a nightmare to figure out if you have vulnerable software on your system.

Agreed. In fact, I never suggested that we allow package changes without a new version/revision number.

larsbutler · 2014-07-01T09:55:57Z

@mgeisler

One thought is that I think the registry should be app centric, not developer centric.

That's a fair point, although I think that making that work on Swift/ZeroCloud would be more difficult; having separate namespaces for each user solves a lot of quota and permissions issue for us, almost completely out of the box. I'm open to suggestions on how the app-centric approach could be implemented, though.

I'd like to hear from some others. If there is an overwhelming consensus to do it this way, we can figure out how to make it work.

pkit · 2014-07-01T10:22:09Z

@larsbutler it's very important that revision number (be it timestamp or running integer) is assigned by client and not by server.
Server cannot know what revision is the current one, because it never checkpoints itself to any consistent state, it's the job of a client (or user) to produce a "cause-effect" relationship out of the server data.
It is also important that the dependencies are calculated including the revision number.

I'm open to suggestions on how the app-centric approach could be implemented, though.

It can be implemented by giving each account a write permission to specific app container.
I.e. it can work like that:

We create special account for "registry"
We publish a zapp on this account with execution rights: for authenticated users only.
The zapp can have some simple interface, like: https://.../auth.nexe?project=geomet&action=add or https://.../auth.nexe?project=geomet&action=remove
This will add or remove user to the write acl on the container (named geomet) in the "registry" account.

pkit · 2014-07-01T10:36:01Z

Hmm...on the second thought it was kind of stupid to just allow users to write any stuff to a container.
What we want is to enforce some rules:

Never overwrite.
Never write something that's not a zapp
Probably check zapp.yamp format
Therefore we need some sort of service.
We can do it with "open" but probably implementing "open-with" will be even better.
So, we still do steps 1. and 2. from above.
But then:
User uploads a zap to his/her account, and then does "open-with" our app, located in the "registry" account.
App does all the checks and if they pass: writes the zapp to the specific container.
This way we don't need to worry about permissions also, because if user could upload stuff to own account, and it's is authenticated user - we are good to go.

mgeisler · 2014-07-01T11:15:56Z

Constantine Peresypkin [email protected] writes:

Hmm...on the second thought it was kind of stupid to just allow users
to write any stuff to a container.
What we want is to enforce some rules:

Never overwrite.

Never write something that's not a zapp

Probably check zapp.yamp format
Therefore we need some sort of service.

Agreed. I never envisioned giving users Swift credentials and letting
them upload/download stuff themselves. Uploads would preferably go to a
zapp which puts it's stdin into a Swift object after checking that it
has the right format.

We can do it with "open" but probably implementing "open-with" will be
even better. So, we still do steps 1. and 2. from above.

But then:
3. User uploads a zap to his/her account, and then does "open-with"
our app, located in the "registry" account.
4. App does all the checks and if they pass: writes the zapp to the
specific container. This way we don't need to worry about permissions
also, because if user could upload stuff to own account, and it's is
authenticated user - we are good to go.

I'm unsure what you're proposing here? The way I imagine the system
there would only be one account on Swift -- the internal one used by the
registry. Users will invoke zapps and these zapps will manipulate Swift
in the background, possibly using credentials for the registry user.
These credentials can be stored in the zapp.

What is lacking for this to work is mostly the ability to invoke a zapp
anonymously, right? Swift can already be configured to allow anonymous
container listings -- I've used that before. @rpedde talked about
extending the Swift ACLs with a proper permission bit (the "x" bit) and
I think that sounds like just the thing we would need.

mgeisler · 2014-07-01T11:18:14Z

@larsbutler You write

[...] having separate namespaces for each user solves a lot of quota and permissions issue for us, almost completely out of the box

It is my impression that the problems you see solved by Swift aren't the difficult or important problems. Implementing quotas can be done in many ways and I don't think that's the core problem we're solving here.

larsbutler · 2014-07-01T11:25:32Z

@pkit

...it's very important that revision number (be it timestamp or running integer) is assigned by client and not by server.
Server cannot know what revision is the current one, because it never checkpoints itself to any consistent state, it's the job of a client (or user) to produce a "cause-effect" relationship out of the server data.

Fair enough. So in that case, how do we enforce the rule that new versions bear a newer version/revision number? Can we enforce that (in the way that Launchpad does, for example)?

It is also important that the dependencies are calculated including the revision number.

Dependencies between what exactly? Dependencies between zapps?

pkit · 2014-07-01T11:28:40Z

Uploads would preferably go to a zapp which puts it's stdin into a Swift object after checking that it has the right format.

Right now you cannot "upload through zapp" anything. But it can be arranged either on job-description level or on the "helper middleware" level.

The way I imagine the system there would only be one account on Swift -- the internal one used by the registry.

You still need users to authenticate themselves. And the trivial approach: users must have a Zebra account.

What is lacking for this to work is mostly the ability to invoke a zapp anonymously, right?

We have the ability to invoke zapp anonymously. We just need to sort out the correct permission level for that.

I think that sounds like just the thing we would need.

I think current Swift permissions are too limited and have too much legacy baggage. Like "referrer" or "rlisting". On the other hand we may want to have backwards compatibility here. On yet other hand the whole auth is external to Swift, and "Swift ACL" is just a recommendation, as even keystone and tempauth slightly differ already.

mgeisler · 2014-07-01T11:30:18Z

Lars Butler [email protected] writes:

There are very strong traditions in Debian and other distributions
for requiring a new version number with every change to the published
source.

That's why I suggested for the ZPA to do the revision/timestamp
incrementing automatically. If the developer wants to make minor
changes without rolling a whole new version number, the new revision
needs an updated number. We could make them do it manually, but why
not automate it?

The problem is simply that you cannot change the version number of a
piece of software by external means only. You need to edit some files
inside the package too: in our case at least the zapp.yaml file which
holds the version number. You will typically have to update other files
too, such as a changelog.

If there is a problem here, then I feel that solving it is outside the
scope of what a package registry should do.

One area where this is important is security updates: if package
content can be updated without changing the version number, well then
it becomes a nightmare to figure out if you have vulnerable software
on your system.

Agreed. In fact, I never suggested that we allow package changes
without a new version/revision number.

Then I misunderstood the example where you had geomet version 0.1.9
uploaded twice. You also mentioned "latest 0.1.9 package", which I took
to indicate that you operate with the idea that there can be more than
one package at a given version.

larsbutler · 2014-07-01T11:33:21Z

@mgeisler

It is my impression that the problems you see solved by Swift aren't the difficult or important problems.

I never said whether they were difficult or not, but I think they do need to be solved, and solving them in this way (with out-of-the-box functionality) reduces the amount of work we have to do. It is my impression that you underestimate the amount work it will take build something like that from scratch.

Implementing quotas can be done in many ways and I don't think that's the core problem we're solving here.

First of all, I never said it was the "core problem"; it's "a problem", one of the many which needs to be solved in order to build this thing.

Would you perceive to be the core problem, then? (Just saying "that's not the core problem" is not helpful or constructive, without offering your ideas about the core problem.)

pkit · 2014-07-01T11:34:02Z

@larsbutler

So in that case, how do we enforce the rule that new versions bear a newer version/revision number? Can we enforce that (in the way that Launchpad does, for example)?

If we store stuff in registry by invoking a zapp it can enforce "no overwrite" rule by using specific headers (If-None-Match for example).

Dependencies between what exactly? Dependencies between zapps?

Yep

larsbutler · 2014-07-01T11:41:37Z

@mgeisler

Then I misunderstood the example where you had geomet version 0.1.9
uploaded twice. You also mentioned "latest 0.1.9 package", which I took
to indicate that you operate with the idea that there can be more than
one package at a given version.

Right, there's only 1 official package per version (version being 0.1.9, for example). If I have a 0.1.9-1 and I upload a 0.1.9-2, 0.1.9-2 should be the new canonical package for the 0.1.9 version (it might include some security patches). When 0.1.9-2 is uploaded, it should effectively replace 0.1.9-1, BUT due to the lack of consistent state of Swift, 0.1.9-1 will be deleted at some point in time later with no guarantee about when that happens. So technically, multiple revisions of 0.1.9 can exist in the storage system at a given time, but for all intents and purposes, there is only one: the latest one. See the point about deleting old revisions in the section "Searching and listing packages".

pkit · 2014-07-01T11:45:03Z

@larsbutler

0.1.9-1 will be deleted at some point in time later with no guarantee about when that happens

There is no need to delete the old one in a generic case. We can just make sure that action "download package 0.19" will choose the latest one.

larsbutler · 2014-07-01T11:47:03Z

@pkit

There is no need to delete the old one in a generic case. We can just make sure that action "download package 0.19" will choose the latest one.

Yeah, that's technically true. I was thinking of doing that more as a housecleaning; if the download action will never grab an old version, why keep it around?

larsbutler · 2014-07-01T11:55:01Z

@mgeisler

The problem is simply that you cannot change the version number of a
piece of software by external means only. You need to edit some files
inside the package too: in our case at least the zapp.yaml file which
holds the version number. You will typically have to update other files
too, such as a changelog.

You don't change the entire version number only through external means; what I'm proposing is that the developer still chooses when increment the version number (x.x.x), just not the revision. In this case, the revision would be more of an internal artifact to keep track of what is newer and what is older.

If there is a problem here, then I feel that solving it is outside the
scope of what a package registry should do.

A fair point. We can make developers do it themselves. As long we have a clear rule about version increments and a way to enforce it, I'm fine with this.

larsbutler · 2014-07-01T12:01:29Z

@pkit

If we store stuff in registry by invoking a zapp it can enforce "no owerwrite" rule by using specific headers (If-None-Match for example).

Can you please elaborate on that? If you can provide some more details, I'll edit the spec and put it in.

mgeisler · 2014-07-01T12:01:45Z

@larsbutler

Implementing quotas can be done in many ways and I don't think that's the core problem we're solving here.

First of all, I never said it was the "core problem"; it's "a problem", one of the many which needs to be solved in order to build this thing.

Would you perceive to be the core problem, then? (Just saying "that's not the core problem" is not helpful or constructive, without offering your ideas about the core problem.)

The biggest unknown I see is how to let the server-side code accept files, check them, and put them into Swift. The output objects from a given job are fixed before the job starts today, but the way I think about it, the ZeroVM job that inspects the tarball would need to decide on output objects the tarball is to be stored in. AFAIK, that isn't supported today, so I'm unsure how we would do this.

One option might be to use the tempurl feature: that way the registry zapp can give clients a token that allows them to make a specific PUT request. That could enforce that uploads go where we want them. We would have no idea what the user uploads, though. So my something like my original scheme might be needed:

the client requests a tempurl token for a scratch area
the client uploads the zapp
the client invokes a registry zapp with the Swift path to the just uploaded zapp
if the registry zapp accepts the zapp, it writes a job description with the correct output object name
the client invokes this job description, which then installs the zapp in the right place

There are obvious pitfalls here: someone needs to clean up at various stages if the client goes away before finishing all steps. I'm also not sure if we can allow anonymous people to invoke zapps and still restrict them to only invoke zapps using pre-defined job descriptions.

So there are still some unknowns here: hence me thinking that this is where you'll end up with most of the effort.

pkit · 2014-07-01T12:03:59Z

@larsbutler

if the download action will never grab an old version, why keep it around?

If we will have dependencies it will matter. If we won't - why do we need a registry? :)

Can you please elaborate on that?

If you do a PUT with If-None-Match: * request header the PUT will succeed only if the object with that name does not exist already. We cannot enforce the header, because it's a request header, but we can allow storing files in registry only by invoking a zapp with specific parameters, and the zapp then can use proper request headers.

larsbutler · 2014-07-01T12:09:59Z

@pkit

If we will have dependencies it will matter. If we won't - why do we need a registry? :)

It depends on how granular you want your dependency specification to be. If I want 0.1.9 as a dependency, would I be allowed specify 0.1.9-2? I was thinking that we wouldn't do this; instead, one would specify just 0.1.9 and will get the latest revision of 0.1.9, whatever happens to be available.

pkit · 2014-07-01T12:12:57Z

@larsbutler

would I be allowed specify 0.1.9-2

Yes, probably it's a good idea. And also any other variant.
Something like Debian: package >=0.1.9, package < 0.1.10 or package >= 0.1 or package = 0.1.9-5

larsbutler · 2014-07-01T12:14:33Z

@pkit

Yes, probably it's a good idea. And also any other variant.
Something like Debian: package >=0.1.9, package < 0.1.10 or package >= 0.1 or package = 0.1.9-5

Okay, agreed.

mgeisler · 2014-07-01T12:37:15Z

@larsbutler

If I want 0.1.9 as a dependency, would I be allowed specify 0.1.9-2? I was thinking that we wouldn't do this; instead, one would specify just 0.1.9 and will get the latest revision of 0.1.9, whatever happens to be available.

What you're saying here is (apparently) that the full version number (0.1.9-2) isn't the version number of the software. Instead it's something else — an internal version number of the registry.

This means that you allow people to upload different packages and still give them the same version number (0.1.9). That should not be allowed and I think you also think so based on what you said earlier.

I think you should avoid over-thinking this part. Let users decide on version numbers and the semantics. Let the registry maintain a version->zapp mapping, with the constraint that the version numbers are unique per zapp. That is the semantics developers are used to from other package indexes.

As for dependencies between zapps: we've talked about this before and zapps was designed to be self-contained. I would also like to see something like libraries in the future, but that's still far away. Even when we have some notion of libraries, I expect it to be the clients that download the dependencies. So let the clients decide how they want to resolve >= 0.1.9.

larsbutler · 2014-07-01T13:21:56Z

@mgeisler

Let the registry maintain a version->zapp mapping, with the constraint that the version numbers are unique per zapp. That is the semantics developers are used to from other package indexes.

Okay, fair enough.

larsbutler · 2014-07-01T13:25:49Z

Okay, I think I've received enough feedback to fix/rewrite some parts of the spec. Let me take another stab at this and see where we land.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Registry design (blueprint) #1

Registry design (blueprint) #1

larsbutler commented Jun 30, 2014

Name

Platform

General requirements

Publishing a zapp

Searching and listing packages

Backend logic

User interface

Client use cases

mgeisler commented Jul 1, 2014

mgeisler commented Jul 1, 2014

larsbutler commented Jul 1, 2014

larsbutler commented Jul 1, 2014

pkit commented Jul 1, 2014

pkit commented Jul 1, 2014

mgeisler commented Jul 1, 2014

mgeisler commented Jul 1, 2014

larsbutler commented Jul 1, 2014

pkit commented Jul 1, 2014

mgeisler commented Jul 1, 2014

larsbutler commented Jul 1, 2014

pkit commented Jul 1, 2014

larsbutler commented Jul 1, 2014

pkit commented Jul 1, 2014

larsbutler commented Jul 1, 2014

larsbutler commented Jul 1, 2014

larsbutler commented Jul 1, 2014

mgeisler commented Jul 1, 2014

pkit commented Jul 1, 2014

larsbutler commented Jul 1, 2014

pkit commented Jul 1, 2014

larsbutler commented Jul 1, 2014

mgeisler commented Jul 1, 2014

larsbutler commented Jul 1, 2014

larsbutler commented Jul 1, 2014

Registry design (blueprint) #1

Registry design (blueprint) #1

Comments

larsbutler commented Jun 30, 2014

Name

Platform

General requirements

Publishing a zapp

Searching and listing packages

Backend logic

User interface

Client use cases

mgeisler commented Jul 1, 2014

mgeisler commented Jul 1, 2014

larsbutler commented Jul 1, 2014

larsbutler commented Jul 1, 2014

pkit commented Jul 1, 2014

pkit commented Jul 1, 2014

mgeisler commented Jul 1, 2014

mgeisler commented Jul 1, 2014

larsbutler commented Jul 1, 2014

pkit commented Jul 1, 2014

mgeisler commented Jul 1, 2014

larsbutler commented Jul 1, 2014

pkit commented Jul 1, 2014

larsbutler commented Jul 1, 2014

pkit commented Jul 1, 2014

larsbutler commented Jul 1, 2014

larsbutler commented Jul 1, 2014

larsbutler commented Jul 1, 2014

mgeisler commented Jul 1, 2014

pkit commented Jul 1, 2014

larsbutler commented Jul 1, 2014

pkit commented Jul 1, 2014

larsbutler commented Jul 1, 2014

mgeisler commented Jul 1, 2014

larsbutler commented Jul 1, 2014

larsbutler commented Jul 1, 2014