Skip to content

use a URI-based scheme for package names like Java #20183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
andrewrk opened this issue Jun 4, 2024 · 23 comments
Closed

use a URI-based scheme for package names like Java #20183

andrewrk opened this issue Jun 4, 2024 · 23 comments
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. zig build system std.Build, the build runner, `zig build` subcommand, package management
Milestone

Comments

@andrewrk
Copy link
Member

andrewrk commented Jun 4, 2024

Related issues:

In order to detect that two different fetched packages represent different versions of the same project, with one intended to supersede the other, we need some kind of stable identifier. Zig package management is decentralized, so there is no global namespace that serves this purpose.

We have the name field of packages, but it is not globally unique.

... or is it? This proposal takes inspiration from Java's solution to this problem which is to use the domain name system and canonical URIs, so you end up with names such as com.sun.foo.bar.ArrayList. Although Zig supports URL mirrors, canonical URIs can still be used as globally unique package identifiers.

Example fully-qualified package names would be:

  • com.github.scottredig.JavascriptBridge
  • me.andrewkelley.groovebasin
  • org.codeberg.river.river
  • org.ziglang.zig

In order to discourage people from fighting over unqualified names and causing general chaos, Zig would emit an error if a package fetched by URL was missing a top level domain name.

Whether the person owns the domain names they use is not enforced. When you choose to add a dependency on a package, you can notice whether they use their own domain name, or someone else's and factor that into the human decision of whether to trust them. Maybe someone decides to take over an abandoned project, and continue using the old maintainer's domain name for the project, for a seamless transition. If that new maintainer has built trust in the community, people would generally find this acceptable.

As for #14288, there would no longer be an id field introduced. The name field acts as the globally unique identifier that persists across versions.

As for #20178, only the last path segment in the name would be used. For example:

JavascriptBridge-1.2.3-iDYAACr46GhU
groovebasin-2.0.0-NQ8kAD5eWxrE
river-3.1.0-eTYAAFRBXp0H
zig-0.14.0-cW8vBMOZSHPt

Finally, as for #20180, this could potentially provide a way to override an entire project across the full dependency tree, to quickly find out if a particular patch, for example, could be used to make all the different usage sites of the package satisfied, thus being a candidate for a global override.

@andrewrk andrewrk added proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. zig build system std.Build, the build runner, `zig build` subcommand, package management labels Jun 4, 2024
@andrewrk andrewrk added this to the 0.14.0 milestone Jun 4, 2024
@sno2
Copy link
Contributor

sno2 commented Jun 4, 2024

Is there any way we could avoid having to read package names backwards and forwards to be able to find the actual URL? We could try a restricted version of Go's approach and use github.com/ziglang/zig where everything after the last / is used in #20178.

@andrewrk
Copy link
Member Author

andrewrk commented Jun 4, 2024

In this proposed system, there is no actual URL corresponding to a package name. URLs are only a fetch mechanism, and the same package could be fetched via an arbitrary number of different URLs, each with different domain names and schemes. Why would you want such an "actual URL" anyway?

@squeek502
Copy link
Collaborator

What's the benefit of the domain name over just a <namespace>.<name> format, so e.g. squeek502.resinator, ziglang.zig, etc?

@mlugg
Copy link
Member

mlugg commented Jun 4, 2024

One criticism I would have of this is it functionally requires you to own a [sub]domain in order to create a package. I recognise this is pretty easy with services like GitHub Pages, but it's an extra roadblock which I don't think ought to exist. If Layperson Bob wants to make a high-quality package that others can use, which conforms to our typical naming convention, I don't think we should require him to set up a personal domain. Perhaps they could use com.github.username.repo to effectively include the path components of a URL, but then that ties the package name to the version control host the author happened to use when they first created the package.

@andrewrk
Copy link
Member Author

andrewrk commented Jun 4, 2024

What's the benefit of the domain name over just a <namespace>.<name> format, so e.g. squeek502.resinator, ziglang.zig, etc?

How do you ensure that <namespace> is unique? andrewrk is already taken on many sites I register on, for example.

@andrewrk
Copy link
Member Author

andrewrk commented Jun 4, 2024

If Layperson Bob wants to make a high-quality package that others can use, which conforms to our typical naming convention, I don't think we should require him to set up a personal domain.

In this example, what URL will users of Layperson Bob's package use to fetch it?

@sno2
Copy link
Contributor

sno2 commented Jun 4, 2024

In this proposed system, there is no actual URL corresponding to a package name. URLs are only a fetch mechanism, and the same package could be fetched via an arbitrary number of different URLs, each with different domain names and schemes. Why would you want such an "actual URL" anyway?

I'm not implying that it is better. I just find it an inconvenience to mentally turn your format into a URL. We can say that this format is not restricted to being a valid URL, but in almost every case (and example you wrote in this proposal) I'd argue it'd be one.

@mlugg
Copy link
Member

mlugg commented Jun 4, 2024

In this example, what URL will users of Layperson Bob's package use to fetch it?

I imagine a GitHub archive URL, like many people already use today (or perhaps a Git URI). However, there's a big difference between using this as the fetch URL and the package name: if Bob later decides that GitHub's pivot to AI is leading to too many poor business decisions, and moves his repos over to GitLab, I don't think he should be stuck with this legacy package name.

@squeek502
Copy link
Collaborator

squeek502 commented Jun 4, 2024

How do you ensure that <namespace> is unique? andrewrk is already taken on many sites I register on, for example.

Why does it need to be unique, or, alternatively, how is the uniqueness of the domain enforced/what is the uniqueness meant to represent? Couldn't I create my own org.ziglang.zig package and distribute it however I want?

@mlugg
Copy link
Member

mlugg commented Jun 4, 2024

@squeek502 The idea isn't that it would be technically enforced, but that it would just give a unique name you could use. This is effectively a convention (the only enforcement would be the leading TLD requirement), but the benefit of the convention is that when you follow it, by using a [sub]domain you own, you'll get a package name which is unique amongst everyone else following the convention.

Packages need some kind of unique identifier for overrides, deduplication, and version management. This proposal is an alternative to the id field suggested by #14288.

@andrewrk
Copy link
Member Author

andrewrk commented Jun 4, 2024

if Bob later decides that GitHub's pivot to AI is leading to too many poor business decisions, and moves his repos over to GitLab, I don't think he should be stuck with this legacy package name.

I think this is the main downside of the proposal; people will be tempted to pointlessly rename their projects, causing completely unnecessary churn for package users.

Why does it need to be unique, or, alternatively, how is the uniqueness of the domain enforced/what is the uniqueness meant to represent?

The build system needs a way to find out when the dependency tree has multiple package versions of the same project, so that it can implement features such as:

  • version selection
  • detecting when an updated version is available
  • many other kinds of tooling that I didn't think of right this second

Imagine the situation that would occur if two different projects were both named "abc" and the build system swapped the higher version number in for the lower version one. This is not a problem for centralized package managers because they have a single, global namespace.

The only alternative to this proposal is to generate a unique random identifier and either attach it to the name, or have another id field in the package manifest.

@nektro
Copy link
Contributor

nektro commented Jun 4, 2024

as an example of using this in the wild, Zigmod went with the id field and uses a 40-character random hex[1] string and it has worked out great. additionally, zigmod init will generate one for you so the UX of it is nearly transparent.

edit: it got this idea from the original package manager thread [2] and been using it for over 3 years the entire life of the project

@squeek502
Copy link
Collaborator

squeek502 commented Jun 4, 2024

the only enforcement would be the leading TLD requirement

What type of enforcement would this be out of curiosity? Would this mean a dependency on something like the public suffix list data (a moving target)?

@andrewrk
Copy link
Member Author

andrewrk commented Jun 4, 2024

Basically just that you have at least one byte followed by . followed by at least one byte followed by . followed by at least one byte.

And then an error message that suggests to use a domain name, when there are no dots.

In other words, just enough to make the path of least resistance to be to follow the convention.

@thejoshwolfe
Copy link
Contributor

it's unclear to me what problem this proposal solves over using a random id. i'm also concerned along the same lines as @mlugg.

My own experience as a young aspiring software developer in college looking at java's convention was that i thought i had to buy a domain name to be a "real programmer", which we don't want to communicate with this. i wouldn't want the legitimacy of someone's opensource contributions to humanity to be implicitly tied to a domain name someone has to pay for. Either you pay 10USD per year, or thereabouts, or you decide which sugardaddy host platform is going to pay for you, like github.

but your hosting provider is just an implementation detail, unrelated to the identity of your project. sure using ICANN and related registries to resolve naming conflicts works, but why implicitly endorse any centralized authority when RNG is the only authority you really need? and just back to my original point, why is having a long name desirable over having a short name plus a random number?

@mlugg
Copy link
Member

mlugg commented Jun 4, 2024

The issue I anticipate with a random number is unintentional duplication -- for instance, someone copying build.zig.zon from a template they have (I personally never use zig init, because I only ever care about build.zig.zon and some parts of build.zig; it's quite literally faster for me to just write it myself, but even better is to copy from a personal template) and forgetting to regenerate the ID.

A check which could help here is storing an association between known package IDs and names in the global cache; that way, when you first try to build your new project, Zig can notice that the ID is already known but under a different name, and emit an appropriate error. Ideally we'd also have a command to just regenerate the ID, perhaps zig init --generate-id.

An alternative, which would certainly work for me personally, is something like zig init --minimal, which could sidestep the "manual template" issue altogether. But still, someone would inevitably come along with a weird workflow that caused strange breakages.

Duplicate IDs are a particularly big problem because they can be quite hard to catch. Ignoring the system I proposed above, if two packages are unintentionally given the same ID, then we might see no problems whatsoever -- up until those packages are used by the same project at some point. We'd really like to avoid this.

I've just come up with another idea to perhaps solve this duplicate id issue. What if the package ID contained a kind of checksum, such that a given ID is only valid for packages of the same name? I'm thinking a process like this:

  • Generate a 128-bit random number; for lack of a better term I'll call this the "main ID"
  • Run CRC32 or something over main_id ++ package_name
  • The id field is main_id ++ crc_sum

This has a similar effect to the global package ID <-> name association idea, but removes unnecessary global state, and consequently doesn't require the package with that ID to have been used on this system before. Thinking about it, I quite like this idea.

@judofyr
Copy link

judofyr commented Jun 4, 2024

The only alternative to this proposal is to generate a unique random identifier and either attach it to the name, or have another id field in the package manifest.

Oh, I have another alternative: Introduce a $version variable that can be used inside the URL in the dependant, and then say that the package is identified by its URL before the version expansion:

.{
    .dependencies = .{
        .hello = .{
            .url = "https://example.com/hello/v$version.tar.gz",
            .version = "1.2.8",
            .hash = "...",
        },
    },
}

This would enable the build system to use the latest semantic version of https://example.com/hello/v$version.tar.gz across the whole dependency tree, and we're guaranteed that it won't be mixed up with another package – without having to depend on any identifier inside the package itself.

This will work nicely with GitHub tags as well: https://github.com/OWNER/REPO/archive/v$version.tar.gz.

@mlugg
Copy link
Member

mlugg commented Jun 4, 2024

That solution disallows having multiple mirrors for one package, enforces a URL scheme, and in fact even disallows upstreams from ever changing their archive URLs - it's a non-starter.

@judofyr
Copy link

judofyr commented Jun 4, 2024

That solution disallows having multiple mirrors for one package, enforces a URL scheme, and in fact even disallows upstreams from ever changing their archive URLs - it's a non-starter.

Not sure why this makes it a "non-starter" instead of "an idea that we can expand and build on top of and then evaluate":

  • Support for mirrors can be included in a multitude of ways:
    • The dependency in build.zig.zon can have a .mirror = "https://…" field which overrides where the package is actually fetched from. The .url is then only used for identification.
    • We can add support in build.zig.zon for overriding the URL across the whole dependency tree: .mirrors = .{ .{.from = "<URL>", .to = "<URL>" } }.
    • The proposed solution of "ID inside the package itself" means that it's actually impossible for the build system to determine anything before it has actually fetched the packages. It's up to a package to decide the mirror for its dependency for all of its users.
    • I assume that some users would like "global" mirroring where all packages are gone through a local proxy. This isn't impacted by of any of these solutions I think.
  • As for changing the upstream URL: I'm assuming that version selection will only happen inside the same major version which makes each separate major version conceptually completely "different" packages. If an upstream decides to change their canonical URL then this would be considered a new "package / major version". This doesn't sound too bad to me. Preferably the upstream will change the URL in a major version anyway. Either way, all dependants would have to update their dependencies to the new upstream URL.
  • And if the problem is "what if the URL goes away", then I would say that the currently proposed solution doesn't really provide a good experience for that. If Package A depends on Package B and Package B's URL start returning 404, then an application developer (i.e. end-user of build.zig) would have no way of getting their build to work again without getting Package A to both update its dependency and upgrade to it. This scenario is much easier to handle if a package is identified by its URL instead of something inside itself.

EDIT: I realized that I might have misunderstood the current proposal here and that the full package ID is also part of the dependency definition. That would of course simplify matters quite a lot.

@thejoshwolfe
Copy link
Contributor

  • Run CRC32 or something over main_id ++ package_name

@mlugg This means you can never rename a package, which seems fine maybe? renaming a package is indistinguishable from making the mistake you're trying to avoid. if we really wanted to support renaming, i can imagine something like an additional field called "original_name" and then hopefully the copypaste workflow would remember to remove that when making a new file. but that seems pretty low stakes by that point, and i agree that your suggestion to prevent the typo is pretty important.

@nissarin
Copy link

nissarin commented Jun 4, 2024

Why not use ID (UUID ?) as namespace, i.e. <package_name>.<namespace_id>, this way it's not tied to specific URL, fork or some completely unrelated package with same name will just have a different namespace_id.
And it's in that order to make the name human readable, you could imagine it on disk as /package_name/namespace_id/versions for example.

Of course there still can be a collision is someone is intentionally using the same values but there is little you can do about it without centralised system, unless you want to use something like crypto key hash as ID and use the key for signing packages at the same time ?

@ghost
Copy link

ghost commented Jun 4, 2024

I agree with @mlugg on this. Since the naming mechanism is purely cooperative anyway, the best we can do is to prevent accidental name clashes. We don't need a lot of entropy for that, though, so a 128 bit base_ID may be overkill. We could settle for a 64-bit checksum, the first half being a random seed, followed by 32 bits from sha256(seed ++ name). This way the package manager can enforce a freshly generated ID for any given package name without causing too much friction.

@andrewrk
Copy link
Member Author

andrewrk commented Jun 4, 2024

  • And if the problem is "what if the URL goes away", then I would say that the currently proposed solution doesn't really provide a good experience for that. If Package A depends on Package B and Package B's URL start returning 404, then an application developer (i.e. end-user of build.zig) would have no way of getting their build to work again without getting Package A to both update its dependency and upgrade to it.

This is incorrect. Package A can declare a mirror for Package B's URL, and then the build proceeds as before.

@andrewrk andrewrk closed this as not planned Won't fix, can't repro, duplicate, stale Jun 4, 2024
@andrewrk andrewrk modified the milestones: 0.14.0, 0.13.0 Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. zig build system std.Build, the build runner, `zig build` subcommand, package management
Projects
None yet
Development

No branches or pull requests

8 participants