Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CEP: run_exports in sharded repodata #108

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions cep-0016-2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
<table>
<tr><td> Title </td><td> Run-exports in sharded Repodata. </td>
<tr><td> Status </td><td> Draft </td></tr>
<tr><td> Author(s) </td><td> Bas Zalmstra &lt;[email protected]&gt;</td></tr>
<tr><td> Created </td><td> Jan 16, 2025</td></tr>
<tr><td> Updated </td><td> Jan 16, 2025</td></tr>
<tr><td> Discussion </td><td> </td></tr>
<tr><td> Implementation </td><td> </td></tr>
</table>

# Run-exports in sharded Repodata

We propose to add run-export information to the sharded repodata shards.

## Motivation

When building conda packages the build infrastructure needs to extract run-export information from conda packages in the host- and build environments.
Run-export information is stored in a package and can be extracted by downloading the package and extracting the `run_exports.json` file.
Even with the possibility to stream parts of `.conda` files this is a relatively resource-intensive operation.

[CEP-12](https://github.com/conda/ceps/blob/main/cep-0012.md) formalized a `run_exports.json` file that is stored next `repodata.json` file.
However, not all channels on the default server (conda.anaconda.org) provide this information which means falling back to downloading and extracting this information from the packages. It is possible to extract the data by only sparsly reading the file but the overhead is still relatively large.

Having two separate files also poses some problems as extra mechanisms have to be introduced in the build infrastructure to manage and sync both files on the build machines.

## Specification

CEP-12 mentions the following reasons for splitting the information into two files:

> * It would require extending the repodata schema, currently not formally standardized.
> * It would increase the size of the already heavy repodata.json files.
> * (Typed) repodata parsers would need to be updated to handle the new field.
We propose that these reasons no longer hold with [sharded repodata](https://github.com/conda/ceps/blob/main/cep-0016.md).

**It would require extending the repodata schema, currently not formally standardized.**

We propose to add a `run_export` field to each record that mimics the specification from CEP-12.

If the `run_export` field is not present in the record it means no `run_export` information is stored with the record, and a fallback mechanism should be used to acquire the run-export information.

Since *adding* a field will not break existing parsers we feel this is safe and does not require a schema change.

**It would increase the size of the already heavy repodata.json files.**

The size of the shards would grow if run-exports are added.

Let's take a look at the current sizes of run_exports.json and repodata.json files.

| channel + subdir | size of repo_data.json | size of run_exports.json | repodata.json.zst | run_exports.json.zst |
|------------------|------------------------|--------------------------|--------------------------|-----------------------------|
| conda-forge + linux-64 | 254 MB | 34.7 MB (11%) | 38.4 MB | 2.2MB (5%) |
| conda-forge + noarch | 107 MB | 13.6 MB (11%) | 16.7 MB | 0.9 MB (5%) |
| conda-forge + osx-arm64 | 99.8 MB | 11.8 MB (11%) | 12.6 MB | 0.8 MB (6%) |
| conda-forge + win-64 | 185 MB | 22.1 MB (11%) | 24.7 MB | 1.4 MB (5%) |

Since the repodata shards are also compressed we can conclude that in practice adding run exports information would increase the size of the repodata shards by roughly 5-6%.

With the introduction of sharded repodata in [CEP-16](https://github.com/conda/ceps/blob/main/cep-0016.md) the issues with size (and scale) have been effectively mitigated. Adding 5-6% to the total size of the shards will not pose a risk since all advantages of sharded repodata mentioned in the CEP still hold.

**(Typed) repodata parsers would need to be updated to handle the new field.**

Unless parsers are very strict about unknown fields (which was not a requirement for sharded repodata shards) this will not pose an issue. Since the absence of the field means that the state of the run-exports is unknown adding the field does not require a schema change.

## Patching

With the run-exports part of the repodata we propose to also allow repodata patching these fields. The original run-exports can still be extracted from the packages. We do not see a reason how the run-export information is different from other patchable information stored in the repodata (like the dependencies).

To facilitate these patches a new file is added to the repodata-patches package called `run_exports_patch_instructions.json` so that `run_exports` can be modified per package:

```json
{
"packages": {
"_libarchive_static_for_cph-3.3.3-h0921cf1_1.tar.bz2": {
"run_exports": {
"weak": ["libarchive >=3,<4"],
}
},
...
}
}
```

baszalmstra marked this conversation as resolved.
Show resolved Hide resolved
For backwards compatibility reasons we propose *not* to add the patches to the existing `patch_instructions.json` file because older implementations do not have to ability to filter certain instructions. Without that ability the `run_exports` patches would end up in the `repodata.json` file which is undesirable.

For a channel to provide run-export patches, it MUST contain a `run_exports.json` file.
For build tools to support patched run exports it MUST always query the sharded repodata, or use a `run_exports.json` file as the source of truth for the run exports.
If neither sharded repodata or a `run_exports.json` file is present build tools can assume no run export patches exist.