Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conda package #119

Closed
mara004 opened this issue Aug 22, 2023 · 62 comments
Closed

Conda package #119

mara004 opened this issue Aug 22, 2023 · 62 comments

Comments

@mara004
Copy link
Contributor

mara004 commented Aug 22, 2023

Hi @bblanchon,

There have been some requests (e.g. from doctr) that pypdfium2 should have a conda package.
Basically the pre-requisite for this would be a conda package for pdfium (see below).

The Readme says:

I can provide packages for your favorite package manager, but I need help from someone who knows the format.

Unfortunately, my knowledge about conda is very limited as I don't use it personally, but this would be my idea of the workflow:

  • Add a recipe to the repo that specifies metadata and files to include (conda/recipe/meta.yaml, spec)
    • Version may be set dynamically through an env var (version: {{ environ["VERSION"] }})
    • Files can be included through build.sh / bld.bat scripts.
  • Call conda build conda/ --output-folder conda/out/ in each runner after the binary was built
  • For cross-compiled binaries, call conda convert conda/out/$HOST_PLATFORM/*.tar.bz2 -p $TARGET_PLATFORM -o conda/out/ (for a list of conda platform names, see the choices for -p in conda convert --help)
  • Transfer the built package to the root runner (actions/upload-artifact, conda/out/$TARGET_PLATFORM/*.tar.gz)
  • Publish the gathered packages to anaconda. (ANACONDA_API_TOKEN=... anaconda upload $PACKAGES_DIR/*/*.tar.bz2 - not sure, probably something like this)

Presumably we need the action conda-incubator/setup-miniconda.


Why we need this

Initially I intended to bundle the binaries in pypdfium2's conda packages (as we do for PyPI wheels), but it turns out this is not really the way to go:

  • Bundling dynamic libraries is against the spirit of conda as a general package manager. Instead, they should be shareable across different dependents.
  • What's more, conda does not support packages to be platform specific and python version independent at the same time, but I'd prefer to avoid building factor n_py_versions - 1 unnecessary packages. If pdfium is packaged separately, pypdfium2 can just depend on that and cleanly specify noarch: python.
@mara004
Copy link
Contributor Author

mara004 commented Aug 22, 2023

CC @frgfm @felixT2K (mindee/doctr#113)

@bblanchon
Copy link
Owner

Hi @mara004,

I see that you already created the package 👍
https://anaconda.org/anaconda/pdfium-binaries

Let me know if you want to add it to the CI.

Best regards,
Benoit

@mara004
Copy link
Contributor Author

mara004 commented Aug 24, 2023

Hi @bblanchon,

oh, but that was not me! 🙃 When I searched before posting this issue I couldn't find that package yet.
I think it was @boldorider4, who mentioned he's packaging pypdfium2 as part of a customer project, but (unfortunately) intends to make the result private. However, apparently he publishes any dependencies of pypdfium2 that also need to be packaged, at least for the time being ... ?

However, his builds require manual interaction, and you can see from the https://anaconda.org/main/pdfium-binaries/files tab that he's not building the whole intersection of platforms supported by pdfium-binaries and conda, e.g. linux-32, linux-armv7l, win-32 and win-arm64 are missing.

So I think it would still be good to integrate conda into pdfium-binaries CI.
I can give this a try but am afraid my knowledge (and opinion) of bash scripts is poor.

@bblanchon
Copy link
Owner

Don't worry, I can handle the CI part 😎; I just need to know what to do.
@boldorider4, could you give us some direction on how to build this package?

@mara004
Copy link
Contributor Author

mara004 commented Aug 24, 2023

boldorider4, could you give us some direction on how to build this package?

@bblanchon I've just made a draft PR with packaging except for the CI part: #120
Unfortunately I'm not sure how to test it, the workflow doesn't want to run from a non-master branch in a fork...

@bblanchon
Copy link
Owner

I'll integrate this into the build and get back to you.

@bblanchon
Copy link
Owner

The first batch of Conda packages is available:
https://github.com/bblanchon/pdfium-binaries/actions/runs/5986613758#artifacts
How should we test them?

@mara004
Copy link
Contributor Author

mara004 commented Aug 27, 2023

Thanks a lot for your effort! We definitely need to think about testing.
However, even before that we can notice that something must be wrong with the osx packages, they're much smaller and just don't contain the binaries.

@bblanchon
Copy link
Owner

Nice catch! I’ll fix the glob.

@mara004
Copy link
Contributor Author

mara004 commented Aug 27, 2023

Also, while the windows packages do contain their binaries, I think they're not properly following conda conventions yet.

https://docs.conda.io/projects/conda-build/en/latest/user-guide/environment-variables.html says

Unix-style packages on Windows, which are usually statically linked to executables, are built in a special Library directory under the build prefix.

See also the table from #121 (comment).
(Not sure if this is strictly required, though.)

@bblanchon
Copy link
Owner

That part didn’t make any sense so I decided to use the same layout and see if it works.

@mara004
Copy link
Contributor Author

mara004 commented Aug 27, 2023

Separate thread: when trying to install the linux package locally using conda install "./linux-64/pdfium-binaries-118.0.5961-0.tar.bz2" it breaks with conda.auxlib.exceptions.ValidationError: Invalid value 0 for build_number.

@mara004
Copy link
Contributor Author

mara004 commented Aug 27, 2023

That part didn’t make any sense so I decided to use the same layout and see if it works.

Hmm, I think that's just some conda convention to put stuff in a Library/ subdirectory on Windows. I'm not sure if conda install maybe handles that specially. Personally I'd just try to lay out packages as similar to conda build as possible for robustness in case receiver tools rely on these peculiarities. But if we can test it and it works I'm of course OK with it too.

@bblanchon
Copy link
Owner

bblanchon commented Aug 27, 2023

I downloaded the packages for lxml and they share the same layout:
https://repo.anaconda.com/pkgs/main/win-64/lxml-4.9.2-py39h2bbff1b_0.tar.bz2
https://repo.anaconda.com/pkgs/main/linux-64/lxml-4.9.2-py39h5eee18b_0.tar.bz2
See? No Library\lib nonsense.

However, they don't have the same layout as ours.
For example, here are the locations of the shared lib:
lib\python3.9\site-packages\lxml\etree.cpython-311-x86_64-linux-gnu.so
Lib\site-packages\lxml\etree.cp39-win_amd64.pyd

PS: I just started another build that should fix the missing dylib and the build_number

@mara004
Copy link
Contributor Author

mara004 commented Aug 27, 2023

I downloaded the packages for lxml and they share the same layout:
See? No Library\lib nonsense.

Well, lxml is a python package (with API level bindings). I'm under the impression conda python packages might have a bit of an own layout. It would be interesting to find another "C library only" windows package and see what layout it uses.

PS: I just started another build that should fix the missing dylib and the build_number

Thanks!

@mara004

This comment was marked as outdated.

@mara004

This comment was marked as outdated.

@mara004
Copy link
Contributor Author

mara004 commented Aug 27, 2023

OK conda clean did the trick. The build_number error might have been for unrelated reasons, I'm not sure.
However, I now get a new installation error:

pdfium-binaries-118. | ############################################################################################################################################################################################################# | 100% 

# >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<

    Traceback (most recent call last):
      File "/usr/lib/python3.11/site-packages/conda/exceptions.py", line 1114, in __call__
        return func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.11/site-packages/conda/cli/main.py", line 86, in main_subshell
        exit_code = do_call(args, p)
                    ^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.11/site-packages/conda/cli/conda_argparse.py", line 90, in do_call
        return getattr(module, func_name)(args, parser)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.11/site-packages/conda/cli/main_install.py", line 20, in execute
        install(args, parser, 'install')
      File "/usr/lib/python3.11/site-packages/conda/cli/install.py", line 177, in install
        explicit(args_packages, prefix, verbose=not context.quiet)
      File "/usr/lib/python3.11/site-packages/conda/misc.py", line 97, in explicit
        assert not any(spec_pcrec[1] is None for spec_pcrec in specs_pcrecs)
    AssertionError

@bblanchon
Copy link
Owner

Can you add a pdb.set_trace() and analyze what's going on?

@mara004
Copy link
Contributor Author

mara004 commented Aug 27, 2023

I'm not used to pdb, but I've now added the trace before the assert and then did p specs_pcrecs.
It is ([MatchSpec("packages/linux-64::pdfium-binaries==118.0.5961.0=1"), None],) (with packages/ being the parent folder, and the invocation being conda install --offline "./linux-64/pdfium-binaries-118.0.5961.0-1.tar.bz2").
I tried to take a look at the surrounding code. PackageCacheData.query_all(spec) apparently fails by returning an empty iterator. But I have no idea what this code is supposed to achieve.

That said, I don't think it leads us anywhere, and this is not very nice to debug because the code is in /usr.

@mara004

This comment was marked as outdated.

@mara004

This comment was marked as outdated.

@bblanchon
Copy link
Owner

With the same command, I get:

The current version of conda is too old to install this package. (This version only supports paths.json schema version 1.) Please update conda to install this package.

It turns out, it expects an undocumented paths_version at the root of paths.json 🙄

@bblanchon
Copy link
Owner

After adding paths_version, the package installs correctly, however, I'm a bit surprised by the location of the installed files.
On my Docker setup, it installs files in /opt/conda with all the other conda files.
It's acceptable for /opt/conda/lib/libpdfium.so but the .h files are mixed with hundreds of other ones.

@bblanchon
Copy link
Owner

@mara004
Copy link
Contributor Author

mara004 commented Aug 28, 2023

Interesting. Unfortunately I'm still stuck with the same error desipte the new build.
If you're able to install the same package it might just be something wrong with my setup.
I installed conda from fedora 37 repos (not sure if it's up-to-date), think I should better have used the official miniconda installer.

@bblanchon
Copy link
Owner

I tried again with the Docker images continuumio/anaconda3 and continuumio/miniconda3 and it seems to work.

C:\> docker run --rm -v %CD%:/host continuumio/miniconda3 conda install /host/pdfium-binaries-118.0.5975.0-1.tar.bz2

Downloading and Extracting Packages


Downloading and Extracting Packages

Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done

@mara004
Copy link
Contributor Author

mara004 commented Aug 28, 2023

Nice.

I will uninstall distro conda (actually more commands randomly fail) and reinstall miniconda cleanly, then test the linux package again. It might take some time. You don't need to wait for me if you think this is ready.

PS: The full version (e.g. 118.0.5975.0) seems inconvenient for pinning or specifying bounds. The build tag (e.g. 5975) is already unique of its own and easier to work with. Could the conda packages just use that as version, similar to this repo's tags? IIRC the full version is a bit complicated to get, so changing this would aid integration with pypdfium2.

@bblanchon
Copy link
Owner

Can't you pin to ==118.* or >=118?

@mara004
Copy link
Contributor Author

mara004 commented Aug 28, 2023

That often stays the same across different builds. I probably have to pin to an exact build because that's what the bindings file is tied to.
Also I don't currently have the full version in pypdfium2 and am not sure how to get/integrate that nicely.

@mara004
Copy link
Contributor Author

mara004 commented Aug 28, 2023

OK nevermind about version, I think we can handle this.
I agree the full version is visually nicer and consistency matters. Sorry for the fuss.
(Extracting the full version and translating to that from a shorthand version should actually be pretty straightforward. Moreover it's an enhancement to also expose the full version to the user.)

@bblanchon
Copy link
Owner

I tested again with a Fedora image and it works.

Here is the Dockerfile:

FROM fedora:37

RUN mkdir -p ~/miniconda3
RUN curl -L https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o ~/miniconda3/miniconda.sh
RUN bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
RUN rm -rf ~/miniconda3/miniconda.sh
RUN ~/miniconda3/bin/conda init bash

Here is the command I ran:

# conda install --offline pdfium-binaries-118.0.5975.0-1.tar.bz2 

Downloading and Extracting Packages
                                                                                                                                                                   

Downloading and Extracting Packages

Preparing transaction: done
Verifying transaction: done
Executing transaction: done

@boldorider4
Copy link

@bblanchon @mara004,

my apologies for my silence. This thread is getting very long very quickly. Could you give me a brief summary what you guys are trying to do?
To my understanding, you packaged pdfium-binaries-118.0.5975 yourself, supposedly using a recipe of your own, and then now you're trying to install it locally (so not pulling the package from the default repo of anaconda, but rather from your locally built package). Is my understanding correct?

I may also package that using our standard recipe methods so you can simply pull it from main and don't have to install it --offline. But I want to be sure that there isn't any ABI-/API-breakage with pypdfium2 (trying to save myself from more effort).

Answering a few questions:

@mara004 Could the conda packages just use that as version, similar to this repo's tags?

you could specify in the conda recipe something along the lines of:
source:
git_url: [email protected]:bblanchon/pdfium-binaries.git
git_rev: "chromium/5975"
but it is not really encouraged, it's more condiac to rely on a pre-tarballed source archive.
source:
url: https://github.com/bblanchon/pdfium-binaries/releases/download/chromium%2F5975/pdfium-v8-mac-arm64.tgz # [osx and arm64]
sha256: <sha256> # [osx and arm64]

@bblanchon can you transfer ownership to bblanchon?

If you're referring to the feedstock where we keep our recipes, unfortunately I cannot. Those recipes are meant to be managed by the anaconda organization (they're not community repos).
If you really do wish to proceed with your own flavor of pdfium-binaries, I'd suggest to upload it to the conda-forge channel. But in that case you need to open a conda-forge repo. You should not upload privately built packages to an official channel like main or conda-forge.

The good news is that I am very close to patching the setup.py and setupsrc scripts of pypdfium2 so that the setup process seamlessly integrates into the conda eco system. In order not to transform this into maintenance nightmare, @mara004 could have a look at the recipe and related patch (the patch will be part of the same feedstock) and implement a switch into the setup.py that allows switching from the self-contained setup (the one that packages libpdfium.[dylib|so|dll] into a portable package) and the setup that is meant to integrate when running conda build or conda install. I will comment as much as possible into the patch to make this transition easier.

@mara004
Copy link
Contributor Author

mara004 commented Aug 29, 2023

I tested again with a Fedora image and it works.

@bblanchon OK. This is good because it suggests the packages are fine. Maybe there is some broken leftover from my former distro install of conda. I'll check.

@boldorider4 Thanks for joining this thread (sorry it grew so long). I'll reply to the questions directed at me later when I have more time.

@mara004
Copy link
Contributor Author

mara004 commented Aug 29, 2023

Answer part 1:

Could you give me a brief summary what you guys are trying to do?

TLDR We're trying to integrate conda packaging into pdfium-binaries CI directly to avoid dependence on a third party and necessity of manual interaction, and to package for all supported platforms.

supposedly using a recipe of your own

@bblanchon wrote a custom script (not a recipe) because of problems with conda build/convert.
See the master...conda2 diff.

I may also package that using our standard recipe methods so you can simply pull it from main and don't have to install it --offline.

Due to the direct integration we're trying to achieve, it is not possible to "merge" this into anaconda-recipes/pdfium-binaries-feedstock. (FWIW pdfium-binaries-feedstock does a lot more CI work than necessary and still ends up supporting less platforms, not to mention the very extensive dependence on native hosts.)

and then now you're trying to install it locally (so not pulling the package from the default repo of anaconda, but rather from your locally built package). Is my understanding correct?

Local installation works fine for @bblanchon but fails for me. I'm currently investigating why. This may not be overly relevant for the purposes of this thread, though.

you could specify in the conda recipe something along the lines of: [...]

We were takling about the version specifier in pdfium-binaries and a possible pin in the upcoming requirements section in pypdfium2's recipe, not about the source section of a third-party style recipe. 😅
However, this is settled, I think I know now how to pin, should the necessity arise.
FWIW your first suggestion were definitely not applicable, noone wants to build the binaries again, but of course use the pre-built archives. 🤪

But I want to be sure that there isn't any ABI-/API-breakage with pypdfium2 (trying to save myself from more effort).

See the second paragraph of the Outdated tab of #119 (comment). This is still my current state of knowledge, just not relevant for this thread anymore.
TLDR If not pinning/bounding pdfium-binaries in pypdfium2's recipe, ABI breakage (or just a degree of mismatch that renders the bindings unusable) is possible in principle, but comparatively unlikely due to pdfium's API stability practices.

@boldorider4
Copy link

OK, I see what you're doing.
I'm afraid my work on the feedstock recipe cannot be of help here.
Nonetheless, when my patch for setup.py and setupsrc is ready, I'd like you to have a quick review of it and let me know your thoughts. Ultimately, it would be really cool if you could merge it to the master of pypdfium2 by protecting the code with some sort of switch. Something like:

if os.environ['CONDA_BUILD_ENABLED']:
    # do what my patch does
else:
    # do what setup_base.py used to do, pulling all prebuilt binaries
     from pdfium-binaries and bundling them in the pypdfium package

Where CONDA_BUILD_ENABLED is a new switch settable somewhere, maybe at the toml level. I used an env var because that would be relatively easy to enable in our recipe.

@mara004
Copy link
Contributor Author

mara004 commented Aug 29, 2023

@boldorider4 You may submit your patch as draft PR, then I can take a look (I'll leave it up to you whether to target main or devel_v5).

However, I doubt if this will actually be necessary. You can already set $PDFIUM_PLATFORM=none1 as described here to opt out of pypdfium2's bundling. One could generate the bindings file with correct runtime libdirs externally and copy it to src/pypdfium2/raw.py1 ahead of the pip install.

That said, as I plan to do the conda packaging myself properly anyway, I'm not particularly enthusiastic to aid a simultaneous paid third-party packaging project that will probably not even make the results public.
I think I'd only merge a patch if it's good and also needed for our purposes.

Footnotes

  1. rsp. PDFIUM_BINARY and raw_unsafe.py in v5 2

@bblanchon
Copy link
Owner

If you're referring to the feedstock where we keep our recipes, unfortunately I cannot. Those recipes are meant to be managed by the anaconda organization (they're not community repos).

I was talking about the package named "pdfium-binaries" on Conda, not about the GitHub repo.

If you really do wish to proceed with your own flavor of pdfium-binaries, I'd suggest to upload it to the conda-forge channel.

But then which package conda install pdfium-binaries will install?

But in that case you need to open a conda-forge repo. You should not upload privately built packages to an official channel like main or conda-forge.

Do you mean we cannot build the package in a GitHub workflow?

@mara004
Copy link
Contributor Author

mara004 commented Aug 29, 2023

But then which package conda install pdfium-binaries will install?

If -c $CHANNELNAME is specified, it would install from that channel. Otherwise, I believe it would depend on the user's channel configuration, with anaconda main being the default choice. (This is a guess, please correct me if I'm wrong.)

But in that case you need to open a conda-forge repo. You should not upload privately built packages to an official channel like main or conda-forge.

Do you mean we cannot build the package in a GitHub workflow?

I believe conda-forge, unlike anaconda, does not allow custom uploads, but requires everything to go through the feedstock repos. So we cannot use that. If we will get access to main/pdfium-binaries on anaconda, we'll use that. If we don't, this is bad, but then we can still upload under a different name (see above for suggestions) so dependent recipes can request the packages built here and not leave it up to the end user's channel config.

@mara004
Copy link
Contributor Author

mara004 commented Aug 31, 2023

BTW, this SO thread sounds related to my offline install problems.
I once ran conda clean --all in the previous install, but @kalefranz writes

But take note: if you ever want to do any type of --offline operations, don't use --all; be more selective.

@mara004
Copy link
Contributor Author

mara004 commented Sep 2, 2023

RE #119 (comment):

however, I'm a bit surprised by the location of the installed files.
On my Docker setup, it installs files in /opt/conda with all the other conda files.
It's acceptable for /opt/conda/lib/libpdfium.so but the .h files are mixed with hundreds of other ones.

@bblanchon That's a good point. I guess we could prevent this by nesting the headers in an own pdfium subdirectory in the tarball, so the layout would be include/pdfium/fpdf_*.h instead of include/fpdf_*.h?

Also, a loose cpp/ dir in the top-level conda include dir is problematic. I guess that's why you were reluctant to include it in the first place?

@bblanchon
Copy link
Owner

@boldorider4, what's the status of this package?
@mara004 and I spent hours on this, I think we deserve some closure.

@mara004
Copy link
Contributor Author

mara004 commented Sep 22, 2023

@bblanchon I have already been wondering what we are waiting for. What exactly do you want from @boldorider4? The ownership question, I suppose? As to that, I think he already indicated it's out of his hands.
I'd suggest we just go ahead and upload the builds to a custom anaconda channel...

@bblanchon
Copy link
Owner

I, too, wondered what we were waiting for.
I don't see the point of uploading to a custom Anaconda channel.
If @boldorider4 provides the "official" package, why should we create another one?

@mara004
Copy link
Contributor Author

mara004 commented Sep 22, 2023

If @boldorider4 provides the "official" package, why should we create another one?

Well, for the reasons stated in #119 (comment): it isn't build regularly and not for all platforms, it's not in our control (e.g. might go non-public or stop being updated in the future), and not implemented elegantly.

@mara004
Copy link
Contributor Author

mara004 commented Sep 22, 2023

Concerning custom channels, I think it's common practice. Anaconda main presumably is in the sole control of Anaconda Inc and tied to the feedstock repos which are integrated in their CI.
e.g. @frgfm's pylocron package is in a custom channel, and a future doctr package probably would be, too.

@felixdittrich92
Copy link

@mara004 @bblanchon correctly that's the plan after pypdfium2 is available i can go ahead on doctr 😁
And thanks a lot for all the work 🙌🤗

@mara004
Copy link
Contributor Author

mara004 commented Sep 22, 2023

@felixdittrich92 Question BTW: How can we enforce a dependency to be supplied from a specific channel in the meta.yaml config? I read something about channel_name::package_name - is that right?

@mara004
Copy link
Contributor Author

mara004 commented Sep 23, 2023

@bblanchon Thanks a lot!

@mara004
Copy link
Contributor Author

mara004 commented Sep 23, 2023

However, the first upload looks a bit odd?
The trailing - seems strange, and two packages are missing (linux-64 and linux-armv7l):
image

@mara004
Copy link
Contributor Author

mara004 commented Sep 23, 2023

And I still believe changing include/ to include/pdfium/ might be good to avoid the directory pollution.

@bblanchon
Copy link
Owner

The empty dash is because the "build" field is empty.
Specification sayes

The build string. May not contain "-". Differentiates builds of packages with otherwise identical names and versions, such as:

  • A build with other dependencies, such as Python 3.4 instead of Python 2.7.
  • A bug fix in the build process.
  • Some different optional dependencies, such as MKL versus ATLAS linkage. Nothing in conda actually inspects the build string. Strings such as np18py34_1 are designed only for human readability and conda never parses them.

I didn't know what to put there, so I kept it empty.

@mara004
Copy link
Contributor Author

mara004 commented Sep 24, 2023

Ah, I see. Apparently conda sets this to some "{hash}_{build_number}" format. Not sure how they're generating the hash, though...

The good news is, installing pdfium-binaries now works for me, even offline.
And the missing Linux wheels are now present, too 🎉

@bblanchon
Copy link
Owner

I reupload everything, I don't know why they where missing.

@felixdittrich92
Copy link

felixdittrich92 commented Sep 26, 2023

@mara004 Yes see: https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-file-manually

This should work the same way for meta.yml

@bblanchon
Copy link
Owner

@mara004 I reuploaded all the packages with headers in include/pdfium/

@frgfm
Copy link

frgfm commented Oct 26, 2023

Coming in very late, but quite happy to see pypdfium is getting closer to a conda release !
Thanks for all the work @mara004 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants