-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use JPype to call into jars directly #10
Comments
Sounds like a good idea. Can you please try the code in the use-jpype branch and let me know if you encounter any issues? The API is almost the same as what is in master. There currently isn't any control over content sent to stdout by the Java library. |
IMHO we need a deeper integration. I mean no temporary files, only blobs in memory. No command line arguments, filling the structures directly. Ideally the same capabilities as using pdrbox as a lib from Java, but with all necessary wrappers removing the burden of converting python objects to Java ones (IDK if any of it in this lib, but I had some experience with some apps, dealing with immutable types. It was pain, I had to write some functions which only purpose was patching immutable objects by parsing them into dicts, patching the dicts and then transforming dicts back to immutable objects. Though the result worthed it - the app started to work much faster, I got rid of temporary files and got access to the features not exposed via CLI) from programmer. |
I put together a quick wrapper for the PDF to image functionality that may be what you are looking for; it returns the extracted pages as RGB numpy arrays. I don't have time to create a full-blown Python interface to the pdfbox Java API, but I can add the above gist to python-pdfbox as a separate function (or perhaps combine it with the jar download code and submit it to camelot as a PR). |
For the first time we don't need full-blown, just keep the existing python-pdfbox one, but overcome limitations of CLI interface by changing the way pdfhox is called.
IMHO: it shouldn't download and install jars. Downloading and/or installing jars is either user's burden, or systemwide package manager's (such as apt, portage, brew, nix and conda), or installer's. Not ours. Not camelot's. |
Since a major design goals of python-pdfbox is enabling users to quickly access pdfbox features regardless of their jar management preferences, I don't wish to remove the automated download feature. Moreover, python-pdfbox permits one to specify the location of the jar file via an environmental variable if one does not want to rely upon the automated download. |
Not being aware of this thread, I wrote a similar post and gist independently 😅: The gist targets Pillow and else differs slightly, so may still be worth taking a look at.
Same problem for me (lack of time), but incorporating the gist code as a new function for direct rendering sounds like a good idea. |
I see both points, but getting a jar out of the box is indeed much more convenient than having to supply one manually, esp. seeing as it's a library. An app end user shouldn't have to supply any binaries. However, I'd recommend to move the Jar handling logic to setup stage (e.g. similar to how pypdfium2 bundles binaries on setup). This should not be library runtime code. |
|
@KOLANICH I think there may be a misunderstanding. Calling
Or from sdists. Actually it was the main intent behind the wheel format to bundle binaries; for the source alone there's the sdist format. You can pass |
You cannot call code on package installation because packages are installed from wheels, not sdists. Again, packages are NOT instelled from sdists. When one "installs" a package from an sdist, pip builds a wheel and installs a package from that wheel. This way every files left in the system by
No, they should not. Bundling anything is bloatware. If someone wants to install |
That may be correct, but is not relevant here. The main distinction relevant for this context is that installing an sdist runs setup code, while a wheel does not (it's already a frozen file set). If you do not want a bundled pdfbox, you can install via sdist, opt out with that env var and point to a system/caller-provided pdfbox binary instead. That's what a linux distro packager could use, for instance.
I would not download pdfbox somewhere into the system but into the source tree and flag it as package data, which is officially supported and not a breakage risk. I clearly technically disagree with what you're writing and doubt if there's much point in further discussing this with you. |
FWIW, bundling is also desired by the camelot project:
source: @vinayak-mehta in atlanhq/camelot#346 (comment) |
IMHO it should be opt-in rather than opt-out. And by default it should use the package from the system.
This way the talk about using sdists is irrelevant. The wheels having a jar within them should woork fine. |
Hmm, yeah, a smart setup logic like "use system pdfbox if available, download pdfbox otherwise" would be a possibility.
I don't think I understand you here? Why should sdists be irrelevant now? Isn't that contradictory to what you wrote above? Do you mean you actually agree with bundling pdfbox in wheels now? |
should be incapsulated into a separate lib because it is needed for pretty any Python wrapper for a Java lib.
|
I think system pdfbox should be picked automatically and by default. And jars in other wrapper libs should be picked the same way. Automatically and by default. That's why we need a special lib to discover jars in various system-specific places. |
No description provided.
The text was updated successfully, but these errors were encountered: