-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Minimising code execution at import time / the downsides of the JPype import system #933
Comments
@pelson Happy to discuss this. First there have been many changes since the original PR most of which have been documented in Most notably we now support importing of a tld (ie "import org") without issues. In the old version we have to import all the way down the one item to have proper error checking. In addition JPackage and JImports are merged to a common object so they have the same caching process. The only difference between them is that JImport has an import handler that produces ImportError rather than AttributeError. Last at some point we did add support for Python 2.6 and 2.7 which likely included 3.5. However this may have been removed when 2.7 was dropped. About the commentThe comment about failing on imported simply that using the style
is better in that it will fast fail if the Java classes can not be located immediately. This is a huge advantage when the Java code gets refactored and you don't need to wait for a specific branch to be hit. Also at the time JPackage did not check if the class actually existed. This was a huge defect. If you did "JPackage('org').pkg.MyObject and MyObject did not exist it gave a meaningless error on JPackage can not be called. This is because JPackage was basically an implementation of Mock which just added attributes whenever it was requested. The new version of JPackage does not have any of these problems. It now will correctly error when an object does not exist and actually has a dir function so you can see what is available. I still strongly encourage the fast fail method of importing though as it is much less error prone. Order and Side effectsAs far as the side effect at import time, this is unfortunately a product of how the import system hooks work. You can't use an import hook until it is installed and you can't install a hook without importing a module with the hook. So for example if you wanted to use maven to install import hooks for python so that it automatically pulls an import then the order of the imports matters. This is separate from the requirement to start the JVM, so lets deal with startup process in general as another subject. StartupThe jpype startup pattern predates me by a lot. I have studied other packages, but unfortunately they all have similar problems. The main issue is that Python does not allow you to pass parameters into an import so there is no way to configure the JVM. Lets consider two different styles. JPype current style... ( for simplicity lets assume that tld and import registration was automatic)
versus the pyjnius method
Same: Worse: Better: I do think that people have a preference towards the passive loading, and we could add a jpype.config module which would enable automatic starting of the JVM on the first call to JClass or JPackage. But that really doesn't solve order issue. If Python had a well defined method of how to configure a package or pass options to it, I think we should conform. But the convention is that we should have as little side effects as possible so Programming by StringThe other key advantage on using the imports is that it is actually much faster. Calling JClass with a string is going to call
and I see this formulation or similar frequently when I review JPype using code. TLDsI debated a lot on the tlds. I believe it acceptable for these reasons.
Unfortunately a lot of Java packages don't use tlds. They are unfortunately on their own. Language conventions were defined for a reason and package names are not some stylistic thing. ConclusionThe JImports system was meant to be a bolt on because I was unwilling to force the tld registration on the user. It has improved a lot since it was introduced. I often debate if it is sugar or a necessity. With the merge of JPackage it is more of a nicety as now JPackage does everything that JImport does in terms of safety. But it does mirror the same system in Jython and JEP so it is unlikely to get removed. Prior to the merge my feeling is it should be the default so the "jpype.imports" is not necessary. But to do so the JVM startup issue needs to be resolved. The order issue is more of a start up method which is another topic. |
Agreed entirely. I doubt that it will ever be possible to pass parameters at import - so we are left having to figure out a workable solution with the tools in our box... the trick that we can play is that we get a "startup" execution when a module is imported - so long as we don't blow that trick on code execution / major side-effects. In a non-interactive context it is also entirely reasonable to assume that all imports should happen at the start (that is recommended in PEP8: "Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants"). Given this, we could gather together all of our JVM requirements before actually starting the JVM:
Many modules can do this kind of declaration, and the If the JVM has already started before the module is imported (because it happens, especially in an interactive environment), then we have the context to know that the JVM config needed to be in a certain way. We can raise if we can't do anything about it, or if we have clever tricks up our sleeve for classpath etc., we can apply them automagically. One thing I don't know: How the type annotation would work-out. You need the JVM running to be able to resolve the annotation 🤔. For Python before 3.10 that annotation has an import-time side-effect, so definitely wouldn't work without the JVM running. This is addressed in PEP563. With that in place there is a little trick we might be able to play such that inside JPype itself:
As discussed in #900, I would go one step further for packages - I would expose hooks to allow Java dependency declaration as part of the package metadata, rather than as runtime-metadata. This would mean that JPype can know about installed packages which need a running JVM to function, without having to import them before the JVM is started. This would dramatically help in the REPL environment, because in that context users often don't import things in a certain order. In interactive mode the above looks like:
Great to hear about the significant improvements to For what it is worth, and just in case this is also one of the hurdles for convincing you, I don't love the spelling of |
Unfortunately we still have too many issues on the table for me to resolve in this thread. My initial thoughts are there were two abandoned PR that I worked on long ago that may be relevant. The first was The second was Perhaps there are solutions to these problems as both seem to be in the direction you are requesting. |
Here is an example of the onload that I had been working on. (recreated from memory)
Here the symbols in deferred don't actually get executed until after the JVM starts. The symbols appear in the correct module. But they can't be accessed ahead of time nor appear in The use of this pattern would be in
The modules would defined their classpaths and jvm parameters using As no Java code gets executed on the initial import this satisfies the requirements that imports and import order do not product side effects. The actual evaluation of the deferred code happens at a defined place and explicit location. Things I don't like here, we have an explicit call to return |
Perhaps Python needs a
|
Any thoughts on the deferred module loading as a solution? |
The Your
Your The only difference between my straw-man above and the
In this case,
To see how it might look, I took a shot at implementing something which can catch the issue in the following code:
My prototype doesn't actually work in that:
What is nice though is that the exception is rather good considering we don't have immediate (fail-fast) exceptions as a result of our "no import-time side-effects" rule:
The implementation of this prototype looks like:
|
Just for clarification jpype.classes contains packages so it is just like what you are describing. So lets see if i can boil this down a bit. You cant check if something is a package or a class until the JVM is started nor if it exists. The auto JPackage simple creates JPackage instances mock style until it hits a valid class. As once the object is created we can't unmake it, this is a problem. Radical solution... make JClass and JPackage have the same memory layout. Then we can make them do mock behavior until the jvm is started. Then we check which are classes then we have to some how polymorph an existing class object. I have tried this sort of magic in the past but it is really meta magic as you dangerously write over the pointers of an existing type object. I was planning this trick to make JString into java.lang.String when the jvm goes live. But as it was the only one, I decided to shelved it. The downside is we only get fast fail after the jvm is started. |
I wanted to round off the ideas that are flying around with a couple of simple examples. In each case I don't explicitly state where the JVM starts because it can be in a number of places (before import, during import or after import) and it is also plausible that we could move to an auto-start (non explicit) model, should we so wish. (note that if we don't have auto-start then it would be probably sensible to have a decorator to allow us to declare that a function/class needs the JVM to be started in order to work) Examples (for googlers, these don't work, they are pseudo-code):
It is entirely conceivable that we can support both the attribute form and the import form. note:
Example:
The above examples are both entirely type-annotate-able I believe (using the same stubs), giving us static analysis that will warn if we try to access things that don't exist, and offering auto-completion (and in the latter case, auto-import and package identification for a class name). (note: I probably need to look in more detail at exactly how the Reminder: both of the above examples can co-exist. The preferred style is entirely down to the developer. There are pros and cons to each, and I can see an advantage in being able to mix-and-match. The key thing though is that neither approach requires the JVM to be running at import time and both styles can be used by scripts, applications, and libraries without the concern for having to have the substantial side-effect. |
I looked at morphing the thing too, but in truth, isn't
Agreed. This is the logical conclusion of avoiding import-time behaviour whilst still allowing "Pythonic" code (e.g. import order isn't important, imports at top of code, type-annotate-able, ). Either way you have to have the JVM running to get the exception, the major difference is that the exception won't happen at the moment the erroneous code is executed, so we'd have to be careful to provide a helpful traceback (hence my prototype). |
Proxing has some big speed implications over a morph. There are a lot of implications with referencing as the typing model for isinstance would have major implications. This is hit very hard during method resolution. |
Also my resistence to the name jpype.jvm is I consider that to be where the jvm controls belong. I couldn't move them due to compatiblity. But I consider anything that doesnt have a capital J as currently being out of place with the exception of subpackages in the jpype module. Jpype itself already contains java and javax and should have had all tld as well as a package called notld for non conforming packages. |
I'd like to explore that a little bit. You definitely wouldn't want to be doing a lookup on each attribute access, but if you save a successful lookup then the next lookup can be at the same speed as normal Python method resolution. I've confirmed this with the following code:
The results are pretty clear that there is no measurable cost to the proxy approach when done in this way:
Agreed. I think the conclusion that we'd have to draw is that you can't have an instance until the JVM is running, so a What is slightly uncomfortable is the distinction between packages, classes and instances (or anything else you can reference in Java). If a class has a static variable, then it is impossible for us to know that we are accessing an instance from a class vs accessing a class from a package. As a result we need to be able to handle any reference-able thing from Java in our proxy. However there is a clear difference in Python between a type and an instance, and it isn't possible to morph from one to the other. |
I should point out the lookups for method resolution are in C++ currently and use much faster methods than you can do when testing from within Python. It really isn't possible to mock up tests of the cost of proxies for method resolution to test does is this a java object or a java proxy. We were 4 to 20 times slower when we used pure Python lookups that were using attribute based lookups. When I switched to a dedicated slot, it was much faster. Unfortunately if we have more than one path to lookup the type during method resolution. I did a lot of testing at the time and determined that supporting a secondary path was as bad as if all objects were pure python. The problem is when doing a method resolution (not simply finding the name of the method but choosing the overload by matching each JPype/Python type to each Java type ) often has to fall through unless the overload was the first found. Thus doing two paths would hit the full cost of pure python 90% of the time or more. Thus I had to reject the old proxy method on matching as a backup path. Not saying that we can't do a proxy of some kind but it would have to use the same slot mechanism to have reasonable efficiencies. We would simply have to copy the slots over to proxy so that it can be a direct lookup. Important speedups that we are running under the hood.
I know this are black magic. Python internally does most of this using a 1 hot encoded bit field for the types that need acceleration. I also found one case of using the slot comparison trick. The Java slot trick uses the slot trick on the allocator followed by "extra" memory on the end of the object (invisible to Python). It is basically a replication of the dict and weakref system but hidden as there is no type slots to support it. I can do these sort of tricks on certain items but it would require the Proxy object to be in C but at that point morphing objects is just as easy. It would take me a week to study the current details (memory footprint) to be sure. I know that I can make JClass and JPackage work, but it would get much harder if I have to deal with enum values or static fields as well. Many objects have different layouts as they have to derive from exceptions, methods, int, long, object, and str. They simply can't morph. They are very polymorphic in memory layout. The best I can do there would be a C reference that has a Javaslot reserve and then a proxy to real class. When they get referenced they would polymorph by setting the Java slot to the deferred item (which makes Java think they are the proper type) and then proxy their methods to a second copy which is built late (which makes Python think they are the proper type). But there are a large number of edge cases (such as isinstance and issubclass) that have to be considered. Another problem, the memory footprint of the object would be huge if they are used as type they must be type objects. JImplements may be able to handle operating without actual type object, but once you can extend objects that won't be true. There may be ways to be work around this (intercepting the base classes when the JClass meta is building). Though lets not let perfect be the enemy of the good and I will see what I can cook up first. Ultimately I will be limited by the Python object model. The more deeply I patch it the more likely it will break in some future version of Python (unless they formalize some of my tricks). |
It isn't before the JVM is running that I am worried about. Most of the behaviors that happen when you mix objects before or after the JVM is created. For example....
|
Some great detail here, thank you!
Indeed, I could have updated the prototype to use My guess is that overall, JIT method resolution (slow-ish) + caching would be faster than class slots (super fast C++) for every single method of every single imported class in most cases (unless you are building a stubgenerator or something and need to access all methods that are available).
Agreed. I think inevitably the conclusion is that the "pre-JVM proxy type" has to be the same thing as the "post-JVM proxy type", i.e. I think this sounds like your "C reference that has a Javaslot reserve and then a proxy to real class" description, but I'm not sure.
Yep. Perhaps this should raise when the JVM starts (we can know then if the thing was a class or an instance), and we have some explicit syntax to access a morphable reference (e.g. for the default value of my unit kwarg of the function defined in an earlier post, you'd have to always declare it as a proxy since it can't be morphed automatically. You'd end up typing:
What we are talking about here is quite a bit of effort. I suspect a few things I've said aren't new, and indeed I have the feeling that some of the things already exist in JPype (apologies for my re-invention of the wheel in those cases). I think we should decide whether this is this something you'd like to explore further? Are there any red-lines that are being crossed for you (aside from the If you wanted to proceed, then I suggest we could thrash out a pure-Python prototype in fairly short order to fully understand the implications of the decisions, and then we can look at optimising the hell out of it later on. I don't mind leading on the prototype if that is helpful, but truth is that you'd probably have to lead on the "make this thing go like the clappers" given you JNI/JPype experience. |
I worked a bit on the prototype last night to see if it is workable. I am going with JForward as proxy is a very different thing with a specific meaning in Java. I still think that you are a bit unclear on the details of the method resolution process. (Or maybe I misunderstand) It seems like you are viewing it as
Only here The issue is actually some where unexpected.
Seem innocent... the call to append is not involving the JForward. And we didnt even use a Java type. Except append is highly overloaded. Thus each and every type match for each potential overload must check if the argument is type JForward. So what is the deal here. Well isinstanceof for all but bit vector mapped types means getting checking if there is a meta class, look for This is the exact same issue that the Python folks would have with I can likely solve this. I cant make it a bit field type but i have other tricks to make type checking fast. And rather than placing the check on each type check, if instead the check is on a common point such as when the arguments get unpacked that reduces the burden. Now its cost does not scale with the overload count. But that leaves about 20 edge cases like when the forward is buried in a list. And the short cutting checks for some basic container types. So still doable but clearly would be a major feature requiring a large test suite. And as the JVM is a start once thing this requires a big subprocess type test bench. Perhaps there is another primary point where i can catch all points once without paying the cost on every use. I think the requirements you listed are about right. And it doesnt look like morphing is going to work so it is likely going to have to be a proxy object instance as we dont want to pay the cost of a heap type. That means we have to proxy every potential slot used in every type in JPype. There is a certral place for that, the list of slots in pyjp_class constructor switch table. |
Example code:
|
Seems like a total of 8 pointers should cover the needs for most morphs. 2 for base object, 2 for weak and dict, 2 for payload, and 2 hidden for java slot. This just leaves class, jchar, and throwable as resolved proxies. |
See https://github.com/Thrameos/jpype/blob/forward/forward.py for the current state. |
I posted a tweak to the branch in Thrameos#55.
In terms of "does this look reasonable", the answer is a resounding yes! The devil is in the detail, but I've just added a tweak that hooks us into the import machinery. With this, it is entirely possible to write a module (library) which has Java elements (even type annotations) which don't require the JVM to be running at import time. For example, the following works as expected without needing the JVM to be running until we actually want to execute the function:
In the prototype currently we can't do things like implementing interfaces:
But I think this one is perfectly resolvable given we know how to do deferred interface validation. Furthermore, the prototype doesn't yet:
From your perspective how is it feeling? Do you think it is workable in a sustainable fashion? Do you see this as something that could become a primary path for JPype usage? |
It may be possible to just have the JImplements check it it is currently a forward and set the deferred flag accordingly. |
If it appears reasonable that I can start cutting code on the morphing which will change it from pure Python to actually mucking with the CPython internals. That will cut the speed penalty for all but @marscher with regard to the root directory for java packages, we already have |
Is the C implementation purely about performance, or are there other advantages? It seems to me that it will get a whole lot more complex (and therefore harder to maintain & less portable), so we should be sure that we can't hit the desired performance on the pure Python side. |
There is a critical difference between the Python and C versions. In Python, the best we can achieve is to make an object that appears to proxy to the real thing. Depending on how many slots we implement and how those slots are accessed by the internals of Python it may or may not appear to be the same. To verify this we would have to repeat the entire test bench to verify every behavior on the resolved object. In C, we get a very different result. For every object which is the same or small memory foot print we will be rewriting the memory of the object to be the real deal. Every behavior we have will be the same because the object will be the same object that exists after the JVM is started. There is little need for additional testing except for the few object types that we can't actually morph, but those are very limited. If there is a change in how Python visits the slots or some new behavior that we add to an existing object, then we get exactly the behavior with the resolved object. We would even satisfy the contracts like The reason this is possible has to do with how Python objects exist in memory. They are an arbitrary blob of memory which come in two flavors (gc or no gc) in which the first pointer is to their identity (the class). The actual interpretation of that data is inherited from pointer in the first position. Unlike C++ there is no vtable of private pieces that are internal and immutable. All of our types are gc so we don't need to worry about changing the collection policy. So if you change the class pointer you change its identity entirely. The Python version does something similar by replacing the They can of course add checks for this which makes morphing of objects more difficult in the future within Python. In the C version, there is no limitations. We just have to make sure the memory footprint of the new object matches. If the dict is the 4th pointer in the struct and we change it to an object which is the third pointer, then we just have to find the old and new location and relocate it. If the dict doesn't exist then we simply have to free it. So ultimately, the portability and the number of edge cases is actually a whole lot lower lot less with a C version than a Python version. The only difficulty is the coding the first time. When we implement it in C we have to manually code each slot and its behavior. And C is much more verbose (and laborious) that Python. So we are trading a bunch of two line Python implementations for 20 line C implementations. But then we only need two resolved classes and those are both pretty small. Performance is not really the main driver. Unless the user made a whole lot of forwards or used those forwards in critical sections of code, it would be difficult to get a significant performance hit. Does that clear it up? |
Definitely, thank you! Though I have one follow-up one (sorry!) since this isn't about performance and more about fiddling with the underlying object pointers: how plausible is it to expose the morphing part (written in C) as a function in Python, thereby giving us the ability to continue to write the class based logic in Python, but still benefit from the low-level morphing. I'm essentially saying: can we get away with writing those 2 line Python implementations and still benefit from the C-level pointer morphing, or is it not possible to have our cake and eat it? 🍰 |
It is possible. We can define only the base class in C and then derive it in Python to define most of its behavior in the derived class. That is usually the starting point for any implementation. However, this usually envolves adding hooks for the C version to call such as those that you find for implementing docstrings. That is if something in C needs to access a portion of the derived behavior then there must be a defined path to do so. Depending on how many hooks are required sometimes it is just easier to push it all in C rather that leave it half way. But this is really just implementation details. Most of the C slots are just like those in Python. There is some boiler plate code then a call to the real version. So if I need to convert just one or two slots to avoid a hook, then cutting and pasting many slots and just changing the real call is likely just as easy. I wont really know until I finish if I can get away with most in Python or most in C. |
The new(ish) JPype import hook (added in #224) is a neat trick to allow us to declare Java imports in a Pythonic way rather than using string based package access (e.g.
from java.utils import Object
vsObject = jpype.JClass('java.utils.Object')
). There are some major downsides to using it though, so I wanted to open this ticket to discuss (and hopefully mitigate) them.Pros / Cons
Downsides
Python imports should be side-effect free & import order shouldn't matter
In Python it is frowned upon to have a major side-effect at import-time. I'm no Java expert, but I've seen a few Java packages in which side-effects are common (starting a background thread, initialising static members, etc.), so perhaps this is a significant cultural difference between the Python & Java languages?
Unfortunately the JPype import mechanism forces us to either start the JVM in our code before importing Java packages, or to import things in a specific order such that another module has a JVM starting side-effect. For example:
Or
Both examples demonstrate import side-effects (the JVM gets started!) and the import order is critical to the successful operation of the code. This is more than just a theoretical "you shouldn't do that" - it puts our code against essential development tools such linters and our helpful IDE/isort (to fix import order automatically) will actively break both examples by "fixing" our import order for us.
The import side-effects / order requirements alone mean that I've been unable to recommend the use of the JPype import system for anything other than end-user applications (and categorically not libraries). In all honesty, this is the single reason why I'm writing this issue - I'd love to find a single solution that is workable for all types of code which use JPype.
JPype takes ownership of Java TLD namespaces
The JPype import hook involves commandeering the top-level names used in Java. This is done by opt-in, except for a few pre-registered names. If a registered name collides with a Python package, the Java package wins if
jpype.imports
is enabled (and not otherwise). I believe it is possible to create an alias/special-prefix if there is a collision you particularly want to avoid, but this mechanism seems to be undocumented (the whole ofjpype.imports
is undocumented currently, includingregisterDomain
).Lots of imports
This is a fairly minor point, but in order to use something from Java using the new import system, you have to import it. This is a fair reflection of what you have to do in both Python and in Java, but it does indeed represent more LOC than the old mechanism. You could argue this is an advantage, as it is much more explict, and allows things like import aliases etc. (see advantages below).
Only works on py3.6+
Honestly, this doesn't bother me in the slightest. Python 3.5 was end-of-life in September 2020.
Upsides
No more programming by string 🎉
The old
jpype.JClass('java.lang.String')
approach of accessing a class by string rather than using the import mechanism feels a little hacky, and is certainly not something that is easy to validate statically (I had an idea on that, but hit a brick wall in python/mypy#10004, more to follow). Tab completion in an IDE is also not possible.I don't fully understand the comment in the motivating PR though:
Could you provide more detail on this please @Thrameos? I'm a little confused as JPype could/does manage the imports for us as we access names on a
JPackage
instance.Fail fast
If we have the JVM running when we define classes/interfaces we can validate the implementation there and then, rather than deferring any exception to later on. This follows the fail-fast philosophy (definitely Pythonic style!), and means that exceptions are raised at the point of issue, rather than some other place such as at the constructor.
Take the following example:
If
MyImplementation
doesn't implement the full set of features defined inorg.foo.JavaInterface
then we get an exception at import time. This is a good thing - we haven't defined the interface correctly, and the behaviour is just like Python's own ABC which validates the implementation at import-time.There are means to avoid doing this at import-time in JPype. I added the
deferred
flag for JImplements in #659, and the previous example can be written as the following (example from the docs):This time, since JPype doesn't yet have a running JVM, it cannot find out more about the
org.foo.JavaInterface
and therefore cannot know if the implementation is good or not. The result is that we end up getting the exception at instantiation:This can be highly non-obvious to a user. Ideally a user would never see such an exception (even in the non-deferred case) - in Java for example this error would be seen by the
MyImplementation
developer, as their code would fail to compile (thought: perhaps we are missing a stage that would help us validate our implementation at a pseudo-compile-time, somehow).Type annotations possible
Having the JVM running at import time means we can use things like Java classes as return type annotations:
In #714 there is a prototype which generates stubs for Java packages/classes which would give us the ability to run static analysis (using tools like mypy) on our Java interactions, and this will also enable IDEs such as PyCharm to give us very convenient tab-completion, for example. In the example above this would work by generating a stub-package for the
org
(pseudo-)package exposed by JPype's import hook.Summary
For full transparency: The reason I'm opening this discussion is because I find it hard to get fully behind the JPype imports mechanism - for me the requirement to have the JVM running at import time is a deal-breaker, despite the numerous very attractive benefits that the mechanism brings.
As a result, I'm curious if these is another approach that might be able to give us the benefits, without the drawbacks. I have one or two prototypes (mostly extending the ideas in #714 (stubgenj)) and believe there might be mileage in extending JPype's use of typing / annotations as a form of pseduo-compile step. It currently is based on the idea of accessing
jp.JPackage(top_level_only).<thing_on_top_level>
which can be fully type-checked by providing an exhaustive set oftyping.Literal[top_level_name]
overload annotations forjpype.JPackage
. I don't really want this discussion to go too far down this route - I'm more than happy to open up another issue for that. Instead...My objective for this issue is to have an exhaustive set of pros/cons such that any alternative proposal to the JPype import mechanism could ensure that it is hitting the various requirements. Please comment if there is something I've missed, or if there is detail that can be added to any of the points. This issue doesn't need to remain open indefinitely, so please feel free to close as soon as the discussion has taken its course.
The text was updated successfully, but these errors were encountered: