-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add package extension to support InlineStrings in Arrow.jl #66
Conversation
Fixes apache/arrow-julia#196. This utilizes the new package extension feature of Julia 1.9 to add a conditional dependency on the ArrowTypes.jl package. With ArrowTypes.jl, it adds the necessary overloads to allow round- tripping of inline strings through the arrow format. Other language implementations will read them as normal strings, but in the Julia implementation, the additional type metadata signal that these strings were originally inline strings and can be deserialized as such. I'm explicitly not using the Requires.jl hack for backwards compat w/ older Julia versions because I like the idea of this being sort of a "beta" feature for users already using 1.9 to see if there are any unexpected issues that pop up for inline strings in the arrow format.
Codecov Report
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. @@ Coverage Diff @@
## main #66 +/- ##
==========================================
+ Coverage 93.04% 93.07% +0.03%
==========================================
Files 1 2 +1
Lines 676 679 +3
==========================================
+ Hits 629 632 +3
Misses 47 47
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀 very cool
i've a couple questions about the set-up. it looks fine to me, but curious if it's "best practice" as i've never seen this stuff before. i had a quick look at https://pkgdocs.julialang.org/v1/creating-packages/#Weak-dependencies but i didn't see answers there
|
||
# only test package extension on >= 1.9.0 | ||
if VERSION >= v"1.9.0" | ||
include(joinpath(dirname(pathof(InlineStrings)), "../ext/tests.jl")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
include(joinpath(dirname(pathof(InlineStrings)), "../ext/tests.jl")) | |
include(joinpath(pkgdir(InlineStrings), "ext", "tests.jl")) |
@@ -0,0 +1,15 @@ | |||
module ArrowTypesExt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first time review a package extension
is this "flat" structure:
ext/
- MyExt.jl
- tests.jl
the standard/recommended structure for package extensions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are new enough that there aren't any standards or recommendations. I looked at CategoricalArrays, but couldn't actually find any tests for their extensions. It felt helpful IMO to keep the tests next to the extension code, but I guess you could also do a /test/ext/ArrowTypesExt.jl
file that would also be clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool - what you have here seems fine to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would find it more logical to put tests under tests/ or test/ext/. In particular, that allows using a separate test/Project.toml instead of using [targets]
in Project.toml (which is more practical to instantiate the project manually when debugging).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And FWIW, in CategoricalArrays tests for extensions are in test files that cover "standard" (non-extension) features. But these were written even before we used Requires.jl to make some dependencies optional IIRC.
@@ -0,0 +1,8 @@ | |||
using Test, Arrow, InlineStrings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, is adding the "weak dep" as a test dep then include("ext/test.jl")
the standard/recommended approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, no standards yet that I'm aware of. CategoricalArrays did the pattern of extensions as a test dep, but like I said elsewhere, I couldnt' actually find tests for those extensions.
|
||
for sz in (1, 4, 8, 16, 32, 64, 128, 256) | ||
nm = Symbol(:InlineString, max(1, sz - 1)) | ||
arrow_nm = Symbol("JuliaLang.InlineStrings.", nm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we instead store the size in bytes using ArrowTypes.arrowmetadata? It seems more natural to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying in addition to have the symbol name? What would putting the size in arrowmetadata give us?
using Test, Arrow, InlineStrings | ||
|
||
@testset "basic Arrow.jl interop" begin | ||
t = (x = inlinestrings(["a", "b", "sailor"]),) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add tests along something like:
for i in 0:255
s = InlineString("a" ^ i)
t = Arrow.Table(Arrow.tobuffer((x = [s], ))
@test t.x[1] === s
end
Any thoughts about putting this in Arrow? At leat most often it seems that the extension goes into the package where the functions that get extended are defined. |
Generally looks good (though I am not an expert on extensions either 😞). |
huh, that's surprising to me. Seems the opposite to usual, where package can implement an interface by adding new methods in their own package. i.e. if this weren't a weakdep, i'd expect a package to be able to opt-in to custom Arrow serialization by that package extending Arrow methods, rather than having every package wanting custom Arrow serialization having to make a PR to Arrow. e.g. DataFrames.jl defines new |
I think this reasoning is based on which package is more upstream/downstream (which sometimes can be hard to judge). Clearly Tables.jl extensions should be added in DataFrames.jl as you say. However, I treat InlineStrings.jl as a more low-level package than Arrow.jl (e.g. it is imaginable that InlineStrings.jl could become part of Base - I do not say that it would but it could). The hard case is e.g. CategoricalArrays.jl vs DataFrames.jl. It is hard to say which one is more level (at least for me), but probably CategoricalArrays.jl, so if we want to have support of CategoricalArrays.jl with DataFrames.jl we should add it to DataFrames.jl. As a second thought, it is related to the level of maturity/change dynamics in the package. Extensions should be added to packages that are changing more often (if we have a stable package we do not want to constantly release it to keep it consistent with an actively developed package). @KristofferC - as a general comment (maybe you could suggest where I should put it for a discussion 😄). I feel that package extensions are super useful, but I find the explanations given in https://pkgdocs.julialang.org/dev/creating-packages/#Conditional-loading-of-code-in-packages-(Extensions) slightly imprecise. The points that I lack the precision for are:
Thank you! |
When all the "triggers" of it are loaded (those weak dependencies that are listed to the right of the extension in the Project file.
They work pretty much exactly as they would if the extension was a normal package that depended on the parent and the triggers.
Add the triggers as test dependencies.
Yes, you add it. |
Yeah, I agree with @nickrobinson251 here; putting the extension in Arrow feels like going the direction of "type piracy" where it would feel like Arrow is defining for InlineStrings how InlineStrings work w/ Arrow. Whereas this setup feels more natural, InlineStrings gets to define how the integration works (and gets updated/versioned!). But I also agree with everything @bkamins has said. In this case, and IMO, ArrowTypes.jl (the interface package of Arrow.jl) and InlineStrings.jl are equally low-level in my mind, which is why I then go to the 2nd item that I mentioned above, i.e. it feels more natural for InlineStrings to "own" the method extensions while not needing to add an actual dependency. |
Well, I can just list some examples where this is how it is done:
But there might be some other criteria that better determines where to put the extension. |
Ok, I think there's been some good discussion here and I can see reasons why extensions may want to live in one place or another. In this case where the same JuliaData people are the same maintainers over both ArrowTypes and InlineStrings, it's not really a question of keeping the code closer to those most capable of maintaining, so I'd like to go with the current setup as-is and we'll see how it plays out. (will do a follow up investigation into unrelated CI nightly failures) |
Fixes apache/arrow-julia#196.
This utilizes the new package extension feature of Julia 1.9 to add a conditional dependency on the ArrowTypes.jl package. With ArrowTypes.jl, it adds the necessary overloads to allow round- tripping of inline strings through the arrow format. Other language implementations will read them as normal strings, but in the Julia implementation, the additional type metadata signal that these strings were originally inline strings and can be deserialized as such.
I'm explicitly not using the Requires.jl hack for backwards compat w/ older Julia versions because I like the idea of this being sort of a "beta" feature for users already using 1.9 to see if there are any unexpected issues that pop up for inline strings in the arrow format.