Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add package extension to support InlineStrings in Arrow.jl #66

Merged
merged 2 commits into from
Jun 20, 2023

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Jun 14, 2023

Fixes apache/arrow-julia#196.

This utilizes the new package extension feature of Julia 1.9 to add a conditional dependency on the ArrowTypes.jl package. With ArrowTypes.jl, it adds the necessary overloads to allow round- tripping of inline strings through the arrow format. Other language implementations will read them as normal strings, but in the Julia implementation, the additional type metadata signal that these strings were originally inline strings and can be deserialized as such.

I'm explicitly not using the Requires.jl hack for backwards compat w/ older Julia versions because I like the idea of this being sort of a "beta" feature for users already using 1.9 to see if there are any unexpected issues that pop up for inline strings in the arrow format.

Fixes apache/arrow-julia#196.

This utilizes the new package extension feature of Julia 1.9 to
add a conditional dependency on the ArrowTypes.jl package. With
ArrowTypes.jl, it adds the necessary overloads to allow round-
tripping of inline strings through the arrow format. Other language
implementations will read them as normal strings, but in the Julia
implementation, the additional type metadata signal that these strings
were originally inline strings and can be deserialized as such.

I'm explicitly not using the Requires.jl hack for backwards compat w/
older Julia versions because I like the idea of this being sort of a
"beta" feature for users already using 1.9 to see if there are any
unexpected issues that pop up for inline strings in the arrow format.
@codecov-commenter
Copy link

codecov-commenter commented Jun 14, 2023

Codecov Report

Merging #66 (0807e29) into main (b7923a2) will increase coverage by 0.03%.
The diff coverage is 100.00%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main      #66      +/-   ##
==========================================
+ Coverage   93.04%   93.07%   +0.03%     
==========================================
  Files           1        2       +1     
  Lines         676      679       +3     
==========================================
+ Hits          629      632       +3     
  Misses         47       47              
Impacted Files Coverage Δ
ext/ArrowTypesExt.jl 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Collaborator

@nickrobinson251 nickrobinson251 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀 very cool

i've a couple questions about the set-up. it looks fine to me, but curious if it's "best practice" as i've never seen this stuff before. i had a quick look at https://pkgdocs.julialang.org/v1/creating-packages/#Weak-dependencies but i didn't see answers there


# only test package extension on >= 1.9.0
if VERSION >= v"1.9.0"
include(joinpath(dirname(pathof(InlineStrings)), "../ext/tests.jl"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
include(joinpath(dirname(pathof(InlineStrings)), "../ext/tests.jl"))
include(joinpath(pkgdir(InlineStrings), "ext", "tests.jl"))

@@ -0,0 +1,15 @@
module ArrowTypesExt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first time review a package extension

is this "flat" structure:

ext/
  - MyExt.jl
  - tests.jl

the standard/recommended structure for package extensions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are new enough that there aren't any standards or recommendations. I looked at CategoricalArrays, but couldn't actually find any tests for their extensions. It felt helpful IMO to keep the tests next to the extension code, but I guess you could also do a /test/ext/ArrowTypesExt.jl file that would also be clear.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool - what you have here seems fine to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would find it more logical to put tests under tests/ or test/ext/. In particular, that allows using a separate test/Project.toml instead of using [targets] in Project.toml (which is more practical to instantiate the project manually when debugging).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And FWIW, in CategoricalArrays tests for extensions are in test files that cover "standard" (non-extension) features. But these were written even before we used Requires.jl to make some dependencies optional IIRC.

@@ -0,0 +1,8 @@
using Test, Arrow, InlineStrings
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, is adding the "weak dep" as a test dep then include("ext/test.jl") the standard/recommended approach?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, no standards yet that I'm aware of. CategoricalArrays did the pattern of extensions as a test dep, but like I said elsewhere, I couldnt' actually find tests for those extensions.


for sz in (1, 4, 8, 16, 32, 64, 128, 256)
nm = Symbol(:InlineString, max(1, sz - 1))
arrow_nm = Symbol("JuliaLang.InlineStrings.", nm)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we instead store the size in bytes using ArrowTypes.arrowmetadata? It seems more natural to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying in addition to have the symbol name? What would putting the size in arrowmetadata give us?

using Test, Arrow, InlineStrings

@testset "basic Arrow.jl interop" begin
t = (x = inlinestrings(["a", "b", "sailor"]),)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add tests along something like:

for i in 0:255
    s = InlineString("a" ^ i)
    t = Arrow.Table(Arrow.tobuffer((x = [s], ))
    @test t.x[1] === s
end

@KristofferC
Copy link
Member

KristofferC commented Jun 15, 2023

Any thoughts about putting this in Arrow? At leat most often it seems that the extension goes into the package where the functions that get extended are defined.

@bkamins
Copy link

bkamins commented Jun 15, 2023

Generally looks good (though I am not an expert on extensions either 😞).

@nickrobinson251
Copy link
Collaborator

nickrobinson251 commented Jun 15, 2023

the extension goes into the package where the functions that get extended are defined

huh, that's surprising to me. Seems the opposite to usual, where package can implement an interface by adding new methods in their own package. i.e. if this weren't a weakdep, i'd expect a package to be able to opt-in to custom Arrow serialization by that package extending Arrow methods, rather than having every package wanting custom Arrow serialization having to make a PR to Arrow.

e.g. DataFrames.jl defines new Tables.foo methods, it doesn't make a PR to Tables.jl

@bkamins
Copy link

bkamins commented Jun 15, 2023

huh, that's surprising to me.

I think this reasoning is based on which package is more upstream/downstream (which sometimes can be hard to judge).

Clearly Tables.jl extensions should be added in DataFrames.jl as you say.

However, I treat InlineStrings.jl as a more low-level package than Arrow.jl (e.g. it is imaginable that InlineStrings.jl could become part of Base - I do not say that it would but it could).

The hard case is e.g. CategoricalArrays.jl vs DataFrames.jl. It is hard to say which one is more level (at least for me), but probably CategoricalArrays.jl, so if we want to have support of CategoricalArrays.jl with DataFrames.jl we should add it to DataFrames.jl.


As a second thought, it is related to the level of maturity/change dynamics in the package. Extensions should be added to packages that are changing more often (if we have a stable package we do not want to constantly release it to keep it consistent with an actively developed package).


@KristofferC - as a general comment (maybe you could suggest where I should put it for a discussion 😄). I feel that package extensions are super useful, but I find the explanations given in https://pkgdocs.julialang.org/dev/creating-packages/#Conditional-loading-of-code-in-packages-(Extensions) slightly imprecise. The points that I lack the precision for are:

  1. exact description of a rule when the optional module is loaded and how it is looked for.
  2. it is not clear how extensions work with precompilation (i.e. if I use PrecompileTools.jl does/when the code in the extension get precompiled and how) + how do extensions play with method invalidation between the original interacting packages
  3. what is the recommended way to write tests for extensions.
  4. A more clear instruction how to ensure backward compatibility of an extension in https://pkgdocs.julialang.org/dev/creating-packages/#Transition-from-normal-dependency-to-extension part. In particular, if I add a package to both [deps] and [weakdeps] then do I also add [extensions] part or I do not add it and instead do the if !isdefined(Base, :get_extension) part (this is what I assume, but it is not clear).

Thank you!

@KristofferC
Copy link
Member

exact description of a rule when the optional module is loaded and how it is looked for.

When all the "triggers" of it are loaded (those weak dependencies that are listed to the right of the extension in the Project file.

it is not clear how extensions work with precompilation

They work pretty much exactly as they would if the extension was a normal package that depended on the parent and the triggers.

what is the recommended way to write tests for extensions.

Add the triggers as test dependencies.

if I add a package to both [deps] and [weakdeps] then do I also add [extensions] part

Yes, you add it.

@quinnj
Copy link
Member Author

quinnj commented Jun 15, 2023

Any thoughts about putting this in Arrow? At leat most often it seems that the extension goes into the package where the functions that get extended are defined.

Yeah, I agree with @nickrobinson251 here; putting the extension in Arrow feels like going the direction of "type piracy" where it would feel like Arrow is defining for InlineStrings how InlineStrings work w/ Arrow. Whereas this setup feels more natural, InlineStrings gets to define how the integration works (and gets updated/versioned!).

But I also agree with everything @bkamins has said. In this case, and IMO, ArrowTypes.jl (the interface package of Arrow.jl) and InlineStrings.jl are equally low-level in my mind, which is why I then go to the 2nd item that I mentioned above, i.e. it feels more natural for InlineStrings to "own" the method extensions while not needing to add an actual dependency.

@KristofferC
Copy link
Member

KristofferC commented Jun 15, 2023

Well, I can just list some examples where this is how it is done:

But there might be some other criteria that better determines where to put the extension.

@quinnj
Copy link
Member Author

quinnj commented Jun 20, 2023

Ok, I think there's been some good discussion here and I can see reasons why extensions may want to live in one place or another. In this case where the same JuliaData people are the same maintainers over both ArrowTypes and InlineStrings, it's not really a question of keeping the code closer to those most capable of maintaining, so I'd like to go with the current setup as-is and we'll see how it plays out.

(will do a follow up investigation into unrelated CI nightly failures)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Roundtrippability of special strings
7 participants