Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roundtrippability of special strings #196

Closed
nilshg opened this issue May 13, 2021 · 5 comments · Fixed by JuliaStrings/InlineStrings.jl#66
Closed

Roundtrippability of special strings #196

nilshg opened this issue May 13, 2021 · 5 comments · Fixed by JuliaStrings/InlineStrings.jl#66

Comments

@nilshg
Copy link

nilshg commented May 13, 2021

As discussed on Slack, it would be nice if this would work:

julia> using Arrow, DataFrames, ShortStrings

julia> df = DataFrame(stringcol = ShortString7.(["abcde", "fghij"]))
2×1 DataFrame
 Row │ stringcol 
     │ ShortStr… 
─────┼───────────
   1 │ abcde
   2 │ fghij

julia> Arrow.write("test.arrow", df);

julia> DataFrame(Arrow.Table("test.arrow"))
2×1 DataFrame
 Row │ stringcol 
     │ String    
─────┼───────────
   1 │ abcde
   2 │ fghij
quinnj added a commit to JuliaStrings/InlineStrings.jl that referenced this issue Jun 14, 2023
Fixes apache/arrow-julia#196.

This utilizes the new package extension feature of Julia 1.9 to
add a conditional dependency on the ArrowTypes.jl package. With
ArrowTypes.jl, it adds the necessary overloads to allow round-
tripping of inline strings through the arrow format. Other language
implementations will read them as normal strings, but in the Julia
implementation, the additional type metadata signal that these strings
were originally inline strings and can be deserialized as such.

I'm explicitly not using the Requires.jl hack for backwards compat w/
older Julia versions because I like the idea of this being sort of a
"beta" feature for users already using 1.9 to see if there are any
unexpected issues that pop up for inline strings in the arrow format.
@quinnj
Copy link
Member

quinnj commented Jun 14, 2023

PR up to add support for InlineStrings round-tripping in arrow: JuliaStrings/InlineStrings.jl#66

quinnj added a commit to JuliaStrings/InlineStrings.jl that referenced this issue Jun 20, 2023
* Add package extension to support InlineStrings in Arrow.jl

Fixes apache/arrow-julia#196.

This utilizes the new package extension feature of Julia 1.9 to
add a conditional dependency on the ArrowTypes.jl package. With
ArrowTypes.jl, it adds the necessary overloads to allow round-
tripping of inline strings through the arrow format. Other language
implementations will read them as normal strings, but in the Julia
implementation, the additional type metadata signal that these strings
were originally inline strings and can be deserialized as such.

I'm explicitly not using the Requires.jl hack for backwards compat w/
older Julia versions because I like the idea of this being sort of a
"beta" feature for users already using 1.9 to see if there are any
unexpected issues that pop up for inline strings in the arrow format.

* Only test package extension on 1.9
@Moelf
Copy link
Contributor

Moelf commented Jun 20, 2023

So if a Julia user doesn't want to depend on ArrowTypes.jl, they will get back normal String right?

@quinnj
Copy link
Member

quinnj commented Jun 20, 2023

Correct (though it's whether the user has InlineStrings loaded or not, not ArrowTypes):

julia> t = Arrow.Table("/Users/quinnj/.julia/dev/inlinestrings.arrow")
┌ Warning: unsupported ARROW:extension:name type: "JuliaLang.InlineStrings.InlineString7", arrow type = String
└ @ Arrow ~/.julia/dev/Arrow/src/eltypes.jl:53
Arrow.Table with 3 rows, 1 columns, and schema:
 :x  String

@Moelf
Copy link
Contributor

Moelf commented Jun 20, 2023

I see... to certain degree I feel like this kind of support is a bit of rabbit hole, maybe InlineString is special enough to justify.

But in large applications, it might be a footgun that users get different type in schema depending (invisible to user because some other dependency might load it) if another package is loaded or not.

@ericphanson
Copy link
Member

Btw Legolas.jl will error if validate=true and an unknown schema is present, and if a known schema is used it will check the types against it. So Legolas can be used to mitigate the “silent” aspect to some extent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants