-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update pyarrow package to 2.0.0 #12164
Comments
You are right, we are quite out of date for pyarrow. We plan to update it in the coming weeks. |
Gently bumping this issue. In the meantime, pyarrow 3.0.0 is released. |
Yes, indeed. Note that pyarrow is at 11.2M downloads/month on PyPI https://pypistats.org/packages/pyarrow Comparative numbers:
I think you can infer from this that keeping this package up to date is important from a production stability point of view. |
I added recently to defaults pyarrow/arrow-cpp versions 2.0.0, and 3.0.0. I would anyway love to learn a bit more about the "common" pyarrow feature-set used for 3.0.0/2.0.0? |
Could you clarify what you mean by "common"? Note that from here on out more or less every release of pyarrow will be a "major" release from a SemVer perspective, even if there are no backwards incompatible API changes. |
@katietz it seems that the newly added 3.0.0 packages are somewhat broken (or at least not enabling all essential parts). They also seem to be based on a quite old version of the conda-forge recipe, which has changed a lot since arrow 0.15. For example, the Dataset submodule was not built (which means that reading partitioned parquet datasets will error). Few example StackOverflow questions related to this: [1] https://stackoverflow.com/questions/66017811/python-error-using-pyarrow-arrownotimplementederror-support-for-codec-snappy |
I would personally suggest to remove the packages again (if possible) until those issues are fixed, as it seems to be causing quite some problems for users. |
Hello Joris, I am openminded and will happily add additional features to the package to the 3.0.0 recipe. That exactly was the reason why I asked about more details/needs on addtional features. So as said above, it would be good to open here new ticket for such items. We can ellaborate on this fast. Just as side-note, we won't merge in directly conda-forge's recipe. First of all, we would need to investage (and collect more details) on some of the additionally used packages there. And (a even more strong) reason is that we are not happy about merging python + cpp parts within one feedstock. We prefer to see native and embedded languages splitted here. I admit that all those other languages are put by the arrow within one tarball, but that is in our opinon not enough reasoning here. |
Thanks for the answer. See also the reports at AnacondaRecipes/pyarrow-feedstock#2 from @xhochy Generally speaking, I think it's best to enable almost all features (previously many parts were turned on by default when building arrow/pyarrow, but those all have been turned off in recent released to by default have a more minimal build. But so to restore the behaviour of the older packages, all those features have to be enabled manually: different compressions codecs for parquet, datasets, hdfs, ..) |
Sidenote: The |
What can we do to help defaults get sorted out? A LOT of people depend on this package, so every day that it remains broken is doing harm to users. |
Thanks for the offering of help. I am looking into the differences of recipes right now. I might come up with some questions on that. For arrow-cpp I just see the aws-sdk dependency right now as something we lack. (right now we don’t have a working aws-sdk package on anaconda-recipes side … I will keep that on my pile). what is about that parquet-cpp 1.5.1* run requirement. What is the background of that? Are the numpy >=1.16 requirements essential? if so, why? About build.sh/bld.bat I will follow up Is that -DARROW_SIMD_LEVEL=NONE required, or is it still enough to turn off SSE42? Are the issues about other archs SIMD instructions? So I worked a bit on adding some features to arrow-cpp. Please take a look. I will upload the package update as soon as last bits are ready compiled. I haven’t touched our ‘py-arrow-feedstock’ for now, but that will be next. |
The comment in the recipe says "empty parquet-cpp metapackage, force old versions to be uninstalled"
Yes, pyarrow 3.0.0 has a minimum dependency on 1.16, see https://issues.apache.org/jira/browse/ARROW-10861 for context |
Is there a PR at https://github.com/AnacondaRecipes/pyarrow-feedstock/pulls ? (I also don't see changes committed to the master branch) |
I changed first arrow-cpp (we have python and native part split in recipes). See https://github.com/AnacondaRecipes/arrow-cpp-feedstock |
Yeah, I was looking at the wrong repo. Additional comment looking at your chagnes: I think it is important to compile with support for the different compression codecs. The conda forge recipe has:
but in the defaults one I don't see eg snappy in the windows build script? |
Ok, I see that for windows we lack LZ4 and BROTLI. I will test to add them for windows too. For unix they are there-
I will test for snappy, but AFAIR I ran into issues about it. But thanks for the pointers |
There is still the BMI2 issue I manually patched out of Snappy: https://github.com/conda-forge/snappy-feedstock/blob/master/recipe/disable_bmi.patch (yes, I need to upstream that :( ) |
Any updates on this? |
Yes, I will need to visit pyarrow feedstock for adding some tests. But in general it should work already after doing a rebuild. |
I don't think so, also for pyarrow the build options need to be enabled. See eg https://github.com/conda-forge/arrow-cpp-feedstock/blob/c97f5ca5deffc89316702054de792cd968ca7e60/recipe/build-pyarrow.sh#L12-L24 So eg reading parquet datasets will not yet work with pyarrow from defaults |
I see. I was just looking to the meta.yaml and its deps. Yes, I will get
to it soon. Thanks for the pointer
…On Fri, Feb 12, 2021 at 5:41 PM Joris Van den Bossche < ***@***.***> wrote:
Yes, I will need to visit pyarrow feedstock for adding some tests. But in
general it should work already after doing a rebuild.
I don't think so, also for pyarrow the build options need to be enabled.
See eg
https://github.com/conda-forge/arrow-cpp-feedstock/blob/c97f5ca5deffc89316702054de792cd968ca7e60/recipe/build-pyarrow.sh#L12-L24
So eg reading parquet datasets will not yet work with pyarrow from defaults
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12164 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALJLMJFCIB2T2DY2IYO4LWDS6VK3JANCNFSM4TWJUDCA>
.
|
So I just pushed updated arrow-cpp with all features we want to enable for now there. |
so final recipe for pyarrow is pushed, and it passes all tests. I will put the packages to default channel. Thanks for your assistance! |
As discussed in #11227 (comment), defaults is more than a year behind pyarrow releases -- in addition to material new features, we also have been working with other projects like fsspec / s3fs and Dask to keep things in sync, so ideally these projects would also track reasonably close to the latest releases for the best user experience.
The text was updated successfully, but these errors were encountered: