Update pyarrow package to 2.0.0 #12164

wesm · 2020-11-15T16:55:21Z

As discussed in #11227 (comment), defaults is more than a year behind pyarrow releases -- in addition to material new features, we also have been working with other projects like fsspec / s3fs and Dask to keep things in sync, so ideally these projects would also track reasonably close to the latest releases for the best user experience.

csoja · 2020-11-17T21:45:30Z

You are right, we are quite out of date for pyarrow. We plan to update it in the coming weeks.

jorisvandenbossche · 2021-01-30T08:01:38Z

Gently bumping this issue. In the meantime, pyarrow 3.0.0 is released.

wesm · 2021-01-31T20:39:14Z

Yes, indeed.

Note that pyarrow is at 11.2M downloads/month on PyPI

https://pypistats.org/packages/pyarrow

Comparative numbers:

pandas: 18.5M
NumPy: 29.8M
scikit-learn: 8M
tensorflow: 3.95M

I think you can infer from this that keeping this package up to date is important from a production stability point of view.

katietz · 2021-02-01T15:29:09Z

I added recently to defaults pyarrow/arrow-cpp versions 2.0.0, and 3.0.0.

I would anyway love to learn a bit more about the "common" pyarrow feature-set used for 3.0.0/2.0.0?

wesm · 2021-02-01T15:58:20Z

Could you clarify what you mean by "common"? Note that from here on out more or less every release of pyarrow will be a "major" release from a SemVer perspective, even if there are no backwards incompatible API changes.

jorisvandenbossche · 2021-02-04T19:54:51Z

@katietz it seems that the newly added 3.0.0 packages are somewhat broken (or at least not enabling all essential parts). They also seem to be based on a quite old version of the conda-forge recipe, which has changed a lot since arrow 0.15.

For example, the Dataset submodule was not built (which means that reading partitioned parquet datasets will error).

Few example StackOverflow questions related to this:

[1] https://stackoverflow.com/questions/66017811/python-error-using-pyarrow-arrownotimplementederror-support-for-codec-snappy
[2] https://stackoverflow.com/questions/66026964/pyarrow-3-0-0-and-1-0-0-reading-a-parquet-file-fails-when-passing-the-cont
[3] https://stackoverflow.com/questions/66047129/pyarrow-library-is-not-installed-but-i-am-able-to-import-it-downloading-data-f

jorisvandenbossche · 2021-02-04T19:55:56Z

I would personally suggest to remove the packages again (if possible) until those issues are fixed, as it seems to be causing quite some problems for users.

katietz · 2021-02-08T14:23:24Z

Hello Joris,

I am openminded and will happily add additional features to the package to the 3.0.0 recipe. That exactly was the reason why I asked about more details/needs on addtional features. So as said above, it would be good to open here new ticket for such items. We can ellaborate on this fast. Just as side-note, we won't merge in directly conda-forge's recipe. First of all, we would need to investage (and collect more details) on some of the additionally used packages there. And (a even more strong) reason is that we are not happy about merging python + cpp parts within one feedstock. We prefer to see native and embedded languages splitted here. I admit that all those other languages are put by the arrow within one tarball, but that is in our opinon not enough reasoning here.

jorisvandenbossche · 2021-02-08T14:40:54Z

Thanks for the answer. See also the reports at AnacondaRecipes/pyarrow-feedstock#2 from @xhochy

Generally speaking, I think it's best to enable almost all features (previously many parts were turned on by default when building arrow/pyarrow, but those all have been turned off in recent released to by default have a more minimal build. But so to restore the behaviour of the older packages, all those features have to be enabled manually: different compressions codecs for parquet, datasets, hdfs, ..)

xhochy · 2021-02-09T12:50:29Z

And (a even more strong) reason is that we are not happy about merging python + cpp parts within one feedstock.

Sidenote: Thearrow-cpp package has had since the beginning a Python dependency. It is something for the future to split up the Arrow C++ parts upstream into C++-without-Python and C++-with-Python but is still out looking for a volunteer.

wesm · 2021-02-09T15:58:46Z

What can we do to help defaults get sorted out? A LOT of people depend on this package, so every day that it remains broken is doing harm to users.

katietz · 2021-02-10T16:38:10Z

Thanks for the offering of help. I am looking into the differences of recipes right now. I might come up with some questions on that. For arrow-cpp I just see the aws-sdk dependency right now as something we lack. (right now we don’t have a working aws-sdk package on anaconda-recipes side … I will keep that on my pile).

what is about that parquet-cpp 1.5.1* run requirement. What is the background of that?

Are the numpy >=1.16 requirements essential? if so, why?

About build.sh/bld.bat I will follow up
unden

Is that -DARROW_SIMD_LEVEL=NONE required, or is it still enough to turn off SSE42? Are the issues about other archs SIMD instructions?

So I worked a bit on adding some features to arrow-cpp. Please take a look. I will upload the package update as soon as last bits are ready compiled.

I haven’t touched our ‘py-arrow-feedstock’ for now, but that will be next.

jorisvandenbossche · 2021-02-10T16:45:25Z

what is about that parquet-cpp 1.5.1* run requirement. What is the background of that?

The comment in the recipe says "empty parquet-cpp metapackage, force old versions to be uninstalled"
In the past, parquet-cpp was a separate package, but now it is included in arrow-cpp

Are the numpy >=1.16 requirements essential? if so, why?

Yes, pyarrow 3.0.0 has a minimum dependency on 1.16, see https://issues.apache.org/jira/browse/ARROW-10861 for context

jorisvandenbossche · 2021-02-10T16:46:52Z

So I worked a bit on adding some features to arrow-cpp. Please take a look. I will upload the package update as soon as last bits are ready compiled.

Is there a PR at https://github.com/AnacondaRecipes/pyarrow-feedstock/pulls ? (I also don't see changes committed to the master branch)

katietz · 2021-02-10T16:51:25Z

So I worked a bit on adding some features to arrow-cpp. Please take a look. I will upload the package update as soon as last bits are ready compiled.

Is there a PR at https://github.com/AnacondaRecipes/pyarrow-feedstock/pulls ? (I also don't see changes committed to the master branch)

I changed first arrow-cpp (we have python and native part split in recipes). See https://github.com/AnacondaRecipes/arrow-cpp-feedstock

jorisvandenbossche · 2021-02-10T16:58:43Z

Yeah, I was looking at the wrong repo.

Additional comment looking at your chagnes: I think it is important to compile with support for the different compression codecs. The conda forge recipe has:

    -DARROW_WITH_BROTLI=ON \
    -DARROW_WITH_BZ2=ON \
    -DARROW_WITH_LZ4=ON \
    -DARROW_WITH_SNAPPY=ON \
    -DARROW_WITH_ZLIB=ON \
    -DARROW_WITH_ZSTD=ON \

but in the defaults one I don't see eg snappy in the windows build script?

katietz · 2021-02-10T17:24:44Z

Yeah, I was looking at the wrong repo.

Additional comment looking at your chagnes: I think it is important to compile with support for the different compression codecs. The conda forge recipe has:
    -DARROW_WITH_BROTLI=ON \
    -DARROW_WITH_BZ2=ON \
    -DARROW_WITH_LZ4=ON \
    -DARROW_WITH_SNAPPY=ON \
    -DARROW_WITH_ZLIB=ON \
    -DARROW_WITH_ZSTD=ON \

Ok, I see that for windows we lack LZ4 and BROTLI. I will test to add them for windows too. For unix they are there-

but in the defaults one I don't see eg snappy in the windows build script?

I will test for snappy, but AFAIR I ran into issues about it. But thanks for the pointers

xhochy · 2021-02-10T19:38:07Z

There is still the BMI2 issue I manually patched out of Snappy: https://github.com/conda-forge/snappy-feedstock/blob/master/recipe/disable_bmi.patch (yes, I need to upstream that :( )

jorisvandenbossche · 2021-02-12T13:00:24Z

Any updates on this?
(arrow-cpp is now updated to include more features, but I think pyarrow is not yet updated for that?)

katietz · 2021-02-12T16:15:23Z

Any updates on this?
(arrow-cpp is now updated to include more features, but I think pyarrow is not yet updated for that?)

Yes, I will need to visit pyarrow feedstock for adding some tests. But in general it should work already after doing a rebuild.

jorisvandenbossche · 2021-02-12T16:41:05Z

Yes, I will need to visit pyarrow feedstock for adding some tests. But in general it should work already after doing a rebuild.

I don't think so, also for pyarrow the build options need to be enabled. See eg https://github.com/conda-forge/arrow-cpp-feedstock/blob/c97f5ca5deffc89316702054de792cd968ca7e60/recipe/build-pyarrow.sh#L12-L24

So eg reading parquet datasets will not yet work with pyarrow from defaults

katietz · 2021-02-12T16:44:37Z

I see. I was just looking to the meta.yaml and its deps. Yes, I will get to it soon. Thanks for the pointer

…

On Fri, Feb 12, 2021 at 5:41 PM Joris Van den Bossche < ***@***.***> wrote: Yes, I will need to visit pyarrow feedstock for adding some tests. But in general it should work already after doing a rebuild. I don't think so, also for pyarrow the build options need to be enabled. See eg https://github.com/conda-forge/arrow-cpp-feedstock/blob/c97f5ca5deffc89316702054de792cd968ca7e60/recipe/build-pyarrow.sh#L12-L24 So eg reading parquet datasets will not yet work with pyarrow from defaults — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12164 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALJLMJFCIB2T2DY2IYO4LWDS6VK3JANCNFSM4TWJUDCA> .

katietz · 2021-02-12T17:32:35Z

So I just pushed updated arrow-cpp with all features we want to enable for now there.
additionally I modified pyarrow feedstock to add new features, I am about to test them, when ready I will let you know.

katietz · 2021-02-12T17:48:36Z

so final recipe for pyarrow is pushed, and it passes all tests. I will put the packages to default channel. Thanks for your assistance!

csoja added package_request Package build requests (new, updates, and rebulds) type-packaging labels Nov 17, 2020

csoja closed this as completed Feb 1, 2021

oliche mentioned this issue Mar 18, 2021

[Bug report] - ArrowNotImplementedError: Support for codec 'snappy' not built int-brain-lab/iblenv#212

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update pyarrow package to 2.0.0 #12164

Update pyarrow package to 2.0.0 #12164

wesm commented Nov 15, 2020

csoja commented Nov 17, 2020 •

edited

Loading

jorisvandenbossche commented Jan 30, 2021

wesm commented Jan 31, 2021

katietz commented Feb 1, 2021

wesm commented Feb 1, 2021

jorisvandenbossche commented Feb 4, 2021

jorisvandenbossche commented Feb 4, 2021

katietz commented Feb 8, 2021

jorisvandenbossche commented Feb 8, 2021

xhochy commented Feb 9, 2021

wesm commented Feb 9, 2021

katietz commented Feb 10, 2021

jorisvandenbossche commented Feb 10, 2021

jorisvandenbossche commented Feb 10, 2021

katietz commented Feb 10, 2021

jorisvandenbossche commented Feb 10, 2021

katietz commented Feb 10, 2021

xhochy commented Feb 10, 2021

jorisvandenbossche commented Feb 12, 2021

katietz commented Feb 12, 2021

jorisvandenbossche commented Feb 12, 2021

katietz commented Feb 12, 2021 via email

katietz commented Feb 12, 2021

katietz commented Feb 12, 2021

Update pyarrow package to 2.0.0 #12164

Update pyarrow package to 2.0.0 #12164

Comments

wesm commented Nov 15, 2020

csoja commented Nov 17, 2020 • edited Loading

jorisvandenbossche commented Jan 30, 2021

wesm commented Jan 31, 2021

katietz commented Feb 1, 2021

wesm commented Feb 1, 2021

jorisvandenbossche commented Feb 4, 2021

jorisvandenbossche commented Feb 4, 2021

katietz commented Feb 8, 2021

jorisvandenbossche commented Feb 8, 2021

xhochy commented Feb 9, 2021

wesm commented Feb 9, 2021

katietz commented Feb 10, 2021

jorisvandenbossche commented Feb 10, 2021

jorisvandenbossche commented Feb 10, 2021

katietz commented Feb 10, 2021

jorisvandenbossche commented Feb 10, 2021

katietz commented Feb 10, 2021

xhochy commented Feb 10, 2021

jorisvandenbossche commented Feb 12, 2021

katietz commented Feb 12, 2021

jorisvandenbossche commented Feb 12, 2021

katietz commented Feb 12, 2021 via email

katietz commented Feb 12, 2021

katietz commented Feb 12, 2021

csoja commented Nov 17, 2020 •

edited

Loading