Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pyarrow package to 2.0.0 #12164

Closed
wesm opened this issue Nov 15, 2020 · 24 comments
Closed

Update pyarrow package to 2.0.0 #12164

wesm opened this issue Nov 15, 2020 · 24 comments
Labels
package_request Package build requests (new, updates, and rebulds) type-packaging

Comments

@wesm
Copy link

wesm commented Nov 15, 2020

As discussed in #11227 (comment), defaults is more than a year behind pyarrow releases -- in addition to material new features, we also have been working with other projects like fsspec / s3fs and Dask to keep things in sync, so ideally these projects would also track reasonably close to the latest releases for the best user experience.

@csoja
Copy link
Contributor

csoja commented Nov 17, 2020

You are right, we are quite out of date for pyarrow. We plan to update it in the coming weeks.

@csoja csoja added package_request Package build requests (new, updates, and rebulds) type-packaging labels Nov 17, 2020
@jorisvandenbossche
Copy link

Gently bumping this issue. In the meantime, pyarrow 3.0.0 is released.

@wesm
Copy link
Author

wesm commented Jan 31, 2021

Yes, indeed.

Note that pyarrow is at 11.2M downloads/month on PyPI

https://pypistats.org/packages/pyarrow

Comparative numbers:

  • pandas: 18.5M
  • NumPy: 29.8M
  • scikit-learn: 8M
  • tensorflow: 3.95M

I think you can infer from this that keeping this package up to date is important from a production stability point of view.

@katietz
Copy link

katietz commented Feb 1, 2021

I added recently to defaults pyarrow/arrow-cpp versions 2.0.0, and 3.0.0.

I would anyway love to learn a bit more about the "common" pyarrow feature-set used for 3.0.0/2.0.0?

@csoja csoja closed this as completed Feb 1, 2021
@wesm
Copy link
Author

wesm commented Feb 1, 2021

Could you clarify what you mean by "common"? Note that from here on out more or less every release of pyarrow will be a "major" release from a SemVer perspective, even if there are no backwards incompatible API changes.

@jorisvandenbossche
Copy link

@katietz it seems that the newly added 3.0.0 packages are somewhat broken (or at least not enabling all essential parts). They also seem to be based on a quite old version of the conda-forge recipe, which has changed a lot since arrow 0.15.

For example, the Dataset submodule was not built (which means that reading partitioned parquet datasets will error).

Few example StackOverflow questions related to this:

[1] https://stackoverflow.com/questions/66017811/python-error-using-pyarrow-arrownotimplementederror-support-for-codec-snappy
[2] https://stackoverflow.com/questions/66026964/pyarrow-3-0-0-and-1-0-0-reading-a-parquet-file-fails-when-passing-the-cont
[3] https://stackoverflow.com/questions/66047129/pyarrow-library-is-not-installed-but-i-am-able-to-import-it-downloading-data-f

@jorisvandenbossche
Copy link

I would personally suggest to remove the packages again (if possible) until those issues are fixed, as it seems to be causing quite some problems for users.

@katietz
Copy link

katietz commented Feb 8, 2021

Hello Joris,

I am openminded and will happily add additional features to the package to the 3.0.0 recipe. That exactly was the reason why I asked about more details/needs on addtional features. So as said above, it would be good to open here new ticket for such items. We can ellaborate on this fast. Just as side-note, we won't merge in directly conda-forge's recipe. First of all, we would need to investage (and collect more details) on some of the additionally used packages there. And (a even more strong) reason is that we are not happy about merging python + cpp parts within one feedstock. We prefer to see native and embedded languages splitted here. I admit that all those other languages are put by the arrow within one tarball, but that is in our opinon not enough reasoning here.

@jorisvandenbossche
Copy link

Thanks for the answer. See also the reports at AnacondaRecipes/pyarrow-feedstock#2 from @xhochy

Generally speaking, I think it's best to enable almost all features (previously many parts were turned on by default when building arrow/pyarrow, but those all have been turned off in recent released to by default have a more minimal build. But so to restore the behaviour of the older packages, all those features have to be enabled manually: different compressions codecs for parquet, datasets, hdfs, ..)

@xhochy
Copy link

xhochy commented Feb 9, 2021

And (a even more strong) reason is that we are not happy about merging python + cpp parts within one feedstock.

Sidenote: Thearrow-cpp package has had since the beginning a Python dependency. It is something for the future to split up the Arrow C++ parts upstream into C++-without-Python and C++-with-Python but is still out looking for a volunteer.

@wesm
Copy link
Author

wesm commented Feb 9, 2021

What can we do to help defaults get sorted out? A LOT of people depend on this package, so every day that it remains broken is doing harm to users.

@katietz
Copy link

katietz commented Feb 10, 2021

Thanks for the offering of help. I am looking into the differences of recipes right now. I might come up with some questions on that. For arrow-cpp I just see the aws-sdk dependency right now as something we lack. (right now we don’t have a working aws-sdk package on anaconda-recipes side … I will keep that on my pile).

what is about that parquet-cpp 1.5.1* run requirement. What is the background of that?

Are the numpy >=1.16 requirements essential? if so, why?

About build.sh/bld.bat I will follow up
unden

Is that -DARROW_SIMD_LEVEL=NONE required, or is it still enough to turn off SSE42? Are the issues about other archs SIMD instructions?

So I worked a bit on adding some features to arrow-cpp. Please take a look. I will upload the package update as soon as last bits are ready compiled.

I haven’t touched our ‘py-arrow-feedstock’ for now, but that will be next.

@jorisvandenbossche
Copy link

what is about that parquet-cpp 1.5.1* run requirement. What is the background of that?

The comment in the recipe says "empty parquet-cpp metapackage, force old versions to be uninstalled"
In the past, parquet-cpp was a separate package, but now it is included in arrow-cpp

Are the numpy >=1.16 requirements essential? if so, why?

Yes, pyarrow 3.0.0 has a minimum dependency on 1.16, see https://issues.apache.org/jira/browse/ARROW-10861 for context

@jorisvandenbossche
Copy link

So I worked a bit on adding some features to arrow-cpp. Please take a look. I will upload the package update as soon as last bits are ready compiled.

Is there a PR at https://github.com/AnacondaRecipes/pyarrow-feedstock/pulls ? (I also don't see changes committed to the master branch)

@katietz
Copy link

katietz commented Feb 10, 2021

So I worked a bit on adding some features to arrow-cpp. Please take a look. I will upload the package update as soon as last bits are ready compiled.

Is there a PR at https://github.com/AnacondaRecipes/pyarrow-feedstock/pulls ? (I also don't see changes committed to the master branch)

I changed first arrow-cpp (we have python and native part split in recipes). See https://github.com/AnacondaRecipes/arrow-cpp-feedstock

@jorisvandenbossche
Copy link

Yeah, I was looking at the wrong repo.

Additional comment looking at your chagnes: I think it is important to compile with support for the different compression codecs. The conda forge recipe has:

    -DARROW_WITH_BROTLI=ON \
    -DARROW_WITH_BZ2=ON \
    -DARROW_WITH_LZ4=ON \
    -DARROW_WITH_SNAPPY=ON \
    -DARROW_WITH_ZLIB=ON \
    -DARROW_WITH_ZSTD=ON \

but in the defaults one I don't see eg snappy in the windows build script?

@katietz
Copy link

katietz commented Feb 10, 2021

Yeah, I was looking at the wrong repo.

Additional comment looking at your chagnes: I think it is important to compile with support for the different compression codecs. The conda forge recipe has:

    -DARROW_WITH_BROTLI=ON \
    -DARROW_WITH_BZ2=ON \
    -DARROW_WITH_LZ4=ON \
    -DARROW_WITH_SNAPPY=ON \
    -DARROW_WITH_ZLIB=ON \
    -DARROW_WITH_ZSTD=ON \

Ok, I see that for windows we lack LZ4 and BROTLI. I will test to add them for windows too. For unix they are there-

but in the defaults one I don't see eg snappy in the windows build script?

I will test for snappy, but AFAIR I ran into issues about it. But thanks for the pointers

@xhochy
Copy link

xhochy commented Feb 10, 2021

There is still the BMI2 issue I manually patched out of Snappy: https://github.com/conda-forge/snappy-feedstock/blob/master/recipe/disable_bmi.patch (yes, I need to upstream that :( )

@jorisvandenbossche
Copy link

Any updates on this?
(arrow-cpp is now updated to include more features, but I think pyarrow is not yet updated for that?)

@katietz
Copy link

katietz commented Feb 12, 2021

Any updates on this?
(arrow-cpp is now updated to include more features, but I think pyarrow is not yet updated for that?)

Yes, I will need to visit pyarrow feedstock for adding some tests. But in general it should work already after doing a rebuild.

@jorisvandenbossche
Copy link

Yes, I will need to visit pyarrow feedstock for adding some tests. But in general it should work already after doing a rebuild.

I don't think so, also for pyarrow the build options need to be enabled. See eg https://github.com/conda-forge/arrow-cpp-feedstock/blob/c97f5ca5deffc89316702054de792cd968ca7e60/recipe/build-pyarrow.sh#L12-L24

So eg reading parquet datasets will not yet work with pyarrow from defaults

@katietz
Copy link

katietz commented Feb 12, 2021 via email

@katietz
Copy link

katietz commented Feb 12, 2021

So I just pushed updated arrow-cpp with all features we want to enable for now there.
additionally I modified pyarrow feedstock to add new features, I am about to test them, when ready I will let you know.

@katietz
Copy link

katietz commented Feb 12, 2021

so final recipe for pyarrow is pushed, and it passes all tests. I will put the packages to default channel. Thanks for your assistance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
package_request Package build requests (new, updates, and rebulds) type-packaging
Projects
None yet
Development

No branches or pull requests

5 participants