-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] Allow to pass Arrow array as groups #6166
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me, just left a few small requests.
I've also removed the WIP:
from the title since this is now ready to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look good to me! I'm +1 on supporting Arrow arrays and chunked arrays for group
, and good with the proposed changes to the Python package and added tests.
I just left one more suggestion, around also testing for chunked arrays with empty chunks.
@borchero could you please update this branch to latest master
to get the newest changes from there?
@shiyu1994 @guolinke could you help with another review?
|
||
|
||
@pytest.mark.parametrize( | ||
["array_type", "group_data"], [(pa.array, [2, 3]), (pa.chunked_array, [[2], [3]])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add test cases covering the case where a chunked array with some empty chunks is given?
At least these two:
- the first chunk of the chunked array is empty
- one or more other non-first chunk of the chunked array is empty
And then could you:
- put up a separate PR adding such test cases for the other
Dataset
components you've already added Arrow support for - add such test cases in the remaining Arrow PRs
Sorry I didn't think of this earlier. It does seem to me that it's possible to have empty chunks in a chunked array, e.g.
import pyarrow as pa
arr = pa.chunked_array(
[[2, 3, 4], []],
type=pa.int16()
)
pa.compute.sum(arr)
# <pyarrow.Int64Scalar: 9>
And I can imagine situations where such arrays could end up being passed into LightGBM. I'm thinking use cases similar to how Dask Arrays are commonly created by concatenating the results of multiple separate function calls, some of which may be empty, e.g.:
- reading chunks from multiple files (and maybe applying row filters while reading them)
- executing multiple database queries
- paging over results from a REST API
- streaming over chunks returned from a database (e.g. like in
snowflake-connector-python
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha! I didn't know that this was possible, will add tests!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++ part LGTM!
Here we finally go @jameslamb 😄 let me know if this kind of test suffices and I will put up another PR to add these kinds of tests for the dataset, labels, and weights. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New tests look great to me, thanks! I'll merge this once it builds.
Thank you! Let's go to the next one 😁 |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Motivation
This is part of a series of PRs towards #6022 and depends on #6164.
For a legible diff, see borchero/LightGBM@arrow-support-weights...arrow-support-groups.