Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Allow to pass Arrow array as groups #6166

Merged
merged 72 commits into from
Nov 22, 2023
Merged

[python-package] Allow to pass Arrow array as groups #6166

merged 72 commits into from
Nov 22, 2023

Conversation

borchero
Copy link
Collaborator

Motivation

This is part of a series of PRs towards #6022 and depends on #6164.

For a legible diff, see borchero/LightGBM@arrow-support-weights...arrow-support-groups.

@borchero borchero marked this pull request as ready for review November 14, 2023 14:25
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me, just left a few small requests.

I've also removed the WIP: from the title since this is now ready to review.

src/io/metadata.cpp Outdated Show resolved Hide resolved
src/io/metadata.cpp Outdated Show resolved Hide resolved
src/io/metadata.cpp Outdated Show resolved Hide resolved
tests/python_package_test/test_arrow.py Outdated Show resolved Hide resolved
@jameslamb jameslamb changed the title WIP: [python-package] Allow to pass Arrow array as groups [python-package] Allow to pass Arrow array as groups Nov 14, 2023
@borchero borchero requested a review from jameslamb November 16, 2023 17:38
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me! I'm +1 on supporting Arrow arrays and chunked arrays for group, and good with the proposed changes to the Python package and added tests.

I just left one more suggestion, around also testing for chunked arrays with empty chunks.

@borchero could you please update this branch to latest master to get the newest changes from there?

@shiyu1994 @guolinke could you help with another review?



@pytest.mark.parametrize(
["array_type", "group_data"], [(pa.array, [2, 3]), (pa.chunked_array, [[2], [3]])]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add test cases covering the case where a chunked array with some empty chunks is given?

At least these two:

  • the first chunk of the chunked array is empty
  • one or more other non-first chunk of the chunked array is empty

And then could you:

  • put up a separate PR adding such test cases for the other Dataset components you've already added Arrow support for
  • add such test cases in the remaining Arrow PRs

Sorry I didn't think of this earlier. It does seem to me that it's possible to have empty chunks in a chunked array, e.g.

import pyarrow as pa

arr = pa.chunked_array(
    [[2, 3, 4], []],
    type=pa.int16()
)

pa.compute.sum(arr)
# <pyarrow.Int64Scalar: 9>

And I can imagine situations where such arrays could end up being passed into LightGBM. I'm thinking use cases similar to how Dask Arrays are commonly created by concatenating the results of multiple separate function calls, some of which may be empty, e.g.:

  • reading chunks from multiple files (and maybe applying row filters while reading them)
  • executing multiple database queries
  • paging over results from a REST API
  • streaming over chunks returned from a database (e.g. like in snowflake-connector-python)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! I didn't know that this was possible, will add tests!

Copy link
Collaborator

@guolinke guolinke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ part LGTM!

@borchero
Copy link
Collaborator Author

Here we finally go @jameslamb 😄 let me know if this kind of test suffices and I will put up another PR to add these kinds of tests for the dataset, labels, and weights.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New tests look great to me, thanks! I'll merge this once it builds.

@jameslamb jameslamb merged commit 516bde9 into microsoft:master Nov 22, 2023
39 checks passed
@jameslamb
Copy link
Collaborator

Thank you! Let's go to the next one 😁

Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants