Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Infer feature names from pyarrow.Table #6781

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

mlondschien
Copy link

Bugfix for #6780.

@mlondschien
Copy link
Author

xref #6782. This bugfix works only if cffi is installed.

@mlondschien mlondschien marked this pull request as ready for review January 9, 2025 20:17
@mlondschien
Copy link
Author

@microsoft-github-policy-service agree

Copy link
Collaborator

@borchero borchero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once CI passes, this looks good to me, thank you for the fix!

@jameslamb jameslamb added the fix label Jan 9, 2025
@jameslamb jameslamb changed the title Infer feature names from pyarrow.Table [python-package] Infer feature names from pyarrow.Table Jan 9, 2025
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I disagree with @borchero though, I don't think this is ready to merge yet. Please see my first round of suggestions.



def test_arrow_categorical():
data = generate_random_arrow_table(10, 10000, 42)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data = generate_random_arrow_table(10, 10000, 42)
data = generate_random_arrow_table(
num_columns=3,
num_datapoints=1_000,
seed=42
)

Since this is just about column names and not related to data shape, can we please use a smaller dataset? To save some time and memory in tests.

data = generate_random_arrow_table(10, 10000, 42)
dataset = lgb.Dataset(
data,
label=generate_random_arrow_array(10000, 43, generate_nulls=False, values=np.arange(4)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
label=generate_random_arrow_array(10000, 43, generate_nulls=False, values=np.arange(4)),
label=generate_random_arrow_array(num_datapoints=data.shape[0], seed=43, generate_nulls=False, values=np.arange(4)),

Let's please have this reference the size of the feature data, instead of repeating it.

categorical_feature=["col_0"]
)
booster = lgb.train({"num_leaves": 7}, dataset, num_boost_round=5)
assert_equal_predict_arrow_pandas(booster, data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is about ensuring that the feature names are derived correctly, but the test is just checking that predicting on pandas and pyarrow is the same.

Could you please modify the test such that the input data has custom feature names (not col_0, col_1, etc. similar to what LightGBM automatically assigns internally)? Like red, black, green or something (any random nonsense is fine).

And then explicitly test that booster.feature_name() returns the exact expected values?

That'd be a stronger test that this is working as expected.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is about ensuring that the feature names are derived correctly

IIRC, I think this PR is really only about ensuring that categorical_feature is picked up correctly. Feature names were already stored correctly (and tested for) beforehand 🤔

Copy link
Collaborator

@jameslamb jameslamb Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. The issue says:

the docs [say that] if feature_name="auto", a lightgbm.Dataset can infer feature names from the column names of a pyarrow table. It appears that this does not happen.

And this PR is titled "Infer feature names from pyarrow.Table".

I also don't see any tests on the correct handling for feature names in https://github.com/microsoft/LightGBM/blob/master/tests/python_package_test/test_arrow.py (but please correct me if I'm wrong).

Copy link
Collaborator

@borchero borchero Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't see any tests on the correct handling for feature names

I think this is implicitly covered by calls to

def assert_datasets_equal(tmp_path: Path, lhs: lgb.Dataset, rhs: lgb.Dataset):
lhs._dump_text(tmp_path / "arrow.txt")
rhs._dump_text(tmp_path / "pandas.txt")
assert filecmp.cmp(tmp_path / "arrow.txt", tmp_path / "pandas.txt")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm maybe, but that's related to what I'm asking for in this thread... more explicit tests.

Bugs like "feature names for both pandas and pyarrow tables are ignored" wouldn't be caught by a test like that. In other words, these two things are related but not identical:

  • "handling of pyarrow tables is consistent with its handling of pandas dataframes"
  • "handling of pyarrow tables is correct"

I'd prefer we explicitly test both.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, fair enough 👍🏼

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature_name were correctly assigned before. It's just that they were assigned after the check for categorical_feature's, so this failed.

I had just copied tests from test_pandas and aligned them with other test_arrow tests. I can make the test check feature_name()

@@ -2126,6 +2126,8 @@ def _lazy_init(
categorical_feature=categorical_feature,
pandas_categorical=self.pandas_categorical,
)
elif _is_pyarrow_table(data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elif _is_pyarrow_table(data):
elif isinstance(str, feature_name) and feature_name == "auto" and _is_pyarrow_table(data):

As currently written, your patch would overwrite whatever was passed in via the feature_name keyword argument. We only want that behavior when feature_name="auto", just as with pandas:

# determine feature names
if feature_name == "auto":
feature_name = [str(col) for col in data.columns]

Please ensure this use of data.column_names only happens when feature_name="auto", and add a unit test case confirming that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah phew, I think I misinterpreted how feature_name is supposed to work in my original Arrow-related PRs 🤔 I think this PR fixes a broader issue then 😄 passing feature_name has always been ignored for PyArrow tables, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think you're right, I never noticed this either!

But there are legitimate reasons for wanting to use the feature_name argument instead of automatic feature names from the columns. For example, if you want have very-long feature names and want to reduce the size of the model file. Of if your feature names in the source data contain sensitive information that you don't want to be encoded in the model.

Those are slightly niche and "just change the names in the input data" would be a fair response to them... but since we've supported that with pandas for a while I think we should support it for all dataframe-like inputs.

Copy link
Collaborator

@borchero borchero Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are slightly niche and "just change the names in the input data" would be a fair response to them... but since we've supported that with pandas for a while I think we should support it for all dataframe-like inputs.

I agree! Given the long-standing behavior for pandas, we should adapt the PyArrow table support to do the same as pandas.

@jameslamb jameslamb self-requested a review January 9, 2025 21:13
@@ -2126,6 +2126,8 @@ def _lazy_init(
categorical_feature=categorical_feature,
pandas_categorical=self.pandas_categorical,
)
elif _is_pyarrow_table(data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think you're right, I never noticed this either!

But there are legitimate reasons for wanting to use the feature_name argument instead of automatic feature names from the columns. For example, if you want have very-long feature names and want to reduce the size of the model file. Of if your feature names in the source data contain sensitive information that you don't want to be encoded in the model.

Those are slightly niche and "just change the names in the input data" would be a fair response to them... but since we've supported that with pandas for a while I think we should support it for all dataframe-like inputs.

@mlondschien
Copy link
Author

I didn't find any changelog to update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants