[python-package] Infer feature names from `pyarrow.Table` #6781

mlondschien · 2025-01-09T17:38:09Z

Bugfix for #6780.

mlondschien · 2025-01-09T17:47:47Z

xref #6782. This bugfix works only if cffi is installed.

mlondschien · 2025-01-09T20:26:38Z

@microsoft-github-policy-service agree

borchero

Once CI passes, this looks good to me, thank you for the fix!

jameslamb

Thanks for this! I disagree with @borchero though, I don't think this is ready to merge yet. Please see my first round of suggestions.

jameslamb · 2025-01-09T20:46:49Z

tests/python_package_test/test_arrow.py

+
+
+def test_arrow_categorical():
+    data = generate_random_arrow_table(10, 10000, 42)


Suggested change

data = generate_random_arrow_table(10, 10000, 42)

data = generate_random_arrow_table(

num_columns=3,

num_datapoints=1_000,

seed=42

)

Since this is just about column names and not related to data shape, can we please use a smaller dataset? To save some time and memory in tests.

jameslamb · 2025-01-09T20:48:09Z

tests/python_package_test/test_arrow.py

+    data = generate_random_arrow_table(10, 10000, 42)
+    dataset = lgb.Dataset(
+        data,
+        label=generate_random_arrow_array(10000, 43, generate_nulls=False, values=np.arange(4)),


Suggested change

label=generate_random_arrow_array(10000, 43, generate_nulls=False, values=np.arange(4)),

label=generate_random_arrow_array(num_datapoints=data.shape[0], seed=43, generate_nulls=False, values=np.arange(4)),

Let's please have this reference the size of the feature data, instead of repeating it.

jameslamb · 2025-01-09T20:50:37Z

tests/python_package_test/test_arrow.py

+        categorical_feature=["col_0"]
+    )
+    booster = lgb.train({"num_leaves": 7}, dataset, num_boost_round=5)
+    assert_equal_predict_arrow_pandas(booster, data)


This PR is about ensuring that the feature names are derived correctly, but the test is just checking that predicting on pandas and pyarrow is the same.

Could you please modify the test such that the input data has custom feature names (not col_0, col_1, etc. similar to what LightGBM automatically assigns internally)? Like red, black, green or something (any random nonsense is fine).

And then explicitly test that booster.feature_name() returns the exact expected values?

That'd be a stronger test that this is working as expected.

This PR is about ensuring that the feature names are derived correctly

IIRC, I think this PR is really only about ensuring that categorical_feature is picked up correctly. Feature names were already stored correctly (and tested for) beforehand 🤔

I don't think so. The issue says:

the docs [say that] if feature_name="auto", a lightgbm.Dataset can infer feature names from the column names of a pyarrow table. It appears that this does not happen.

And this PR is titled "Infer feature names from pyarrow.Table".

I also don't see any tests on the correct handling for feature names in https://github.com/microsoft/LightGBM/blob/master/tests/python_package_test/test_arrow.py (but please correct me if I'm wrong).

I also don't see any tests on the correct handling for feature names

I think this is implicitly covered by calls to

LightGBM/tests/python_package_test/test_arrow.py

Lines 138 to 141 in e0c34e7

def assert_datasets_equal(tmp_path: Path, lhs: lgb.Dataset, rhs: lgb.Dataset):

lhs._dump_text(tmp_path / "arrow.txt")

rhs._dump_text(tmp_path / "pandas.txt")

assert filecmp.cmp(tmp_path / "arrow.txt", tmp_path / "pandas.txt")

hmmm maybe, but that's related to what I'm asking for in this thread... more explicit tests.

Bugs like "feature names for both pandas and pyarrow tables are ignored" wouldn't be caught by a test like that. In other words, these two things are related but not identical:

"handling of pyarrow tables is consistent with its handling of pandas dataframes"

"handling of pyarrow tables is correct"

I'd prefer we explicitly test both.

Yup, fair enough 👍🏼

The feature_name were correctly assigned before. It's just that they were assigned after the check for categorical_feature's, so this failed.

I had just copied tests from test_pandas and aligned them with other test_arrow tests. I can make the test check feature_name()

jameslamb · 2025-01-09T20:52:38Z

python-package/lightgbm/basic.py

@@ -2126,6 +2126,8 @@ def _lazy_init(
                categorical_feature=categorical_feature,
                pandas_categorical=self.pandas_categorical,
            )
+        elif _is_pyarrow_table(data):


Suggested change

elif _is_pyarrow_table(data):

elif isinstance(str, feature_name) and feature_name == "auto" and _is_pyarrow_table(data):

As currently written, your patch would overwrite whatever was passed in via the feature_name keyword argument. We only want that behavior when feature_name="auto", just as with pandas:

LightGBM/python-package/lightgbm/basic.py

Lines 840 to 842 in e0c34e7

# determine feature names

if feature_name == "auto":

feature_name = [str(col) for col in data.columns]

Please ensure this use of data.column_names only happens when feature_name="auto", and add a unit test case confirming that.

Ah phew, I think I misinterpreted how feature_name is supposed to work in my original Arrow-related PRs 🤔 I think this PR fixes a broader issue then 😄 passing feature_name has always been ignored for PyArrow tables, right?

Yeah I think you're right, I never noticed this either!

But there are legitimate reasons for wanting to use the feature_name argument instead of automatic feature names from the columns. For example, if you want have very-long feature names and want to reduce the size of the model file. Of if your feature names in the source data contain sensitive information that you don't want to be encoded in the model.

Those are slightly niche and "just change the names in the input data" would be a fair response to them... but since we've supported that with pandas for a while I think we should support it for all dataframe-like inputs.

Those are slightly niche and "just change the names in the input data" would be a fair response to them... but since we've supported that with pandas for a while I think we should support it for all dataframe-like inputs.

I agree! Given the long-standing behavior for pandas, we should adapt the PyArrow table support to do the same as pandas.

jameslamb · 2025-01-09T21:13:22Z

python-package/lightgbm/basic.py

@@ -2126,6 +2126,8 @@ def _lazy_init(
                categorical_feature=categorical_feature,
                pandas_categorical=self.pandas_categorical,
            )
+        elif _is_pyarrow_table(data):


Yeah I think you're right, I never noticed this either!

But there are legitimate reasons for wanting to use the feature_name argument instead of automatic feature names from the columns. For example, if you want have very-long feature names and want to reduce the size of the model file. Of if your feature names in the source data contain sensitive information that you don't want to be encoded in the model.

Those are slightly niche and "just change the names in the input data" would be a fair response to them... but since we've supported that with pandas for a while I think we should support it for all dataframe-like inputs.

mlondschien · 2025-01-10T09:18:02Z

I didn't find any changelog to update.

Infer feature names from .

54cda90

mlondschien added 2 commits January 9, 2025 21:17

Assigning once is enough.

7ba4fd6

Add test.

2bfe09e

mlondschien marked this pull request as ready for review January 9, 2025 20:17

mlondschien requested review from guolinke, jameslamb, shiyu1994, jmoralez, borchero and StrikerRUS as code owners January 9, 2025 20:17

Use _is_pyarrow_table

9ea6d2e

borchero approved these changes Jan 9, 2025

View reviewed changes

jameslamb added the fix label Jan 9, 2025

jameslamb changed the title ~~Infer feature names from pyarrow.Table~~ [python-package] Infer feature names from pyarrow.Table Jan 9, 2025

jameslamb requested changes Jan 9, 2025

View reviewed changes

jameslamb self-requested a review January 9, 2025 21:13

jameslamb requested changes Jan 9, 2025

View reviewed changes

mlondschien added 3 commits January 10, 2025 10:05

Only reassign feature_name if feature_name == 'auto'

c989d9d

Update tests.

b316f07

Format.

86ead3f

mlondschien requested review from borchero and jameslamb January 10, 2025 09:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] Infer feature names from `pyarrow.Table` #6781

[python-package] Infer feature names from `pyarrow.Table` #6781

mlondschien commented Jan 9, 2025

mlondschien commented Jan 9, 2025

mlondschien commented Jan 9, 2025

borchero left a comment •

edited

Loading

jameslamb left a comment

jameslamb Jan 9, 2025

jameslamb Jan 9, 2025

jameslamb Jan 9, 2025

borchero Jan 9, 2025

jameslamb Jan 9, 2025 •

edited

Loading

borchero Jan 9, 2025 •

edited

Loading

jameslamb Jan 9, 2025

borchero Jan 9, 2025

mlondschien Jan 10, 2025

jameslamb Jan 9, 2025

borchero Jan 9, 2025

jameslamb Jan 9, 2025

borchero Jan 9, 2025 •

edited

Loading

jameslamb Jan 9, 2025

mlondschien commented Jan 10, 2025



		def test_arrow_categorical():
		data = generate_random_arrow_table(10, 10000, 42)

	label=generate_random_arrow_array(10000, 43, generate_nulls=False, values=np.arange(4)),
	label=generate_random_arrow_array(num_datapoints=data.shape[0], seed=43, generate_nulls=False, values=np.arange(4)),

	def assert_datasets_equal(tmp_path: Path, lhs: lgb.Dataset, rhs: lgb.Dataset):
	lhs._dump_text(tmp_path / "arrow.txt")
	rhs._dump_text(tmp_path / "pandas.txt")
	assert filecmp.cmp(tmp_path / "arrow.txt", tmp_path / "pandas.txt")

	elif _is_pyarrow_table(data):
	elif isinstance(str, feature_name) and feature_name == "auto" and _is_pyarrow_table(data):

	# determine feature names
	if feature_name == "auto":
	feature_name = [str(col) for col in data.columns]

[python-package] Infer feature names from pyarrow.Table #6781

Are you sure you want to change the base?

[python-package] Infer feature names from pyarrow.Table #6781

Conversation

mlondschien commented Jan 9, 2025

mlondschien commented Jan 9, 2025

mlondschien commented Jan 9, 2025

borchero left a comment • edited Loading

Choose a reason for hiding this comment

jameslamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameslamb Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

borchero Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

borchero Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlondschien commented Jan 10, 2025

[python-package] Infer feature names from `pyarrow.Table` #6781

[python-package] Infer feature names from `pyarrow.Table` #6781

borchero left a comment •

edited

Loading

jameslamb Jan 9, 2025 •

edited

Loading

borchero Jan 9, 2025 •

edited

Loading

borchero Jan 9, 2025 •

edited

Loading