-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] Infer feature names from pyarrow.Table
#6781
base: master
Are you sure you want to change the base?
Conversation
xref #6782. This bugfix works only if |
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once CI passes, this looks good to me, thank you for the fix!
pyarrow.Table
pyarrow.Table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! I disagree with @borchero though, I don't think this is ready to merge yet. Please see my first round of suggestions.
|
||
|
||
def test_arrow_categorical(): | ||
data = generate_random_arrow_table(10, 10000, 42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data = generate_random_arrow_table(10, 10000, 42) | |
data = generate_random_arrow_table( | |
num_columns=3, | |
num_datapoints=1_000, | |
seed=42 | |
) |
Since this is just about column names and not related to data shape, can we please use a smaller dataset? To save some time and memory in tests.
data = generate_random_arrow_table(10, 10000, 42) | ||
dataset = lgb.Dataset( | ||
data, | ||
label=generate_random_arrow_array(10000, 43, generate_nulls=False, values=np.arange(4)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
label=generate_random_arrow_array(10000, 43, generate_nulls=False, values=np.arange(4)), | |
label=generate_random_arrow_array(num_datapoints=data.shape[0], seed=43, generate_nulls=False, values=np.arange(4)), |
Let's please have this reference the size of the feature data, instead of repeating it.
categorical_feature=["col_0"] | ||
) | ||
booster = lgb.train({"num_leaves": 7}, dataset, num_boost_round=5) | ||
assert_equal_predict_arrow_pandas(booster, data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is about ensuring that the feature names are derived correctly, but the test is just checking that predicting on pandas
and pyarrow
is the same.
Could you please modify the test such that the input data has custom feature names (not col_0
, col_1
, etc. similar to what LightGBM automatically assigns internally)? Like red
, black
, green
or something (any random nonsense is fine).
And then explicitly test that booster.feature_name()
returns the exact expected values?
That'd be a stronger test that this is working as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is about ensuring that the feature names are derived correctly
IIRC, I think this PR is really only about ensuring that categorical_feature
is picked up correctly. Feature names were already stored correctly (and tested for) beforehand 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. The issue says:
the docs [say that] if feature_name="auto", a lightgbm.Dataset can infer feature names from the column names of a pyarrow table. It appears that this does not happen.
And this PR is titled "Infer feature names from pyarrow.Table
".
I also don't see any tests on the correct handling for feature names in https://github.com/microsoft/LightGBM/blob/master/tests/python_package_test/test_arrow.py (but please correct me if I'm wrong).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't see any tests on the correct handling for feature names
I think this is implicitly covered by calls to
LightGBM/tests/python_package_test/test_arrow.py
Lines 138 to 141 in e0c34e7
def assert_datasets_equal(tmp_path: Path, lhs: lgb.Dataset, rhs: lgb.Dataset): | |
lhs._dump_text(tmp_path / "arrow.txt") | |
rhs._dump_text(tmp_path / "pandas.txt") | |
assert filecmp.cmp(tmp_path / "arrow.txt", tmp_path / "pandas.txt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm maybe, but that's related to what I'm asking for in this thread... more explicit tests.
Bugs like "feature names for both pandas
and pyarrow
tables are ignored" wouldn't be caught by a test like that. In other words, these two things are related but not identical:
- "handling of
pyarrow
tables is consistent with its handling ofpandas
dataframes" - "handling of
pyarrow
tables is correct"
I'd prefer we explicitly test both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, fair enough 👍🏼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feature_name
were correctly assigned before. It's just that they were assigned after the check for categorical_feature
's, so this failed.
I had just copied tests from test_pandas
and aligned them with other test_arrow
tests. I can make the test check feature_name()
python-package/lightgbm/basic.py
Outdated
@@ -2126,6 +2126,8 @@ def _lazy_init( | |||
categorical_feature=categorical_feature, | |||
pandas_categorical=self.pandas_categorical, | |||
) | |||
elif _is_pyarrow_table(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elif _is_pyarrow_table(data): | |
elif isinstance(str, feature_name) and feature_name == "auto" and _is_pyarrow_table(data): |
As currently written, your patch would overwrite whatever was passed in via the feature_name
keyword argument. We only want that behavior when feature_name="auto"
, just as with pandas
:
LightGBM/python-package/lightgbm/basic.py
Lines 840 to 842 in e0c34e7
# determine feature names | |
if feature_name == "auto": | |
feature_name = [str(col) for col in data.columns] |
Please ensure this use of data.column_names
only happens when feature_name="auto"
, and add a unit test case confirming that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah phew, I think I misinterpreted how feature_name
is supposed to work in my original Arrow-related PRs 🤔 I think this PR fixes a broader issue then 😄 passing feature_name
has always been ignored for PyArrow tables, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think you're right, I never noticed this either!
But there are legitimate reasons for wanting to use the feature_name
argument instead of automatic feature names from the columns. For example, if you want have very-long feature names and want to reduce the size of the model file. Of if your feature names in the source data contain sensitive information that you don't want to be encoded in the model.
Those are slightly niche and "just change the names in the input data" would be a fair response to them... but since we've supported that with pandas
for a while I think we should support it for all dataframe-like inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are slightly niche and "just change the names in the input data" would be a fair response to them... but since we've supported that with pandas for a while I think we should support it for all dataframe-like inputs.
I agree! Given the long-standing behavior for pandas, we should adapt the PyArrow table support to do the same as pandas.
python-package/lightgbm/basic.py
Outdated
@@ -2126,6 +2126,8 @@ def _lazy_init( | |||
categorical_feature=categorical_feature, | |||
pandas_categorical=self.pandas_categorical, | |||
) | |||
elif _is_pyarrow_table(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think you're right, I never noticed this either!
But there are legitimate reasons for wanting to use the feature_name
argument instead of automatic feature names from the columns. For example, if you want have very-long feature names and want to reduce the size of the model file. Of if your feature names in the source data contain sensitive information that you don't want to be encoded in the model.
Those are slightly niche and "just change the names in the input data" would be a fair response to them... but since we've supported that with pandas
for a while I think we should support it for all dataframe-like inputs.
I didn't find any changelog to update. |
Bugfix for #6780.