Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate scikit-learn's set_output method into TransactionEncoder #1087

Merged
merged 20 commits into from
Mar 30, 2024

Conversation

it176131
Copy link
Contributor

@it176131 it176131 commented Mar 25, 2024

Code of Conduct

Description

This defines the :method:get_feature_names_out in :class:TransactionEncoder to expose the :method:set_output.

Related issues or pull requests

Fixes #1085

Pull Request Checklist

  • Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
  • Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
  • Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
  • Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
    • 20 preexisting unit tests did not pass before work was started. I added two new unit tests to cover the work requested. Both new tests pass and the new work does not affect the prior tests.
  • Checked for style issues by running flake8 ./mlxtend
    • Prior work outside of this PR does not pass flake8 v7.0.0 guidelines. See mlxtend/feature_selection/column_selector.py:81

	- Added two new tests, `test_get_feature_names_out` and `test_set_output`. Passing these tests is a step towards the output of `TransactionEncoder` being formatted as a pandas.DataFramed by default.
	- Added `get_feature_names_out` method to `TransactionEncoder` to expose the `set_output` method.
	- Updated test to include more checks. It is now back in a failing state.
	- Updated test_set_output docstring to be more explicit.
	- Added numpy assertion to check that the transformed output columns match the original columns_ attribute for test_set_output.
	- Added numpy assertion to check that the get_feature_names_out output match the original columns_ attribute for test_get_feature_names_out.
	- Added logic similar to that in `sklearn.base.ClassNamePrefixFeaturesOutMixin` and `sklearn.base.OneToOneFeatureMixin` for the get_feature_names_out method.
	- Updated the user guide to show both the get_feature_names_out method and the set_output method.
	- Updated changelog to reflect new features.
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

	- Updated issue number.
	- Updated issue number (again) to reflect the PR link instead of the issue link.
	- Ran isort over imports to fix failing check in PR.
@rasbt
Copy link
Owner

rasbt commented Mar 26, 2024

Thanks a lot for the PR, really appreciate it and hope to review it in the upcoming days!

	- Increased scikit-learn version to minimum required for set_output to work.
@it176131
Copy link
Contributor Author

Thanks a lot for the PR, really appreciate it and hope to review it in the upcoming days!

Sure thing! I wasn't sure what to put regarding newer version release dates so I bumped the patch number by one and set the release date to "TBD". Lmk if I should change anything.

@it176131
Copy link
Contributor Author

Hey @rasbt it looks like the pipeline checks keep failing for files unrelated to this PR. Should I log a separate issue and open a new PR to fix those too?

@rasbt
Copy link
Owner

rasbt commented Mar 28, 2024

I'd say as long as the tests for the new feature pass then it should be ok. There have been some other tests that have been failing in some submodules for several months due to certain software version updates and minor precision differences I think. I haven't had time to investigate yet.

From a quick look, there seems to be a more major problem though:

      AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'?
      [end of output]

Maybe that's related to Python 3.12. That's definitely something worth fixing so we can find out whether the relevant tests for this PR pass. Maybe doing it in a separate branch first may make sense so the PR doesn't become too cluttered. I also changed the setting here in hope tests will now run automatically each time the PR is updated.

Screenshot 2024-03-28 at 8 35 52 AM

And thanks again for your efforts ... I wish I could be more responsive but it's a busy week

@it176131
Copy link
Contributor Author

I'd say as long as the tests for the new feature pass then it should be ok. There have been some other tests that have been failing in some submodules for several months due to certain software version updates and minor precision differences I think. I haven't had time to investigate yet.

From a quick look, there seems to be a more major problem though:

      AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'?
      [end of output]

Maybe that's related to Python 3.12. That's definitely something worth fixing so we can find out whether the relevant tests for this PR pass. Maybe doing it in a separate branch first may make sense so the PR doesn't become too cluttered. I also changed the setting here in hope tests will now run automatically each time the PR is updated.

Screenshot 2024-03-28 at 8 35 52 AM And thanks again for your efforts ... I wish I could be more responsive but it's a busy week

I'll do some digging. If I can replicate the issue and come up with a solution I'll submit it in a separate PR.

@rasbt
Copy link
Owner

rasbt commented Mar 28, 2024

No worries, it should be fixed now via #1089 (except for the transaction encoder test, but that's something we can address in this PR)

@rasbt
Copy link
Owner

rasbt commented Mar 28, 2024

Hm, I think the new failures could be due to the sklearn version bump

@it176131
Copy link
Contributor Author

No worries, it should be fixed now via #1089 (except for the transaction encoder test, but that's something we can address in this PR)

I noticed that scikit-learn 1.1.3 is being installed in the github workflow. Can we bump it to 1.2.2 as that is required for set_output to work?

@rasbt
Copy link
Owner

rasbt commented Mar 28, 2024

Yes, please feel free to bump it up. We probably need to fix some other places that have not been adjusted for the most recent version though

	- Bumped scikit-learn version up to 1.2.2 to match requirements.txt.
	- Bumped scikit-learn version up to 1.2.2 to match environment.yml and requirements.txt.
@it176131
Copy link
Contributor Author

Running test_inverse_transform locally, it appears the error is raised when np.array(data_sorted) is called. This is because the data_sorted is list of lists where the nested lists may have varying lengths (same for oht.inverse_transform(expect)). I see two potential ways around this:

  1. Use a simpler assertion like assert data_sorted == oht.inverse_transform(expect)
  2. Add dtype="object" to the np.array constructors

Both result in the test passing. I'm in favor of the first option as it doesn't require any changes to the data_sorted and the expected output type is a list rather than an np.ndarray.

	- Updated `test_inverse_transform` to passing state by removing conversion to numpy array.
@it176131
Copy link
Contributor Author

Turns out scikit-learn version 1.2.2 had a bug in the set_output API that failed when the input was not a pandas.DataFrame. See this PR for details. The fix came in scikit-learn version 1.3.1. Bumping the scikit-learn version to it fixes the issue. Commit(s) to follow.

	- Updated scikit-learn version to 1.3.1 to integerate fix from scikit-learn/scikit-learn#27044

modified:   environment.yml
	- Updated scikit-learn version to 1.3.1 to integerate fix from scikit-learn/scikit-learn#27044

modified:   requirements.txt
	- Updated scikit-learn version to 1.3.1 to integerate fix from scikit-learn/scikit-learn#27044
@it176131
Copy link
Contributor Author

Of course bumping the version results in more failed tests 🤦

I'm going to make a separate PR to handle the scikit-learn version bump and the failed unit tests. THEN maybe this will work. Didn't realize it was going to take so much work 😅

@rasbt
Copy link
Owner

rasbt commented Mar 29, 2024

Arg sorry. Yeah, a sklearn bump was overdue but I recently didn't have the time to look into it since I wasn't using the affected features. If this is too much work, don't worry about it, I can understand if you want to drop this. I could revisit the version bump in the upcoming weeks then, address this, and then merge your PR once it's addressed.

Copy link

codecov bot commented Mar 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.32%. Comparing base (e82c9c5) to head (f018a8d).

❗ Current head f018a8d differs from pull request most recent head 44961b5. Consider uploading reports for the commit 44961b5 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1087      +/-   ##
==========================================
+ Coverage   78.29%   78.32%   +0.03%     
==========================================
  Files         196      196              
  Lines       11140    11157      +17     
  Branches     1404     1404              
==========================================
+ Hits         8722     8739      +17     
  Misses       2200     2200              
  Partials      218      218              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@it176131
Copy link
Contributor Author

@rasbt looks like all checks passed. Ready to merge 😎

Copy link
Owner

@rasbt rasbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thanks so much for the efforts! And sorry again about the unit test hassles!

mlxtend/preprocessing/transactionencoder.py Outdated Show resolved Hide resolved
mlxtend/preprocessing/transactionencoder.py Outdated Show resolved Hide resolved
mlxtend/preprocessing/transactionencoder.py Outdated Show resolved Hide resolved
@rasbt rasbt merged commit 506a4d5 into rasbt:master Mar 30, 2024
2 checks passed
@it176131 it176131 deleted the issue_1085 branch March 30, 2024 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate scikit-learn's set_output method into TransactionEncoder
2 participants