Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FPGrowth/FPMax and Association Rules with the existence of missing values (#1004) #1106

Merged
merged 6 commits into from
Oct 23, 2024

Conversation

zazass8
Copy link
Contributor

@zazass8 zazass8 commented Oct 7, 2024

Description

This update intends to implement the FP-Growth and FP-Max algorithms from frequent_patterns with the possibility of missing values in the input dataset. The code implements the same structure and logic of the algorithms, while computing the support metric as in "ignoring" the missing values in the data. That gives a more realistic indication of the frequency of existence in the items/itemsets that are generated from the algorithm. Given the output of the algorithm, the corresponding association rules and metrics are also updated taking into account the existence of missing values.

The input accepted for this implementation is a pandas.DataFrame that accepts binary input as 0/1 or True/False (as it was originally), as well as 0/1/NaN or True/False/NaN, where NaN is numpy.nan.

Please also find attached the corresponding paper conducting this research of association rule mining with the existence of missing values.
ragel1998.pdf

Related issues or pull requests

Please use this link for the corresponding discussion of this new feature in the issue tracker.

Pull Request Checklist

  • Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
  • Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
  • Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
  • Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
  • Checked for style issues by running flake8 ./mlxtend

@zazass8 zazass8 force-pushed the fpgrowth-nan branch 6 times, most recently from e39df31 to 7e1aab4 Compare October 13, 2024 20:55
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@zazass8 zazass8 force-pushed the fpgrowth-nan branch 3 times, most recently from a671f6a to 9cae476 Compare October 14, 2024 12:57
@zazass8
Copy link
Contributor Author

zazass8 commented Oct 15, 2024

@rasbt In metrics such as Zhang's and Jaccard, if null_values=True it computes values that are outside the unrestricted range. Sometimes it occurs with the certainty metric as well. I haven't found anywhere in the literature if there's a modified formula of these metrics that also takes into account the existence of NaNs. If there is, I am happy to update otherwise if you prefer I could exclude these metrics from the final array that is created. Let me know :)

@zazass8 zazass8 force-pushed the fpgrowth-nan branch 2 times, most recently from 7d83053 to 260517e Compare October 15, 2024 12:35
@rasbt
Copy link
Owner

rasbt commented Oct 22, 2024

Just wanted to say thanks for the PR! I first bookmarked it 2 weeks ago, but then I got sick and am just slowly catching up.

@zazass8 zazass8 force-pushed the fpgrowth-nan branch 2 times, most recently from ab9e6e4 to 99dad90 Compare October 22, 2024 21:25
@zazass8
Copy link
Contributor Author

zazass8 commented Oct 22, 2024

No worries at all, my pleasure to contribute on your project! No worries at all, I am glad you are feeling better now. I resolved and made the changes you mentioned above, feel free to have another look and let me know if there's anything else!

@rasbt
Copy link
Owner

rasbt commented Oct 22, 2024

No worries at all, my pleasure to contribute on your project! No worries at all, I am glad you are feeling better now. I resolved and made the changes you mentioned above, feel free to have another look and let me know if there's anything else!

Thanks for addressing these! I just had 2 follow-up comments above, but no rush.

Copy link
Owner

@rasbt rasbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! And thanks so much for addressing all the minor concerns I have.

@rasbt rasbt merged commit 11a295e into rasbt:master Oct 23, 2024
2 checks passed
@zazass8
Copy link
Contributor Author

zazass8 commented Oct 23, 2024

Awesome work! And thanks so much for addressing all the minor concerns I have.

No worries at all, my pleasure in contributing. Until next time!

@d-kleine
Copy link
Contributor

d-kleine commented Nov 2, 2024

fixed #1004

@jraymartinez
Copy link

Memory usage regression after PR #1106

We noticed significant memory usage increase in our production pipeline using fpgrowth algorithm after recent updates.

Details:

  • Pipeline was running successfully weekly until November 8, 2024
  • Same input data size and parameters throughout (no changes in data volume or fpgrowth parameters)
  • Originally running on ml.c5n.9xlarge (96GB RAM)
  • After failure, upgraded to ml.c5n.18xlarge (192GB RAM) but still getting memory errors
  • Error message: "Please use an instance type with more memory, or reduce the size of job data processed on an instance"
  • When pinned to mlxtend==0.23.1 (pre-PR version), pipeline runs successfully again on original instance size

This suggests potential memory regression after PR #1106 (merged Oct 23, 2024). Could you please investigate if there are any unintended memory usage increases from these changes?

Environment:

  • AWS SageMaker
  • Python 3.10
  • Input data shape: (1832691, 11161)
  • FPGrowth parameters: min_support=0.0005, max_len=15

@rasbt
Copy link
Owner

rasbt commented Nov 16, 2024

Thanks for letting us know. I will take a look at this in the next few weeks and see what's causing this. In the meantime, I recommend sticking to mlxtend==0.23.1 as you suggested.

If you have any ideas @zazass8 let me know

@zazass8
Copy link
Contributor Author

zazass8 commented Nov 16, 2024

I investigated the issue. It seems that in fpcommon.py where the disabled array is initialised, multiple temporary arrays are created to generate the desired one. For that I already worked on an alternative solution which it could work.

Furthermore, on line 40 where I used np.sum and np.logical_or would consume a lot of memory for the same as above. For this, when I was working on the PR I actually used directly np.nansum() as I did in the next line. Where, theoretically it can solve the memory allocation issue as it directly computes what is desired. Although, several unit tests when the code was tested on CI were failing without me figuring out the source of the issue.

Finally, in generate_itemsets maybe the fact that multiple if statements are implemented may cause again memory allocation. I couldn't think of any other way to simplify those lines in order to obtain the desired output.

@rasbt if you have any ideas, let me know as well. Thank you for pointing this out @jraymartinez.

@rasbt
Copy link
Owner

rasbt commented Nov 16, 2024

Thanks for the analysis @zazass8! I would say we should maybe investigate why np.nansum doesn't/didn't work. Maybe that is the main culprit here.
I currently only have access to a small laptop until December but can help try some things in December when I am back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants