I used FPGrowth to make this recommender web app: wackysubs.com #896

geoffreya · 2022-03-03T17:10:13Z

geoffreya
Mar 3, 2022

My new website wackysubs.com was built using association rules mining of all the subreddit names that people posted comments to at reddit.com for two months in 2018. Soon I will get a more recent dataset and train that and install it at the website. The purpose of my web app is to help people find obscure but interesting subreddits they may like to use, but did not know about them.

My computer memory during model training was the main limitation to making a model that includes all the newest or least-popular subreddits. I had to put a limit of min_confidence = 0.0003 (no lower than this) to fit into my 128GB RAM plus 128GB swap which is NVMe-backed.

After training I was able to delete many rows and columns of the dataframe because they are not relevant for my app. I did not need to use any antecedents or consequents of greater than length 3. So, during postprocessing I removed those, shrinking down the model to 10% of the original size with no loss of functionality for my app, and greatly speeding up the query which runs on click of the Find button. I also did some other tricks to speed up the query of the trained model, notably eliminating the frozensets completely and replacing it with simple 1/0 columns for a very wide dataframe of all the subreddits. I found top query performance this way using pyarrow/feather of pandas lib.

As a proposal, people with requirements such as mine, could build a better more inclusive model with even lower min_confidence if mlxtend FPGrowth or association_rules let me specify a way to limit the rule length in advance, so that the search would never puts things I won't use (long rules) into the model tree and the RAM in the first place. Just my 2 cents!

Hope you enjoy the app! It was enjoyable using mlxtend which I found very bug-free compared to other open source in my experience. thanks for making mlxtend!

rasbt · 2022-03-04T02:54:17Z

rasbt
Mar 4, 2022
Maintainer

It was enjoyable using mlxtend which I found very bug-free compared to other open source in my experience.

Glad to hear! And thanks so much for sharing your experience regarding the performance considerations.

3 replies

rasbt Mar 4, 2022
Maintainer

Cool app btw! just see that r/machinelearning is not a wacky one, haha :)

geoffreya Mar 15, 2022
Author

Ya, r/machinelearning seemingly did not get enough comments to reach 0.0003 minconfidence filter, perhaps. But you could still try lowering the app's slider all the way to the left in the web GUI, to include as much results as possible including the lowest statistical lift subreddits.

The result set from the query into the trained model is still also limited apriori for users, by my training the model at 0.0003 for now until I splurge on bigger RAM training box or cloud instance, or else it could be resolved better (cheaper) if we get that code into mlxtend which I thought might help resolve our memory limit bottleneck during training ;)

Do you want me to try modifying mlxtend fpgrowth() or association_rules()? I dont remember exactly where my memory bottleneck actually is; I have to look again. I mean it's likely easier to do this mod than parallelizing it, I imagine.

This proposed reduction of search space and associated model memory allocation requirement during training could be a key consideration, for following reasons, at least in my current case study:

Today the app is being moved out of AWS EC2 into Microsoft Azure as a Web App Service (notably, I am not manually creating a whole virtual server and python libs and web server setup myself, it's supposedly going to be automatic at Azure) and see how easier that deployment design is, if at all. The challenge I anticipate is RAM -- Azure for estimated 13.0 dollars a month on B1 compute level is giving me only 1.75 GB RAM and maybe 1 core to load my big trained model into memory, but I recall AWS is giving it 2.0 GB RAM and 2 cores, which is known by me to be just big enough to work without crashing from RAM-monitoring reaper processes at the AWS cloud. I wish Azure could let me apply just a couple more dollars to get more RAM, because the next step up in RAM is gigantic cost which is too much money for a side project.

Actually on 2nd thought, the runtime memory allocation is not going to be improved by my proposal to mlxtend code. Only the training memory allocation will be improved (reduced). sigh....

Thank you again for reading!

geoffreya Mar 15, 2022
Author

On 3rd thought, this big AR model can be slimmed down: Currently it has all rules up to length 3, but the web app is only using length 1 rules presently. There was going to be a feature in the web app to let people specify 2 subreddits they like as query criteria. But I can probably get quite a smaller model by deleting all rows from the dataframe where length is not 1 in the antecedents and consequents rules.

By typing so much here, I hope some in community can understand what I did to solve RAM issues and query speed issues surrounding AR models.

(If it seems like too much text, you can delete it as needed. I did not ask permission like I should have.)

Regards,
ga

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I used FPGrowth to make this recommender web app: wackysubs.com #896

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

I used FPGrowth to make this recommender web app: wackysubs.com #896

geoffreya Mar 3, 2022

Replies: 1 comment · 3 replies

rasbt Mar 4, 2022 Maintainer

rasbt Mar 4, 2022 Maintainer

geoffreya Mar 15, 2022 Author

geoffreya Mar 15, 2022 Author

geoffreya
Mar 3, 2022

Replies: 1 comment 3 replies

rasbt
Mar 4, 2022
Maintainer

rasbt Mar 4, 2022
Maintainer

geoffreya Mar 15, 2022
Author

geoffreya Mar 15, 2022
Author