Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback on best_layer selection #8

Open
zilch42 opened this issue Jul 30, 2024 · 1 comment
Open

Feedback on best_layer selection #8

zilch42 opened this issue Jul 30, 2024 · 1 comment

Comments

@zilch42
Copy link

zilch42 commented Jul 30, 2024

Hi there,

I'm really enjoying looking through this library. I wanted to provide some thoughts on the selection of the 'best layer' returned by fit_predict. Firstly I just wanted to check my understanding. Is the approach that the layer that produces the fewest outliers is assumed to be the one that best fits the data and that is therefore the one returned by fit_predict?

Initially I thought it was a bug that fit_predict wasn't returning the most granular layer contained in cluster_layers (in my case it was returning cluster_layers[1]) until I went looking through the code and found the best_layer calculation.

It wasn't intuitive or clear to me that the most granular layer wasn't the one returned so I think that needs to be outlined in the doc strings of both EVoC and fit_predict.

Secondly, I think it would be good to have some option of what layer is returned by fit_predict. While it is easy enough to get the most granular layer from cluster_layers[0] explicitly, for some use cases (e.g. using EVoC as a drop in clusterer in BERTopic), BERTopic is just going to call fit_predict and return whatever it thinks is best. If the user sets base_min_cluster_size to try and control the level of granularity that they expect in the resulting clusters, but then EVoC chooses a different layer as the best layer, then the user won't be getting the level of granularity they expect.

It might be nice to introduce something like layer_selection = ['best', 'bottom', 'top'] so the user can force fit_predict to return the most granular layer if desired. best could be called fewest_outliers or something to be more explicit about how 'best' is being determined. It could also take an integer to just select a given layer (with the top layer returned if the integer is out of range).

Just some ideas.

@lmcinnes
Copy link
Contributor

Yes, I think a few more options there might be useful. I haven't really gotten to the point of properly documenting everything, and have been travelling and on vacation for the past month. I'll see what I can do when I get back to working on this and have time available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants