You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm really enjoying looking through this library. I wanted to provide some thoughts on the selection of the 'best layer' returned by fit_predict. Firstly I just wanted to check my understanding. Is the approach that the layer that produces the fewest outliers is assumed to be the one that best fits the data and that is therefore the one returned by fit_predict?
Initially I thought it was a bug that fit_predict wasn't returning the most granular layer contained in cluster_layers (in my case it was returning cluster_layers[1]) until I went looking through the code and found the best_layer calculation.
It wasn't intuitive or clear to me that the most granular layer wasn't the one returned so I think that needs to be outlined in the doc strings of both EVoC and fit_predict.
Secondly, I think it would be good to have some option of what layer is returned by fit_predict. While it is easy enough to get the most granular layer from cluster_layers[0] explicitly, for some use cases (e.g. using EVoC as a drop in clusterer in BERTopic), BERTopic is just going to call fit_predict and return whatever it thinks is best. If the user sets base_min_cluster_size to try and control the level of granularity that they expect in the resulting clusters, but then EVoC chooses a different layer as the best layer, then the user won't be getting the level of granularity they expect.
It might be nice to introduce something like layer_selection = ['best', 'bottom', 'top'] so the user can force fit_predict to return the most granular layer if desired. best could be called fewest_outliers or something to be more explicit about how 'best' is being determined. It could also take an integer to just select a given layer (with the top layer returned if the integer is out of range).
Just some ideas.
The text was updated successfully, but these errors were encountered:
Yes, I think a few more options there might be useful. I haven't really gotten to the point of properly documenting everything, and have been travelling and on vacation for the past month. I'll see what I can do when I get back to working on this and have time available.
Hi there,
I'm really enjoying looking through this library. I wanted to provide some thoughts on the selection of the 'best layer' returned by
fit_predict
. Firstly I just wanted to check my understanding. Is the approach that the layer that produces the fewest outliers is assumed to be the one that best fits the data and that is therefore the one returned byfit_predict
?Initially I thought it was a bug that
fit_predict
wasn't returning the most granular layer contained incluster_layers
(in my case it was returningcluster_layers[1]
) until I went looking through the code and found thebest_layer
calculation.It wasn't intuitive or clear to me that the most granular layer wasn't the one returned so I think that needs to be outlined in the doc strings of both
EVoC
andfit_predict
.Secondly, I think it would be good to have some option of what layer is returned by
fit_predict
. While it is easy enough to get the most granular layer fromcluster_layers[0]
explicitly, for some use cases (e.g. using EVoC as a drop in clusterer inBERTopic
), BERTopic is just going to callfit_predict
and return whatever it thinks is best. If the user setsbase_min_cluster_size
to try and control the level of granularity that they expect in the resulting clusters, but thenEVoC
chooses a different layer as the best layer, then the user won't be getting the level of granularity they expect.It might be nice to introduce something like
layer_selection = ['best', 'bottom', 'top']
so the user can forcefit_predict
to return the most granular layer if desired.best
could be calledfewest_outliers
or something to be more explicit about how 'best' is being determined. It could also take an integer to just select a given layer (with the top layer returned if the integer is out of range).Just some ideas.
The text was updated successfully, but these errors were encountered: