-
Notifications
You must be signed in to change notification settings - Fork 780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for cuml hdbscan membership_vector #1324
base: master
Are you sure you want to change the base?
add support for cuml hdbscan membership_vector #1324
Conversation
cuml version 23.04 now has membership_vector, so this allows topic_model.transform() to calculate the probability matrix if using a cuml-based hdbscan model
Thanks for the PR! Did you check whether the |
6f2f48e
to
2fce29b
Compare
Good catch, that should be an I'm fine removing the batch_size param, but I added it so you can use a custom batch size if necessary. I think it's important to leave the check in there, because telling people this function won't work for cuml 23.04 or .06 isn't great. I just left it as it was, and yes, |
Apologies for the late reply. I am wondering whether it would not be better to skip over handling the |
I am currently experiencing Cuda out_of_memory (OOM) errors on my (admittedly large) dataset when |
@HeadCase what version of cuML are you using? I think version 23.08 uses a batch size of 4096 which likely will fix your problem. You could also take a look at my first commit, which does include a batch_size parameter, if 4096 is too big for you |
@MaartenGr sorry this dropped off my radar. I get it causing future problems, but rather than removing it I can add in version tests. I think it's 23.04 and 23.06 that need the batch_size patch eg |
Seeing as the newest version of cuML resolves the issue. Would it not be straightforward to just mention that users will need to use the latest version? Especially since it was fixed in their latest release. |
For me, it's a lot easier to update the BERTopic version in my environment than it is to update cuML. I've had major dependency conflicts when trying to install cuML with some of my other packages. My point being, other people might not be able to update cuML in their environments just to get a feature that technically does work. But I get that you don't want the code cluttered. Your call. We could just leave the commit as is, and not merge it, and people can patch their copy of BERTopic if needed... |
Hmmm, in that case, it might indeed be better to check for those specific versions and only implement the fix for these versions. I think that would be more stable and indeed still allows users to keep using the current version. As long as it does not affect newer versions, that should suffice. |
cuml version 23.04 now has membership_vector, so this allows topic_model.transform() to calculate the probability matrix if using a cuml-based hdbscan model
As I was preparing to submit this PR, I saw #1317 and that some of you have already figured this out 😄
I had included batch_size as a new parameter to hdbscan_delegator just in case someone needed to pass in a batch_size somehow