Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NO MERGE] Putting data on host for HDBSCAN using mst optimize #6044

Draft
wants to merge 1 commit into
base: branch-24.10
Choose a base branch
from

Conversation

jinsolp
Copy link
Contributor

@jinsolp jinsolp commented Aug 23, 2024

PR for future references of putting data on host for HDBSCAN so that it scales to large datasets. No reviews needed.

In reachability.cuh, currently using a optimize() function from cuvs to ensure connectedness for a knn graph. Note that cuML does not support building with cuvs yet, so the related functions are copy-pasted into mst_opt.cuh file.

Batching NND features and putting data on host supported by the implementation. Can be run like this;

hdbscan_nnd = HDBSCAN(min_samples=16, build_algo="nn_descent", build_kwds={'nnd_return_distances': True, "nnd_n_clusters": 4})
labels = hdbscan_nnd.fit(data, data_on_host=True).labels_

@jinsolp jinsolp requested review from a team as code owners August 23, 2024 20:40
@jinsolp jinsolp marked this pull request as draft August 23, 2024 20:41
@github-actions github-actions bot added Cython / Python Cython or Python issue CUDA/C++ labels Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CUDA/C++ Cython / Python Cython or Python issue
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

1 participant