Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Enable HDBSCAN to build knn graph using NN Descent #5939

Open
wants to merge 35 commits into
base: branch-24.10
Choose a base branch
from

Conversation

jinsolp
Copy link
Contributor

@jinsolp jinsolp commented Jun 17, 2024

Description

Adds build_algo option to HDBSCAN, and allow HDBSCAN to build knn graphs using nn descent
Now user can choose the knn graph build algorithm between "brute_force_knn" and "nn_descent"
Defaults to "auto", in which case decides to run with brute force knn or nn descent depending on the given dataset size.
"auto" decides to run with brute_force_knn if data has <= 50K rows. Otherwise decides to run with nn_descent.

agg_nnd = HDBSCAN(min_samples=16, build_algo="nn_descent")
agg_nnd.fit(data)

Running Benchmarks

Screenshot 2024-08-22 at 3 16 55 PM

@jinsolp jinsolp changed the title Enable HDBSCAN to build knn graph using NN Descent [WIP] Enable HDBSCAN to build knn graph using NN Descent Jun 25, 2024
rapids-bot bot pushed a commit to rapidsai/raft that referenced this pull request Jul 11, 2024
Adding distance epilogue to NN Descent.
Planning to use them for the following use cases as of now;
- Calculating mutual reachability distance for HDBSCAN (possibly for usage with HDBSCAN - [related PR here](rapidsai/cuml#5939))
- Enabling `L2SqrtExpanded` distance metric, by `sqrt`-ing the current supported metric (`L2Expanded`) of NN Descent in the distance epilogue. (for usage with UMAP - [related PR here](rapidsai/cuml#5910))

Authors:
  - Jinsol Park (https://github.com/jinsolp)

Approvers:
  - Tarang Jain (https://github.com/tarang-jain)
  - Tamas Bela Feher (https://github.com/tfeher)
  - Divye Gala (https://github.com/divyegala)

URL: #2364
@jinsolp jinsolp changed the title [WIP] Enable HDBSCAN to build knn graph using NN Descent [FEA] Enable HDBSCAN to build knn graph using NN Descent Jul 12, 2024
@jinsolp jinsolp changed the title [FEA] Enable HDBSCAN to build knn graph using NN Descent [WIP] Enable HDBSCAN to build knn graph using NN Descent Jul 24, 2024
@jinsolp jinsolp removed request for a team and bdice August 20, 2024 17:09
@jinsolp jinsolp marked this pull request as draft August 20, 2024 20:04
@github-actions github-actions bot added the CMake label Aug 21, 2024
@jinsolp jinsolp added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 21, 2024
@jinsolp
Copy link
Contributor Author

jinsolp commented Aug 22, 2024

algo.cuh and get_raft.cmake will be changed after this PR in cuML and this PR in raft is merged

@jinsolp jinsolp marked this pull request as ready for review August 22, 2024 22:21
@jinsolp jinsolp changed the title [WIP] Enable HDBSCAN to build knn graph using NN Descent [FEA] Enable HDBSCAN to build knn graph using NN Descent Aug 22, 2024
@github-actions github-actions bot removed the CMake label Aug 23, 2024
@jinsolp jinsolp self-assigned this Aug 23, 2024
Copy link
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looked the Python side, LGTM. Congrats Jinsol!

Comment on lines +859 to +887
if self.build_algo == "auto":
if self.n_rows <= 50000:
# brute force is faster for small datasets
logger.warn("Building knn graph using brute force")
self.build_algo = "brute_force_knn"
else:
logger.warn("Building knn graph using nn descent")
self.build_algo = "nn_descent"

if self.build_algo == 'brute_force_knn':
params.build_algo = GRAPH_BUILD_ALGO.BRUTE_FORCE_KNN
elif self.build_algo == 'nn_descent':
params.build_algo = GRAPH_BUILD_ALGO.NN_DESCENT
if self.build_kwds is None:
params.nn_descent_params.graph_degree = <size_t> 64
params.nn_descent_params.intermediate_graph_degree = <size_t> 128
params.nn_descent_params.max_iterations = <size_t> 20
params.nn_descent_params.termination_threshold = <float> 0.0001
params.nn_descent_params.return_distances = <bool> True
else:
params.nn_descent_params.graph_degree = <size_t> self.build_kwds.get("nnd_graph_degree", 64)
params.nn_descent_params.intermediate_graph_degree = <size_t> self.build_kwds.get("nnd_intermediate_graph_degree", 128)
params.nn_descent_params.max_iterations = <size_t> self.build_kwds.get("nnd_max_iterations", 20)
params.nn_descent_params.termination_threshold = <float> self.build_kwds.get("nnd_termination_threshold", 0.0001)
params.nn_descent_params.return_distances = <bool> self.build_kwds.get("nnd_return_distances", True)
else:
raise ValueError("Build algo not supported. "
"Must one of {'brute_force_knn', 'nn_descent'}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to create a function to avoid repeating this section of code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agree w/ @viclafargue here.

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry Jinsol! I reviewed this awhile back and didn't submit. I think I was still working my way through all of your changes but let me submit this to start.

}
};

template <typename T>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create an issue in raft to eventually move these functions over to raft::matrix? They are certainly useful. Please reference the issue in a comment here.

cpp/src/hdbscan/detail/reachability.cuh Outdated Show resolved Hide resolved

DI value_t operator()(value_t value, value_idx row, value_idx col) const
{
return max(core_dists[col], max(core_dists[row], powf(fabsf(alpha * value), 0.5)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqrt is pretty expensive, which is why we like to hold off on doing until the end (and only if a user needs it). The thing about the sqrt is that it since it affects all the distances and since it doesn't change the relative ordering of the distances, we can usually avoid having to do it in place. Just a thougt- you might be able to save some precious compute cycles here.

cpp/src/hdbscan/detail/reachability.cuh Outdated Show resolved Hide resolved
"n_neighbors should be smaller than the graph degree computed by nn descent");

auto epilogue = ReachabilityPostProcessSqrt<value_idx, value_t>(core_dists, alpha);
build_params.return_distances = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost wonder if we should add this to all the nn algos eventually. Scikit-learn has this exact option, for example, which applies to all the nn estimators. Can you create a cuVS issue for this? You don't have to reference it in the code here.

python/cuml/cuml/cluster/hdbscan/hdbscan.pyx Show resolved Hide resolved
cdef index_params build_params

if self.build_algo == "auto":
if self.n_rows <= 50000:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this in multiple places is a good indicator that we should create a new function for it so we can consolidate the similarities (and differences).

Copy link

copy-pr-bot bot commented Sep 21, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cjnolet
Copy link
Member

cjnolet commented Sep 21, 2024

/ok to test

@dantegd
Copy link
Member

dantegd commented Sep 23, 2024

/ok to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CUDA/C++ Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants