Feature/hdbscan change #799

Jacks0nJ · 2025-01-16T13:40:58Z

Changed HDBSCAN to use scikit-learn's version.

Previously the HDBSCAN clustering algorithm was implemented using a mostly redundant package (simply called HDBSCAN), which has now been incorporated into scikit-learn. By making this change, less packages are needed, reducing the chance of dependency conflicts. E.g. the original HDBSCAN has issues with the version of numpy and compiling its own Cython code.

As the scikit-learn's HDBSCAN uses the same input format as the other algorithms in the clustering module, some of the unit tests have been changed. The most significant change is cluster _persistence is no longer available.

Both manual and unit tests have shown minimal changes to the clusters found. Only in one case did the unit test need to change, as the cluster labelling had changed for the same clusters found.

Finally, the condensed_tree_.plot is no longer available. Therefore the notebook How_Clustering_Works.ipynb was changed to use static .png images to explain the HDBSCAN algorithm. Some minor typos were also fixed.

…orithm names have also changed

…earn hdbscan

CKrawczyk

All changes look good, feel free to merge when you are ready. A note about the removal of cluster persistence will need to be included in the release notes.

Further testing needs to be done to make sure the docker containers (now using a newer Python) are working as expected. These tests can be done on the update-dependencies branch. We should also check if Python versions 3.12 and 3.13 work yet. It might be worth bumping the docker containers to 3.13 if it does, that will give the longest time before it needs to be updated again.

The release notes will also need to address the new supported Python versions.

“JoeJ” added 9 commits January 6, 2025 17:01

changed to scikit hdbscan

7944d29

Removed by serach and replace cluster persistance

9c7218f

Removed p as a keyword argument for HDBSCAN

78ebb2d

changed algorithm from best to auto in default for hdbscan. Other alg…

8cf38db

…orithm names have also changed

changed order of unit test to match changed order of data in scikit-l…

d775791

…earn hdbscan

additional metric parameters added as dictionary instead of individually

7d52570

Updated How Clustering Works Notebook

778943d

Updated how_clustering_works notebook

21c16ab

Fixed typos in clustering notebook

f50a4f3

Jacks0nJ requested a review from CKrawczyk January 16, 2025 13:40

“JoeJ” added 2 commits January 16, 2025 14:42

Changed dependancy to be python 3.10 and 3.11

0a2e69b

Updated documentation to sklean HDBSCAN

45c2e39

CKrawczyk approved these changes Jan 27, 2025

View reviewed changes

Jacks0nJ merged commit 9bd444e into feature/update-dependencies Jan 27, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/hdbscan change #799

Feature/hdbscan change #799

Jacks0nJ commented Jan 16, 2025

CKrawczyk left a comment

Feature/hdbscan change #799

Feature/hdbscan change #799

Conversation

Jacks0nJ commented Jan 16, 2025

Changed HDBSCAN to use scikit-learn's version.

CKrawczyk left a comment

Choose a reason for hiding this comment