Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/hdbscan change #799

Merged
merged 11 commits into from
Jan 27, 2025
Merged

Conversation

Jacks0nJ
Copy link
Collaborator

Changed HDBSCAN to use scikit-learn's version.

Previously the HDBSCAN clustering algorithm was implemented using a mostly redundant package (simply called HDBSCAN), which has now been incorporated into scikit-learn. By making this change, less packages are needed, reducing the chance of dependency conflicts. E.g. the original HDBSCAN has issues with the version of numpy and compiling its own Cython code.

As the scikit-learn's HDBSCAN uses the same input format as the other algorithms in the clustering module, some of the unit tests have been changed. The most significant change is cluster _persistence is no longer available.

Both manual and unit tests have shown minimal changes to the clusters found. Only in one case did the unit test need to change, as the cluster labelling had changed for the same clusters found.

Finally, the condensed_tree_.plot is no longer available. Therefore the notebook How_Clustering_Works.ipynb was changed to use static .png images to explain the HDBSCAN algorithm. Some minor typos were also fixed.

@Jacks0nJ Jacks0nJ requested a review from CKrawczyk January 16, 2025 13:40
Copy link
Collaborator

@CKrawczyk CKrawczyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes look good, feel free to merge when you are ready. A note about the removal of cluster persistence will need to be included in the release notes.

Further testing needs to be done to make sure the docker containers (now using a newer Python) are working as expected. These tests can be done on the update-dependencies branch. We should also check if Python versions 3.12 and 3.13 work yet. It might be worth bumping the docker containers to 3.13 if it does, that will give the longest time before it needs to be updated again.

The release notes will also need to address the new supported Python versions.

@Jacks0nJ Jacks0nJ merged commit 9bd444e into feature/update-dependencies Jan 27, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants