Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features/1674 optimization of k means initialization #1754

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Hakdag97
Copy link
Collaborator

@Hakdag97 Hakdag97 commented Dec 17, 2024

Description

The bottleneck of k-means clustering (concerning runtime) is the initialization of centroids, which was previously built on a cost intensive serial algorithm. The aim of this pull request is to replace this algorithm by the more sophisticated k-means || initialization of centroids.

Issue/s resolved:

Changes proposed:

  • Appropriate handling of an additional edge case in the function nonzero
  • Complete new implementation of the initialization of centroids used for k-means, k-medians, and k-medoids
  • Adjustment of classes (like KMeans) to match with the new implementation

Type of change

  • Bug fix
  • New feature

Performance

  • Reducing the runtime of initialization of clustering algorithm in distributed and non-distributed mode with split=None and split not None by (at least) an order of magnitude (depending on the setting concerning, e.g., size of data and chosen parameters)

Does this change modify the behaviour of other functions? If so, which?

  • yes: the classes KMeans, KMedoids, KMedians and the function where are affected

Copy link
Contributor

Thank you for the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant