Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch-parallel K-means and K-medians #1288

Merged
merged 39 commits into from
Apr 12, 2024

Conversation

mrfh92
Copy link
Collaborator

@mrfh92 mrfh92 commented Nov 30, 2023

Due Diligence

  • General:
    • base branch must be main for new features, latest release branch (e.g. release/1.3.x) for bug fixes
    • title of the PR is suitable to appear in the Release Notes
  • Implementation:
    • unit tests: all split configurations tested (here: only split=0 allowed)
    • unit tests: multiple dtypes tested
    • documentation updated where needed

Description

This PR implements a variant of K-Means that can take advantage of Heat data distribution structure. The algorithm is described in the paper https://doi.org/10.1016/j.cie.2020.107023. In essence, the idea is to perform K-means on each chunk of data separately in parallel as a first step. In the second step, the obtained cluster centers are "merged": the set of all process-local cluster centers is collected, and then another application of K-means is used to determine the final "global" cluster centers als cluster centers of the set of "local" cluster centers.
Of course, this idea generalizes to K-Medians in a straightforward way. To improve scalability of this approach even to very high number of processes, the "merging" can also be done in hierarchical manor.

Caveat: This does not necessarily yield the same results as the classical K-Means.

Issue/s resolved: #1287

Changes proposed:

  • New classes heat.cluster.BatchParallelKMeans and heat.cluster.BatchParallelKMedians

Type of change

new feature

Memory requirements

TBD

Performance

Example of ~130GB on up to 8 GPU-nodes of the Terrabyte-Cluster. Left plot shows standard K-means (4 clusters) and right plot shows BatchParallelKmeans (4 and 40 clusters)

Does this change modify the behaviour of other functions? If so, which?

no

@ghost
Copy link

ghost commented Nov 30, 2023

👇 Click on the image for a new way to code review

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map legend

@mrfh92 mrfh92 self-assigned this Dec 1, 2023
@mrfh92 mrfh92 added cluster enhancement New feature or request labels Dec 1, 2023
Copy link
Contributor

github-actions bot commented Dec 1, 2023

Thank you for the PR!

Copy link

codecov bot commented Dec 1, 2023

Codecov Report

Attention: Patch coverage is 99.54955% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 92.06%. Comparing base (850dcfb) to head (6e9e9ec).

Files Patch % Lines
heat/cluster/_kcluster.py 94.73% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1288      +/-   ##
==========================================
+ Coverage   91.92%   92.06%   +0.14%     
==========================================
  Files          79       80       +1     
  Lines       11721    11941     +220     
==========================================
+ Hits        10774    10993     +219     
- Misses        947      948       +1     
Flag Coverage Δ
unit 92.06% <99.54%> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mrfh92 mrfh92 added the high-level functions High-level machine-learning algorithms label Dec 1, 2023
Copy link
Contributor

github-actions bot commented Dec 1, 2023

Thank you for the PR!

Copy link
Contributor

github-actions bot commented Dec 1, 2023

Thank you for the PR!

Copy link
Contributor

github-actions bot commented Dec 4, 2023

Thank you for the PR!

Hoppe added 2 commits December 4, 2023 14:02
… for storing the value of the K-Clustering functional after predict
Copy link
Contributor

github-actions bot commented Dec 4, 2023

Thank you for the PR!

1 similar comment
Copy link
Contributor

github-actions bot commented Dec 4, 2023

Thank you for the PR!

Copy link
Contributor

github-actions bot commented Dec 4, 2023

Thank you for the PR!

@ClaudiaComito ClaudiaComito added this to the 1.4.0 milestone Dec 6, 2023
… is now also possible in a hierarchical manor that scales better to high numbers of processes
Copy link
Contributor

github-actions bot commented Dec 7, 2023

Thank you for the PR!

@mrfh92 mrfh92 marked this pull request as ready for review December 8, 2023 08:25
Copy link
Contributor

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrfh92 this is great.

Would it be possible or advisable to make batch-parallel clustering available from within a general ht.cluster.KMeans or ht.cluster.KMedians call? I'm thinking something like:

kmeans = ht.cluster.KMeans(n_clusters=k, parallel="batch")

where parallel can be "batch" or "standard".

Copy link
Contributor

github-actions bot commented Mar 1, 2024

Thank you for the PR!

@mrfh92 mrfh92 added the PR talk label Mar 4, 2024
Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

@mrfh92 mrfh92 removed the PR talk label Mar 25, 2024
Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

github-actions bot commented Apr 8, 2024

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me @mrfh92 , I only have a few edits to the documentation. Otherwise we can merge.

heat/cluster/batchparallelclustering.py Outdated Show resolved Hide resolved
heat/cluster/batchparallelclustering.py Outdated Show resolved Hide resolved
heat/cluster/batchparallelclustering.py Outdated Show resolved Hide resolved
Copy link
Contributor

Thank you for the PR!

2 similar comments
Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

@mrfh92
Copy link
Collaborator Author

mrfh92 commented Apr 11, 2024

@ClaudiaComito I have committed all suggested changes.

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 @mrfh92 !

@mrfh92 mrfh92 merged commit ef44340 into main Apr 12, 2024
56 checks passed
@mrfh92 mrfh92 deleted the features/1287-Batch-parallel_K-means_and_K-medians branch April 12, 2024 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cluster enhancement New feature or request high-level functions High-level machine-learning algorithms merge queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants