-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clustering freezing when assigning noise points: #1190
Comments
I get this also on Mac M3. I have tried to cluster the same data with CPU on Linux but it failed differently there at a later step
|
Same for me. Its stuck at "noise points (42.3%) will be assigned to nearest cluster." Also bge m3 cant be used for clustering model, lilac defaults to weak model that trained on english only regardless of my preferred embedding model and instead of using already created embeddings for same dataset field it creates new embeddings. I hope both get fixed. |
We just added When set to True, it will skip assigning noisy points to the nearest cluster to speedup clustering. This will be available in the next release (in a couple of days). For much faster (100x) clustering, please apply for our Lilac Garden pilot. |
Will I be able to use bge m3 in clustering? |
I'm attempting to cluster ~600k short texts (reviews). The process goes ok up until it logs that it's assigning noise points to clusters. It spends close to an hour embedding and clustering.
I successfully clustered a much smaller dataset sampled from this data (1000 items)
I'm running lilac in docker, it's the
latest
tag, which appears to currently belilacai/lilac:0.3.5
. I'm running it on a system with a geforce 3090.Here are the logs:
After this point it freezes, and the UI and server become very slow. The main thread appears to be very busy
Looking at the code, it seems like it's skipping assigning labels or something?
https://github.com/lilacai/lilac/blob/8e7418d533e6fba64ef1854e4112e66c035321bf/lilac/data/clustering.py#L386
the
num_noisy
count here seems to be quite high, and so I assume this condition isn't true, so it's actually skipping assigning labels. My read-through of the code here falls apart, and I'm unsure where the process is spending its time.The text was updated successfully, but these errors were encountered: