FastThresholdClustering

中文文档

Introduction

FastThresholdClustering is an efficient vector clustering algorithm based on FAISS, particularly suitable for large-scale vector data clustering tasks. The algorithm uses cosine similarity as the distance metric and supports GPU acceleration.

Key Features

GPU acceleration support
Automatic parameter optimization
Memory usage optimization
Performance monitoring and logging
Batch processing for large-scale data
Noise point detection

Quick Start

from fast_clustering import fast_cluster_embeddings

# Use the convenience function for clustering
labels = fast_cluster_embeddings(
    embeddings,
    similarity_threshold=0.8,
    min_samples=5,
    use_gpu=True
)

FastThresholdClustering Parameter Details

Core Parameters

similarity_threshold

Property	Description
Type	float
Default	0.8
Range	[0, 1]
Function	Similarity threshold for determining if two vectors belong to the same cluster

Detailed Description:

Higher values lead to stricter clustering and more clusters
Lower values result in looser clustering and fewer clusters
Recommended values:
- 0.7-0.8: Suitable for general text vectors
- 0.8-0.9: Suitable for high-precision matching scenarios
- 0.9: Extremely strict matching requirements
Performance impact: Higher threshold leads to faster computation

min_samples

Property	Description
Type	int
Default	5
Range	>= 2
Function	Minimum number of samples required to form a valid cluster

Detailed Description:

Clusters with fewer samples are marked as noise points (label -1)
Parameter setting recommendations:
- Small datasets (<1000): 2-5
- Medium datasets (1000-10000): 5-10
- Large datasets (>10000): 10-20
Key parameter affecting noise point determination
Higher values result in more noise points and higher cluster quality

Performance Parameters

use_gpu

Property	Description
Type	bool
Default	True
Function	Whether to use GPU acceleration

Detailed Description:

True: Use GPU acceleration
False: Use CPU computation
Performance impact:
- GPU mode: Suitable for large-scale data (>100k entries)
- CPU mode: Suitable for small-scale data (<100k entries)
Memory usage:
- GPU mode limited by VRAM
- CPU mode limited by RAM
Recommendation: Prefer GPU mode when available

nprobe

Property	Description
Type	int
Default	8
Range	[1, nlist]
Function	Number of cluster units to visit during FAISS index search

Detailed Description:

Balances search accuracy and speed
Recommended values:
- Small datasets (<10k): 4-8
- Medium datasets (10k-100k): 8-16
- Large datasets (>100k): 16-32
Higher values:
- Pros: More accurate search results
- Cons: Slower search speed
Automatically adjusted based on data scale

batch_size

Property	Description
Type	int
Default	1000
Range	[100, dataset size]
Function	Batch size affecting memory usage and computation speed

Detailed Description:

Recommended values:
- GPU mode: 500-2000
- CPU mode: 200-1000
Memory impact:
- Larger values use more memory
- Smaller values increase computation time
Auto-adjustment:
- Small datasets: Smaller batch_size
- Large datasets: Larger batch_size
Adjust based on available memory

n_workers

Property	Description
Type	int
Default	None
Range	[1, CPU cores]
Function	Number of parallel processing worker threads

Detailed Description:

When None, automatically set to min(CPU cores, 8)
Recommended values:
- Small datasets: 2-4 threads
- Medium datasets: 4-8 threads
- Large datasets: 8-16 threads
Considerations:
- Too many threads may cause resource contention
- Consider resource needs of other system processes
- Less impact in GPU mode

Parameter Combination Recommendations

Small Dataset Optimization (<10k samples)

FastThresholdClustering(
    similarity_threshold=0.75,
    min_samples=3,
    use_gpu=False,
    nprobe=4,
    batch_size=500,
    n_workers=4
)

Large Dataset Optimization (>100k samples)

FastThresholdClustering(
    similarity_threshold=0.85,
    min_samples=10,
    use_gpu=True,
    nprobe=32,
    batch_size=2000,
    n_workers=8
)

High Precision Scenarios

FastThresholdClustering(
    similarity_threshold=0.9,
    min_samples=5,
    use_gpu=True,
    nprobe=64,
    batch_size=1000,
    n_workers=8
)

Main Methods

def fit(self, embeddings: np.ndarray) -> FastThresholdClustering:
    """
    Perform clustering on input vectors
    
    Parameters:
        embeddings: numpy array with shape (n_samples, n_features)
        
    Returns:
        self: Returns the clustering instance
    """

Convenience Function

def fast_cluster_embeddings(
    embeddings: np.ndarray,
    similarity_threshold: float = 0.8,
    min_samples: int = 5,
    use_gpu: bool = True,
    nprobe: int = None,
    batch_size: int = None,
    n_workers: int = None
) -> np.ndarray:
    """
    Quick clustering interface function
    
    Parameters:
        embeddings: Input vectors with shape (n_samples, n_features)
        similarity_threshold: Clustering similarity threshold
        min_samples: Minimum sample count
        use_gpu: Whether to use GPU
        nprobe: FAISS index nprobe parameter (optional)
        batch_size: Batch size (optional)
        n_workers: Number of worker threads (optional)
    
    Returns:
        labels: Clustering label array with shape (n_samples,)
    """

Return Value Description

Clustering results stored in labels_ attribute
Label -1 indicates noise points
Other labels numbered consecutively from 0

Performance Monitoring

The algorithm includes a built-in performance monitoring system that automatically records:

Time spent in each phase
Memory usage
Clustering progress
Final clustering statistics

Usage Example

import numpy as np
from fast_clustering import FastThresholdClustering

# Prepare data
embeddings = np.random.random((10000, 768))

# Create clusterer
clusterer = FastThresholdClustering(
    similarity_threshold=0.8,
    min_samples=5,
    use_gpu=True
)

# Perform clustering
clusterer.fit(embeddings)

# Get clustering results
labels = clusterer.labels_

See example.py for detailed usage examples

Notes

Input vectors are automatically L2 normalized
GPU acceleration recommended for large-scale datasets
Parameters are automatically optimized based on data scale
Memory usage increases with data scale

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

FastThresholdClustering

中文文档

Introduction

Key Features

Quick Start

FastThresholdClustering Parameter Details

Core Parameters

similarity_threshold

min_samples

Performance Parameters

use_gpu

nprobe

batch_size

n_workers

Parameter Combination Recommendations

Small Dataset Optimization (<10k samples)

Large Dataset Optimization (>100k samples)

High Precision Scenarios

Main Methods

Convenience Function

Return Value Description

Performance Monitoring

Usage Example

See example.py for detailed usage examples

Notes

Files

README.md

Latest commit

History

README.md

File metadata and controls

FastThresholdClustering

中文文档

Introduction

Key Features

Quick Start

FastThresholdClustering Parameter Details

Core Parameters

similarity_threshold

min_samples

Performance Parameters

use_gpu

nprobe

batch_size

n_workers

Parameter Combination Recommendations

Small Dataset Optimization (<10k samples)

Large Dataset Optimization (>100k samples)

High Precision Scenarios

Main Methods

Convenience Function

Return Value Description

Performance Monitoring

Usage Example

See example.py for detailed usage examples

Notes