Skip to content

A GPU-Accelerated Clustering Algorithm that uses the Hungarian method

License

Notifications You must be signed in to change notification settings

CascadingRadium/CUDA-Hungarian-Clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub license GitHub forks GitHub stars GitHub issues GitHub repo size GitHub last commit

CUDA-Hungarian-Clustering

A GPU-Accelerated Clustering Algorithm that uses the Hungarian method

Written in CUDA and C++

Introduction:

  • Parameterless (Almost) Clustering Algorithm
  • Input is a single CSV file and the output will be a file named 'output.csv' which has the full original data + an extra 'label' column that specifies what cluster/group it belongs to.
  • Does not need any prior knowledge of the number of clusters/groups present in the dataset
  • Results similar to ones obtained from Spectral Clustering (but without the requirement of the number of clusters parameter)
  • Combined the work of two research papers:
  • Mainly used to find the number of groups in the dataset with each group being a set of 'similar' rows similar to DBSCAN

Execution instructions:

nvcc Clustering.cu

./a.out [INPUT_FILE] [PARAMETER] [Number of Columns from right to skip/ignore] [Number of Rows from top to skip/ignore]

python3 plot_output.py 

INPUT FILE - Any file(.xlxs .csv) that can be opened in spreadsheet software like LibreOffice calc/MS Excel.

PARAMETER - Integral value in the range [0,8] for most inputs (must be manually tuned) - 7 works for most datasets (Independent of the real number of clusters in the dataset)

The other two command-line arguments are meant to filter out the label column and the column header row respectively before passing on the raw data to the model

Constraints:

  • The input file should only have numeric columns (float/ integer)
  • The input file should not have any NaN or null values - Dataset cleaning must be done prior
  • Parameter tuning can only be possible if a rough estimate of the number of values the label can take is known, otherwise, a pure unsupervised clustering without any tuning can be done by just assuming Parameter as 7
  • Sensitive to noise
  • Parameter, being fully independent of the dataset, cannot be estimated and is mostly tuned based on trial-and-error, but almost always takes a value in the range [0,10]

Working Example:

nvcc Clustering.cu
./a.out data_banknote_authentication.csv 10 1 1

This will now use parameter 10 and cluster the input .csv file into some number of groups and output a file named 'output.csv' which has an additional column called label which represents the groupID or the group to which it belongs.

Sample output images - using datasets in the TestedDataset directory:

data0 data1 data2

data3 data4 data5

data6 data7 data8

data9 data10 data11

data12 data13 data14

data15 data16 data17

data18 data19 data20