This project implement K-means clustering algorithm. We will be using the NSF Abstract dataset in this project.
These instructions will get run different stages of the project. The project is written in Jupyter Notebook using Python3 language.
Python3 packages needed to install
▪ numpy (Data structuring) ▪ pandas (Data structuring) ▪ scipy (distance calculations between vectors) ▪ sklearn (PCA) ▪ seaborn (Visualization) ▪ matplotlib (Visualization ▪ jupyter notebook (python code execution in browser) ▪ nltk (Tokenization)
Running jupyter notebook in project directory
Terminal: jupyter notebook
Then open the required notebook file in the browser.
You can use Menu->Kernel->Restart & Run All option to execute all cells. You can also use Shift+Enter to execute a single cell.
Data-Set: NSF Grant documents: Part1 https://archive.ics.uci.edu/ml/machine-learning-databases/nsfabs-mld/
Word2Vec Library File: GoogleNews-vectors-negative300.bin.gz
(Note: Takes a while to read data from word2vec pre-trained data.) Estimated Time: 10 mins
• Includes Word2Vec library from gensim model to convert doc to vector • Read all the grant proposals in the 'Part1' folder and extract document identifier 'Award Number' and abstract content • Filter empty abstracts • Save data to ’docs_vector.csv’ file for post-processing
(Note: Takes a while to read all grant proposal files from the directory tree of 'Part1' folder.) Estimated Time: 10 mins
2D visualization of the data with PCA 2-component decomposition. Drop outliers and save final data to 'docs_vector_final.csv'.
K-Means Clustering for K = (1, 2, 3, 4, 5, 6). Post-visualization of labels assigned to docs and distribution of docs in each cluster. Also shows the SSE Profile for different values of K.
K-Means Clustering using scikit learn.
Khagendra Adhikari