This is the third project of the Machine Learning subject (MO444) and it approaches unsupervised learning techniques for the clustering of news.
pip install -r requirements.txt
The First Experiments were done using L2 normalized descriptors and Non-Normalized.
The XX stands for 01 when normalized and 02 when non-normalized.
XX Elbow Kmeans - BoW.ipynb
- Using elbow method to select the number of clusters using Bag of Words as descriptor.XX Elbow Kmeans - Word2Vec.ipynb
- Using elbow method to select the number of clusters using Bag of Word2Vec.XX Hierarchical - BoW.ipynb
- Using an hierarchical clustering method with dendograms to select the number of clusters using Bag of Words as descriptor.XX Hierarchical - Word2Vec.ipynb
- Using an hierarchical clustering method with dendograms to select the number of clusters using Bag of Word2Vec.
We gathered the results generated by the optimizations and compared them using the metrics selected.
-
03 Comparison BoW.ipynb
- Comparing the clusterization results using bag of words as descriptor. -
03 Comparison Word2Vec.ipynb
- Comparing the clusterization results using word2vec as descriptor.
-
04 Visualizing Results BoW.ipynb
- Visualizing the results for the clusterization using bag of words as descriptor. -
04 Visualizing Results Word2Vec.ipynb
- Visualizing the results for the clusterization using word2vec as descriptor. -
04 Result Analysis.ipynb
- Selecting the best results for the clusterization methods.
Since the best result sensed during the experiments was using the Bag of Words. The same experiment was rebuilt using PCA.
-
05 PCA variance analysis - BOW
- Validating the BOW data after PCA while maintaining different values of variance -
05 PCA variance analysis - Word2Vec
- Validating the Word2Vec data after PCA while maintaining different values of variance -
05 Elbow - Kmeans - BoW PCA.ipynb
- Optimizing the Kmeans for the BoW descriptor after the dimensionality reduction. -
05 Visualizing Results BoW PCA.ipynb
- Visualizing the results for the Kmeans with BoW descriptor after the dimensionality reduction.
The results from the optimizations were selected and compared using the metrics selected.
- Discover the number of groups present in the data or a reliable range of possible values. Do some experiments in this regard.
- Analyze the medoids of some groups (for example, 3 groups) and their closest neighbors in the groups. Do they make sense? Are they talking about the same type of documents?
- Think of possible ways of checking the validity/quality of your clusters.
- Re-do the best experiment above considering the PCA dimensionality reduction. Consider different energies (variance) to cut and reduce dimensionality. What are the conclusions when using PCA in this problem?
- Prepare a 4-page (max.) report with all your findings. It is UP TO YOU to convince the reader that you are proficient on Unsupervised Learning Techniques, and the choices it entails.
Health News in Twitter (https://archive.ics.uci.edu/ml/datasets/Health+News+in+Twitter#), from UCI Machine Learning Repository, is a dataset collected in 2015 using Twitter API. Dataset Information:
- The health news are from BBC (’bbchealth.txt’), CNN (’cnnhealth.txt’), Fox News (’foxnewshealth.txt’) and Everyday News (’everydayhealth.txt’).
- Each file is related to one Twitter account of a news agency. For example, bbchealth.txt is related to BBC health news. Each line contains tweet id|date and time|tweet. The separator is ’|’.
- There are 13,229 health news (’health.txt’). The bag-of-words features vectors (with 1,203 dimensions) representing each new are also available ’bags.csv’) as well the word2vec vectors (with 128 dimensions, ’word2vec.csv’). You should choose one.
- The data is available at: https://www.dropbox.com/s/ahkim9u103v0q9i/health-dataset.zip
- Develop an EDA to understand the basics of the dataset.
notebooks\exploratory\
- Define metrics for the model evaluation and create a flow for that matter.
notebooks\exploratory\
- Elbow: For the defition of the cluster numbers.
- Silhuette: Identify the coesion of the cluster centroids.
- Davies Bouldin: To measure the centroids's separation.
- Scatter plots: Used to visualize the clusters separtions and its bellonging instances.
- References: http://www.ims.uni-stuttgart.de/institut/mitarbeiter/schulte/theses/phd/algorithm.pdf
- Apply PCA as a Dimensionality Reduction.
- Redo all above experiments.