Unsupervised Learning

This is the third project of the Machine Learning subject (MO444) and it approaches unsupervised learning techniques for the clustering of news.

Reproducibility:

Requirements

pip install -r requirements.txt

Experiments

Optimizations

The First Experiments were done using L2 normalized descriptors and Non-Normalized.

The XX stands for 01 when normalized and 02 when non-normalized.

XX Elbow Kmeans - BoW.ipynb - Using elbow method to select the number of clusters using Bag of Words as descriptor.
XX Elbow Kmeans - Word2Vec.ipynb - Using elbow method to select the number of clusters using Bag of Word2Vec.
XX Hierarchical - BoW.ipynb - Using an hierarchical clustering method with dendograms to select the number of clusters using Bag of Words as descriptor.
XX Hierarchical - Word2Vec.ipynb - Using an hierarchical clustering method with dendograms to select the number of clusters using Bag of Word2Vec.

Comparison

We gathered the results generated by the optimizations and compared them using the metrics selected.

03 Comparison BoW.ipynb - Comparing the clusterization results using bag of words as descriptor.
03 Comparison Word2Vec.ipynb - Comparing the clusterization results using word2vec as descriptor.

Analysis and Visualization

04 Visualizing Results BoW.ipynb - Visualizing the results for the clusterization using bag of words as descriptor.
04 Visualizing Results Word2Vec.ipynb - Visualizing the results for the clusterization using word2vec as descriptor.
04 Result Analysis.ipynb - Selecting the best results for the clusterization methods.

PCA Trials

Since the best result sensed during the experiments was using the Bag of Words. The same experiment was rebuilt using PCA.

05 PCA variance analysis - BOW - Validating the BOW data after PCA while maintaining different values of variance
05 PCA variance analysis - Word2Vec - Validating the Word2Vec data after PCA while maintaining different values of variance
05 Elbow - Kmeans - BoW PCA.ipynb - Optimizing the Kmeans for the BoW descriptor after the dimensionality reduction.
05 Visualizing Results BoW PCA.ipynb - Visualizing the results for the Kmeans with BoW descriptor after the dimensionality reduction.

Metrics Comparison

The results from the optimizations were selected and compared using the metrics selected.

Activities

Discover the number of groups present in the data or a reliable range of possible values. Do some experiments in this regard.
Analyze the medoids of some groups (for example, 3 groups) and their closest neighbors in the groups. Do they make sense? Are they talking about the same type of documents?
Think of possible ways of checking the validity/quality of your clusters.
Re-do the best experiment above considering the PCA dimensionality reduction. Consider different energies (variance) to cut and reduce dimensionality. What are the conclusions when using PCA in this problem?
Prepare a 4-page (max.) report with all your findings. It is UP TO YOU to convince the reader that you are proficient on Unsupervised Learning Techniques, and the choices it entails.

Dataset

Health News in Twitter (https://archive.ics.uci.edu/ml/datasets/Health+News+in+Twitter#), from UCI Machine Learning Repository, is a dataset collected in 2015 using Twitter API. Dataset Information:

The health news are from BBC (’bbchealth.txt’), CNN (’cnnhealth.txt’), Fox News (’foxnewshealth.txt’) and Everyday News (’everydayhealth.txt’).
Each file is related to one Twitter account of a news agency. For example, bbchealth.txt is related to BBC health news. Each line contains tweet id|date and time|tweet. The separator is ’|’.
There are 13,229 health news (’health.txt’). The bag-of-words features vectors (with 1,203 dimensions) representing each new are also available ’bags.csv’) as well the word2vec vectors (with 128 dimensions, ’word2vec.csv’). You should choose one.
The data is available at: https://www.dropbox.com/s/ahkim9u103v0q9i/health-dataset.zip

Planning

Develop an EDA to understand the basics of the dataset. notebooks\exploratory\
Define metrics for the model evaluation and create a flow for that matter. notebooks\exploratory\
- Elbow: For the defition of the cluster numbers.
- Silhuette: Identify the coesion of the cluster centroids.
- Davies Bouldin: To measure the centroids's separation.
- Scatter plots: Used to visualize the clusters separtions and its bellonging instances.
- References: http://www.ims.uni-stuttgart.de/institut/mitarbeiter/schulte/theses/phd/algorithm.pdf
Apply PCA as a Dimensionality Reduction.
- Redo all above experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.vscode		.vscode
notebooks		notebooks
output		output
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unsupervised Learning

Reproducibility:

Requirements

Experiments

Optimizations

Comparison

Analysis and Visualization

PCA Trials

Metrics Comparison

Activities

Dataset

Planning

About

Releases

Packages

Contributors 2

Languages

azaelmsousa/UnsupLearning

Folders and files

Latest commit

History

Repository files navigation

Unsupervised Learning

Reproducibility:

Requirements

Experiments

Optimizations

Comparison

Analysis and Visualization

PCA Trials

Metrics Comparison

Activities

Dataset

Planning

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages