In this project, different NLP techniques were used to process a set of Amazon reviews, written by buyers. Documents have been preprocessed by applying urls, emojis, numbers, punctuation and stopwords removal. Every word have been then POS tagged and lemmatized. The resulting texts have been represented with different methods, in particular with Tf-Idf representation, Word2Vec representation and Doc2Vec representation. The vector representations of the documents have been used to compute:
- Text classification: different machine learning methods were used (Naive-bayes, Support Vector Machine, Logistic Regression) and the results were compared using ROC curve and AUC score. Best results were given by the Tf-Idf representation and by using SVM and Logistic Regression algorithms, with and AUC score of 0.90 and an accuracy of 0.83.
- Text clustering: clusters built with Kmeans algorithm, optimal number of cluster found through elbow method, extraction of clusters topics through clusters centroids coordinates and visualizations. It was possible to identify 5 different clusters described by the following topics: Videogames, Movies, Products, Music, Books.
For further details see the report in the repository.