In this project, we explore the emotions and opinions expressed in ideas and comments using sentiment analysis. The script is written in Python, leveraging libraries like pandas
, BeautifulSoup
, and NLTK
.
-
Data Preparation:
- Load ideas and comments from JSON files.
- Convert HTML content in the descriptions to plain text for clarity.
-
Sentiment Analysis:
- Utilize
NLTK
's VADER tool to analyze sentiments of ideas and comments. - Calculate a sentiment score for each idea and comment, indicating positive, neutral, or negative sentiment.
- Utilize
-
Data Aggregation:
- Group comments by their associated idea and calculate the average sentiment score for comments on each idea.
-
Data Merging:
- Combine the ideas and their average comment sentiment scores into one DataFrame for comprehensive analysis.
-
Visualization:
- Create a bar chart using
matplotlib
andnumpy
to compare the sentiment scores. - Display idea sentiments in blue and average comment sentiments in yellow for a clear visual comparison.
- Create a bar chart using
The resulting plot provides a visual representation of the sentiments associated with each idea and its comments. This analysis helps in understanding public opinion and emotions towards these ideas, offering valuable insights.
The graphical analysis suggests that the sentiment scores of idea descriptions are more variable compared to the average sentiment of the comments. This observation might stem from the tendency of idea descriptions to be more expressive or exaggerated in nature, resulting in a wider range of sentiment scores. This contrast highlights how different modes of expression (ideas vs. comments) can vary in emotional intensity and variability.
- "We could involve the medical order and the nursing order in the identification of their professionals who are in quarantine and available to do teleconsultations."
- Negative Sentiment Probability: 0.0596
- Neutral Sentiment Probability: 0.8874
- Positive Sentiment Probability: 0.0529
- Interpretation: The sentiment of the English text is predominantly neutral.
- "Poderíamos envolver a ordem dos médicos e a ordem dos enfermeiros na identificação dos seus profissionais que estão de quarentena e disponíveis para fazer teleconsultas."
- Negative Sentiment Probability: 0.0489
- Neutral Sentiment Probability: 0.9181
- Positive Sentiment Probability: 0.0329
- Interpretation: The sentiment of the Portuguese text is predominantly neutral.
For this analysis, I utilized the pretrained model from pysentimiento/bertweet-pt-sentiment (Research Paper). This decision was motivated by the need to ensure accuracy in sentiment analysis of Portuguese texts. My objective was to verify if the existing sentiment analysis tool was appropriate for Portuguese descriptions. The results from the pretrained model confirmed that the previous sentiment analysis was accurate, affirming its suitability for our project. It was done to re-assure the results I had previously on the sentiment of each description (from nltk.sentiment import SentimentIntensityAnalyzer to realize if this SentimentAnalyzer was accurate in portuguese text), since the results were pretty similar there is no need for changes on innovation.py.
This project involves classifying a set of innovation ideas into various categories such as products, services, business models, and others. The initial approach was to use clustering techniques to group similar ideas.
- Clustering with KMeans:
- Applied TF-IDF vectorization to preprocess the idea descriptions.
- Used KMeans clustering to group ideas into 6 clusters.
- Visualized the clusters with a scatter plot.
- Identified the different clusters and manually analyzed each group to label them (e.g., services, products).
One of the clusters, identified by IDs [100, 101, 130, 135, 155, 153, 186, 193, 194, 196, 197, 226, 227, 228, 230, 258, 262, 268, 289, 293, 295, 299, 301, 321, 322, 324, 358, 362, 385, 386, 417, 418, 423, 424, 425, 427, 428, 452, 453, 455, 456, 457, 482, 483, 481, 484, 487, 516, 546, 547, 548, 549, 611, 643, 737, 962, 1025, 1089] was labeled as 'service', and the other clusters would also be analyzed manually and get a innovation type attributed. The process involved manual analysis of the groups to determine appropriate labels.
After the initial clustering process, I found myself dissatisfied with the results of the unsupervised learning approach. This led me to pivot towards a supervised learning strategy.
I embarked on creating a new dataset, utilizing GPT for the generation process (innovation_ideas.txt) , where I generated innovation types labeled. This approach was taken with the intent to train a model more effectively, tailored to the specific requirements of the task at hand.
With the newly created dataset in hand, I opted to train my model using the Support Vector Classifier (SVC) method. This decision was driven by SVC's known efficacy in handling similar classification tasks.
The trained model was then tested on the ideas.json
file. Despite the rigorous process, the results did not align with my expectations. In retrospect, a larger and more diverse dataset might have significantly enhanced the model's performance.
Still dissatisfied with the results, the project shifted towards using advanced NLP models for better classification.
To further enhance the classification, the project adopted zero-shot learning techniques. Zero-shot learning models, like facebook/bart-large-mnli
, can classify text into categories without explicit training on those categories. This approach offers flexibility and reduces the need for a large labeled dataset.
This were some of the results, they were much better and i was finally happy with the labelling, the full results are in Bert_results.txt I can finally say I can move on to the next one :)
This exercise involves analyzing a dataset of Amazon product reviews. The dataset is in JSON format and includes various fields such as the reviewer's ID, product ID, review text, ratings, and votes for helpfulness.
The analysis aims to answer the following questions using the dataset:
- Is there a correlation between the product's rating and the review's helpfulness?
- Who are the most helpful reviewers?
- Have reviews been getting more or less helpful over time?
The dataset is structured with the following columns:
reviewerID
: ID of the reviewer (e.g., A2SUAM1J3GNN3B)asin
: ID of the product (e.g., 0000013714)reviewerName
: Name of the reviewervote
: Helpful votes of the reviewstyle
: A dictionary of the product metadata (e.g., format)reviewText
: Text of the reviewoverall
: Rating of the productsummary
: Summary of the reviewunixReviewTime
: Time of the review (unix time)reviewTime
: Time of the review (raw)
- Correlation between Product Rating and Review Helpfulness: The
corr
function in Pandas calculates the statistical relationship between product ratings (overall
) and review helpfulness (vote
). The result indicates the nature of this relationship. - Most Helpful Reviewers: Reviewers are grouped by their ID, and their votes are summed to identify the most helpful ones.
- Trend of Review Helpfulness Over Time: The dataset is grouped by year, and the average helpfulness votes are calculated to observe trends over time.
- Output:
Correlation: -0.017070877080897048
- Interpretation: The correlation coefficient of approximately -0.017 indicates a very weak negative linear relationship between the product's rating and the review's helpfulness. This suggests that there is virtually no significant linear relationship between these two factors in the dataset.
The analysis identified the top 10 reviewers with the most helpful votes. Here are their IDs along with the total number of helpful votes they received:
A1MRPX3RM48T2I
- 2375 votesA5JLAU2ARJ0BO
- 2063 votesA2D1LPEUCTNT8X
- 2033 votesA3MQAQT8C6D1I7
- 1846 votesA15S4XW3CRISZ5
- 1470 votesA1N40I9TO33VDU
- 1142 votesA1UED9IWEXZAVO
- 1132 votesA250AXLRBVYKB4
- 1108 votesA680RUE1FDO8B
- 1101 votesA2IIN2NFYXHC4J
- 1092 votes
These reviewers are considered the most helpful based on the total number of helpful votes their reviews have received.
The trend of review helpfulness over time was plotted to observe how the average helpful votes changed year by year. The plot provides a visual representation of whether reviews have been getting more or less helpful over time.
Overall Decline: Following the initial peak, there is a general downward trend in the average number of helpful votes over time. This trend suggests that reviews are considered less helpful by users or that users are less inclined to vote on the helpfulness of reviews as time progresses.
A binary classification model was built to predict whether an Amazon product review will be considered helpful. The helpfulness label was defined as binary, with reviews receiving more than 5 votes considered helpful (labeled as 1), and all others considered not helpful (labeled as 0).
The model was trained using the following steps:
- Reviews with more than 5 helpful votes are labeled as 1 (helpful), else 0 (not helpful).
- The feature used for prediction is the review text.
- TF-IDF vectorization is applied to convert text to numeric features, limited to 3 features for simplicity.
- A RandomForestClassifier is used for training.
- The dataset is split into 80% training and 20% testing sets.
The classification report from the model is as follows:
precision recall f1-score support
0 0.89 0.99 0.94 81337
1 0.45 0.07 0.13 10551
accuracy 0.88 91888
macro avg 0.67 0.53 0.53 91888 weighted avg 0.84 0.88 0.84 91888
The model shows high precision for class 0 (not helpful), but low precision and recall for class 1 (helpful), indicating a model bias towards predicting not helpful reviews.
- Balancing the Dataset: The dataset is highly imbalanced, with most reviews having 0 votes. This imbalance can lead to biased results, which is evident from the precision and recall scores. Techniques like SMOTE, undersampling, or assigning class weights in the model could be explored to address this imbalance.
- Incorporating More Features: Although the current model uses only 3 features due to computational constraints, including more features such as the length of the review, the sentiment score, and the time of the review could potentially improve the model's performance.
- Preventing Overfitting: Care must be taken not to overfit the model when adding more features. Cross-validation and regularization techniques should be used to ensure the model generalizes well to unseen data.
- Additional Resources: With more computational power and time, a more thorough grid search for hyperparameter tuning could be conducted, and more complex models or ensemble methods could be tested.
By addressing these points, we could improve the model's ability to accurately predict the helpfulness of reviews.