The model that I implemented took a dataset from Kaggle that had the articles preliminarily labeled as True and False. A groupmate of mine utilized the same model as me but had his model detected left and right bia; I was feeding new data into the same model to see if it would yield the same performance and results. Furthermore, the algorithm is being applied to both the titles and texts of the articles to find the most common ngrams within the true and false articles, versus with his model it was just the titles.
The first challenge I ran into was that the data was too big and I had to cut it down significantly from 20,000 articles per file to about 100 for it to run and be readable. I had word clouds for true, word cloud for false, unigram, bigram and trigram analyses for my data (the most common words extracted from the text). We ran a decision tree based off of unigram and bigram analyses and the accuracies associated with the trees ranged from about 78 - 86%. The cross validation accuracy for both also demonstrated high accuracy.
At the end of the model I introduced two new data points and the unigram model predicted it incorrectly while the bigram predicted one correctly. This may have happened when applying the model in practice, because we realized these articles didn’t cover the same topics as the articles in the training set so the algorithm is possibly not successfully detecting the ngrams. However, if we retested with a bigger set of articles we pulled there is a higher chance it could detect those ngrams.
Overall, I could make the conclusion that it is reasonable to speculate this model and algorithm is applicable to use news articles surrounding one topic (using it in practice would then yield better results as the algorithm will detect ngrams in randomly picked new articles).