Skip to content

Web crawling and NLP engines for clustering of same-event news articles

Notifications You must be signed in to change notification settings

sachaschwab/NLP-Clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP Events Clustering in Python

Web crawling and NLP engines and clustering of same-event news articles

Assessment 3 of MA5851 (Data Science Master Class 1) at James Cook University

Author: Sacha Schwab

MIT license

Quick outline

  • Crawl Yahoo Finance Cryptocurrency news articles
  • Raw text data is preprocessed, embedded (TF-IDF)
  • NLP engine runs keyword extraction based on TF-IDF weights, named entity extraction and sentiment analysis
  • HDBSCAN algorithm used for clustering, with currently moderate effectiveness (to be enhance in upcoming versions).
See architecture outline at the bottom of this page.

For class Tutors

  • Reports are in /main as 'A3_DocumentNumber_X_sacha_schwab<' as per assessment outline/li>
  • Code files: (1) 'code_webcralwer.ipynb', (2) 'code_nlp.ipynb'
  • Model available under /main/model
  • For privacy reasons the audio annotated Powerpoint presentation is not available here but in the assessment folder in JCU Learn

Base requirements

  • Git
  • Python 3.7
  • Any IDE supporting Jupyter Notebook files

Deploy

  • Schedule daily_jobs/webcrawler.py code for daily run (ipynb version is for grading)
  • Schedule daily_jobs/model_update.py for daily run
  • TBD: Get connected articles to a new article by running get_cluster from model_run.py

Architecture


architecture_

About

Web crawling and NLP engines for clustering of same-event news articles

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published