NLP Events Clustering in Python

Web crawling and NLP engines and clustering of same-event news articles

Assessment 3 of MA5851 (Data Science Master Class 1) at James Cook University

Author: Sacha Schwab

MIT license

Quick outline

Crawl Yahoo Finance Cryptocurrency news articles
Raw text data is preprocessed, embedded (TF-IDF)
NLP engine runs keyword extraction based on TF-IDF weights, named entity extraction and sentiment analysis
HDBSCAN algorithm used for clustering, with currently moderate effectiveness (to be enhance in upcoming versions).

See architecture outline at the bottom of this page.

For class Tutors

Reports are in /main as 'A3_DocumentNumber_X_sacha_schwab<' as per assessment outline/li>
Code files: (1) 'code_webcralwer.ipynb', (2) 'code_nlp.ipynb'
Model available under /main/model
For privacy reasons the audio annotated Powerpoint presentation is not available here but in the assessment folder in JCU Learn

Base requirements

Git
Python 3.7
Any IDE supporting Jupyter Notebook files

Deploy

Schedule daily_jobs/webcrawler.py code for daily run (ipynb version is for grading)
Schedule daily_jobs/model_update.py for daily run
TBD: Get connected articles to a new article by running get_cluster from model_run.py

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
code		code
data		data
model		model
pdf_reports		pdf_reports
reports_jupyter_notebooks		reports_jupyter_notebooks
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP Events Clustering in Python

Web crawling and NLP engines and clustering of same-event news articles

Assessment 3 of MA5851 (Data Science Master Class 1) at James Cook University

Quick outline

For class Tutors

Base requirements

Deploy

Architecture

About

Releases

Packages

Languages

sachaschwab/NLP-Clustering

Folders and files

Latest commit

History

Repository files navigation

NLP Events Clustering in Python

Web crawling and NLP engines and clustering of same-event news articles

Assessment 3 of MA5851 (Data Science Master Class 1) at James Cook University

Quick outline

For class Tutors

Base requirements

Deploy

Architecture

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages