Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update documentation #12

Merged
merged 3 commits into from
Oct 12, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ TwitStat
:License: MIT
:Pycharm: Yes

Twitstat is a simple web application that analyses twitter data to provide interesting insights into trending hashtags
and topics. It cleverly clusters and charts data to ease the process of better understanding trends around the world!

Basic Project Structure
-----------------------
Expand Down
11 changes: 11 additions & 0 deletions docs/source/about.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
About Twitstat
==============

Twitstat is a simple web application that analyses twitter data to provide interesting insights into trending hashtags
and topics. It cleverly clusters and charts data to ease the process of better understanding trends around the world!

Twitstat is split into multiple modules

- :ref:`scrape`

- :ref:`nlp`
4 changes: 4 additions & 0 deletions docs/source/contributors.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Contributors
============

Made with love by `Aditya Raman <https://github.com/ramanaditya>`_ and `Garima Singh <https://github.com/grimmmyshini>`_!
13 changes: 13 additions & 0 deletions docs/source/future.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Future Iterations
=================

**Twitter + Statistics = Amazing information!**

And that is why, we want to keep improving. Future iterations of Twitstat will include (but are not limited to)

- A new and improved clustering algorithm to cluster data with higher fidelity

- Get better insights on data by geo-locating tweets and forming heat-maps

- Create gists of each modelled topic for a quick look into what's the most talked about in real time

9 changes: 8 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,17 @@
Welcome to twitstat's documentation!
====================================

Contents:

.. toctree::
:maxdepth: 2
:caption: Contents:

about
scrape
nlp
future
resources
contributors


Indices and tables
Expand Down
35 changes: 35 additions & 0 deletions docs/source/nlp.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
.. _nlp:

Analysis Module
===============

Twitsat uses three major modules to facilitate its data analysis

- Preprocessing module
- Clustering module
- Sentiment analysis module


Preprocessing Module
--------------------

Before data can be loaded into any of the *actual* analyser functions, it has to be preprocessed or *cleaned*. The preprocessing module removes any unwanted text such as emoticons,
line breaks, punctuations et cetera, from the tweets. Certain words *(called stop-words)* are also removed as they do not add meaning to the text. At last, all the words are tokenized
*(split into multiple words)* and *stemmed*. These tasks are done with the help of `nltk's <https://github.com/nltk/nltk>`_ algorithms.


Clustering Module
-----------------

Twitstat's clustering module uses `scikit-learn's <https://github.com/scikit-learn/scikit-learn>`_ `DBSCAN <https://scikit-learn.org/stable/modules/clustering.html#dbscan>`_
clustering algorithm to cluster tweets falling under the trending categories. **Density-based spatial clustering of applications with noise** *(DBSCAN)* is a density-based
clustering algorithm, that is, given a set of points in some space, it groups together points that are closely packed together. Points which are sparsely packed are classified
as outliers.


Sentiment Analysis Module
-------------------------

At last, after splitting tweets into clusters, the most popular tweet of each cluster is identified. These *popular* tweets are then fed into `texblob's <https://github.com/sloria/TextBlob>`_
sentiment analysis module where the tone (positive, negative or neutral) of the tweets is decided.

12 changes: 12 additions & 0 deletions docs/source/resources.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
Resources and References
========================

Twitstat and this documentation would have not been possible without these amazing resources!

- `Scikit-learn clustering documentation <https://scikit-learn.org/stable/modules/clustering.html>`_
- `Tweepy documentation <http://docs.tweepy.org/en/latest/>`_
- `This insightful paper! <https://github.com/heerme/twitter-topics/blob/master/insight-snow14dc-final.pdf>`_
- `'Text Mining and Clustering of Tweets Based on Context' by Toly Novik <https://www.dezyre.com/student-project/toly-novik-text-mining-and-clustering-of-tweets-based-on-context/2>`_
- `Tutorial on Scikit-learn Tfi-df with nltk preprocessing <https://www.bogotobogo.com/python/NLTK/tf_idf_with_scikit-learn_NLTK.php>`_

`And all the amazing open source software! <https://github.com/MLH-Fellowship/twitstat/blob/main/requirements/base.txt>`_
9 changes: 9 additions & 0 deletions docs/source/scrape.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
.. _scrape:

Scraping Module
===============

Twitstat uses Twitter's python API `tweepy <https://github.com/tweepy/tweepy>`_ to get all the tweets for the analysis.
Tweepy is first used to fetch the trending topics around a specified geographical location, these fetched topics are then
fed into the api's search method. The search method gets Twitstat all the tweets (and other important information such as
the likes, retweets, et cetera for each tweet) corresponding to the search query.