Merge pull request #12 from grimmmyshini/cluster-dev

Update documentation
MLH-Fellowship · Oct 12, 2020 · 4579a32 · 4579a32
2 parents e7f86a4 + 34e3eb6
commit 4579a32
Show file tree

Hide file tree

Showing 8 changed files with 94 additions and 1 deletion.
diff --git a/README.rst b/README.rst
@@ -31,6 +31,8 @@ TwitStat
 :License: MIT
 :Pycharm: Yes
 
+Twitstat is a simple web application that analyses twitter data to provide interesting insights into trending hashtags
+and topics. It cleverly clusters and charts data to ease the process of better understanding trends around the world!
 
 Basic Project Structure
 -----------------------

diff --git a/docs/source/about.rst b/docs/source/about.rst
@@ -0,0 +1,11 @@
+About Twitstat
+==============
+
+Twitstat is a simple web application that analyses twitter data to provide interesting insights into trending hashtags
+and topics. It cleverly clusters and charts data to ease the process of better understanding trends around the world!
+
+Twitstat is split into multiple modules
+
+- :ref:`scrape`
+
+- :ref:`nlp`
diff --git a/docs/source/contributors.rst b/docs/source/contributors.rst
@@ -0,0 +1,4 @@
+Contributors
+============
+
+Made with love by `Aditya Raman <https://github.com/ramanaditya>`_ and `Garima Singh <https://github.com/grimmmyshini>`_!
diff --git a/docs/source/future.rst b/docs/source/future.rst
@@ -0,0 +1,13 @@
+Future Iterations
+=================
+
+**Twitter + Statistics = Amazing information!**
+
+And that is why, we want to keep improving. Future iterations of Twitstat will include (but are not limited to)
+
+- A new and improved clustering algorithm to cluster data with higher fidelity
+
+- Get better insights on data by geo-locating tweets and forming heat-maps
+
+- Create gists of each modelled topic for a quick look into what's the most talked about in real time
+
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -6,10 +6,17 @@
 Welcome to twitstat's documentation!
 ====================================
 
+Contents:
+
 .. toctree::
    :maxdepth: 2
-   :caption: Contents:
 
+   about
+   scrape
+   nlp
+   future
+   resources
+   contributors
 
 
 Indices and tables

diff --git a/docs/source/nlp.rst b/docs/source/nlp.rst
@@ -0,0 +1,35 @@
+.. _nlp:
+
+Analysis Module
+===============
+
+Twitsat uses three major modules to facilitate its data analysis
+
+- Preprocessing module
+- Clustering module
+- Sentiment analysis module
+
+
+Preprocessing Module
+--------------------
+
+Before data can be loaded into any of the *actual* analyser functions, it has to be preprocessed or *cleaned*. The preprocessing module removes any unwanted text such as emoticons,
+line breaks, punctuations et cetera, from the tweets. Certain words *(called stop-words)* are also removed as they do not add meaning to the text. At last, all the words are tokenized
+*(split into multiple words)* and *stemmed*. These tasks are done with the help of `nltk's <https://github.com/nltk/nltk>`_ algorithms.
+
+
+Clustering Module
+-----------------
+
+Twitstat's clustering module uses `scikit-learn's <https://github.com/scikit-learn/scikit-learn>`_ `DBSCAN <https://scikit-learn.org/stable/modules/clustering.html#dbscan>`_
+clustering algorithm to cluster tweets falling under the trending categories. **Density-based spatial clustering of applications with noise** *(DBSCAN)* is a density-based
+clustering algorithm, that is, given a set of points in some space, it groups together points that are closely packed together. Points which are sparsely packed are classified
+as outliers.
+
+
+Sentiment Analysis Module
+-------------------------
+
+At last, after splitting tweets into clusters, the most popular tweet of each cluster is identified. These *popular* tweets are then fed into `texblob's <https://github.com/sloria/TextBlob>`_
+sentiment analysis module where the tone (positive, negative or neutral) of the tweets is decided.
+
diff --git a/docs/source/resources.rst b/docs/source/resources.rst
@@ -0,0 +1,12 @@
+Resources and References
+========================
+
+Twitstat and this documentation would have not been possible without these amazing resources!
+
+- `Scikit-learn clustering documentation <https://scikit-learn.org/stable/modules/clustering.html>`_
+- `Tweepy documentation <http://docs.tweepy.org/en/latest/>`_
+- `This insightful paper! <https://github.com/heerme/twitter-topics/blob/master/insight-snow14dc-final.pdf>`_
+- `'Text Mining and Clustering of Tweets Based on Context' by Toly Novik <https://www.dezyre.com/student-project/toly-novik-text-mining-and-clustering-of-tweets-based-on-context/2>`_
+- `Tutorial on Scikit-learn Tfi-df with nltk preprocessing <https://www.bogotobogo.com/python/NLTK/tf_idf_with_scikit-learn_NLTK.php>`_
+
+`And all the amazing open source software! <https://github.com/MLH-Fellowship/twitstat/blob/main/requirements/base.txt>`_
diff --git a/docs/source/scrape.rst b/docs/source/scrape.rst
@@ -0,0 +1,9 @@
+.. _scrape:
+
+Scraping Module
+===============
+
+Twitstat uses Twitter's python API `tweepy <https://github.com/tweepy/tweepy>`_ to get all the tweets for the analysis.
+Tweepy is first used to fetch the trending topics around a specified geographical location, these fetched topics are then
+fed into the api's search method. The search method gets Twitstat all the tweets (and other important information such as
+the likes, retweets, et cetera for each tweet) corresponding to the search query.