Skip to content

Gods-of-Bigdata/SS_yab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Aug 16, 2020
0483fc8 · Aug 16, 2020

History

55 Commits
Aug 10, 2020
Aug 13, 2020
Aug 9, 2020
Aug 9, 2020
Aug 14, 2020
Aug 16, 2020
Aug 9, 2020
Aug 5, 2020
Aug 10, 2020
Aug 9, 2020
Aug 9, 2020
Aug 9, 2020
Aug 9, 2020
Aug 9, 2020

Repository files navigation

SS_Yab

Tasks

  • Sahamyab Crawler - NSQ Producer
  • Tweet Preprocessing - Cassandra DBMS
  • Elastic - Kibana dashboard - Redis
  • Flask dashboard
  • ML model
  • Clickhouse DBMS - Superset visualization

Prerequisites

NSQ
pynsq (pip package)
colorama (pip package)
requests (pip package)
openjdk-8
Cassandra
cassandra-driver (pip package)
hazm (pip package)
nltk (pip package)
elasticsearch (pip package)
redis (pip package)
wordcloudfa (pip package)
jwt (pip package)
psutil (pip package)
flask-login(pip package)
docker

Installing

- NSQ

1- Download latest NSQ binaries HERE.
2- Extract the archive, add bin folder to system PATH variable.
3- Install prerequisties libraries:

$ pip install pynsq colorama requests
$ pip install hazm
$ pip install https://github.com/sobhe/hazm/archive/master.zip --upgrade

- Preprocess

1- Install prerequisites libraries:

$ pip install hazm (also will install nltk as prerequisite)
$ pip install https://github.com/sobhe/hazm/archive/master.zip --upgrade

2- Download and extract nlp prerequisite resource.zip in project folder.

- Cassandra

1- Install jdk-8

$ sudo apt-get install openjdk-8-jdk
$ export JAVA_HOME=path_to_java_home

2- Install Cassandra:

$ echo "deb https://downloads.apache.org/cassandra/debian 311x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
$ curl https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -
$ sudo apt-get update
$ sudo apt-get install cassandra

3- Install Cassandra Python Driver (cassandra-driver):

$ pip install cassandra-driver

- Elasticsearch & Kibana

1- Install Elasticsearch & Kibana. (We are using 7.8.0)

For Ubuntu, follow these steps:

  • Elasticsearch:
$ wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
$ sudo apt-get install apt-transport-https
$ echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-7.x.list
$ sudo apt-get update && sudo apt-get install elasticsearch
  • Kibana:
$ wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
$ sudo apt-get install apt-transport-https
$ echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-7.x.list
$ sudo apt-get update && sudo apt-get install kibana

2- Install Python Elasticsearch Client

$ python -m pip install elasticsearch

- Redis

1- Install Redis

$ sudo apt update
$ sudo apt install redis-server

2- In the config file, change supervised no to supervised systemd, so it will run with the system start-up.
In the end, restart the server:

$ sudo systemctl restart redis.service

3- Install redis-py (or here):

$ pip install redis

- Clickhouse

create and run clickhouse container

$ docker network create -d bridge sahamyab
$ docker run -d -p 8123:8123 -p 9000:9000 --network="sahamyab" --name clickhouse --ulimit nofile=262144:262144

- Apache Superset

create and run superset container

$ docker run --detach -p 8080:8088 --name superset --network="sahamyab" amancevice/superset

Usage

1- In one shell, start nsqlookupd:

$ nsqlookupd

2- In another shell, start nsqd:

$ nsqd --lookupd-tcp-address=127.0.0.1:4160

3- In another shell, start nsqadmin:

$ nsqadmin --lookupd-http-address=127.0.0.1:4161

4- Now run Sahamyab tweet crawler/producer:

$ python sahamyab_producer.py

5- You can run an example program for consuming tweets:

$ python sahamyab_consumer_example.py

** To use consumer.py you must first run these:
Cassandra:

$ sudo Cassandra -R

Elasticsearch:

$ sudo /bin/systemctl daemon-reload
$ sudo /bin/systemctl enable elasticsearch.service
$ sudo systemctl start elasticsearch.service

Kibana:

$ sudo /bin/systemctl daemon-reload
$ sudo /bin/systemctl enable kibana.service
$ sudo systemctl start kibana.service

Also you need to import dashboard.ndjson into Kibana (Saved objects).

Redis:
If you did that config part, should already be runnig; if not:

$ sudo systemctl start redis.service

Clickhouse:

$ python3 clickhouse_consumer.py

Superset:

Go to localhost:8080

Add clickhouse to sources -> databases by clickhouse://clichouse and sahamyab in sources -> tables

Go to manage -> import dashboards and import superset_dashboard.json that found in resources folder of project

Flask dashboard:

cd flask_dashboard
sudo python3 run.py

License

This project is licensed under the GPLv2 - see the LICENSE.md file for details