Data manipulation and feature extraction with SPARK, network calcuations with Neo4j.
Setting MongoDB connection and paths.
Setting Spark and Neo4j connections, importing and exporting data, handling preprocessing and normalization.
Calculating twitter-user features by manipulating and aggregating data by Spark methods.
Running ETL methods for Neo4J and calculating graph centrality measures by Cypher queries.
- install pyspark --> pip install pyspark
- install mongo-spark-connector --> pyspark --packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.1
or download with dependencies https://jar-download.com/artifacts/org.mongodb.spark/mongo-spark-connector_2.11/2.4.1/source-code
- install neo4j-desktop --> https://neo4j.com/download/
- create an empty neo4j graph and install the plug-ins: "graph algorithms" and "APOC". set auth key as "12345678"
- pip install neo4j-driver
- tweets database should be labeled by "category" column.
- tweets database should also contain "id", "user.id", "user.screen_name", "created_at" columns.
- user database should contain "id" and "screen_name" columns.
- edges file is a json file with columns "Source" and "Target", which defines a relationship between them.
root
|-- id: string (nullable = true)
|-- user_features: struct (nullable = true)
| |-- dict_activeness_1: struct (nullable = true)
| | |-- category1: double (nullable = true)
| | |-- category2: double (nullable = true)
| |-- dict_activeness_2: struct (nullable = true)
| | |-- category1: double (nullable = true)
| | |-- category2: double (nullable = true)
| |-- dict_activeness_3: struct (nullable = true)
| | |-- category1: double (nullable = true)
| | |-- category2: double (nullable = true)
| |-- dict_days_posted_by_topic: struct (nullable = true)
| | |-- category1: long (nullable = true)
| | |-- category2: long (nullable = true)
| |-- dict_focus_rate: struct (nullable = true)
| | |-- category1: double (nullable = true)
| | |-- category2: double (nullable = true)
| |-- dict_tweet_by_topic: struct (nullable = true)
| | |-- category1: long (nullable = true)
| | |-- category2: long (nullable = true)
| |-- tweets_total: long (nullable = true)
|-- centralities: struct (nullable = true)
| |-- betweennessCentrality: struct (nullable = true)
| | |-- category1: double (nullable = true)
| | |-- category2: double (nullable = true)
| |-- closenessCentrality: struct (nullable = true)
| | |-- category1: double (nullable = true)
| | |-- category2: double (nullable = true)
| |-- degreeCentrality: struct (nullable = true)
| | |-- category1: double (nullable = true)
| | |-- category2: double (nullable = true)
| |-- pageRank: struct (nullable = true)
| | |-- category1: double (nullable = true)
| | |-- category2: double (nullable = true)