Preprocessing Pipeline with MongoDB - Spark - Neo4j

Data manipulation and feature extraction with SPARK, network calcuations with Neo4j.

Main.py

Setting MongoDB connection and paths.

RunCalculators.py

Setting Spark and Neo4j connections, importing and exporting data, handling preprocessing and normalization.

UserFeatures.py

Calculating twitter-user features by manipulating and aggregating data by Spark methods.

CentralityMeasures.py

Running ETL methods for Neo4J and calculating graph centrality measures by Cypher queries.

PREREQUISITES FOR SPARK

install pyspark --> pip install pyspark
install mongo-spark-connector --> pyspark --packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.1
or download with dependencies https://jar-download.com/artifacts/org.mongodb.spark/mongo-spark-connector_2.11/2.4.1/source-code

PREREQUISITES FOR NEO4J

install neo4j-desktop --> https://neo4j.com/download/
create an empty neo4j graph and install the plug-ins: "graph algorithms" and "APOC". set auth key as "12345678"
pip install neo4j-driver

STRUCTURES OF DATABASES

tweets database should be labeled by "category" column.
tweets database should also contain "id", "user.id", "user.screen_name", "created_at" columns.
user database should contain "id" and "screen_name" columns.
edges file is a json file with columns "Source" and "Target", which defines a relationship between them.

STRUCTURE OF OUTPUT JSON

root  
 |-- id: string (nullable = true)  
 |-- user_features: struct (nullable = true)  
 |    |-- dict_activeness_1: struct (nullable = true)  
 |    |    |-- category1: double (nullable = true)  
 |    |    |-- category2: double (nullable = true)  
 |    |-- dict_activeness_2: struct (nullable = true)  
 |    |    |-- category1: double (nullable = true)  
 |    |    |-- category2: double (nullable = true)  
 |    |-- dict_activeness_3: struct (nullable = true)  
 |    |    |-- category1: double (nullable = true)  
 |    |    |-- category2: double (nullable = true)  
 |    |-- dict_days_posted_by_topic: struct (nullable = true)  
 |    |    |-- category1: long (nullable = true)  
 |    |    |-- category2: long (nullable = true)  
 |    |-- dict_focus_rate: struct (nullable = true)  
 |    |    |-- category1: double (nullable = true)  
 |    |    |-- category2: double (nullable = true)  
 |    |-- dict_tweet_by_topic: struct (nullable = true)  
 |    |    |-- category1: long (nullable = true)  
 |    |    |-- category2: long (nullable = true)  
 |    |-- tweets_total: long (nullable = true)  
 |-- centralities: struct (nullable = true)  
 |    |-- betweennessCentrality: struct (nullable = true)  
 |    |    |-- category1: double (nullable = true)  
 |    |    |-- category2: double (nullable = true)  
 |    |-- closenessCentrality: struct (nullable = true)  
 |    |    |-- category1: double (nullable = true)  
 |    |    |-- category2: double (nullable = true)  
 |    |-- degreeCentrality: struct (nullable = true)  
 |    |    |-- category1: double (nullable = true)  
 |    |    |-- category2: double (nullable = true)  
 |    |-- pageRank: struct (nullable = true)  
 |    |    |-- category1: double (nullable = true)  
 |    |    |-- category2: double (nullable = true)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Preprocessing Pipeline with MongoDB - Spark - Neo4j

Main.py

RunCalculators.py

UserFeatures.py

CentralityMeasures.py

PREREQUISITES FOR SPARK

PREREQUISITES FOR NEO4J

STRUCTURES OF DATABASES

STRUCTURE OF OUTPUT JSON

Files

README.md

Latest commit

History

README.md

File metadata and controls

Preprocessing Pipeline with MongoDB - Spark - Neo4j

Main.py

RunCalculators.py

UserFeatures.py

CentralityMeasures.py

PREREQUISITES FOR SPARK

PREREQUISITES FOR NEO4J

STRUCTURES OF DATABASES

STRUCTURE OF OUTPUT JSON