Skip to content

Processing IT Recruitment Data in HDFS Cluster, Spark, Elasticsearch and Kibana, deployed by Docker compose

Notifications You must be signed in to change notification settings

tienlonghungson/BigData-HDFS-Spark-Elasticsearch-Kibana

Folders and files

NameName
Last commit message
Last commit date

Latest commit

cdc6c76 · Jan 13, 2022

History

91 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Recruitment Insight

Brief Introduction

In this project, we want to get some insight of the labor market, with data crawled from recruitment website. We focus on IT jobs demand in TopCV.vn.

Our system is designed as follow:

architecture

Crawled data is first uploaded to HDFS cluster. Spark Cluster reads that data and filters information about frameworks, plattforms, design patterns, programming languages, knowledges and salaries. The extracted data will be saved again in HDFS Cluster (for storage) and Elasticsearch cluster (for visualization in Kibana).

Here is an example of visualization in Kibana about salary range

sal_vis

See the report directory for the full report and slide.

Data Prepataion

Work Flow

We first create 2 directories in HDFS cluster: /data/rawdata and /data/extracteddata. The crawled data will be uploaded to /data/rawdata.

Execute the bash file run.sh (this bash will turn on docker compose, upload src folder and jar files to spark-master)

/bin/bash run.sh

Come inside spark-master node

docker exec -it spark-master /bin/bash

Do the spark-submit jobs:

spark/bin/spark-submit --master spark://spark-master:7077 --jars elasticsearch-hadoop-7.15.1.jar --driver-class-path elasticsearch-hadoop-7.15.1.jar src/main.py 

Requirement:

  • At least 8GB RAM (but kibana needs to be left out)
  • Set vmmem at least 4.5GB RAM

Acknowledgement

This project can not be completed without the help of our friend (also our advisor) Quan Nguyen, and the one who gave us the idea, our brother Xuan Nam.

The following sources are helpful:

About

Processing IT Recruitment Data in HDFS Cluster, Spark, Elasticsearch and Kibana, deployed by Docker compose

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published