Popcorn DB

PopCorn DB http://popcorn-db.net is a personal project which aims at recreating from scratch an IMDB like website with a machine learning layer on top of it. Therefore it includes the following features:

A fast & scalable web crawler.

I used Apache Spark for parallel computing, and InfluxDB for logging in live its activity. To reuse the CPU idle time when waiting for network responses I configured Spark to create 8 times more executors than CPU cores for each machine of the cluster.

A blazingly fast custom built search-engine with fuzzy search and autocompletion.

The average query time for 100K movies is 0.03ms. The speed is obtained by indexing every possible ngram of each movie title. The fuzzy search is done by building & exploring Levenstein automata on the go.

A movie genre & nationality predictor

I used a naive bayes network approach as it seemed after experimentation to be the best Machine Learning model adapted to this case.

A web-server, socket-server and front-end

The search engine and machine learning layers are written in C++. So I decided to build a web-server also in my C++ program. No need of Apache or nginx, less overhead = more speed.

Presentation Slides

https://www.hutworks.net/PopcornValentinMercierFinalProjectSlides.pdf

Run

Install https://github.com/uNetworking/uWebSockets and its dependencies
mkdir tmp && cd tmp && cmake .. && make && cd ..
./AllocineBackend
Visit http://localhost:2200
Feel free to daeomonize the backend with an init.d service or even proxy/change the port to 80.

You will need to re-crawl the movies because I could not upload the database and the images to this git repository since GitHub enforces a file/repo size limit. You might additionally want to disable the InfluxDB logging featured inside the crawler if you do not want to install InfluxDB

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
accuracy_analysis		accuracy_analysis
crawler		crawler
init.d		init.d
stats		stats
web		web
.gitignore		.gitignore
AllocineBackend		AllocineBackend
CMakeLists.txt		CMakeLists.txt
README.md		README.md
json.hpp		json.hpp
levenshtein.hpp		levenshtein.hpp
main.cpp		main.cpp
movie.hpp		movie.hpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Popcorn DB

A fast & scalable web crawler.

A blazingly fast custom built search-engine with fuzzy search and autocompletion.

A movie genre & nationality predictor

A web-server, socket-server and front-end

Presentation Slides

Run

About

Releases

Packages

Languages

ValHook/popcorn

Folders and files

Latest commit

History

Repository files navigation

Popcorn DB

A fast & scalable web crawler.

A blazingly fast custom built search-engine with fuzzy search and autocompletion.

A movie genre & nationality predictor

A web-server, socket-server and front-end

Presentation Slides

Run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages