PopCorn DB http://popcorn-db.net is a personal project which aims at recreating from scratch an IMDB like website with a machine learning layer on top of it. Therefore it includes the following features:
I used Apache Spark for parallel computing, and InfluxDB for logging in live its activity. To reuse the CPU idle time when waiting for network responses I configured Spark to create 8 times more executors than CPU cores for each machine of the cluster.
The average query time for 100K movies is 0.03ms. The speed is obtained by indexing every possible ngram of each movie title. The fuzzy search is done by building & exploring Levenstein automata on the go.
I used a naive bayes network approach as it seemed after experimentation to be the best Machine Learning model adapted to this case.
The search engine and machine learning layers are written in C++. So I decided to build a web-server also in my C++ program. No need of Apache or nginx, less overhead = more speed.
https://www.hutworks.net/PopcornValentinMercierFinalProjectSlides.pdf
- Install https://github.com/uNetworking/uWebSockets and its dependencies
mkdir tmp && cd tmp && cmake .. && make && cd ..
./AllocineBackend
- Visit http://localhost:2200
- Feel free to daeomonize the backend with an
init.d
service or even proxy/change the port to 80.
You will need to re-crawl the movies because I could not upload the database and the images to this git repository since GitHub enforces a file/repo size limit. You might additionally want to disable the InfluxDB logging featured inside the crawler if you do not want to install InfluxDB