Skip to content

Traveler Reviews/Blogs --> Big Data Real-time Analysis --> Recommendation [DEI-Capstone]

Notifications You must be signed in to change notification settings

azzurolilc/Travel-Recommender-Cobra-Bobo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#####a.l.: Capstone for DEI

Track Project Progress -> Trello

Project Presentation -> Google Slides

##Schistometopum Thomense

wikipedia - Schistometopum Thomense is a species of amphibian in the family Dermophiidae, endemic to São Tomé and Ilhéu das Rolas.It is found in most soils on São Tomé, from tropical moist lowland forests to coastal coconut plantations. It is absent only from the driest northern areas of the island. It is typically around 30 cm (12 in) in length, and is often bright yellow. This species may be referred to as the São Tomé caecilian (with various spellings of the island's name), as the Aqua Ize caecilian, or as the island caecilian, or by the local name of cobra bobo. If you managed to read the paragraph above, and all the long-word location all the jargons made sense to you, just so you know that I am VERY impressed! It's not relevant to the project AT ALL. So here is the project:

###Project Inspiration My friend and I were planning a road trip, we have a specific, but also blurry idea that we want to go on a roadtrip to some State/National Parks between our locations WA, OR, North CA to spend a few days in Sept or Oct to enjoy nature. We are both travellers and the parks has to be where none of us has been to. How do we start planning the trip? #####Method 1: List all the given the query and... you guessed it... Google it! There, Google returned a lot of blogs, articles. It is not very efficient to find matching places at first sight -- we have to read through all the travel blogs, really dig into the lines, putting all puzzles together to come up with a plan. #####Method 2: Start again with Google Map. Then search for national and state parks in the area. Choose a few that is within the desired travel parameter, then filter to get the final travel plan by searching each park individually.

#####As a non-native English speaker and a lazy reader ( argument: world is made a better place by lazy people? ), therefore I came up with the idea to build a real-time fuzzy (and personalized) search engine for providing travel destinations suggestions!

###Project Features

###Challenge Discreet words and categories, semantic understanding would be better than traditional NLP. Count Based: GloVe Prediction Based: word2vec

###Core Architecture Data collection and : data indexing and feature extraction. Modeling: Neural Network Realtime: Query and search algorithm.

Also see: IMG

###Data Source Language Classifier Modeling: Common Crawl April 2016 - TripAdvisor LonelyPlanet Travbuddy

User Search: Query UI Log History

###NLP and Proximity Search

steming - remove some capitaized words - word2vec

###Why train a model for this task? There are options for existing models: GloVe So why spend the time wotk on a differnt one?

  1. NLP is fun! And getting to work with text/search/ML is more so.
  2. Optimazied for Precision: Only concentrate on word meaning in the given field. Avoid word ambiguity, for a more accurate model. (for all positive, filter out false resident in non-related documents)
  3. Optimized for Recall: (for all related documents, improve retrieved rate)
  4. Because I can, with enough data from the travel sites, I have enough data to build my own model.

#####List of Dictionaries - word2vec approach

UNK - Uncommon Words, not in dictionary

#####Skip-gram Model

For this project, I choose to use 7 skip-gram.

###Evaluation Search result and user log *Rating

###System Infrustructure

Data comes from Common Crawl S3, cleaned, formated and saved to CFS. Spark read in data from CFS, process with MlLib word2vec and save the model to web server.

User logs sink to Kafka and processed with Spark Streaming. Play framework servers the query result to web UI.

###Configuration and Instruction

TBD

###Resources

#####Neural Network & NLP

DL4J:word2vec

UNDERSTANDING CONVOLUTIONAL NEURAL NETWORKS FOR NLP

Stanford CS224d: Deep Learning for Natural Language Processing Lectures

Better Word Representations with Recursive Neural Networks for Morphology

IMPLEMENTING A CNN FOR TEXT CLASSIFICATION IN TENSORFLOW

About

Traveler Reviews/Blogs --> Big Data Real-time Analysis --> Recommendation [DEI-Capstone]

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published