Problem Statement Zomato is an Indian restaurant aggregator and food delivery start-up founded by Deepinder Goyal and Pankaj Chaddah in 2008. Zomato provides information, menus and user-reviews of restaurants, and also has food delivery options from partner restaurants in select cities.
India is quite famous for its diverse multi cuisine available in a large number of restaurants and hotel resorts, which is reminiscent of unity in diversity. Restaurant business in India is always evolving. More Indians are warming up to the idea of eating restaurant food whether by dining outside or getting food delivered. The growing number of restaurants in every state of India has been a motivation to inspect the data to get some insights, interesting facts and figures about the Indian food industry in each city. So, this project focuses on analysing the Zomato restaurant data for each city in India.
The Project focuses on Customers and Company, you have to analyze the sentiments of the reviews given by the customer in the data and made some useful conclusion in the form of Visualizations. Also, cluster the zomato restaurants into different segments. The data is vizualized as it becomes easy to analyse data at instant. The Analysis also solve some of the business cases that can directly help the customers finding the Best restaurant in their locality and for the company to grow up and work on the fields they are currently lagging in.
This could help in clustering the restaurants into segments. Also the data has valuable information around cuisine and costing which can be used in cost vs. benefit analysis
Data could be used for sentiment analysis. Also the metadata of reviewers can be used for identifying the critics in the industry.
Attribute Information Zomato Restaurant names and Metadata Name : Name of Restaurants
Links : URL Links of Restaurants
Cost : Per person estimated Cost of dining
Collection : Tagging of Restaurants w.r.t. Zomato categories
Cuisines : Cuisines served by Restaurants
Timings : Restaurant Timings
Zomato Restaurant reviews Restaurant : Name of the Restaurant
Reviewer : Name of the Reviewer
Review : Review Text
Rating : Rating Provided by Reviewer
MetaData : Reviewer Metadata - No. of Reviews and followers
Time: Date and Time of Review
Pictures : No. of pictures posted with review
- Descriptive Statistics
- Data Visualization
- Machine Learning -Supervised-Learning -Unsupervised-Learning
- Python
- Pandas
- Numpy
- Matplotlib
- Seaborn
- Scikit-learn
- NLP
- Used Sentiment Intensity Analyzer (vaderSentiment) and RoBerta Pretrained Model -Plotted the distribution of ratings to have an understanding of the proportion of good and bad reviews. -Made many visualizations which include, top 10/bottom 10 restaurants in terms of average rating. -Had a clear understanding of the cost summary of the restaurants. -Done pre-processing such as removing emojis, punctuations and only used Adjectives and verbs to reduce dimensionality.
-Did the pre-processing, Clubbed some cuisines together so that one-hot encoding would be possible. -Removed the unwanted variables and Normalized the data. -Clustered the data using K-means as well as Hierarchical clustering.
K-means clustering is a widely used unsupervised machine learning technique.Unsupervised algorithms make inferences from datasets using only input vectors and do not relate to known or tagged outcomes. A cluster is a collection of data points that have been grouped together due to particular commonalities. K, for the number of centroids required in the dataset. The centroid is an imaginary or real point that represents the cluster's center.By lowering the in-cluster sum of squares, each data point is assigned to one of the clusters.
To put it another way, the K-means algorithm finds k centroids and then assigns each data point to the closest cluster while keeping the centroids as small as possible. The 'means' in K-means refers to the data being averaged, or determining the centroid. K means is highly vulnerable to high dimensional data. As the dimensions increase it is not able to differentiate between the clusters. Hence, a dimensionality reduction technique called Principal Component Analysis was used.