Skip to content

data-science-ml/naive-bayes-hw

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

naive-bayes-hw

Part I

Coding.

Remember the Bayes Dice app we build many weeks ago? Let's revisit that app, but with a twist.

You have 3 coins with the following probabilities. P(H|1) = 0.3, P(H|2) = 0.45, P(H|3) = 0.75.

That is read as the probability of heads for coin 1 is 30%, etc.

Write a small app, using Object Oriented Python, that allows you to randomly select a coin (without looking) and then repeatedly flip it about 10 times or so until you are fairly certain as to the type of coin you selected.

Part II

Questions.

In general, what makes the Naive Bayes Classifier so naive?

  • It is naive because it assumes all features are independent.

What is the difference between the Bernoulli, Gaussian and Multinomial Naive Bayes Classifiers?

  • Bernoulli is when features are 0 or 1
  • Multinomial is when features are counts
  • Gaussian is when features are continuous and normally distributed

Can you use the Naive Bayes Classifier if your features are not independent?

  • You shouldn't, as that's the primary assumption before using this classifier.

Part III

Models.

Take this data. https://github.com/gSchool/dsi-logistic-regression/blob/g79/data/grad.csv

Predict whether someone will get into grad school. Use the following models.

  • Logistic Regression
  • Random Forest
  • Naive Bayes (you will need to figure out what type works best for this data)

Which model performed the best?

Part IV

Text Classification.

Remember this assignment.

https://github.com/data-science-ml/tweets-nlp-assignment/blob/master/nlp-assignment.md

Take the above tweets and turn them into a bag of words. Use a Naive Bayes classifier to figure out if a particular tweet is Neutral, Negative or Positive. Remember to split your data.

What is the accuracy of your model?

Compare this model to a KNN model (neighbors == 3) where each tweet is a 300 dimensional vector.

Which model performs better?

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published