- Gaussian Naive Bayes
- Multinomial Naive Bayes
- Complement Naive Bayes
- Bernoulli Naive Bayes
- Categorical Naive Baye
- clf = GaussianNB()
- Classification
- Regression
- sklearn.svm.SVC(kernel='rbf', C=1.0, gamma='scale')
- Classification
- Regression
- DecisionTreeClassifier: min_samples_splitint or float, default=2
- k nearest neighbors: classic, simple, easy to understnd
- adaboot & random forest: 'emsemble methods', meta classifiers built from decision trees
- do some research
- find sklearn documenttion
- deploy it
- use it to make predictions
- evaluate it accuracy
- Patterns
- numerical - numerical values (numbers, e.g. salary)
- categorical - limited number of discrete values (category, e.g. star of movie)
- time series - temporal value (date, timestamp)
- text - words
- Continous supervised learning
- Minimize sum of the squared errors (SSE)
- Ordinary Least Squares
- Stochastic Gradient Descent
- r2 (r squared)(0<r2<1): how much of my change in the output (y) is explained by the change in my input
Comparing Classification & Regression
Property | Supervised classification | Regression |
---|---|---|
Output type | Discrete (class labels) | Continous (number) |
What are you tring to find | Decision boundary | "best fit line" |
Evaluation | Accuracy | "Sum of squared error" or r2("r squared") |
- Ignore:
- Sensor malfuntions
- Data entries
- Pay attention:
- Freak event
- Removal strategy:
- Train
- Remove points with largest residual errors (10%)
- Retrain
- K-means
- number of clusters (default:8)
- max_iter
- n_init: run with different centroid seeds
rescale = (x-x_min)/(x_max-x_min) (0,1)
MinMaxScaler() when the features have dramatically different quantities and large difference
These algorithms would be affected by feature rescaling.
- SVM with RBF kernel
- K-means clustering
These algorithms would NOT be affected by feature rescaling.
- Decision trees
- Linear regression
Stopwords, low information
-Term frequency: bag of words
-Inverse document frequency: weighting by how often word occurs in the corpus
- Use human intuition
- Code up the new feature
- Visualize
- Repeat
Two big univariate feature selection tools in sklearn: SelectPercentile and SelectKBest.
- SelectPercentile selects the X% of features that are most powerful (where X is a parameter)
- SelectKBest selects the K features that are most powerful (where K is a parameter)
Lasso().fit(features, labels)
Lasso().coef_
Maximal Variance
Retains maximum amount of information in original data
- Systematized way to transform input features into principal components (PC)
- Use principal components as new features
- PCs are directions in data that maximize variance (minimize information loss) when you project/compress down onto them
- More variance of data along a PC, higher that PC is ranked
- Most variance/most information -> first PC
second-most variance (without overlapping w/ first PC) -> second PC - Max no. of PCs = no. of input features
- Give estimate of performance on an independent dataset
- Serve as check on overfitting
train_test_split(features, labels, test_size=0.3, random_state=42)
Confusion Matrix (Row is true labels, column is predicted labels)
-
Recall: True Positive / (True Positive + False Negative) (Check the rows) (recall equals sensitive)
-
Precision: True Positive / (True Positive + False Positive) (Check the columns)
-
My identifier doesn't have great precision, but it does have good recall. That means nearly every time a POI shows up in my test set, I am able to identify him or her. The cost of this is that I sometimes get some false positives, where non-POIs get flagged.
-
My identifier doesn't have great recall, but it does have good precision. That means whenever a POI gets flagged in my test set, I know with a lot of confidence that it's very likely to be a real POI and not a false alarm. On the other hand, the price I pay for this is that I sometimes miss real POIs, since I'm effectively reluctant to pull the trigger on edge cases.
Receiver operating characteristic (ROC) curve: plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. (Check the rows)
- The true-positive rate is also known as sensitivity, recall or probability of detection in machine learning. The false-positive rate is also known as probability of false alarm and can be calculated as (1 − specificity).