Skip to content

Merge Macbook Changes #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 41 additions & 3 deletions Algorithm/CRF/CRF.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,45 @@
# Condiitonal Random Field
# Conditonal Random Field

## Probabilistic Undirected Graphical Model (aka. Markov Random Field)
> Can be considered to a extension of [MEM](../MEM/MEM.md)

## Overview

### Quick View

| Category | Usage | Methematics | Application Field |
| ------------------- | -------------- | ----------- | ----------------- |
| Supervised Learning | Classification | Entropy | NLP |

## Background - From MEM to CRF

![](https://i.stack.imgur.com/khcnl.png)

### Conditional Maximum Entropy Distribution

## Concept

### [Undirected Graph Model](../../Notes/GraphicalModel.md#Undirected-Graph-Model)

## Viterbi Algorithm

## Links

[Wiki - Conditional random field](https://en.wikipedia.org/wiki/Conditional_random_field)
* [**An Introduction to Conditional Random Fields**](https://www.research.ed.ac.uk/portal/files/10482724/crftut_fnt.pdf)

### Wikipedia

* [Graphical model](https://en.wikipedia.org/wiki/Graphical_model)
* [Clique (graph theory)](https://en.wikipedia.org/wiki/Clique_(graph_theory))
* [Markov random field](https://en.wikipedia.org/wiki/Markov_random_field)
* [Conditional random field](https://en.wikipedia.org/wiki/Conditional_random_field)

### Tools

* [kmkurn/pytorch-crf: (Linear-chain) Conditional random field in PyTorch.](https://github.com/kmkurn/pytorch-crf)
* [pytorch-crf — pytorch-crf 0.7.2 documentation](https://pytorch-crf.readthedocs.io/en/stable/)
* [CRF++](https://taku910.github.io/crfpp/)
* [github](https://github.com/taku910/crfpp)
* [TensorFlow CRF](https://www.tensorflow.org/api_docs/python/tf/contrib/crf)
* [github](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/crf)
* [sklearn-crfsuite](https://sklearn-crfsuite.readthedocs.io/en/latest/)
* [github](https://github.com/TeamHG-Memex/sklearn-crfsuite/)
Empty file.
2 changes: 2 additions & 0 deletions Algorithm/HMM/HMM_Text_Segmentation/HMMLearn.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from hmmlearn import hmm

8 changes: 8 additions & 0 deletions Algorithm/HMM/HMM_Text_Segmentation/HMM_FromScratch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
import numpy as np

def log_normalize(vector):
return np.log(vector) - np.log(np.sum(vector))

def log_sum(vector):
pass

8 changes: 7 additions & 1 deletion Algorithm/LogisticRegression/LogisticRegression.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ For each piece of data in the dataset:

## Multiple Classes

### [Multinomial](../MEM/MEM.md) - Softmax Regression (SMR)
### Multinomial - Softmax Regression (SMR)

> Softmax Regression (synonyms: Multinomial Logistic, Maximum Entropy Classifier, or just Multi-class Logistic Regression) is a generalization of logistic regression that we can use for multi-class classification (under the assumption that the classes are mutually exclusive)

Expand All @@ -57,6 +57,10 @@ $$

### Book

Dive into Deep Learning

* [Ch3.4. Softmax Regression](http://d2l.ai/chapter_linear-networks/softmax-regression.html)

Machine Learning in Action

* Ch5 Logistic Regression
Expand Down Expand Up @@ -93,3 +97,5 @@ Multinomial (softmax)

* [2 Ways to Implement Multinomial Logistic Regression in Python](http://dataaspirant.com/2017/05/15/implement-multinomial-logistic-regression-python/) - use scikit learn
* [Machine Learning and Data Science: Multinomial (Multiclass) Logistic Regression](https://www.pugetsystems.com/labs/hpc/Machine-Learning-and-Data-Science-Multinomial-Multiclass-Logistic-Regression-1007/)
* [mlxtend - Softmax Regression](https://rasbt.github.io/mlxtend/user_guide/classifier/SoftmaxRegression/)
* [jupyter notebook](https://github.com/rasbt/python-machine-learning-book/blob/master/code/bonus/softmax-regression.ipynb)
85 changes: 85 additions & 0 deletions Algorithm/MEM/MEM.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Maximum Entropy Model

Maximum Entropy Classifier / [Multinomial Logistic Regression - i.e. Softmax](../LogisticRegression/LogisticRegression.md#Multinomial---Softmax-Regression-(SMR)),

> Can be considered to a mother of other algorithms
>
> [Condiitonal Random Field](../CRF/CRF.md)

## Brief Description

### Quick View

Category|Usage|Methematics|Application Field
--------|-----|-----------|-----------------
Supervised Learning|Classification|Entropy|Many

## Concept

### The MEM Model

#### Background

Consider a machine learning problem

* $x$ = $(x_1, x_2, \dots, s_m)$ is input feature vector
* $y \in \{1, 2, \dots, k\}$ => a k classes classification problem

Given k linear model for machine learning. Each has dimension of m.

$$
\phi = w_{i1}x_1 + w_{i2}x_2+\cdots + w_{im}x_m,~~~1\leq i \leq k
$$

Prediction "class" $\hat{y}$ is the maximum "score" for each linear model output.

$$
\hat{y} = \arg\max_{1\leq i \leq k} \phi_i(x)
$$

TBD





### Training the Model

* GIS Algorithm
* IIS Algorithm
* Gradient Descent
* [Quasi-Newton Method](https://en.wikipedia.org/wiki/Quasi-Newton_method) (擬牛頓法) - L-BFGS Algorithm

#### GIS Algorithm

> GIS stands for Generalized Iterative Scaling

#### IIS Algorithm

> IIS stands for Improved Iterative Scaling. Improved from [GIS](#GIS-Algorithm)

### Solving Overfitting

* Feature Select: throw out rare feature
* Feature Induction: pick useful feature (improves performance)
* Smoothing

### Feature Selection

### Feature Induction

### Smoothing

## Application

> MEM is a classification model. It's not impossible to solve the sequential labeling problem, just not so suitable.
> For example of POS tagging, a classifier maybe not considered the global meaning information.

### POS Tagging

## Resources

### Wikipedia

* [Principle of maximum entropy - Maximum entropy models](https://en.wikipedia.org/wiki/Principle_of_maximum_entropy#Maximum_entropy_models)
* [Multinomial logistic regression (Maximum entropy classifier)](https://en.wikipedia.org/wiki/Maximum_entropy_classifier)
35 changes: 31 additions & 4 deletions Algorithm/NaiveBayes/NaiveBayes.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

## Brief Description

Naive bayes are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.
Naive bayes are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of *conditional independence* between every pair of features given the value of the class variable.

> But in the real word, in our concept, most of the things are not conditional independence. e.g. context in NLP

### Quick View

Expand All @@ -21,6 +23,10 @@ Supervised Learning|Classification|Bayes' Theorem|

## Concept

### Bayes Decision

Posterior prob. = (Likelihood * Prior prob.) / Evidence

### Bayes' Theorem

$$
Expand All @@ -35,10 +41,30 @@ $$

### Real-world conditions

* We predict label by multiplying them. But if any of these probability is 0, then we will get 0 when we multiply them. To lessen the impact of this, we'll initialize all of our occurence counts to 1, and initialize the denominators to 2. (for binary classifier)
* Another problem is **Underflow**: doing too many multiplications of small numbers. (In programming, multiply many small numbers will eventually rounds off to 0)
* We predict label by multiplying them. But if any of these [probability is 0](#Zero-Probability-=>-Smoothing), then we will get 0 when we multiply them. To lessen the impact of this, we'll initialize all of our occurence counts to 1, and initialize the denominators to 2. (for binary classifier)
* Another problem is **Underflow**: doing too many multiplications of small numbers. (In programming, multiply many small numbers will eventually rounds off to 0 which called **floating-point underflow**)
* Solution 1: Take the natural logarithm of this product

#### Zero Probability => Smoothing

> original: $P(w_k|c_j) = \displaystyle\frac{n_k}{n}$

m-estimation: $P(w_k|c_j) = \displaystyle\frac{n_k + mp}{n + m}$

> additional m "virtual samples" distributed according to p

## Application

### Document Classification/Categorization

Smoothing using Laplace smoothing (for $mp = 1$ and $m$ = Vocabulary)

$$
P(w_k|c_j) = \frac{n_k + 1}{n + |\operatorname{Vocabulary}|}
$$

### Word Sense Disambiguation

## TODO

* Figure out why the log mode in predictOne function has lower accuracy when using + than using * as the origin mode. ([Line 66](NaiveBayes_Nursery/NaiveBayes_Nursery_sklearn.py))
Expand All @@ -55,5 +81,6 @@ $$

## Wikipedia

* [Additive smoothing (Laplace smoothing)](https://en.wikipedia.org/wiki/Additive_smoothing)
* [Bayesian Machine Learning](http://fastml.com/bayesian-machine-learning/)
* [Naive Bayes Classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
* [Naive Bayes Classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
30 changes: 15 additions & 15 deletions Algorithm/PCA/PCA.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ A method for doing dimensionality reduction by transforming the feature space to

### Quick View

Category|Usage|Methematics|Application Field
--------|-----|-----------|-----------------
Unsupervised Learning|Dimensionality Reduction|Orthogonal, Covariance Matrix, Eigenvalue Analysis|
| Category | Usage | Methematics | Application Field |
| --------------------- | ------------------------ | ------------------------------------------------------------------------ | ----------------- |
| Unsupervised Learning | Dimensionality Reduction | Orthogonal, Covariance Matrix, Eigenvalue Analysis, Lagrange Multipliers |

## Concepts

Expand All @@ -22,13 +22,13 @@ Steps

* Take the first principal component to be in the direction of the largest variability of the data
* The second preincipal component will be in the direction orthogonal to the first principal component
> (We can get these values by taking the covariance matrix of the dataset and doing eigenvalue analysis on the covariance matrix)
> (We can get these values by taking the covariance matrix of the dataset and doing eigenvalue analysis on the covariance matrix)
* Once we have the eigenvectors of the covariance matrix, we can take the top N eigenvectors => N most important feature
* Multiply the data by the top N eigenvectors to transform our data into the new space

Pseudocode

```
```txt
Remove the mean
Compute the covariance matrix
Find the eigenvalues and eigenvectors of the covariance matrix
Expand All @@ -42,11 +42,11 @@ Transform the data into the new space created by the top N eigenvectors
Variables

* m x n matrix: $X$
* In practice, column vectors of $X$ are positively correlated
* the hypothetical factors that account for the score should be uncorrelated
* In practice, column vectors of $X$ are positively correlated
* the hypothetical factors that account for the score should be uncorrelated
* orthogonal vectors: $\vec{y}_1, \vec{y}_2, \dots, \vec{y}_r$
* We require that the vectors span $R(X)$
* and hence the number of vectors, $r$, should be euqal to the rank of $X$
* We require that the vectors span $R(X)$
* and hence the number of vectors, $r$, should be euqal to the rank of $X$

The covariance matrix is
$$
Expand Down Expand Up @@ -88,11 +88,11 @@ it follows that $\vec{y_1}$ and $\vec{y_2}$ are orthogonal.
## Reference

* Linear Algebra with Applications
* Ch 5 Orthogonality
* Ch 6 Eigenvalues
* Ch 6.5 Application 4 - PCA
* Ch 7.5 Orthogonal Transformations
* Ch 7.6 The Eigenvalue Problem
* Ch 5 Orthogonality
* Ch 6 Eigenvalues
* Ch 6.5 Application 4 - PCA
* Ch 7.5 Orthogonal Transformations
* Ch 7.6 The Eigenvalue Problem

## Links

Expand All @@ -101,7 +101,7 @@ it follows that $\vec{y_1}$ and $\vec{y_2}$ are orthogonal.
### Tutorial

* [**Siraj Raval - Dimensionality Reduction**](https://www.youtube.com/watch?v=jPmV3j1dAv4)
* [Github](https://github.com/llSourcell/Dimensionality_Reduction)
* [Github](https://github.com/llSourcell/Dimensionality_Reduction)

### Scikit Learn

Expand Down
46 changes: 27 additions & 19 deletions Algorithm/SVM/SVM.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ Support Vector Machines (SVM) are learning systems that use a hypothesis space o

### Quick View

Category|Usage|Methematics|Application Field
--------|-----|-----------|-----------------
Supervised Learning|Classification (Main), Regression, Outliers Detection Clustering (Unsupervised)|Convex Optimization, Constrained Optimization, Lagrange Multipliers|Numerous
| Category | Usage | Methematics | Application Field |
| ------------------- | ------------------------------------------------------------------------------- | ------------------------------------------------------------------- | ----------------- |
| Supervised Learning | Classification (Main), Regression, Outliers Detection Clustering (Unsupervised) | Convex Optimization, Constrained Optimization, Lagrange Multipliers | Numerous |

* Support Vector Machine is suited for extreme cases (little sample set)
* SVM find a hyper-plane that separates its training data in such a way that the distance between the hyper plane and the cloest points form each class is maximized
Expand All @@ -39,12 +39,12 @@ Disadvantage
* Poor performance when features >> samples
* SVMs do not provide probability estimates

SVM vs. Perceptron|SVM|Perceptron / NN
------------------|---|----------
**Solving Problem**|Optimization|Iteration
**Optimal**|Global (∵ convex)|Local
**Non-linear Seprable**|Higher dimension|Stack multi-layer model
**Performance**|Better with prior knowledge|Skip feature engineering step
| SVM vs. Perceptron | SVM | Perceptron / NN |
| ----------------------- | --------------------------- | ----------------------------- |
| **Solving Problem** | Optimization | Iteration |
| **Optimal** | Global (∵ convex) | Local |
| **Non-linear Seprable** | Higher dimension | Stack multi-layer model |
| **Performance** | Better with prior knowledge | Skip feature engineering step |

## Terminology

Expand Down Expand Up @@ -110,22 +110,30 @@ SVM vs. Perceptron|SVM|Perceptron / NN
### Kernal Function

* Use a kernal trick to reduce the computational cost
* Kernel Function: Transform a non-linear space into a linear space
* Popular kernel types
* Linear Kernel
* Kernel Function: Transform a non-linear space into a linear space
* Popular kernel types
* Linear Kernel

$K(x, y) = x \times y$
$\kappa(x_i, x_j) = x_i^Ty_j$

* Polynomial Kernel
> $K(x, y) = x \times y$

* Polynomial Kernel

$\kappa(x_i, x_j) = (x_i^Ty_j)^d$

When $d = 0$ then it is linear kernel

> $K(x, y) = (x \times y + 1)^d$, $d \geq 0$

$K(x, y) = (x \times y + 1)^d$
* Radial Basis Function (RBF) Kernel

* Radial Basis Function (RBF) Kernel
$\kappa(x_i, x_j) = \exp(-\frac{||x_i - x_j||^2}{2\sigma^2})$, $\sigma > 0$ and is width of RBF kernel

$K(x, y) = e^{-\gamma ||x-y||^2}$
> $K(x, y) = e^{-\gamma ||x-y||^2}$

* Sigmoid Kernel
* ...
* Sigmoid Kernel
* ...

### Tune Parameter

Expand Down
Loading