daviddwlee84 · daviddwlee84 · May 26, 2022
diff --git a/Algorithm/CRF/CRF.md b/Algorithm/CRF/CRF.md
@@ -1,7 +1,45 @@
-# Condiitonal Random Field
+# Conditonal Random Field
 
-## Probabilistic Undirected Graphical Model (aka. Markov Random Field)
+> Can be considered to a extension of [MEM](../MEM/MEM.md)
+
+## Overview
+
+### Quick View
+
+| Category            | Usage          | Methematics | Application Field |
+| ------------------- | -------------- | ----------- | ----------------- |
+| Supervised Learning | Classification | Entropy     | NLP               |
+
+## Background - From MEM to CRF
+
+![](https://i.stack.imgur.com/khcnl.png)
+
+### Conditional Maximum Entropy Distribution
+
+## Concept
+
+### [Undirected Graph Model](../../Notes/GraphicalModel.md#Undirected-Graph-Model)
+
+## Viterbi Algorithm
 
 ## Links
 
-[Wiki - Conditional random field](https://en.wikipedia.org/wiki/Conditional_random_field)
+* [**An Introduction to Conditional Random Fields**](https://www.research.ed.ac.uk/portal/files/10482724/crftut_fnt.pdf)
+
+### Wikipedia
+
+* [Graphical model](https://en.wikipedia.org/wiki/Graphical_model)
+* [Clique (graph theory)](https://en.wikipedia.org/wiki/Clique_(graph_theory))
+* [Markov random field](https://en.wikipedia.org/wiki/Markov_random_field)
+* [Conditional random field](https://en.wikipedia.org/wiki/Conditional_random_field)
+
+### Tools
+
+* [kmkurn/pytorch-crf: (Linear-chain) Conditional random field in PyTorch.](https://github.com/kmkurn/pytorch-crf)
+  * [pytorch-crf — pytorch-crf 0.7.2 documentation](https://pytorch-crf.readthedocs.io/en/stable/)
+* [CRF++](https://taku910.github.io/crfpp/)
+  * [github](https://github.com/taku910/crfpp)
+* [TensorFlow CRF](https://www.tensorflow.org/api_docs/python/tf/contrib/crf)
+  * [github](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/crf)
+* [sklearn-crfsuite](https://sklearn-crfsuite.readthedocs.io/en/latest/)
+  * [github](https://github.com/TeamHG-Memex/sklearn-crfsuite/)
diff --git a/Algorithm/EM/EM_Iris/EM_Iris_FromScratch.py b/Algorithm/EM/EM_Iris/EM_Iris_FromScratch.py
diff --git a/Algorithm/HMM/HMM_Text_Segmentation/HMMLearn.py b/Algorithm/HMM/HMM_Text_Segmentation/HMMLearn.py
@@ -0,0 +1,2 @@
+from hmmlearn import hmm
+
diff --git a/Algorithm/HMM/HMM_Text_Segmentation/HMM_FromScratch.py b/Algorithm/HMM/HMM_Text_Segmentation/HMM_FromScratch.py
@@ -0,0 +1,8 @@
+import numpy as np
+
+def log_normalize(vector):
+    return np.log(vector) - np.log(np.sum(vector))
+
+def log_sum(vector):
+    pass
+
diff --git a/Algorithm/LogisticRegression/LogisticRegression.md b/Algorithm/LogisticRegression/LogisticRegression.md
@@ -38,7 +38,7 @@ For each piece of data in the dataset:
 
 ## Multiple Classes
 
-### [Multinomial](../MEM/MEM.md) - Softmax Regression (SMR)
+### Multinomial - Softmax Regression (SMR)
 
 > Softmax Regression (synonyms: Multinomial Logistic, Maximum Entropy Classifier, or just Multi-class Logistic Regression) is a generalization of logistic regression that we can use for multi-class classification (under the assumption that the classes are mutually exclusive)
 
@@ -57,6 +57,10 @@ $$
 
 ### Book
 
+Dive into Deep Learning
+
+* [Ch3.4. Softmax Regression](http://d2l.ai/chapter_linear-networks/softmax-regression.html)
+
 Machine Learning in Action
 
 * Ch5 Logistic Regression
@@ -93,3 +97,5 @@ Multinomial (softmax)
 
 * [2 Ways to Implement Multinomial Logistic Regression in Python](http://dataaspirant.com/2017/05/15/implement-multinomial-logistic-regression-python/) - use scikit learn
 * [Machine Learning and Data Science: Multinomial (Multiclass) Logistic Regression](https://www.pugetsystems.com/labs/hpc/Machine-Learning-and-Data-Science-Multinomial-Multiclass-Logistic-Regression-1007/)
+* [mlxtend - Softmax Regression](https://rasbt.github.io/mlxtend/user_guide/classifier/SoftmaxRegression/)
+  * [jupyter notebook](https://github.com/rasbt/python-machine-learning-book/blob/master/code/bonus/softmax-regression.ipynb)
diff --git a/Algorithm/MEM/MEM.md b/Algorithm/MEM/MEM.md
@@ -0,0 +1,85 @@
+# Maximum Entropy Model
+
+Maximum Entropy Classifier / [Multinomial Logistic Regression - i.e. Softmax](../LogisticRegression/LogisticRegression.md#Multinomial---Softmax-Regression-(SMR)),
+
+> Can be considered to a mother of other algorithms
+>
+> [Condiitonal Random Field](../CRF/CRF.md)
+
+## Brief Description
+
+### Quick View
+
+Category|Usage|Methematics|Application Field
+--------|-----|-----------|-----------------
+Supervised Learning|Classification|Entropy|Many
+
+## Concept
+
+### The MEM Model
+
+#### Background
+
+Consider a machine learning problem
+
+* $x$ = $(x_1, x_2, \dots, s_m)$ is input feature vector
+* $y \in \{1, 2, \dots, k\}$ => a k classes classification problem
+
+Given k linear model for machine learning. Each has dimension of m.
+
+$$
+\phi = w_{i1}x_1 + w_{i2}x_2+\cdots + w_{im}x_m,~~~1\leq i \leq k
+$$
+
+Prediction "class" $\hat{y}$ is the maximum "score" for each linear model output.
+
+$$
+\hat{y} = \arg\max_{1\leq i \leq k} \phi_i(x)
+$$
+
+TBD
+
+
+
+
+
+### Training the Model
+
+* GIS Algorithm
+* IIS Algorithm
+* Gradient Descent
+* [Quasi-Newton Method](https://en.wikipedia.org/wiki/Quasi-Newton_method) (擬牛頓法) - L-BFGS Algorithm
+
+#### GIS Algorithm
+
+> GIS stands for Generalized Iterative Scaling
+
+#### IIS Algorithm
+
+> IIS stands for Improved Iterative Scaling. Improved from [GIS](#GIS-Algorithm)
+
+### Solving Overfitting
+
+* Feature Select: throw out rare feature
+* Feature Induction: pick useful feature (improves performance)
+* Smoothing
+
+### Feature Selection
+
+### Feature Induction
+
+### Smoothing
+
+## Application
+
+> MEM is a classification model. It's not impossible to solve the sequential labeling problem, just not so suitable.
+> For example of POS tagging, a classifier maybe not considered the global meaning information.
+
+### POS Tagging
+
+## Resources
+
+### Wikipedia
+
+* [Principle of maximum entropy - Maximum entropy models](https://en.wikipedia.org/wiki/Principle_of_maximum_entropy#Maximum_entropy_models)
+* [Multinomial logistic regression (Maximum entropy classifier)](https://en.wikipedia.org/wiki/Maximum_entropy_classifier)
diff --git a/Algorithm/NaiveBayes/NaiveBayes.md b/Algorithm/NaiveBayes/NaiveBayes.md
@@ -2,7 +2,9 @@
 
 ## Brief Description
 
-Naive bayes are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.
+Naive bayes are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of *conditional independence* between every pair of features given the value of the class variable.
+
+> But in the real word, in our concept, most of the things are not conditional independence. e.g. context in NLP
 
 ### Quick View
 
@@ -21,6 +23,10 @@ Supervised Learning|Classification|Bayes' Theorem|
 
 ## Concept
 
+### Bayes Decision
+
+Posterior prob. = (Likelihood * Prior prob.) / Evidence
+
 ### Bayes' Theorem
 
 $$
@@ -35,10 +41,30 @@ $$
 
 ### Real-world conditions
 
-* We predict label by multiplying them. But if any of these probability is 0, then we will get 0 when we multiply them. To lessen the impact of this, we'll initialize all of our occurence counts to 1, and initialize the denominators to 2. (for binary classifier)
-* Another problem is **Underflow**: doing too many multiplications of small numbers. (In programming, multiply many small numbers will eventually rounds off to 0)
+* We predict label by multiplying them. But if any of these [probability is 0](#Zero-Probability-=>-Smoothing), then we will get 0 when we multiply them. To lessen the impact of this, we'll initialize all of our occurence counts to 1, and initialize the denominators to 2. (for binary classifier)
+* Another problem is **Underflow**: doing too many multiplications of small numbers. (In programming, multiply many small numbers will eventually rounds off to 0 which called **floating-point underflow**)
     * Solution 1: Take the natural logarithm of this product
 
+#### Zero Probability => Smoothing
+
+> original: $P(w_k|c_j) = \displaystyle\frac{n_k}{n}$
+
+m-estimation: $P(w_k|c_j) = \displaystyle\frac{n_k + mp}{n + m}$
+
+> additional m "virtual samples" distributed according to p
+
+## Application
+
+### Document Classification/Categorization
+
+Smoothing using Laplace smoothing (for $mp = 1$ and $m$ = Vocabulary)
+
+$$
+P(w_k|c_j) = \frac{n_k + 1}{n + |\operatorname{Vocabulary}|}
+$$
+
+### Word Sense Disambiguation
+
 ## TODO
 
 * Figure out why the log mode in predictOne function has lower accuracy when using + than using * as the origin mode. ([Line 66](NaiveBayes_Nursery/NaiveBayes_Nursery_sklearn.py))
@@ -55,5 +81,6 @@ $$
 
 ## Wikipedia
 
+* [Additive smoothing (Laplace smoothing)](https://en.wikipedia.org/wiki/Additive_smoothing)
 * [Bayesian Machine Learning](http://fastml.com/bayesian-machine-learning/)
-* [Naive Bayes Classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
+* [Naive Bayes Classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
diff --git a/Algorithm/PCA/PCA.md b/Algorithm/PCA/PCA.md
@@ -10,9 +10,9 @@ A method for doing dimensionality reduction by transforming the feature space to
 
 ### Quick View
 
-Category|Usage|Methematics|Application Field
---------|-----|-----------|-----------------
-Unsupervised Learning|Dimensionality Reduction|Orthogonal, Covariance Matrix, Eigenvalue Analysis|
+| Category              | Usage                    | Methematics                                                              | Application Field |
+| --------------------- | ------------------------ | ------------------------------------------------------------------------ | ----------------- |
+| Unsupervised Learning | Dimensionality Reduction | Orthogonal, Covariance Matrix, Eigenvalue Analysis, Lagrange Multipliers |
 
 ## Concepts
 
@@ -22,13 +22,13 @@ Steps
 
 * Take the first principal component to be in the direction of the largest variability of the data
 * The second preincipal component will be in the direction orthogonal to the first principal component
-> (We can get these values by taking the covariance matrix of the dataset and doing eigenvalue analysis on the covariance matrix)
+    > (We can get these values by taking the covariance matrix of the dataset and doing eigenvalue analysis on the covariance matrix)
 * Once we have the eigenvectors of the covariance matrix, we can take the top N eigenvectors => N most important feature
 * Multiply the data by the top N eigenvectors to transform our data into the new space
 
 Pseudocode
 
-```
+```txt
 Remove the mean
 Compute the covariance matrix
 Find the eigenvalues and eigenvectors of the covariance matrix
@@ -42,11 +42,11 @@ Transform the data into the new space created by the top N eigenvectors
 Variables
 
 * m x n matrix: $X$
-    * In practice, column vectors of $X$ are positively correlated
-    * the hypothetical factors that account for the score should be uncorrelated
+  * In practice, column vectors of $X$ are positively correlated
+  * the hypothetical factors that account for the score should be uncorrelated
 * orthogonal vectors: $\vec{y}_1, \vec{y}_2, \dots, \vec{y}_r$
-    * We require that the vectors span $R(X)$
-    * and hence the number of vectors, $r$, should be euqal to the rank of $X$
+  * We require that the vectors span $R(X)$
+  * and hence the number of vectors, $r$, should be euqal to the rank of $X$
 
 The covariance matrix is
 $$
@@ -88,11 +88,11 @@ it follows that $\vec{y_1}$ and $\vec{y_2}$ are orthogonal.
 ## Reference
 
 * Linear Algebra with Applications
-    * Ch 5 Orthogonality
-    * Ch 6 Eigenvalues
-    * Ch 6.5 Application 4 - PCA
-    * Ch 7.5 Orthogonal Transformations
-    * Ch 7.6 The Eigenvalue Problem
+  * Ch 5 Orthogonality
+  * Ch 6 Eigenvalues
+  * Ch 6.5 Application 4 - PCA
+  * Ch 7.5 Orthogonal Transformations
+  * Ch 7.6 The Eigenvalue Problem
 
 ## Links
 
@@ -101,7 +101,7 @@ it follows that $\vec{y_1}$ and $\vec{y_2}$ are orthogonal.
 ### Tutorial
 
 * [**Siraj Raval - Dimensionality Reduction**](https://www.youtube.com/watch?v=jPmV3j1dAv4)
-    * [Github](https://github.com/llSourcell/Dimensionality_Reduction)
+  * [Github](https://github.com/llSourcell/Dimensionality_Reduction)
 
 ### Scikit Learn
 

diff --git a/Algorithm/SVM/SVM.md b/Algorithm/SVM/SVM.md
@@ -10,9 +10,9 @@ Support Vector Machines (SVM) are learning systems that use a hypothesis space o
 
 ### Quick View
 
-Category|Usage|Methematics|Application Field
---------|-----|-----------|-----------------
-Supervised Learning|Classification (Main), Regression, Outliers Detection Clustering (Unsupervised)|Convex Optimization, Constrained Optimization, Lagrange Multipliers|Numerous
+| Category            | Usage                                                                           | Methematics                                                         | Application Field |
+| ------------------- | ------------------------------------------------------------------------------- | ------------------------------------------------------------------- | ----------------- |
+| Supervised Learning | Classification (Main), Regression, Outliers Detection Clustering (Unsupervised) | Convex Optimization, Constrained Optimization, Lagrange Multipliers | Numerous          |
 
 * Support Vector Machine is suited for extreme cases (little sample set)
 * SVM find a hyper-plane that separates its training data in such a way that the distance between the hyper plane and the cloest points form each class is maximized
@@ -39,12 +39,12 @@ Disadvantage
 * Poor performance when features >> samples
 * SVMs do not provide probability estimates
 
-SVM vs. Perceptron|SVM|Perceptron / NN
-------------------|---|----------
-**Solving Problem**|Optimization|Iteration
-**Optimal**|Global (∵ convex)|Local
-**Non-linear Seprable**|Higher dimension|Stack multi-layer model
-**Performance**|Better with prior knowledge|Skip feature engineering step
+| SVM vs. Perceptron      | SVM                         | Perceptron / NN               |
+| ----------------------- | --------------------------- | ----------------------------- |
+| **Solving Problem**     | Optimization                | Iteration                     |
+| **Optimal**             | Global (∵ convex)           | Local                         |
+| **Non-linear Seprable** | Higher dimension            | Stack multi-layer model       |
+| **Performance**         | Better with prior knowledge | Skip feature engineering step |
 
 ## Terminology
 
@@ -110,22 +110,30 @@ SVM vs. Perceptron|SVM|Perceptron / NN
 ### Kernal Function
 
 * Use a kernal trick to reduce the computational cost
-    * Kernel Function: Transform a non-linear space into a linear space
-    * Popular kernel types
-        * Linear Kernel
+  * Kernel Function: Transform a non-linear space into a linear space
+  * Popular kernel types
+    * Linear Kernel
 
-            $K(x, y) = x \times y$
+        $\kappa(x_i, x_j) = x_i^Ty_j$
 
-        * Polynomial Kernel
+        > $K(x, y) = x \times y$
+
+    * Polynomial Kernel
+
+        $\kappa(x_i, x_j) = (x_i^Ty_j)^d$
+
+        When $d = 0$ then it is linear kernel
+
+        > $K(x, y) = (x \times y + 1)^d$, $d \geq 0$
 
-            $K(x, y) = (x \times y + 1)^d$
+    * Radial Basis Function (RBF) Kernel
 
-        * Radial Basis Function (RBF) Kernel
+        $\kappa(x_i, x_j) = \exp(-\frac{||x_i - x_j||^2}{2\sigma^2})$, $\sigma > 0$ and is width of RBF kernel
 
-            $K(x, y) = e^{-\gamma ||x-y||^2}$
+        > $K(x, y) = e^{-\gamma ||x-y||^2}$
 
-        * Sigmoid Kernel
-        * ...
+    * Sigmoid Kernel
+    * ...
 
 ### Tune Parameter