-
Notifications
You must be signed in to change notification settings - Fork 0
/
ML.txt
121 lines (99 loc) · 3.68 KB
/
ML.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
ML-Algortithms
1. Regression : Dealing linear relation ship
Linear Regression
-->Robust Regression
2.Multiple Regression,Feature importance
2a.OLS
2b.Gradient Decent(GD)
-->Batch GD
-->Stocastic GD
==>SGDRegressior
-->Mini Batch GD
3.Regularized Regression:
3a. Ridge
3b. Lasso
3c. Elastic Net
4.Polynomial Regression
5.Performance evaliuation
2. Classification :Dealing Non-Linear Relationships
1.Logistic Regression(SigmoidCurve) :stochastic GradientDecent
Estimate Coefficent
Performance Measure : Stratified K-fold
Confusion matrix
Precision
Recall
F1 Score
Precision/RecallTradeoff
ROC
2. SVM :
Linear SVM Classification
Polynomial Kernal
Radical Basis Function/Gaussian Kernel
Support Vector Regression
Grid Search
Hyper Parameter tuning
3.Decission Tree :
1. Classification: Graphviz(ibmhr)
Bagging Classifier(BootStrap-reduce the variance)
ID3,C4.5,C5.0,CART,CHAID
GINI
Entropy
Information Gain
2. Regression : Regularization
HR Attrition prediction
2.Random Forest
3.AdaBoost
Feature importance Revisited
sgdclassifer == Crossvalescore both are same
Data Pre-Processing
1.Standardization/Mean removal//Variance Scaling
.Min-Max or scaliing Feature to a Range
.Normalization
.Binarization
.Encoding categorical feature
=>LabelEncoder
=>One Hot/One-of-K Encoding
2.Variance Bias Trade off
.Validation Curve
.Learning Curve
3.Cross Validation-Hot outCV,K-fold CV,stratified k-fold
Imbalanced DataSet: They are 2-types
1.Using Resampling Teachnique to Balance the data:
1.Under sampling : under sampling balance the dataset by reducing the size of the abudant class.
--> As this teachinque involves removel of data,or data loss,it can never be used for secnarios where we have less quantity of data
--> But it can be effectively used for scenarios where we have abundant amount of data
2.Over sampling : Oversampling balances the dataset by incersing the size of rare samples by using repetition,'bootstrapping' etc.
--> This method is Prefered when the Dataset is " Not Huge "
Statical Resampling is a Teachnique which is used to balance all the classes present in a data
sample to improve the accuracy and quantify the uncertainity of a population parameter
Major Takeaway Points :
> There is no absolute advantage of one resampling method over another
> Here, is few rules of thumb while using over and under-sampling
>>Consider testing under-sampling when you have "a lot data"
>>Consider testing Over-sampling when you dont have a "Lot of data"
#### Blogs:
1.https://sebastianraschka.com/
2.https://explained.ai/
3.https://ruder.io/
4.https://distill.io/
5.https://iamtrask.github.io/
6.https://cs.stanford.edu/people/karpathy/
7.https://colah.github.io/posts/2014-10-Visualizing-MNIST/
8.https://machinelearningmastery.com/
9.https://www.analyticsvidhya.com/myfeed/
10.https://actsusanli.medium.com/
11.http://www.becominghuman.org/
12.https://www.datadriveninvestor.com/
13.https://towardsdatascience.com/
14.https://medium.com/
setup.sh
mkdir -p ~/.streamlit/
echo "\
[server]\n\
port = $PORT\n\
enableCORS = false\n\
headless = true\n\
\n\
" > ~/.streamlit/config.toml
Procfile
web: sh setup.sh && streamlit run app.py