forked from WinVector/zmPDSwR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.html
105 lines (104 loc) · 13.3 KB
/
README.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
<h1 id="example-code-and-data-for-practical-data-science-with-r-by-nina-zumel-and-john-mount-manning-2014.">Example code and data for "Practical Data Science with R" by Nina Zumel and John Mount, Manning 2014.</h1>
<ul>
<li>The book: <a href="http://www.manning.com/zumel/">"Practical Data Science with R" by Nina Zumel and John Mount, Manning 2014</a> (book copyright Manning Publications Co., all rights reserved)</li>
<li>The support site: <a href="https://github.com/WinVector/zmPDSwR">GitHub WinVector/zmPDSwR</a></li>
</ul>
<h2 id="the-code-and-data-in-this-directory-supports-examples-from">The code and data in this directory supports examples from:</h2>
<ul>
<li>Chapter 5: Choosing and Evaluating Models</li>
<li>Chapter 9: Exploring Advanced Methods</li>
</ul>
<p>4-26-2013 Data from http://archive.ics.uci.edu/ml/datasets/Spambase Data file is: http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data</p>
<pre class="sourceCode bash"><code class="sourceCode bash">$ <span class="kw">shasum</span> spambase.data
<span class="kw">e28aa2a7d4592b4f5f71347912c1b4b759336b58</span> spambase.data</code></pre>
<p>Data preparation steps:</p>
<pre class="sourceCode R"><code class="sourceCode r">spamD <-<span class="st"> </span><span class="kw">read.table</span>(<span class="st">'http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data'</span>,<span class="dt">sep=</span><span class="st">','</span>,<span class="dt">header=</span>F)
spamCols <-<span class="st"> </span><span class="kw">c</span>(
<span class="st">'word.freq.make'</span>, <span class="st">'word.freq.address'</span>, <span class="st">'word.freq.all'</span>,
<span class="st">'word.freq.3d'</span>, <span class="st">'word.freq.our'</span>, <span class="st">'word.freq.over'</span>, <span class="st">'word.freq.remove'</span>,
<span class="st">'word.freq.internet'</span>, <span class="st">'word.freq.order'</span>, <span class="st">'word.freq.mail'</span>,
<span class="st">'word.freq.receive'</span>, <span class="st">'word.freq.will'</span>, <span class="st">'word.freq.people'</span>,
<span class="st">'word.freq.report'</span>, <span class="st">'word.freq.addresses'</span>, <span class="st">'word.freq.free'</span>,
<span class="st">'word.freq.business'</span>, <span class="st">'word.freq.email'</span>, <span class="st">'word.freq.you'</span>,
<span class="st">'word.freq.credit'</span>, <span class="st">'word.freq.your'</span>, <span class="st">'word.freq.font'</span>,
<span class="st">'word.freq.000'</span>, <span class="st">'word.freq.money'</span>, <span class="st">'word.freq.hp'</span>, <span class="st">'word.freq.hpl'</span>,
<span class="st">'word.freq.george'</span>, <span class="st">'word.freq.650'</span>, <span class="st">'word.freq.lab'</span>,
<span class="st">'word.freq.labs'</span>, <span class="st">'word.freq.telnet'</span>, <span class="st">'word.freq.857'</span>,
<span class="st">'word.freq.data'</span>, <span class="st">'word.freq.415'</span>, <span class="st">'word.freq.85'</span>,
<span class="st">'word.freq.technology'</span>, <span class="st">'word.freq.1999'</span>, <span class="st">'word.freq.parts'</span>,
<span class="st">'word.freq.pm'</span>, <span class="st">'word.freq.direct'</span>, <span class="st">'word.freq.cs'</span>,
<span class="st">'word.freq.meeting'</span>, <span class="st">'word.freq.original'</span>, <span class="st">'word.freq.project'</span>,
<span class="st">'word.freq.re'</span>, <span class="st">'word.freq.edu'</span>, <span class="st">'word.freq.table'</span>,
<span class="st">'word.freq.conference'</span>, <span class="st">'char.freq.semi'</span>, <span class="st">'char.freq.lparen'</span>,
<span class="st">'char.freq.lbrack'</span>, <span class="st">'char.freq.bang'</span>, <span class="st">'char.freq.dollar'</span>,
<span class="st">'char.freq.hash'</span>, <span class="st">'capital.run.length.average'</span>,
<span class="st">'capital.run.length.longest'</span>, <span class="st">'capital.run.length.total'</span>,
<span class="st">'spam'</span>
)
<span class="kw">colnames</span>(spamD) <-<span class="st"> </span>spamCols
spamD$spam <-<span class="st"> </span><span class="kw">as.factor</span>(<span class="kw">ifelse</span>(spamD$spam><span class="fl">0.5</span>,<span class="st">'spam'</span>,<span class="st">'non-spam'</span>))
<span class="kw">set.seed</span>(<span class="dv">2350290</span>)
spamD$rgroup <-<span class="st"> </span><span class="kw">floor</span>(<span class="dv">100</span>*<span class="kw">runif</span>(<span class="kw">dim</span>(spamD)[[<span class="dv">1</span>]]))
<span class="co">#write.table(spamD,file='spamD.tsv',quote=F,sep='\t',row.names=F)</span></code></pre>
<p>Analysis steps, download https://raw.github.com/WinVector/zmPDSwR/master/Spambase/spamD.tsv</p>
<pre class="sourceCode R"><code class="sourceCode r">spamD <-<span class="st"> </span><span class="kw">read.table</span>(<span class="st">'spamD.tsv'</span>,<span class="dt">header=</span>T,<span class="dt">sep=</span><span class="st">'</span><span class="ch">\t</span><span class="st">'</span>)
spamTrain <-<span class="st"> </span><span class="kw">subset</span>(spamD,spamD$rgroup>=<span class="dv">10</span>)
spamTest <-<span class="st"> </span><span class="kw">subset</span>(spamD,spamD$rgroup<<span class="dv">10</span>)
spamVars <-<span class="st"> </span><span class="kw">setdiff</span>(<span class="kw">colnames</span>(spamD),<span class="kw">list</span>(<span class="st">'rgroup'</span>,<span class="st">'spam'</span>))
spamFormula <-<span class="st"> </span><span class="kw">as.formula</span>(<span class="kw">paste</span>(<span class="st">'spam=="spam"'</span>,
<span class="kw">paste</span>(spamVars,<span class="dt">collapse=</span><span class="st">' + '</span>),<span class="dt">sep=</span><span class="st">' ~ '</span>))
spamModel <-<span class="st"> </span><span class="kw">glm</span>(spamFormula,<span class="dt">family=</span><span class="kw">binomial</span>(<span class="dt">link=</span><span class="st">'logit'</span>),
<span class="dt">data=</span>spamTrain)
spamTrain$pred <-<span class="st"> </span><span class="kw">predict</span>(spamModel,<span class="dt">newdata=</span>spamTrain,<span class="dt">type=</span><span class="st">'response'</span>)
spamTest$pred <-<span class="st"> </span><span class="kw">predict</span>(spamModel,<span class="dt">newdata=</span>spamTest,<span class="dt">type=</span><span class="st">'response'</span>)
trainSpamTable <-<span class="st"> </span><span class="kw">table</span>(<span class="dt">truth=</span>spamTrain$spam,
<span class="dt">prediction=</span>spamTrain$pred><span class="fl">0.5</span>)
testSpamTable <-<span class="st"> </span><span class="kw">table</span>(<span class="dt">truth=</span>spamTest$spam,
<span class="dt">prediction=</span>spamTest$pred><span class="fl">0.5</span>)
<span class="co"># sort(sample(1:(dim(spamTest)[[1]]),size=4,replace=F))</span>
<span class="co"># [1] 7 35 224 327</span>
sample <-<span class="st"> </span>spamTest[<span class="kw">c</span>(<span class="dv">7</span>,<span class="dv">35</span>,<span class="dv">224</span>,<span class="dv">327</span>),<span class="kw">c</span>(<span class="st">'spam'</span>,<span class="st">'pred'</span>)]
cM <-<span class="st"> </span><span class="kw">table</span>(<span class="dt">truth=</span>spamTest$spam,<span class="dt">prediction=</span>spamTest$pred><span class="fl">0.5</span>)
(cM[<span class="dv">1</span>,<span class="dv">1</span>]+cM[<span class="dv">2</span>,<span class="dv">2</span>])/<span class="kw">sum</span>(cM)
cM[<span class="dv">2</span>,<span class="dv">2</span>]/(cM[<span class="dv">2</span>,<span class="dv">2</span>]+cM[<span class="dv">1</span>,<span class="dv">2</span>])
cM[<span class="dv">2</span>,<span class="dv">2</span>]/(cM[<span class="dv">2</span>,<span class="dv">2</span>]+cM[<span class="dv">2</span>,<span class="dv">1</span>])
cM[<span class="dv">1</span>,<span class="dv">1</span>]/(cM[<span class="dv">1</span>,<span class="dv">1</span>]+cM[<span class="dv">1</span>,<span class="dv">2</span>])
t <-<span class="st"> </span><span class="kw">as.table</span>(<span class="kw">matrix</span>(<span class="dt">data=</span><span class="kw">c</span>(<span class="dv">288-1</span>,<span class="dv">17</span>,<span class="dv">1</span>,<span class="dv">13882-17</span>),<span class="dt">nrow=</span><span class="dv">2</span>,<span class="dt">ncol=</span><span class="dv">2</span>))
<span class="kw">rownames</span>(t) <-<span class="st"> </span><span class="kw">rownames</span>(cM)
<span class="kw">colnames</span>(t) <-<span class="st"> </span><span class="kw">colnames</span>(cM)
(t[<span class="dv">1</span>,<span class="dv">1</span>]+t[<span class="dv">2</span>,<span class="dv">2</span>])/<span class="kw">sum</span>(t)
t[<span class="dv">2</span>,<span class="dv">2</span>]/(t[<span class="dv">2</span>,<span class="dv">2</span>]+t[<span class="dv">1</span>,<span class="dv">2</span>])
t[<span class="dv">2</span>,<span class="dv">2</span>]/(t[<span class="dv">2</span>,<span class="dv">2</span>]+t[<span class="dv">2</span>,<span class="dv">1</span>])
t[<span class="dv">1</span>,<span class="dv">1</span>]/(t[<span class="dv">1</span>,<span class="dv">1</span>]+t[<span class="dv">1</span>,<span class="dv">2</span>])
<span class="co"># ROC curve</span>
<span class="kw">library</span>(<span class="st">'ROCR'</span>)
eval <-<span class="st"> </span><span class="kw">prediction</span>(spamTest$pred,spamTest$spam)
<span class="kw">plot</span>(<span class="kw">performance</span>(eval,<span class="st">"tpr"</span>,<span class="st">"fpr"</span>))
<span class="co"># AUC</span>
<span class="kw">attributes</span>(<span class="kw">performance</span>(eval,<span class="st">'auc'</span>))$y.values[[<span class="dv">1</span>]]</code></pre>
<p>Saturated model / Bayes rate estimate</p>
<pre class="sourceCode R"><code class="sourceCode r">quantized <-<span class="st"> </span><span class="kw">subset</span>(spamD,T,<span class="dt">select=</span>spamVars)
quantized <-<span class="st"> </span><span class="kw">as.data.frame</span>(<span class="kw">lapply</span>(quantized,
function(col) { <span class="kw">ecdf</span>(col)(col) }))
quantized$groupId <-<span class="st"> </span><span class="kw">sapply</span>(<span class="dv">1</span>:<span class="kw">dim</span>(quantized)[[<span class="dv">1</span>]],
function(row) <span class="kw">paste</span>(<span class="kw">floor</span>(<span class="dv">5</span>*quantized[row,spamVars]),<span class="dt">collapse=</span><span class="st">' '</span>))
quantized$spam <-<span class="st"> </span>spamD$spam
satTable <-<span class="st"> </span><span class="kw">table</span>(quantized$groupId,quantized$spam)
quantized$satPred <-<span class="st"> </span><span class="kw">sapply</span>(<span class="dv">1</span>:(<span class="kw">dim</span>(quantized)[[<span class="dv">1</span>]]),function(rowNum) {
row <-<span class="st"> </span>satTable[quantized[rowNum,<span class="st">'groupId'</span>],]
row[<span class="st">'spam'</span>]>row[<span class="st">'non-spam'</span>]
})
quantized$groupCount <-<span class="st"> </span><span class="kw">table</span>(quantized$groupId)[quantized$groupId]
<span class="kw">table</span>(quantized$spam,quantized$satPred><span class="fl">0.5</span>)
repeated <-<span class="st"> </span><span class="kw">subset</span>(quantized,groupCount><span class="dv">1</span>)
cRepeated =<span class="st"> </span><span class="kw">table</span>(repeated$spam,repeated$satPred><span class="fl">0.5</span>)
<span class="kw">print</span>(cRepeated)
<span class="co"># FALSE TRUE</span>
<span class="co"># non-spam 1022 2</span>
<span class="co"># spam 31 848</span>
<span class="kw">print</span>((cRepeated[<span class="dv">1</span>,<span class="dv">1</span>]+cRepeated[<span class="dv">2</span>,<span class="dv">2</span>])/<span class="kw">sum</span>(cRepeated))
<span class="co"># [1] 0.982659</span></code></pre>
<h2 id="license-for-additional-documentation-notes-code-and-example-data">License for additional documentation, notes, code, and example data:</h2>
<p><a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="http://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.</p>
<p>No guarantee, indemnification or claim of fitness is made regarding any of these items.</p>
<p>No claim of license on works of others or derived data.</p>