Buzz/buzzm.html

<h1 id="markdown-version-of-buzz-data-analysis">Markdown version of Buzz data analysis</h1>
<p>by: Nina Zumel and John Mount Win-Vector LLC 11-8-2013</p>
<p>To run this example you need a system with R installed (see <a href="http://cran.r-project.org">http://cran.r-project.org</a>), and data from <a href="https://github.com/WinVector/zmPDSwR/tree/master/Buzz">https://github.com/WinVector/zmPDSwR/tree/master/Buzz</a>.</p>
<p>We are not perfoming any new analysis here, just supplying a direct application of Random Forests on the data.</p>
<p>Data from: <a href="http://ama.liglab.fr/datasets/buzz/">http://ama.liglab.fr/datasets/buzz/</a> Using: <a href="http://ama.liglab.fr/datasets/buzz/classification/TomsHardware/Relative_labeling/sigma=500/TomsHardware-Relative-Sigma-500.data">TomsHardware-Relative-Sigma-500.data</a></p>
<p>(described in <a href="http://ama.liglab.fr/datasets/buzz/classification/TomsHardware/Relative_labeling/sigma=500/TomsHardware-Relative-Sigma-500.names">TomsHardware-Relative-Sigma-500.names</a> )</p>
<p>Crypto hashes: shasum TomsHardware-<em>.txt </em> 5a1cc7863a9da8d6e8380e1446f25eec2032bd91 TomsHardware-Absolute-Sigma-500.data.txt * 86f2c0f4fba4fb42fe4ee45b48078ab51dba227e TomsHardware-Absolute-Sigma-500.names.txt * c239182c786baf678b55f559b3d0223da91e869c TomsHardware-Relative-Sigma-500.data.txt * ec890723f91ae1dc87371e32943517bcfcd9e16a TomsHardware-Relative-Sigma-500.names.txt</p>
<p>To run this example you need a system with R installed (see <a href="http://cran.r-project.org">cran</a>), Latex (see <a href="http://tug.org">tug</a>) and data from <a href="https://github.com/WinVector/zmPDSwR/tree/master/Buzz">zmPDSwR</a>.</p>
<p>To run this example: * Download buzzm.Rmd and TomsHardware-Relative-Sigma-500.data.txt from the github URL. * Start a copy of R, use setwd() to move to the directory you have stored the files. * Make sure knitr is loaded into R ( install.packages('knitr') and library(knitr) ). * In R run: (produces <a href="https://github.com/WinVector/zmPDSwR/blob/master/Buzz/buzzm.md">buzzm.md</a> from buzzm.Rmd).</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">knit</span>(<span class="st">&#39;buzzm.Rmd&#39;</span>)</code></pre>
<!-- set up caching and knitr chunk dependency calculation -->
<!-- note: you will want to do clean re-runs once in a while to make sure -->
<!-- you are not seeing stale cache results. -->
<p><code>{setup,tidy=F,cache=F,eval=T,echo=F,results='hide'} opts_chunk$set(autodep=T) dep_auto()</code></p>
<p>Now you can run the following data prep steps:</p>
<pre class="sourceCode r"><code class="sourceCode r">infile &lt;-<span class="st"> &quot;TomsHardware-Relative-Sigma-500.data.txt&quot;</span>
<span class="kw">paste</span>(<span class="st">&#39;checked at&#39;</span>,<span class="kw">date</span>())</code></pre>
<pre><code>## [1] &quot;checked at Fri Nov  8 11:21:37 2013&quot;</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">system</span>(<span class="kw">paste</span>(<span class="st">&#39;shasum&#39;</span>,infile),<span class="dt">intern=</span>T)  <span class="co"># write down file hash</span></code></pre>
<pre><code>## [1] &quot;c239182c786baf678b55f559b3d0223da91e869c  TomsHardware-Relative-Sigma-500.data.txt&quot;</code></pre>
<pre class="sourceCode r"><code class="sourceCode r">buzzdata &lt;-<span class="st"> </span><span class="kw">read.table</span>(infile, <span class="dt">header=</span>F, <span class="dt">sep=</span><span class="st">&quot;,&quot;</span>)

makevars &lt;-<span class="st"> </span>function(colname, <span class="dt">ndays=</span><span class="dv">7</span>) {
  <span class="kw">paste</span>(colname, <span class="dv">0</span>:ndays, <span class="dt">sep=</span><span class="st">&#39;&#39;</span>)
}

varnames &lt;-<span class="st"> </span><span class="kw">c</span>(<span class="st">&quot;num.new.disc&quot;</span>,
             <span class="st">&quot;burstiness&quot;</span>,
             <span class="st">&quot;number.total.disc&quot;</span>,
             <span class="st">&quot;auth.increase&quot;</span>,
             <span class="st">&quot;atomic.containers&quot;</span>, <span class="co"># not documented</span>
             <span class="st">&quot;num.displays&quot;</span>, <span class="co"># number of times topic displayed to user (measure of interest)</span>
             <span class="st">&quot;contribution.sparseness&quot;</span>, <span class="co"># not documented</span>
             <span class="st">&quot;avg.auths.per.disc&quot;</span>,
             <span class="st">&quot;num.authors.topic&quot;</span>, <span class="co"># total authors on the topic</span>
             <span class="st">&quot;avg.disc.length&quot;</span>,
             <span class="st">&quot;attention.level.author&quot;</span>,
             <span class="st">&quot;attention.level.contrib&quot;</span>
)

colnames &lt;-<span class="st"> </span><span class="kw">as.vector</span>(<span class="kw">sapply</span>(varnames, <span class="dt">FUN=</span>makevars))
colnames &lt;-<span class="st">  </span><span class="kw">c</span>(colnames, <span class="st">&quot;buzz&quot;</span>)
<span class="kw">colnames</span>(buzzdata) &lt;-<span class="st"> </span>colnames

<span class="co"># Split into training and test</span>
<span class="kw">set.seed</span>(2362690L)
rgroup &lt;-<span class="st"> </span><span class="kw">runif</span>(<span class="kw">dim</span>(buzzdata)[<span class="dv">1</span>])
buzztrain &lt;-<span class="st"> </span>buzzdata[rgroup &gt;<span class="st"> </span><span class="fl">0.1</span>,]
buzztest &lt;-<span class="st"> </span>buzzdata[rgroup &lt;=<span class="fl">0.1</span>,]</code></pre>
<p>This currently returns a training set with 7114 rows and a test set with 791 rows, which is the same as when this document was prepared.</p>
<p>Notice we have exploded the basic column names into the following:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">print</span>(colnames)</code></pre>
<pre><code>##  [1] &quot;num.new.disc0&quot;            &quot;num.new.disc1&quot;           
##  [3] &quot;num.new.disc2&quot;            &quot;num.new.disc3&quot;           
##  [5] &quot;num.new.disc4&quot;            &quot;num.new.disc5&quot;           
##  [7] &quot;num.new.disc6&quot;            &quot;num.new.disc7&quot;           
##  [9] &quot;burstiness0&quot;              &quot;burstiness1&quot;             
## [11] &quot;burstiness2&quot;              &quot;burstiness3&quot;             
## [13] &quot;burstiness4&quot;              &quot;burstiness5&quot;             
## [15] &quot;burstiness6&quot;              &quot;burstiness7&quot;             
## [17] &quot;number.total.disc0&quot;       &quot;number.total.disc1&quot;      
## [19] &quot;number.total.disc2&quot;       &quot;number.total.disc3&quot;      
## [21] &quot;number.total.disc4&quot;       &quot;number.total.disc5&quot;      
## [23] &quot;number.total.disc6&quot;       &quot;number.total.disc7&quot;      
## [25] &quot;auth.increase0&quot;           &quot;auth.increase1&quot;          
## [27] &quot;auth.increase2&quot;           &quot;auth.increase3&quot;          
## [29] &quot;auth.increase4&quot;           &quot;auth.increase5&quot;          
## [31] &quot;auth.increase6&quot;           &quot;auth.increase7&quot;          
## [33] &quot;atomic.containers0&quot;       &quot;atomic.containers1&quot;      
## [35] &quot;atomic.containers2&quot;       &quot;atomic.containers3&quot;      
## [37] &quot;atomic.containers4&quot;       &quot;atomic.containers5&quot;      
## [39] &quot;atomic.containers6&quot;       &quot;atomic.containers7&quot;      
## [41] &quot;num.displays0&quot;            &quot;num.displays1&quot;           
## [43] &quot;num.displays2&quot;            &quot;num.displays3&quot;           
## [45] &quot;num.displays4&quot;            &quot;num.displays5&quot;           
## [47] &quot;num.displays6&quot;            &quot;num.displays7&quot;           
## [49] &quot;contribution.sparseness0&quot; &quot;contribution.sparseness1&quot;
## [51] &quot;contribution.sparseness2&quot; &quot;contribution.sparseness3&quot;
## [53] &quot;contribution.sparseness4&quot; &quot;contribution.sparseness5&quot;
## [55] &quot;contribution.sparseness6&quot; &quot;contribution.sparseness7&quot;
## [57] &quot;avg.auths.per.disc0&quot;      &quot;avg.auths.per.disc1&quot;     
## [59] &quot;avg.auths.per.disc2&quot;      &quot;avg.auths.per.disc3&quot;     
## [61] &quot;avg.auths.per.disc4&quot;      &quot;avg.auths.per.disc5&quot;     
## [63] &quot;avg.auths.per.disc6&quot;      &quot;avg.auths.per.disc7&quot;     
## [65] &quot;num.authors.topic0&quot;       &quot;num.authors.topic1&quot;      
## [67] &quot;num.authors.topic2&quot;       &quot;num.authors.topic3&quot;      
## [69] &quot;num.authors.topic4&quot;       &quot;num.authors.topic5&quot;      
## [71] &quot;num.authors.topic6&quot;       &quot;num.authors.topic7&quot;      
## [73] &quot;avg.disc.length0&quot;         &quot;avg.disc.length1&quot;        
## [75] &quot;avg.disc.length2&quot;         &quot;avg.disc.length3&quot;        
## [77] &quot;avg.disc.length4&quot;         &quot;avg.disc.length5&quot;        
## [79] &quot;avg.disc.length6&quot;         &quot;avg.disc.length7&quot;        
## [81] &quot;attention.level.author0&quot;  &quot;attention.level.author1&quot; 
## [83] &quot;attention.level.author2&quot;  &quot;attention.level.author3&quot; 
## [85] &quot;attention.level.author4&quot;  &quot;attention.level.author5&quot; 
## [87] &quot;attention.level.author6&quot;  &quot;attention.level.author7&quot; 
## [89] &quot;attention.level.contrib0&quot; &quot;attention.level.contrib1&quot;
## [91] &quot;attention.level.contrib2&quot; &quot;attention.level.contrib3&quot;
## [93] &quot;attention.level.contrib4&quot; &quot;attention.level.contrib5&quot;
## [95] &quot;attention.level.contrib6&quot; &quot;attention.level.contrib7&quot;
## [97] &quot;buzz&quot;</code></pre>
<p>We are now ready to create a simple model predicting &quot;buzz&quot; as function of the other columns.</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># build a model</span>
<span class="co"># let&#39;s use all the input variables</span>
nlist =<span class="st"> </span>varnames
varslist =<span class="st"> </span><span class="kw">as.vector</span>(<span class="kw">sapply</span>(nlist, <span class="dt">FUN=</span>makevars))

<span class="co"># these were defined previously,  in Chapter 9</span>
loglikelihood &lt;-<span class="st"> </span>function(y, py) {
  pysmooth &lt;-<span class="st"> </span><span class="kw">ifelse</span>(py==<span class="dv">0</span>, <span class="fl">1e-12</span>,
                     <span class="kw">ifelse</span>(py==<span class="dv">1</span>, <span class="dv">1</span><span class="fl">-1e-12</span>, py))
  <span class="kw">sum</span>(y *<span class="st"> </span><span class="kw">log</span>(pysmooth) +<span class="st"> </span>(<span class="dv">1</span>-y)*<span class="kw">log</span>(<span class="dv">1</span> -<span class="st"> </span>pysmooth))
}
accuracyMeasures &lt;-<span class="st"> </span>function(pred, truth, <span class="dt">threshold=</span><span class="fl">0.5</span>, <span class="dt">name=</span><span class="st">&quot;model&quot;</span>) {
  dev.norm &lt;-<span class="st"> </span>-<span class="dv">2</span>*<span class="kw">loglikelihood</span>(<span class="kw">as.numeric</span>(truth), pred)/<span class="kw">length</span>(pred)
  ctable =<span class="st"> </span><span class="kw">table</span>(<span class="dt">truth=</span>truth,
                 <span class="dt">pred=</span>pred)
  accuracy &lt;-<span class="st"> </span><span class="kw">sum</span>(<span class="kw">diag</span>(ctable))/<span class="kw">sum</span>(ctable)
  precision &lt;-<span class="st"> </span>ctable[<span class="dv">2</span>,<span class="dv">2</span>]/<span class="kw">sum</span>(ctable[,<span class="dv">2</span>])
  recall &lt;-<span class="st"> </span>ctable[<span class="dv">2</span>,<span class="dv">2</span>]/<span class="kw">sum</span>(ctable[<span class="dv">2</span>,])
  f1 &lt;-<span class="st"> </span>precision*recall
  <span class="kw">print</span>(<span class="kw">paste</span>(<span class="st">&quot;precision=&quot;</span>, precision, <span class="st">&quot;; recall=&quot;</span> , recall))
  <span class="kw">print</span>(ctable)
  <span class="kw">data.frame</span>(<span class="dt">model=</span>name, <span class="dt">accuracy=</span>accuracy, <span class="dt">f1=</span>f1, dev.norm)
}


<span class="kw">library</span>(randomForest)</code></pre>
<pre><code>## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.</code></pre>
<pre class="sourceCode r"><code class="sourceCode r">bzFormula &lt;-<span class="st"> </span><span class="kw">paste</span>(<span class="st">&#39;as.factor(buzz) ~ &#39;</span>,<span class="kw">paste</span>(varslist,<span class="dt">collapse=</span><span class="st">&#39; + &#39;</span>))
fmodel &lt;-<span class="st"> </span><span class="kw">randomForest</span>(<span class="kw">as.formula</span>(bzFormula),
                      <span class="dt">data=</span>buzztrain,
                      <span class="dt">mtry=</span><span class="kw">floor</span>(<span class="kw">sqrt</span>(<span class="kw">length</span>(varslist))),
                      <span class="dt">ntree=</span><span class="dv">101</span>,
                      <span class="dt">importance=</span>T)

<span class="kw">print</span>(<span class="st">&#39;training&#39;</span>)</code></pre>
<pre><code>## [1] &quot;training&quot;</code></pre>
<pre class="sourceCode r"><code class="sourceCode r">rtrain &lt;-<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">truth=</span>buzztrain$buzz, <span class="dt">pred=</span><span class="kw">predict</span>(fmodel, <span class="dt">newdata=</span>buzztrain))
<span class="kw">print</span>(<span class="kw">accuracyMeasures</span>(rtrain$pred, rtrain$truth))</code></pre>
<pre><code>## [1] &quot;precision= 1 ; recall= 0.999360613810742&quot;
##      pred
## truth    0    1
##     0 5550    0
##     1    1 1563
##   model accuracy     f1 dev.norm
## 1 model   0.9999 0.9994 0.007768</code></pre>
<pre class="sourceCode r"><code class="sourceCode r">
<span class="kw">print</span>(<span class="st">&#39;test&#39;</span>)</code></pre>
<pre><code>## [1] &quot;test&quot;</code></pre>
<pre class="sourceCode r"><code class="sourceCode r">rtest &lt;-<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">truth=</span>buzztest$buzz, <span class="dt">pred=</span><span class="kw">predict</span>(fmodel, <span class="dt">newdata=</span>buzztest))
<span class="kw">print</span>(<span class="kw">accuracyMeasures</span>(rtest$pred, rtest$truth))</code></pre>
<pre><code>## [1] &quot;precision= 0.831460674157303 ; recall= 0.836158192090395&quot;
##      pred
## truth   0   1
##     0 584  30
##     1  29 148
##   model accuracy     f1 dev.norm
## 1 model   0.9254 0.6952    4.122</code></pre>
<p>Notice the extreme fall-off from training to test performance, the random forest over fit. In fact the random forest fit all the data if it sees it during training:</p>
<pre class="sourceCode r"><code class="sourceCode r">fmodel &lt;-<span class="st"> </span><span class="kw">randomForest</span>(<span class="kw">as.formula</span>(bzFormula),
                      <span class="dt">data=</span>buzzdata,
                      <span class="dt">mtry=</span><span class="kw">floor</span>(<span class="kw">sqrt</span>(<span class="kw">length</span>(varslist))),
                      <span class="dt">ntree=</span><span class="dv">101</span>,
                      <span class="dt">importance=</span>T)
<span class="kw">print</span>(<span class="st">&#39;all data&#39;</span>)</code></pre>
<pre><code>## [1] &quot;all data&quot;</code></pre>
<pre class="sourceCode r"><code class="sourceCode r">rall &lt;-<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">truth=</span>buzztrain$buzz, <span class="dt">pred=</span><span class="kw">predict</span>(fmodel, <span class="dt">newdata=</span>buzztrain))
<span class="kw">print</span>(<span class="kw">accuracyMeasures</span>(rall$pred, rall$truth))</code></pre>
<pre><code>## [1] &quot;precision= 1 ; recall= 0.999360613810742&quot;
##      pred
## truth    0    1
##     0 5550    0
##     1    1 1563
##   model accuracy     f1 dev.norm
## 1 model   0.9999 0.9994 0.007768</code></pre>
<p>To try and control the over-fitting we build a new model with the tree complexity limited to 100 nodes and the node size to at least 20. This is not necessarily a better model (in fact it scores slightly poorer on test), but it is one where the training procedure didn't have enough freedom to memorize the training data (and therefore maybe had visibility into some trade-offs.</p>
<pre class="sourceCode r"><code class="sourceCode r">fmodel &lt;-<span class="st"> </span><span class="kw">randomForest</span>(<span class="kw">as.formula</span>(bzFormula),
                      <span class="dt">data=</span>buzztrain,
                      <span class="dt">mtry=</span><span class="kw">floor</span>(<span class="kw">sqrt</span>(<span class="kw">length</span>(varslist))),
                      <span class="dt">ntree=</span><span class="dv">101</span>,
                      <span class="dt">maxnodes=</span><span class="dv">100</span>,
                      <span class="dt">nodesize=</span><span class="dv">20</span>,
                      <span class="dt">importance=</span>T)

<span class="kw">print</span>(<span class="st">&#39;training&#39;</span>)</code></pre>
<pre><code>## [1] &quot;training&quot;</code></pre>
<pre class="sourceCode r"><code class="sourceCode r">rtrain &lt;-<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">truth=</span>buzztrain$buzz, <span class="dt">pred=</span><span class="kw">predict</span>(fmodel, <span class="dt">newdata=</span>buzztrain))
<span class="kw">print</span>(<span class="kw">accuracyMeasures</span>(rtrain$pred, rtrain$truth))</code></pre>
<pre><code>## [1] &quot;precision= 0.864364981504316 ; recall= 0.896419437340154&quot;
##      pred
## truth    0    1
##     0 5330  220
##     1  162 1402
##   model accuracy     f1 dev.norm
## 1 model   0.9463 0.7748    2.967</code></pre>
<pre class="sourceCode r"><code class="sourceCode r">
<span class="kw">print</span>(<span class="st">&#39;test&#39;</span>)</code></pre>
<pre><code>## [1] &quot;test&quot;</code></pre>
<pre class="sourceCode r"><code class="sourceCode r">rtest &lt;-<span class="st"> </span><span class="kw">data.frame</span>(<span class="dt">truth=</span>buzztest$buzz, <span class="dt">pred=</span><span class="kw">predict</span>(fmodel, <span class="dt">newdata=</span>buzztest))
<span class="kw">print</span>(<span class="kw">accuracyMeasures</span>(rtest$pred, rtest$truth))</code></pre>
<pre><code>## [1] &quot;precision= 0.809782608695652 ; recall= 0.84180790960452&quot;
##      pred
## truth   0   1
##     0 579  35
##     1  28 149
##   model accuracy     f1 dev.norm
## 1 model   0.9204 0.6817    4.401</code></pre>
<p>And we can also make plots.</p>
<p>Training performance: <!-- Trick: since this block is cached the side-effect (saving a new copy --> <!-- of the PDF will not happen unless the block is re-run. --></p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(ggplot2)
<span class="kw">ggplot</span>(rtrain, <span class="kw">aes</span>(<span class="dt">x=</span>pred, <span class="dt">color=</span>(truth==<span class="dv">1</span>),<span class="dt">linetype=</span>(truth==<span class="dv">1</span>))) +<span class="st"> </span>
<span class="st">   </span><span class="kw">geom_density</span>(<span class="dt">adjust=</span><span class="fl">0.1</span>,)</code></pre>
<div class="figure">
<img src="figure/plottrain.png" alt="plot of chunk plottrain" /><p class="caption">plot of chunk plottrain</p>
</div>
<p>Test performance: <!-- Trick: since this block is cached the side-effect (saving a new copy --> <!-- of the PDF will not happen unless the block is re-run. --></p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">ggplot</span>(rtest, <span class="kw">aes</span>(<span class="dt">x=</span>pred, <span class="dt">color=</span>(truth==<span class="dv">1</span>),<span class="dt">linetype=</span>(truth==<span class="dv">1</span>))) +<span class="st"> </span>
<span class="st">   </span><span class="kw">geom_density</span>(<span class="dt">adjust=</span><span class="fl">0.1</span>)</code></pre>
<div class="figure">
<img src="figure/plottest.png" alt="plot of chunk plottest" /><p class="caption">plot of chunk plottest</p>
</div>
<p>Note the classifier scores are concentrated near zero and one (meaning the printed confusion matrices pretty much capture the whole story and the density plots or any sort of ROC plot doesn't add much value in this case).</p>
<p>Save prepared R environment. <!-- Another way to conditionally save, check for file. --> <!-- message=F is letting message() calls get routed to console instead --> <!-- of the document. --></p>
<pre class="sourceCode r"><code class="sourceCode r">fname &lt;-<span class="st"> &#39;thRS500.Rdata&#39;</span>
if(!<span class="kw">file.exists</span>(fname)) {
   <span class="kw">save</span>(<span class="dt">list=</span><span class="kw">ls</span>(),<span class="dt">file=</span>fname)
   <span class="kw">message</span>(<span class="kw">paste</span>(<span class="st">&#39;saved&#39;</span>,fname))  <span class="co"># message to running R console</span>
   <span class="kw">print</span>(<span class="kw">paste</span>(<span class="st">&#39;saved&#39;</span>,fname))    <span class="co"># print to document</span>
} else {
   <span class="kw">message</span>(<span class="kw">paste</span>(<span class="st">&#39;skipped saving&#39;</span>,fname)) <span class="co"># message to running R console</span>
   <span class="kw">print</span>(<span class="kw">paste</span>(<span class="st">&#39;skipped saving&#39;</span>,fname))   <span class="co"># print to document</span>
}</code></pre>
<pre><code>## [1] &quot;skipped saving thRS500.Rdata&quot;</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">paste</span>(<span class="st">&#39;checked at&#39;</span>,<span class="kw">date</span>())</code></pre>
<pre><code>## [1] &quot;checked at Fri Nov  8 11:23:56 2013&quot;</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">system</span>(<span class="kw">paste</span>(<span class="st">&#39;shasum&#39;</span>,fname),<span class="dt">intern=</span>T)  <span class="co"># write down file hash</span></code></pre>
<pre><code>## [1] &quot;304895b8b5860ac5c995e10bd3b8c995820d60a0  thRS500.Rdata&quot;</code></pre>