This repository has been archived by the owner on Sep 18, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 370
/
Copy pathbit005_hw07-hw05-comparing-dataframes.html
645 lines (621 loc) · 38 KB
/
bit005_hw07-hw05-comparing-dataframes.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />
<meta name="author" content="Dean Attali & Jenny Bryan" />
<meta name="date" content="2014-12-15" />
<title>Comparing two almost-identical data.frames</title>
<script src="libs/jquery-1.11.0/jquery.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link href="libs/bootstrap-2.3.2/css/united.min.css" rel="stylesheet" />
<link href="libs/bootstrap-2.3.2/css/bootstrap-responsive.min.css" rel="stylesheet" />
<script src="libs/bootstrap-2.3.2/js/bootstrap.min.js"></script>
<style type="text/css">code{white-space: pre;}</style>
<link rel="stylesheet"
href="libs/highlight/default.css"
type="text/css" />
<script src="libs/highlight/highlight.js"></script>
<style type="text/css">
pre:not([class]) {
background-color: white;
}
</style>
<script type="text/javascript">
if (window.hljs && document.readyState && document.readyState === "complete") {
window.setTimeout(function() {
hljs.initHighlighting();
}, 0);
}
</script>
<link rel="stylesheet" href="libs/local/nav.css" type="text/css" />
</head>
<body>
<style type = "text/css">
.main-container {
max-width: 940px;
margin-left: auto;
margin-right: auto;
}
</style>
<div class="container-fluid main-container">
<header>
<div class="nav">
<a class="nav-logo" href="index.html">
<img src="static/img/stat545-logo-s.png" width="70px" height="70px"/>
</a>
<ul>
<li class="home"><a href="index.html">Home</a></li>
<li class="faq"><a href="faq.html">FAQ</a></li>
<li class="syllabus"><a href="syllabus.html">Syllabus</a></li>
<li class="topics"><a href="topics.html">Topics</a></li>
<li class="people"><a href="people.html">People</a></li>
</ul>
</div>
</header>
<div id="header">
<h1 class="title">Comparing two almost-identical data.frames</h1>
<h4 class="author"><em>Dean Attali & Jenny Bryan</em></h4>
<h4 class="date"><em>December 15, 2014</em></h4>
</div>
<div id="overview" class="section level2">
<h2>Overview</h2>
<p>In <a href="http://stat545-ubc.github.io/hw07_data-wrangling-grand-finale.html">hw07</a>, many accepted the challenge to split the <a href="https://github.com/jennybc/gapminder">Gapminder data</a> by country, write a country-specific delimited file, then reverse the process. Note this homework was all about data wrangling. Several students used <code>identical()</code> and <code>all.equal()</code> to compare before vs after and learned that they were not <em>exactly</em> the same. But they look the same to human eyes, so what’s up?</p>
<p>In this document, we walk through that example to show a workflow for determining <em>how</em> two objects differ.</p>
</div>
<div id="load-packages-and-data" class="section level2">
<h2>Load packages and data</h2>
<p>We’re going to use <code>plyr</code> and <code>dplyr</code> for a lot of data manipulation, and <code>gapminder</code> for data. <code>gapminder</code> is available on <a href="https://github.com/jennybc/gapminder">GitHub</a> and can be installed as follows:</p>
<pre><code>install.packages("devtools")
devtools::install_github("jennybc/gapminder")</code></pre>
<p>Let’s load the packages:</p>
<pre class="r"><code>library(gapminder)
library(plyr)
library(dplyr)
library(knitr)</code></pre>
<p>Just a quick sanity check to ensure we have gapminder properly loaded.</p>
<pre class="r"><code>tbl_df(gapminder)</code></pre>
<pre><code>## Source: local data frame [1,704 x 6]
##
## country continent year lifeExp pop gdpPercap
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
## 7 Afghanistan Asia 1982 39.854 12881816 978.0114
## 8 Afghanistan Asia 1987 40.822 13867957 852.3959
## 9 Afghanistan Asia 1992 41.674 16317921 649.3414
## 10 Afghanistan Asia 1997 41.763 22227415 635.3414
## .. ... ... ... ... ... ...</code></pre>
</div>
<div id="split-gapminder-to-one-file-per-country" class="section level2">
<h2>Split gapminder to one file per country</h2>
<p>First, it’s a good idea to create a separate directory for all these files we’re about to create.</p>
<pre class="r"><code>outdir <- file.path("tmp-gapminder")
dir.create(outdir, recursive = TRUE, showWarnings = FALSE)</code></pre>
<p>Now we use <code>d_ply()</code> to split the data by country and write the data for each country to its own file.</p>
<pre class="r"><code>d_ply(gapminder, ~ country, function(x) {
filename <- file.path(outdir, paste0(x$country[1], ".tsv"))
write.table(x, file = filename, quote = FALSE, sep = "\t" ,row.names = FALSE)
})</code></pre>
<p>Why do we use <code>d_ply()</code>? Because a data.frame goes in (<code>d</code>) and nothing comes out (<code>_</code>), ergo <code>d_ply()</code>. Remember to use this over <code>ddply()</code> or <code>dlply()</code> when there’s no return value, i.e. when you’re submitting the command for a side effect such as writing a file.</p>
<p>Also note the use of <code>file.path()</code> and <code>paste0()</code>, which is shortcut for <code>paste(..., sep = "")</code>. The use of <code>file.path()</code> to build paths will make you code more portable across operating systems.</p>
</div>
<div id="read-the-files-to-re-create-gapminder" class="section level2">
<h2>Read the files to re-create gapminder</h2>
<p>Let’s read the files back in and try to recreate the <code>gapminder</code> data.frame.</p>
<p>We know that the data files are <code>.tsv</code>, so we use a regular expression pattern to find all the files.</p>
<pre class="r"><code>fileRegex <- "^.*\\.tsv$"
out_files <- list.files(outdir, pattern = fileRegex, full.names = TRUE)
out_files %>% head</code></pre>
<pre><code>## [1] "tmp-gapminder/Afghanistan.tsv" "tmp-gapminder/Albania.tsv"
## [3] "tmp-gapminder/Algeria.tsv" "tmp-gapminder/Angola.tsv"
## [5] "tmp-gapminder/Argentina.tsv" "tmp-gapminder/Australia.tsv"</code></pre>
<p>OK, looks like we have a list with all our files, so let’s read them into <code>dejavu</code> and see if it matches the original <code>gapminder</code>!</p>
<pre class="r"><code>dejavu <- ldply(out_files, read.delim)
all.equal(gapminder, dejavu)</code></pre>
<pre><code>## [1] "Component \"country\": Attributes: < Component \"levels\": 2 string mismatches >"
## [2] "Component \"country\": 24 string mismatches"
## [3] "Component \"continent\": Attributes: < Component \"levels\": 4 string mismatches >"
## [4] "Component \"lifeExp\": Mean relative difference: 0.09774601"
## [5] "Component \"pop\": Mean relative difference: 0.1702937"
## [6] "Component \"gdpPercap\": Mean relative difference: 0.1890264"</code></pre>
<p>Bummer.</p>
<p>It looks like we have non-equalness in every variable except “year”.</p>
<p>Every time you run into such a problem, it will be a unique problem, and you must <em>read the message</em> for hints on where the differences lie.</p>
</div>
<div id="digging-into-the-non-identical-ness" class="section level2">
<h2>Digging into the non-identical-ness</h2>
<p>I find that the last three messages, about numerical differences, are the most self-explanatory, so I tackle that first. I’ll look at <code>gdpPercap</code> first since it has the largest difference. See the appendix for some background on the pitfalls of checking for equality of floating point numbers and on <code>all.equal()</code> versus <code>identical()</code>.</p>
<div id="gdp-per-capita" class="section level4">
<h4>GDP per capita</h4>
<p>How many observations have a GDP discrepancy? The threshold is somewhat arbitrary but also informed by reading the message re: mean relative difference.</p>
<pre class="r"><code>check_gdp_diff <- abs(gapminder$gdpPercap - dejavu$gdpPercap) > 0.01
sum(check_gdp_diff)</code></pre>
<pre><code>## [1] 24</code></pre>
<p>Hmm.. 24 observations have <code>gdpPercap</code> that is different by more than 0.01 between the two sources. Let’s see which observations these are.</p>
<pre class="r"><code>which(check_gdp_diff)</code></pre>
<pre><code>## [1] 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629
## [18] 630 631 632 633 634 635 636</code></pre>
<p>They seem to all be in a row, suggesting it’s a block of data from two countries. Let’s take a closer look at what the data from the two data.frames looks like at these observations.</p>
<pre class="r"><code>data.frame(
with(gapminder, data.frame(
continent, country, year, gdpPercap)),
with(dejavu, data.frame(
continent, country, year, gdpPercap))
) %>%
filter(check_gdp_diff) %>%
kable</code></pre>
<table>
<thead>
<tr class="header">
<th align="left">continent</th>
<th align="left">country</th>
<th align="right">year</th>
<th align="right">gdpPercap</th>
<th align="left">continent.1</th>
<th align="left">country.1</th>
<th align="right">year.1</th>
<th align="right">gdpPercap.1</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">1952</td>
<td align="right">510.1965</td>
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">1952</td>
<td align="right">299.8503</td>
</tr>
<tr class="even">
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">1957</td>
<td align="right">576.2670</td>
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">1957</td>
<td align="right">431.7905</td>
</tr>
<tr class="odd">
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">1962</td>
<td align="right">686.3737</td>
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">1962</td>
<td align="right">522.0344</td>
</tr>
<tr class="even">
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">1967</td>
<td align="right">708.7595</td>
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">1967</td>
<td align="right">715.5806</td>
</tr>
<tr class="odd">
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">1972</td>
<td align="right">741.6662</td>
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">1972</td>
<td align="right">820.2246</td>
</tr>
<tr class="even">
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">1977</td>
<td align="right">874.6859</td>
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">1977</td>
<td align="right">764.7260</td>
</tr>
<tr class="odd">
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">1982</td>
<td align="right">857.2504</td>
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">1982</td>
<td align="right">838.1240</td>
</tr>
<tr class="even">
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">1987</td>
<td align="right">805.5725</td>
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">1987</td>
<td align="right">736.4154</td>
</tr>
<tr class="odd">
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">1992</td>
<td align="right">794.3484</td>
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">1992</td>
<td align="right">745.5399</td>
</tr>
<tr class="even">
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">1997</td>
<td align="right">869.4498</td>
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">1997</td>
<td align="right">796.6645</td>
</tr>
<tr class="odd">
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">2002</td>
<td align="right">945.5836</td>
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">2002</td>
<td align="right">575.7047</td>
</tr>
<tr class="even">
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">2007</td>
<td align="right">942.6542</td>
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">2007</td>
<td align="right">579.2317</td>
</tr>
<tr class="odd">
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">1952</td>
<td align="right">299.8503</td>
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">1952</td>
<td align="right">510.1965</td>
</tr>
<tr class="even">
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">1957</td>
<td align="right">431.7905</td>
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">1957</td>
<td align="right">576.2670</td>
</tr>
<tr class="odd">
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">1962</td>
<td align="right">522.0344</td>
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">1962</td>
<td align="right">686.3737</td>
</tr>
<tr class="even">
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">1967</td>
<td align="right">715.5806</td>
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">1967</td>
<td align="right">708.7595</td>
</tr>
<tr class="odd">
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">1972</td>
<td align="right">820.2246</td>
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">1972</td>
<td align="right">741.6662</td>
</tr>
<tr class="even">
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">1977</td>
<td align="right">764.7260</td>
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">1977</td>
<td align="right">874.6859</td>
</tr>
<tr class="odd">
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">1982</td>
<td align="right">838.1240</td>
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">1982</td>
<td align="right">857.2504</td>
</tr>
<tr class="even">
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">1987</td>
<td align="right">736.4154</td>
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">1987</td>
<td align="right">805.5725</td>
</tr>
<tr class="odd">
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">1992</td>
<td align="right">745.5399</td>
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">1992</td>
<td align="right">794.3484</td>
</tr>
<tr class="even">
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">1997</td>
<td align="right">796.6645</td>
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">1997</td>
<td align="right">869.4498</td>
</tr>
<tr class="odd">
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">2002</td>
<td align="right">575.7047</td>
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">2002</td>
<td align="right">945.5836</td>
</tr>
<tr class="even">
<td align="left">Africa</td>
<td align="left">Guinea-Bissau</td>
<td align="right">2007</td>
<td align="right">579.2317</td>
<td align="left">Africa</td>
<td align="left">Guinea</td>
<td align="right">2007</td>
<td align="right">942.6542</td>
</tr>
</tbody>
</table>
<p>Looks like the rows for Guinea and Guinea-Bissau have swapped their order. In <code>gapminder</code>, it’s Guinea then Guinea-Bissau. In <code>dejavu</code> it’s Guinea-Bissau then Guinea.</p>
<p>Let’s see what R thinks the order of these two words is:</p>
<pre class="r"><code>sort(c("Guinea-Bissau", "Guinea"))</code></pre>
<pre><code>## [1] "Guinea" "Guinea-Bissau"</code></pre>
<p>Puzzling … this agrees with the <code>gapminder</code> order. So why is <code>dejavu</code> different?</p>
<p>Inspect the order of the two countries in our vector of file names.</p>
<pre class="r"><code>grep("Guinea", out_files, value = TRUE)</code></pre>
<pre><code>## [1] "tmp-gapminder/Equatorial Guinea.tsv"
## [2] "tmp-gapminder/Guinea-Bissau.tsv"
## [3] "tmp-gapminder/Guinea.tsv"</code></pre>
<p>So here is the explanation of the country order in <code>dejavu</code> – it’s exactly the order from the filename vector. But why does Guinea-Bissau come before Guinea in this vector?</p>
<p>We obtained the filenames using <code>list.files()</code>. The documentation for <code>list.files()</code> says under the <strong>Value</strong> section that “The files are sorted in alphabetical order”.</p>
<p>OK… so the files are supposed to be sorted alphabetically, yet that’s not what we’re seeing - we just saw earlier than R knows that Guinea comes before Guinea-Bissau!</p>
<p>Or at least that’s what I initially thought. If you take a closer look at the files, you’ll realize what’s going on here is that R sorts the files based on the full file path, including its extension. So it’s comparing tmp-gapminder/Guinea-Bissau.tsv to tmp-gapminder/Guinea-Bissau.tsv.</p>
<p>Now do you see the problem? Both files are identical up to the end of “Guinea”, at which point one file has a <code>-Bissau.tsv</code> and the other has a <code>.tsv</code>. So R was behaving correctly all along, just not the way we wanted it to. We want to sort by the country name, not the full file path. Just to make sure our reasoning is correct, let’s confirm that a dash does indeed come before a period</p>
<pre class="r"><code>sort(c(".", "-"))</code></pre>
<pre><code>## [1] "-" "."</code></pre>
<p>Yes. Looks like we understand why we got the problem.</p>
<p>Now that we understand why <code>dejavu</code> has a different order, we just need to think of a way to fix it. The approach I will use it to sort the files based on the country name, before reading the files.<br />The way I do it is by extracting the country name from each file and sorting the file paths based on the country order. I slightly modified the <code>fileRegex</code> from above by just adding parentheses <code>()</code> around the <code>.*</code> so that we can backreference the country name (we learned about regex in hw08).</p>
<pre class="r"><code>fileRegex <- "^(.*)\\.tsv$"
out_files <- out_files[order(sub(fileRegex, "\\1", basename(out_files)))]</code></pre>
<p>If this line is confusing to you, just break it up into each piece and think for a minute what each step is doing. Reordering vectors using a similar approach can be very useful, so do take a minute to take that in.</p>
<p>Now let’s ensure that the order of the Guineas is correct in the new vector.</p>
<pre class="r"><code>grep("Guinea", out_files, value = TRUE)</code></pre>
<pre><code>## [1] "tmp-gapminder/Equatorial Guinea.tsv"
## [2] "tmp-gapminder/Guinea.tsv"
## [3] "tmp-gapminder/Guinea-Bissau.tsv"</code></pre>
<p>Success!<br />So let’s construct <code>dejavu</code> again.</p>
<pre class="r"><code>dejavu <- ldply(out_files, read.delim)
all.equal(gapminder, dejavu)</code></pre>
<pre><code>## [1] "Component \"continent\": Attributes: < Component \"levels\": 4 string mismatches >"</code></pre>
<p>Sweet, that got rid of most of the mismatches. Only one remains - and the error is not too cryptic and tells us that there are 4 mismatches in the levels of the continent variable. What odes this mean? Well, I’m not sure exactly, so one powerful tool that we always have to inspect any object is <code>str()</code>. Let’s try to see if we see a difference between <code>gapminder</code> and <code>dejavu</code> continent variables.</p>
<pre class="r"><code>str(gapminder$continent)</code></pre>
<pre><code>## Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...</code></pre>
<pre class="r"><code>str(dejavu$continent)</code></pre>
<pre><code>## Factor w/ 5 levels "Asia","Europe",..: 1 1 1 1 1 1 1 1 1 1 ...</code></pre>
<p>Okay, we definitely see a difference. But this doesn’t show us enough information. We know that the message said something about the levels, so let’s look at those more closely.</p>
<pre class="r"><code>data.frame(gapminder = levels(gapminder$continent),
dejavu = levels(dejavu$continent))</code></pre>
<pre><code>## gapminder dejavu
## 1 Africa Asia
## 2 Americas Europe
## 3 Asia Africa
## 4 Europe Americas
## 5 Oceania Oceania</code></pre>
<p>Interesting! <code>gapminder</code> levels seem to be alphabetical, while <code>dejavu</code> levels are not. Now we just have to figure out what order the <code>dejavu</code> continent levels are following.<br />The first and most logical guess is that the continent levels are ordered according to the order in which they are seen in the data. An easy way to check what order they appear in the data is to look what the unique continents are.</p>
<pre class="r"><code>unique(gapminder$continent)</code></pre>
<pre><code>## [1] Asia Europe Africa Americas Oceania
## Levels: Africa Americas Asia Europe Oceania</code></pre>
<p>Uh huh, we’re on to something here! This output tells us that Asia is the first continent that appears in the data (meaning that the first row is an Asian country), and Europe is next (meaning that the first non-Asian row is European). Notice how the order of the continents does not match the order of the levels in the case of <code>gapminder</code>.</p>
<p><em>Note: It’s important to understand the difference between looking at the unique continents as they appear in the data vs looking at the continent levels, which are also unique, but can have a different ordering than the order in which they appear in the data.</em></p>
<p>If we run the same command on <code>dejavu</code>, we expect to see the same order of continents in the data since the data.frame is the same, but we know that the order of the levels is different.</p>
<pre class="r"><code>unique(dejavu$continent)</code></pre>
<pre><code>## [1] Asia Europe Africa Americas Oceania
## Levels: Asia Europe Africa Americas Oceania</code></pre>
<p>Yup, just as we suspected! See how in the case of <code>dejavu</code>, the order in which the continents are seen in the data is the same as the levels order. This is good, we’re learning.</p>
<p>So now we know that in order to make our two data.frames the same, we just need to reorder the continent levels of one of them. Alphabetical vs chronological doesn’t really make a difference, since both are not terribly useful for visualization. Remember that we often want to reorder factors for plotting, as we learned in hw05. Let’s choose to reorder the <code>dejavu</code> continent levels to be alphabetical. Since that is the default when creating a new factor, we just need to convert continent to character and back again.</p>
<pre class="r"><code>dejavu <- dejavu %>%
mutate(continent = factor(continent %>% as.character))
all.equal(gapminder, dejavu)</code></pre>
<pre><code>## [1] TRUE</code></pre>
<p>Success!<br />We were able to find all the differences between the original dataset and the written-to-files dataset. As you can see, there isn’t a clear step-by-step guide that will you can always follow. This is mostly problem-solving (sometimes combined with bashing your head against the wall) and you need to use the tools we have in R to solve the question of “what’s different?”. Some common tools are to use the output from <code>all.equal()</code>, <code>str)_</code>, <code>levels()</code>, <code>unique()</code>, or simply just looking at the raw data. Hopefully this can make you more comfortable in tackling a similar problem with different data in the future.</p>
</div>
</div>
<div id="bonus-eliminating-the-problem-at-the-source" class="section level2">
<h2>Bonus: eliminating the problem at the source</h2>
<p>How could we have eliminated all this faffing around? By deliberately sorting <code>dejavu</code> on <code>country</code> and only converting <code>continent</code> to factor after the aggregation was done. Watch.</p>
<pre class="r"><code>fileRegex <- "^.*\\.tsv$"
out_files <- list.files(outdir, pattern = fileRegex, full.names = TRUE)
dejavu <- ldply(out_files, read.delim, stringsAsFactors = FALSE)
all.equal(gapminder, dejavu)</code></pre>
<pre><code>## [1] "Component \"country\": 'current' is not a factor"
## [2] "Component \"continent\": 'current' is not a factor"
## [3] "Component \"lifeExp\": Mean relative difference: 0.09774601"
## [4] "Component \"pop\": Mean relative difference: 0.1702937"
## [5] "Component \"gdpPercap\": Mean relative difference: 0.1890264"</code></pre>
<pre class="r"><code>dejaNEW <- dejavu %>%
arrange(country) %>%
mutate_each(funs(factor), country, continent)
all.equal(gapminder, dejaNEW)</code></pre>
<pre><code>## [1] TRUE</code></pre>
<p>Lesson #1: never rely on row order “just because”. Always check or force things to be the way you need them to be.</p>
<p>Lesson #2: it is often easiest to keep things as character for data munging THEN convert to factor. Don’t be passive about factor level order.</p>
</div>
<div id="bonus-getting-the-data.frames-to-be-identical" class="section level2">
<h2>Bonus: getting the data.frames to be IDENTICAL</h2>
<p>As mentioned before, <code>identical()</code> checks for exact identity, and often times will return FALSE even when two objects seem completely similar to you. I wanted to see if we can get <code>dejavu</code> to be <code>identical()</code> to <code>gapminder</code>, so let’s go!</p>
<pre class="r"><code>dejavu <- dejaNEW # reset dejavu to it's fixed version
identical(gapminder, dejavu)</code></pre>
<pre><code>## [1] FALSE</code></pre>
<p>So we know that <code>all.equal()</code> thinks they are equal, but <code>identical()</code> doesn’t. We can’t use the output from <code>all.equal()</code> to direct us to the possible mismatches anymore - we’re on our own now. Scary!</p>
<p>Let’s see if we can see any obvious differences… Using <code>str</code> again:</p>
<pre class="r"><code>str(gapminder)</code></pre>
<pre><code>## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : num 1952 1957 1962 1967 1972 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ gdpPercap: num 779 821 853 836 740 ...</code></pre>
<pre class="r"><code>str(dejavu)</code></pre>
<pre><code>## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ gdpPercap: num 779 821 853 836 740 ...</code></pre>
<p>Oh, I do see a difference! The “year” variable is numeric in <code>gapminder</code> and <code>int</code> in <code>dejavu</code>. I’m not purposely going in circles here - this is how you troubleshoot these problems in the real world: incrementally, and with mistakes along the way. So let’s change <code>dejavu</code> to match <code>gapminder</code>:</p>
<pre class="r"><code>dejavu <- dejavu %>%
mutate(year = as.numeric(year))
identical(gapminder,dejavu)</code></pre>
<pre><code>## [1] FALSE</code></pre>
<p>Alright, still not identical. No more easy answers. Next I’ll have to try to zoom in on every single variable and see which ones are causing the non-identical-ness.</p>
<pre class="r"><code>identicalVars <-
sapply(colnames(gapminder), function(x){
identical(gapminder[[x]], dejavu[[x]])
})
(nonIdenticalVars <- names(identicalVars[!identicalVars]))</code></pre>
<pre><code>## [1] "pop" "gdpPercap"</code></pre>
<p>It looks like the pop and gdpPercap columns are the ones that are causing non-identity of the two data,frames. Those two columns are numeric - could it be that the floating-point number issue described above is the cause of this?</p>
<pre class="r"><code>which(gapminder$pop != dejavu$pop)</code></pre>
<pre><code>## [1] 289</code></pre>
<pre class="r"><code>which(gapminder$gdpPercap != dejavu$gdpPercap)</code></pre>
<pre><code>## [1] 289</code></pre>
<p>OK, looks like there is one row in which both the population and gdpPercap aren’t the same in both datasets. Is the difference tiny as we suspect?</p>
<pre class="r"><code>idxDiff <- which(gapminder$pop != dejavu$pop)
gapminder$pop[idxDiff] - dejavu$pop[idxDiff]</code></pre>
<pre><code>## [1] -4.768372e-07</code></pre>
<pre class="r"><code>gapminder$gdpPercap[idxDiff] - dejavu$gdpPercap[idxDiff]</code></pre>
<pre><code>## [1] -1.136868e-13</code></pre>
<p>Yup, both are tiny differences that could be attributable to computer storage. The difference these numbers is so small that unless we’re dealing with astrophysics computations, I doubt this matters. I would normally just leave our data in its current state and come to terms with the fact that it’s not “identical” to <code>gapminder</code>, but if it bothers you too much, there is a cheat we can apply to fix it. There might be better ways, but this is what I can think of immediately. It is a big hack so don’t repeat this at home!</p>
<pre class="r"><code>dejavu[idxDiff, nonIdenticalVars] <- gapminder[idxDiff, nonIdenticalVars]
identical(gapminder, dejavu)</code></pre>
<pre><code>## [1] TRUE</code></pre>
<p>There, I hope you’re happy now :) The last step is more for pleasing your OCD than anything else, but it’s nice to know that if we force these two numbers to be the same, <code>identical()</code> is finally TRUE!</p>
</div>
<div id="cleanup" class="section level2">
<h2>Cleanup</h2>
<p>Don’t forget to clean up after yourself.</p>
<pre class="r"><code>unlink(outdir, recursive = TRUE)</code></pre>
</div>
<div id="extras" class="section level2">
<h2>Extras</h2>
<div id="ensuring-your-filenames-are-valid" class="section level4">
<h4>Ensuring your filenames are valid</h4>
<p>One potential problem you might run into when creating files with filenames based on data is that not all filenames are legal or desirable.</p>
<p>For example, we’ve discussed many times in class how having spaces in a filename is a bad idea, and some countries do contain a space, so you might want to remove spaces from filenames. One possibility is to simply strip spaces, but another solution would be to replace spaces with underscores. (Of course, then you can ask “but what about a country name with an underscore?” Well, there are always ways around everything, but I’ll ignore that case for now.)</p>
<p>To do this, all you have to do is slightly modify the code that writes files:</p>
<pre><code>d_ply(gapminder, ~ country, function(x) {
country <- gsub(" ", "_", x$country[1])
filename <- file.path(outdir, paste0(country, ".tsv"))
write.table(x, file = filename, quote = FALSE, sep = "\t" ,row.names = FALSE)
})</code></pre>
<p>You can get even fancier and more advanced if you need. On Windows, filenames cannot contain colons or quotes, so you can replace all those characters in country names as well, but we don’t have to worry about that.</p>
</div>
<div id="floating-point-numbers" class="section level4">
<h4>Floating point numbers</h4>
<p>This is somewhat technical, so don’t worry if you don’t understand it, but the main take home message is this: R (and any other programming language) can be a little inaccurate when representing certain numbers.<br /><strong>Explanation</strong>: Because of the way computers work and store information, only integers and fractions whose denominator is a power of 2 can be represented exactly. Any other number has to be approximated and rounded off, though you generally don’t see this behavior because the rounding is performed at a very insignificant digit. As a result, two fractions that should be equal might not be equal in R, simply because different algorithms are used to compute them, so they may be rounded a little bit differently.</p>
<p>For example, anyone can tell you that 1 - 0.9 = 0.1, right? Let’s see what R says</p>
<pre class="r"><code>1 - 0.9</code></pre>
<pre><code>## [1] 0.1</code></pre>
<p>No problem, what was I making such a big fuss about?? Let’s look again…</p>
<pre class="r"><code>1 - 0.9 == 0.1</code></pre>
<pre><code>## [1] FALSE</code></pre>
<p>That’s the problem I was talking about. Another way to see this more clearly:</p>
<pre class="r"><code>1 - 0.9 - 0.1</code></pre>
<pre><code>## [1] -2.775558e-17</code></pre>
<p>As you can see, the difference is tiny. Most people can live their lives perfectly happy never knowing about this seemingly horrible shortcoming of computers, but it’s good to keep in mind.</p>
<p>You can Google for this to learn more, or <a href="http://stackoverflow.com/questions/9508518/why-are-these-numbers-not-equal">read this StackOverflow thread</a> to see a nice detailed discussion that is tailored to R. It is important to remember that this is NOT an R-specific problem though, so don’t hate on R because of this. (Though, granted, there are <a href="http://www.burns-stat.com/pages/Tutor/R_inferno.pdf">many other reasons to hate on R</a> if you so wish.)</p>
</div>
<div id="difference-between-all.equal-and-identical" class="section level4">
<h4>Difference between <code>all.equal</code> and <code>identical</code></h4>
<p>According to the <code>identical()</code> documentation: “The safe and reliable way to test two objects for being exactly equal”. Notice the stress on “exactly equal”.<br />According to the <code>all.equal()</code> documentation: “a utility to compare R objects x and y testing ‘near equality’”. What exactly do they mean by “near equality”? Unfortunately the documentation doesn’t tell us that exactly, but generally <code>all.equal()</code> will return <code>TRUE</code> if the two objects are equal enough for (almost) all intents and purposes, while <code>identical()</code> is a perfectionist with a gold medal in OCD that will only return <code>TRUE</code> if everything about the two objects is absolutely exactly the same.</p>
<p>For example, <code>all.equal</code> doesn’t care about the exact storage type of numerics if they pretty much represent the same thing, whereas <code>identical()</code> does.</p>
<pre class="r"><code>all.equal(as.integer(5), as.double(5))</code></pre>
<pre><code>## [1] TRUE</code></pre>
<pre class="r"><code>identical(as.integer(5), as.double(5))</code></pre>
<pre><code>## [1] FALSE</code></pre>
<p>Another example is that <code>all.equal()</code> is aware of the floating-point numbers problem described above, so it is a little lenient if two numbers are not exactly identical. As shown above, 1 - 0.9 - 0.1 is represented in R as -2.775557610^{-17}. But <code>all.equal()</code> realizes this is an insignificant difference and that for most everyday uses, that number is as good as <code>0</code>.</p>
<pre class="r"><code>all.equal(1 - 0.9 - 0.1, 0)</code></pre>
<pre><code>## [1] TRUE</code></pre>
<pre class="r"><code>identical(1 - 0.9 - 0.1, 0)</code></pre>
<pre><code>## [1] FALSE</code></pre>
<p>You can actually change the error tolerance of <code>all.equal()</code>, which is 1.490116110^{-8} by default (varies by computer). This means that any two numbers differing by less than 1.490116110^{-8} will be reported as the same.</p>
<pre class="r"><code>all.equal(1 - 0.9 - 0.1, 0, tolerance = 1e-15)</code></pre>
<pre><code>## [1] TRUE</code></pre>
<pre class="r"><code>all.equal(1 - 0.9 - 0.1, 0, tolerance = 1e-20)</code></pre>
<pre><code>## [1] "Mean relative difference: 1"</code></pre>
<p>We’re getting side-tracked, so if you really want to understand the checks that <code>all.equal()</code> performs or what the output means, you can look at the source code of all the different <code>all.equal()</code> methods: there is an <code>all.equal.numeric()</code> for numeric data, <code>all.equal.character()</code> for character data, etc. Just use tab completion to find out what the different <code>all.equal.*</code> methods are. In case you didn’t know, if you type the name of a function without parentheses, it will show you the function code (since a function is really just a variable) instead of running that function. So if you want to see the code of <code>all.equal.numeric()</code>, simply run <code>all.equal.numeric()</code>.</p>
<hr />
</div>
</div>
<div class="footer">
This work is licensed under the <a href="http://creativecommons.org/licenses/by-nc/3.0/">CC BY-NC 3.0 Creative Commons License</a>.
</div>
</div>
<script>
// add bootstrap table styles to pandoc tables
$(document).ready(function () {
$('tr.header').parent('thead').parent('table').addClass('table table-condensed');
});
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
document.getElementsByTagName("head")[0].appendChild(script);
})();
</script>
</body>
</html>