-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathStataDataChecking.Rmd
909 lines (554 loc) · 41.5 KB
/
StataDataChecking.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
---
title: "Data Preparation and Management with Stata"
author: "Jim Tyson"
date: "24/01/2020"
output:
html_document:
df_print: paged
toc: yes
word_document:
toc: yes
pdf_document:
toc: yes
bibliography: references.bib
---
# To be added
- Long to wide
- Add frames for multiple datasets
# Some conventions
Throughout I give examples like:
use "path\filename"
where `path` is any legal path for your operating system. In Windows this might be something like:
"c:\user\jrt\stata\accessproject\data\"
and `filename` is the name of a file on your system, for example:
rawdata.dta
In some code examples, I use a dummy name for a variable: `myvar`. If more than one is needed I use `myvarone, myvartwo...myvarn`.
# Motivation
## Personal motivation
When I first learned Stata, from great teachers, every learning opportunity concentrated on performing and interpreting some statistical test or creating and interpreting a model. And we created a few graphs.
When I left the classroom and began using Stata 'in anger' (as one so often does), I discovered that if my data were not presented to me in a nice, clean, consistent, properly classified and well labeled form, I would need to spend a long time, sometimes hours, diagnosing what was causing me trouble and working out how to fix it.
This is my contribution to saving other data analysts from that pain.
## General motivation for managing your data and variables
- the analysis is only as good as the data;
- good data management saves time and reduces stress;
- well managed data is easier to share.
In what follows I assume that you will *try to script everything* but will, like me, fall short. If you work by writing scripts in **do files** then you work is easy to adapt, easy to correct and easy to reproduce [^1].
[^1]: [If you are not familiar with do files, you will need to read this help page before going any further](https://www.stata.com/manuals13/gsw13.pdf)
However the Stata 'do file language' is not a conventional programming language and you cannot always check and then conditionally manipulate data in the way that you might using Perl, Python or R. But the best is the enemy of the good.
# A project folder structure
Most projects benefit from having a consistent folder structure. I often have a structure like this:
1. data if this is stored locally;
2. scripts[^1]: initiation scripts - the data reading, housekeeping and cleaning scripts; analysis scripts - visualising, estimating, modelling;
3. logs;
4. reports - if possible dynamic markdown documents;
5. images.
[^1]:The most useful ancillary skill sets that I would recommend to any Stata user are (in no particular order):
• (SQL)[https://www.tutorialspoint.com/sql/index.htm]
• (Using gnu make) [https://makefiletutorial.com/#getting-started]
• (Simple bash data manipulation)[https://www.ucl.ac.uk/~ccaajim/unix-further-manual.docx https://ryanstutorials.net/bash-scripting-tutorial/].
**The raw data files are read only.**
The raw data can be in a number of formats. If possible a plain text delimited format - such as CSV - is preferred. A plain text format avoids a number of issues that can make data cleaning and set up unnecessarily complicated:
1. proprietary formats may not be readily readable in Stata;
2. some data types require careful manipulation when transferred between file formats - especially dates;
3. if your data is plain text, you can use almost any tool in preparation;
4. you can share data even if your collaborators don't use Stata.
Cleaned data sets should only be stored temporarily - arguably not at all. The raw data file should be treated as read only. Your initiation script might clean and transform your data, producing a data set fit for analysis. Ideally this data set should not be saved - you call the analysis script over the data set in memory after its transformation. You can call a do file from **within** an do file script using a command like
do myscriptfilename
but of course change the path and file name to match your set-up.
# Session logs
For your own debugging, code review and record keeping purposes, you should also log your sessions. You can include a `log using` command at the beginning of your scripts. The log of your session can be in either SCML (a Stata specific markup language) which is viewed with the Stata `view` command, or plain text which can be viewed with any text editing package, including the Stata do file editor.
There is a second kind of log file: the command log. This log contains **only** the commands issued in a session and no results. It is always in plain text format and this makes it useful for quick conversion to a rough and ready **do** script.
Log files should ideally have their own folder and a consistent file name-scheme. If you wish to create logs that are automatically named with the date you can use a command like this:
log using "$S_DATE.log", append
or
cmdlog using "commands $S_DATE.log", append
I have used the **append** option. This will generate a warning if the file mentioned does not yet exist, but the log will still be created.
Include lines to close logs in your scripts.
log close
cmdlog close
# A quick note on `preserve` and `restore`
A good alternative to creating and saving transformed data files locally is to use Stata's `preserve` command before doing any data transformation and analysis (this snapshots the pre-transformation state of the data) and use `restore` after analysis.
Some operations overwrite data in memory. Since we aim to always preserve original data, we do not want to overwrite data in memory and then write to file. Nor do we want a proliferation of modified data files.
If we wish to make some **temporary** change to a data set, then we should use **`preserve`** and **`restore`**. `Preserve` takes a snapshot of the data set in its current state. We can then process the data in anyway we like and later use `restore` to return to the state in the snapshot.
In some circumstances you may decide to create a file and use it temporarily. It is strongly recommended that you either:
1. delete the file after your analysis is complete (using the Stata `erase` command) and recreate it for any further analysis; or
2. use a **make** utility[^2] for conditional execution of scripts.
[^2]: This is not the place to describe the use of makefiles, but the topic is eminently googleable. A good reference for general use is @baker2020
This conservative approach towards the raw data, and reluctance to create permanent files of processed data, is critical to the goal of reproducible research practices.
# A quick note on `capture` and `quietly`
When a Stata command terminates or a process finishes, Stata (usually invisibly) creates a **return code** which indicates whether the process was a success or if errors of some kind were detected.
In Stata scripts you will often see a simple line like
tostring id, force replace
which would force Stata to represent all the values in **id** as strings. If this were in a script, and **id** already contained all string values, we would not want the script to fail at that point. To avoid this happening we can **capture** the **return code** of a command and let the execution of the script continue by preceding the operation with `capture`
capture tostring id, force replace
This executes ignoring errors and swallowing any error output. The **return code** gives an indication of what happens when a command tries to execute. It is stored in a macro **\_rc** and can be displayed
disp _rc
Usually a return code of 0 indicates success and anything else is some kind of error or anomaly.
As a further example, when we open a log file, we usually don't want our script to fail because the log is already open.
Return codes from `capture` can also be used to create complex branching behaviour in scripts - but that is out of scope here.
The `quietly` keyword prevents the command following sending output to the console. We very often just accept Stata's default output when we run commands, but this can lead to the display of a lot of screen-lines that we don't really want to see. Using `quietly` either on individual lines or with `{}` for scope, allows us to run a process without viewing all the output and then perhaps using `disp` to show selective results. Thus:
quietly summarize maths, detail
disp r(kurtosis)
or
quietly {
regress history maths
predict resid, resid
}
summarize resid
Sometimes you want to see all of the output in the console and sometimes you don't. It's a good idea to get more selective with `quietly`.
## Set graphics off
When you use do files to automate your analysis, you will soon start producing your visualisations in scripts and exporting them to disk: like ths:
scatter english history, msymbol(oh) title("History and English") ///
caption("A plot of the regression of History against English") ytitle("History") xtitle("English")
graph export scattereh.png
If you produce these graphs from your do file, quietly will not surpress the creation of the graph window displaying the plot. Since you do not want to see each plot as it is produced, you can use
set graphics off
before the section of your script that produces the visualisations. You will need to `set graphics on` if you wish to begin producing graph windows again.
# Reading your data
If you have a single source of data then you can open it with
use "path\filename"
if it is Stata data, or with one of the several `import` command variations if it is not Stata data. For a CSV file the command would be:
import delimited c:\filename.csv, delimiter(comma) clear
(Where we assume that the data file is in the root of your C drive.)
For an Excel file you might have
import excel c:\filename.xls, sheet("results"), firstrow case(lower)
In this example, the data are in the workbook sheet **results**, variable names are in the first row and all variable names are forced to lower case.
If the data is in more than one file, then you will need to open one file and then merge the second. The two files need to share a unique identifying column or a combination of columns that uniquely identifies a case. Suppose for example that you have a file **results** that looks like this:
surname sex class maths english history
ADAMS 2 1 55 63 65
ALI 2 1 52 46 35
BAGAL 1 3 51 58 55
BENJAMIN 1 2 59 70 68
BLAKEMORE 2 2 56 38 40
BUCHAN 1 3 45 62 59
CHULANI 1 3 63 69 69
and assume that **surname** is unique. We could merge this with a second file **graded results** with a structure like
surname grade
ADAMS a
ALI c
BAGAL b
BENJAMIN a
BLAKEMORE c
BUCHAN b
CHULANI a
with the stata command:
merge 1:1 surname using gradedresults
### Remote files
Increasingly analysts keep data on the net. If data is kept on a web server, then you will not usually have to worry about backing up data and you will be less likely to accidentally delete or overwrite data.
Stata provides a useful method for using data stored on a web server: `webuse`.
#### Setting the URL and reading the file
Before using a Stata data file stored on a web server you must set the url of the server. This is done as follows
webuse set http://somedomain/somepath/folder
Now you read the file
webuse mydata
The variables in the data set are read into memory in the normal way.
#### Reading non-Stata files over the web.
You can read a csv file with variable names in the first row:
import delimited "https://www.ucl.ac.uk/~ccaajim/results.csv", firstrow clear
and the same for an excelfile:
import excel "https://www.ucl.ac.uk/~ccaajim/results.cxls", firstrow clear
### Dirty or untidy?
Not to push an analogy too far, we need to distinguish two things:
<dd>
<dt>
<h3>
Tidy data:
</h3>
</dt>
<dd>
In tidy data:
1. Each variable forms a column;
2. Each observation forms a row;
3. Each type of observational unit forms a table.
Many data analysts promote the use of tidy data. You can read about the layout and the motivation for it at [this CRAN page](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)(<https://tinyurl.com/tidydef>).
</dl>
<dt>
<h3>
Clean data:
</h3>
</dt>
<dd>
By clean data, for current purposes we mean:
1. Variables have the correct computational type;
2. Values for variables are all within the correct range;
3. No mandatory items have missing values;
4. All uniqueness constraints hold;
5. Cross field constraints hold;
6. Values recorded are, within accepted limits, accurate;
7. Data is complete within accepted limits and missing data is suitably coded;
8. Data are internally consistent.
</dd>
</dl>
This is a more complex matter than making data tidy.
In general having untidy data will make your analytical job more difficult and your processes less clear, while having dirty data will compromise the quality of your analysis - however carefully carried out.
## Reshaping
Data maybe either **wide** or **long** in shape. In wide format, if there is more than one measure for each subject or case, they are in separate columns. In long format, each observation is a row (and so a case may have data in more than one row)
Here is some of the data on student exam performance in wide format:
surname sex class maths english history
ADAMS 2 1 55 63 65
ALI 2 1 52 46 35
BAGAL 1 3 51 58 55
BENJAMIN 1 2 59 70 68
BLAKEMORE 2 2 56 38 40
BUCHAN 1 3 45 62 59
CHULANI 1 3 63 69 69
There is only row per student and each row contains three examination scores.
Here is some of the same data in long format:
ADAMS 2 1 maths 55
ADAMS 2 1 english 63
ADAMS 2 1 history 65
ALI 2 1 maths 52
ALI 2 1 english 46
ALI 2 1 history 55
Sometimes the software you use will require that the data be in one or other format.
### Wide to long
To reshape your data from wide to long, you need some kind of 'stub' in the original separate column names. for reshape to work so i rename the subject exam scores:
rename (maths english history)(mathsexam historyexam englishexam)
And then use the stub `@exam` in a `reshape long` command like this:
reshape long @exam, i(id) j(subject) string
This means **reshape the data to long format using the variable names currently ending with 'exam' as the values for the new variable 'subject', with 'id' as the case identifier, and inserting the exam score into the new exam column**.
| Data | Wide | Long |
|-----------------|--------------------------------------|---------|
| No of rows | 30 | 90 |
| No of variables | 7 | 6 |
| j variable name | | subject |
| x_ij variables | mathsexam, englishexam, historyexam | exam |
. list in 1/5
+-----------------------------------------------+
| id subject surname sex class exam |
|-----------------------------------------------|
1. | 1 english ALI 2 1 35 |
2. | 1 history ALI 2 1 46 |
3. | 1 maths ALI 2 1 52 |
4. | 2 english BLAKEMORE 2 1 40 |
5. | 2 history BLAKEMORE 2 1 38 |
+-----------------------------------------------+
### Long to wide
# Check the data
A useful technique for a general check of the data that you can eyeball in the console involves a special form of the `describe` command.
The procedure **overwrites** the data in memory and in your script you should precede this operation by a `preserve` command and follow it by `restore`.
preserve
describe, replace clear
list
restore
This replaces the data in memory with a summary report of information about the data set.
Consider this data:
| surname | | sex | class | maths | english | history |
|-----------|-----|-----|-------|-------|---------|---------|
| ADAMS | | 2 | 1 | 55 | 63 | 65 |
| ALI | | 2 | 1 | 52 | 46 | 35 |
| BAGAL | | 1 | 3 | 51 | 58 | 55 |
| BENJAMIN | | 1 | 2 | 59 | 70 | 68 |
| BLAKEMORE | | 2 | 2 | 56 | 38 | 40 |
The data contains six variables. If we use `describe` we get:
| variable name | type | format | label | variable label |
|---------------|-------|-------:|------:|----------------|
| Contains | data | | | |
| | obs: | 30 | | |
| | vars: | 6 | | |
| | size: | 480 | | |
| surname | str10 | %10s | | |
| sex | byte | %8.0g | | |
| class | byte | %8.0g | | |
| maths | int | %8.0g | | |
| english | byte | %8.0g | | |
| history | byte | %8.0g | | |
`Sorted by:`
`> Note: Dataset has changed since last saved.`
You can quickly check the variable names, types, labels and so on in this way. To get a summary of your variables use `list` after this describe operation.
Remember that this operation **overwrites the dataset in memory**. You no longer have your original data in the workspace and to use your data you must re-load the file, or you must have preserved it and restored it.
## Data type and variable format
The basic data types in Stata are:
| | |
|--------|-----------------------------------------------------------|
| byte | smallish numbers |
| int | largeish integers |
| long | very large integers |
| float | numbers including fractional values (estimations) |
| double | longer numbers including fractional values (estimations)( |
| strN | strings of length N |
| strL | massively long strings (2000000000 characters) |
Each of these has a default display format. Stata display begin with `%` followed by a number indicating the width, a stop (ie `.`), then the decimal precision and ususally `g` for general format or `f` for fixed format.
For numbers, in general format, decimal places of a number will be displayed up to the value of width. In fixed format, the number of decimal values display is fixed by the number following \`.\` however wide the variable.
| | |
|--------|--------|
| byte | %8.0g |
| int | %8.0g |
| long | %12.0g |
| float | %9.0g |
| double | %10.0g |
| strN | %Ns |
| strL | %9s |
## Type checking
A **type** or **data type** is how the computer system or programming language knows how to treat the data. So, a different set of computations are possible with integers than with text data. Ensuring that data is recognised as the correct type helps make for safe coding.
The division between numeric data and string data (sometimes called text or character data) is one of the most obvious and fundamental. Consider that although a telephone number is a string of characters that may look like a number, we should not take the square root of someone's telephone number or calculate 'average telephone number' from our data.
Stata has two non-numeric data types: short string and long string. The short string (probably the most common string type) is limited to 2054 characters. The long string can hold up to 2 billion characters.
In variable definition, Stata uses the prefix **str#** (where **\#** is a number representing length) for short strings and **strL** for long strings.
Stata has five different numeric types each suitable for storing data elements of different size. They are
| Storage type | Minimum | Maximum | Bytes |
|--------------|-------------------------:|------------------------:|------:|
| Byte | -127 | 100 | 1 |
| Int | -32,767 | 32,740 | 2 |
| Long | -2,147,483,647 | 2,147,483,620 | 4 |
| Float | 1.70141173319 \* 10\^38 | 1.70141173319 \* 10\^38 | 4 |
| Double | -8.9884656743 \* 10\^307 | 8.9884656743 \* 10\^307 | 8 |
If you have a large data set - large enough that you experience difficulty in loading data - you should think about the storage types of your numeric data to see whether they could be stored more efficiently. There is a Stata command `compress` which will attempt to recast your storage types for a data set in memory to use space more efficiently.
You should also be aware that storing a value with **more space** than required can cause confusion. If you store for example the number **7** (ostensibly an integer) as a float, then the internal representation may actually be **7.0000001** or **6.9999999**. If you use a number with a boolean operator like this:
gen flag = 1 if myvar <= 7
with myvar having type float, you risk a wrong result.
Another basic command for type checking in Stata is `ds` with the option `detail`. This will return a list of variables in your data and their current storage types. The output is substantially the same as `describe, replace`. For example:
. ds, detail
storage display value
variable name type format label variable label
----------------------------------------------------------------------------
surname str10 %10s
sex byte %8.0g
class byte %8.0g
maths int %8.0g
english byte %8.0g
history byte %8.0g
There is a useful option on `ds` , `has`. This option allows you to specify **which type** of variables you wish to list in the results, so:
. ds, detail has(type byte)
storage display value
variable name type format label variable label
--------------------------------------------------------------------------
sex byte %8.0g
class byte %8.0g
english byte %8.0g
history byte %8.0g
(You should confirm that you understand why **maths** is not included in this list).
You can also use `ds, has(format %8.0g)` (or indeed any format type) to check for variables with a specific format. You can now visually examine the type and format information for your variables for correctness.
### Coerce to string or numeric
#### Destring and back again
Data may not always be read in to Stata correctly. If after checking you have a variable which should be numeric but is stored as a string, you have to convert if. The command to convert from string to numeric has this format
destring originalstringdata, replace
The converse is achieved with `tostring`.
tostring originalnumericdata, generate(newstringdata)
In each case you can either **replace** the original values in place, or **generate** a new variable to hold the new values.
### Coercing to integer
If you have a variable of type float and you wish to convert it to integer values, then assuming that rounding is appropriate you should generate a new variable of type integer with the following code:
gen int myvari = round(myvar)
where I have chosen `myvari` as the new variable name. As well as `round()`, you can use the functions `floor()` and `ceiling()`.
## Range checking
The strategy for range checking is to create a semaphore or alert variable if a value is not in range for any case.
Consider the continuous variables in this example data
| surname | | sex | class | maths | english | history |
|-----------|-----|-----|-------|-------|---------|---------|
| ADAMS | | 2 | 1 | 55 | 63 | 65 |
| ALI | | 2 | 1 | 52 | 46 | 35 |
| BAGAL | | 1 | 3 | 51 | 58 | 55 |
| BENJAMIN | | 1 | 2 | 59 | 70 | 68 |
| BLAKEMORE | | 2 | 2 | 56 | 38 | 40 |
Suppose that we know that the scores **maths**, english\*\* and **history** are examinations scores out of 100. If any score is less than zero or more than 100, we will consider it out of range and we will have to find a strategy to deal with it.
So we will write a do file to generate a new variable to act as our semaphore - `outofrange` - with a value zero for all cases. Then create a list of relevant variables (here the three continuous variables **maths, english** and **history**) and use a conditional replace statement to replace the zeros with 1 for values out of range.
We will exploit the inbuilt Stata function **inrange()** which has this syntax:
`inrange(myvar, minimum, maximum)`
This function returns true when `myvar` is between the minimum and maximum values. So the code might look like this:
capture drop outofrange
local continuous maths english history
gen outofrange = 0
foreach var of local continuous{
replace outofrange = 1 if !inrange(`var',0,100)
}
We use `capture` in the `drop` statement to ignore any error if `outofrange` does not yet exist (that is, for example, if we are running this routine for the first time).
We can now use
list if outofrange == 1
to examine any cases with values out of range.
The question remains what you should **do** with these cases, but that is a theoretical matter. You must decide.
## Factor variables
### Labelling values (and variables)
If we consider a typical factor variable like ethnicity. **The Office of National Statistics** in the United Kingdom has a list of eighteen options for respondent self identification of ethnic group. Let's look only the **White** sub-category. Here respondents have the choices:
1. English/Welsh/Scottish/Northern Irish/British
2. Irish
3. Gypsy or Irish Traveller
4. Any other White background, please describe
For the purposes of entering, storing and manipulating data it makes good sense to use the numeric code assigned to each of these groups. However, in reporting (and often just eyballing the data) it is good practice to provide 'human friendly' labels for the values.
This is done with the Stata `label` command.
There are two steps involved in labeling values for a factor variable. First define a set (we call it *ethnicityl* here) of label-code equivalences. Like this:
label define ethnicityl 1 "English,Welsh,Scottish,Northern Irish,British' 2 "Irish" 3 "Gypsy or Irish Traveller" 4"Anyother White background"
here **ethnicityl** is a handle for this label definition and is followed by pairs of code and label.
You can check what label definitions exist in your data with `label dir`.
The next step is to apply **ethnicityl** to the values of some variable in our data:
label values myvars ethnicityl
### Defining and managing factors
Stata (unlike for example R or SPSS) does not require that you **define** a variable as a factor or categorical variable. Rather it is when you run analysis that you will need to specify whether some variable should be treated as categorical and its levels transformed (on the fly) to indicators ('dummy variables').
To check the range of factor variables you can use `inlist()`. For example
capture drop foutofrange
gen foutofrange = 0
replace foutofrange = 1 if !inlist(myfactorvar,1,2)
As before we can now list to see the results:
list if foutofrange == 1
### Generating dummy binary variables for factors
Using a factor variable with the *i* prefix, for example
regress maths i.stream
avoids the necessity of creating dummy binary variables. A dummy binary variable recodes each level of a categorical variable into a new 0/1 variable. For example if we have a variable eyecolour with levels
eyecolour
blue
brown
green
other
We can recode this as four variables, each taking a value 0 (false) or 1 (true)
| ecblue | ecbrown | ecgreen | ecother |
|:------:|:-------:|:-------:|:-------:|
| 1 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 |
| 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 1 |
This can be done with `gen` and `replace` and `if`. However, a much simple way to achieve this dummy coding is with `tabulate`:
tabulate eyecolour, generate(ec)
Which produces
| eyecolour | Freq. | Percent | Cum. |
|:----------|------:|--------:|-------:|
| blue | 6 | 20.00 | 20.00 |
| brown | 5 | 16.67 | 36.67 |
| green | 6 | 20.00 | 56.67 |
| other | 13 | 43.33 | 100.00 |
| Total | 30 | 100.00 | |
and createes four new variables, for example:
| ec1 | ec2 | ec3 | ec4 |
|:---:|:---:|:---:|:---:|
| 1 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 |
| 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 1 |
By default, Stata will use the original level names to create labels for these new variables of the form `eyecolour==blue` etc.
# Missing values
Stata can represent up to twenty seven distinct levels of missing values. These are represented as **.**, and **.a**...**.z**.
The default treatment of cases with missing values is to drop them from analysis (listwise deletion).
If your data has codes for missing values before you import it, then you can use `replace` to recode the data so that Stata recognises the missing values. Suppose for example that your missing codes are
| Code | Value |
|------|------------------------------|
| -99 | Missing completely at random |
| -98 | Missing at random |
| -97 | Missing not at random |
You can use a command like:
`replace myvar = .a if myvar == -99 replace myvar = .b if myvar == -98 replace myvar = .c if myvar == -97`
and then apply a label set like
`label define formissing .a "MCAR" .b "MAR" .c "MNAR"`
and finally apply the set of labels **formissing** to your variable:
`label values myvar formissing`
This method works, but you will have to do it a single variable at a time (or use a foreach loop). Perhaps better is to use the command `mvdecode` like this
`mvdecode myvarone myvartwo, mv(-99 = .a -98 = .b -97 = .c)`
So encoding two variables in one instruction. Indeed if your missing value code scheme is uniform across variables you can use the construct **\_all** to simply apply the encoding to all variables in the data.
`mvencode _all, mv(-99 = .a -98 = .b -97 = .c)`
While a few missing values may not critically impact on your analysis and may allow for simply dropping cases, it will be worthwhile thinking about how to encode missing values. Not only because incomplete data may require special handling, but also because data maybe missing for different reasons. Consider for example a survey that asks respondents for their date of birth: no answer is very different from an answer 'unknown'.
Supposing that you wish to exclude all missing values from some procedure (maybe a plot, for example), then you can use an `if` condition with a comparison operator like
`if myvar >= .`
## Outliers
We should remember that an outlier is not an error, nor even a 'bad' data point. There is no reason to suppose that one should always exclude outliers from analysis: to do so is a statistical decision. Moreover, automatic detection of outliers is inherently limited and crude.
Let us assume for the moment Tukey's 1977 characterisation of an outlier (@tukey1977a ) : a value beyond the whiskers, where the upper whisker is at $$quartile \: 3 + 1.5\times{IQR}$$ and the lower is at $$quartile \: 1 + 1.5\times{IQR}$$.
tabstat maths, statistics(p25 p50 p75 iqr) save
matrix statsc=r(StatTotal)
local p75 = statsc[3,1]
local p25 = statsc[1,1]
local IQR = statsc[4,1]
disp "`p75'"
disp "`IQR'"
local upperwhisker = `p75' + 1.5* `IQR'
local lowerwhisker = `p25' - 1.5* `IQR'
disp "`upperwhisker'"
disp "`lowerwhisker'"
list surname maths if maths >= `upperwhisker'
list surname maths if maths <= `lowerwhisker'
But it is very important not to rely on the output of this (or any other computation of extreme values) automatically. It is critical that you consider carefully what constitutes an extreme or otherwise interesting value and decide what to do about them. Graphical approaches such as scatter and box plots are often more conducive to understanding the data.
Notice also that this only detects univariate outliers - you can have combinations of values for a case that together make the case of interest or problematic and this procedure would not detect those.
With small data samples and a bi-variate relationship, it can be useful to create a scatter plot, run a regression analysis and then examine the leverage against residuals plot. So, imagining for a moment data with 30 rows like the following:
| surname | sex | class | maths | english | history |
|-----------|-----|-------|-------|---------|---------|
| ADAMS | 2 | 1 | 55 | 2 | 99 |
| ALI | 2 | 1 | 52 | 46 | 35 |
| BAGAL | 1 | 3 | 51 | 58 | 55 |
| BENJAMIN | 1 | 2 | 59 | 70 | 68 |
| BLAKEMORE | 2 | 2 | 56 | 38 | 40 |
We can investigate any very influential points with this code:
scatter english history
reg history english
capture gen id=_n
lvr2plot, mlabel(id)
predict l, leverage
hilo l english history id, high show(5)
This allows us to consider any outliers and their effect.
## Duplicates
## Constraints I: uniqueness
The simplest uniqueness constraint can be illustrated by considering records that have a unique id - for example patient id in a clinical trial. However, uniqueness constraints can be complex: it may be some combination of values that must be unique.
Look at **contract** in Stata help. <!-- amend results data to contain an id field with duplicates and check with contract surname -->
`contract varlist if in weight , options`
contract replaces the dataset in memory with a new dataset consisting of all combinations of varlist that exist in the data and a new variable that contains the frequency of each combination. As usual with processes that replace data, you should probably wrap this in `preserve` and `restore`.
## Constraints II: cross value constraints
If students grades are based on level of marks, then ensure that the correct relationship holds. How to check? Write code that correctly generates grade from marks and then compare with your original data. *Do not assume that your calculation is always correct - if you discover an anomaly you may need to investigate rather than just assume.* For example a students grade may have improved due to some mitigation process.
<!-- add a stream variable to the original results.dta file, after import check the imported stream variable with one calculated in a do file; or create treatment data where M should only ever get T1, never T2 and check with contract -->
# Creating new variables in your data
Think carefully about **when** you create any derived variables: before or after checking? If you create them before checking you may introduce errors in your derived variables, but if you create them after checking then you may want to check that the derived variables are then also checked for correctness etc.
## Aggregate or summary variables
Once you have your data, you may need to create some variables derived from the raw data. One of the most useful Stata commands for this task is `egen`.
Let's take a simple example. Look at the head of our original data:
| surname | | sex | class | maths | english | history |
|-----------|-----|-----|-------|-------|---------|---------|
| ADAMS | | 2 | 1 | 55 | 63 | 65 |
| ALI | | 2 | 1 | 52 | 46 | 35 |
| BAGAL | | 1 | 3 | 51 | 58 | 55 |
| BENJAMIN | | 1 | 2 | 59 | 70 | 68 |
| BLAKEMORE | | 2 | 2 | 56 | 38 | 40 |
Suppose that you would like too calculate for each student a measure of their overall examination performance. We can calculate the average score as
`(maths + english + history)/3`
These sorts of summaries of data **by case** (rather than by variable) are better created with egen.
First, creating the average examination score for each student.
`egen avxm = rowmean(maths english history)`
Now, use **cut** to break this into groups of scores. I have chosen groups as from zero to less than 50, from 50 to less than 60 and from 60 and above. The `icodes` keyword creates numeric code names (starting at 0) for each resulting group:
`egen stream = cut(avxm), at(0,50,60,101) icodes`
To check that the result is correct we can run this line:
`table stream, contents(min avxm max avxm)`
which will show us the lowest and highest data points in each group for our data.
| stream | min(avxm) | max(avxm) |
|--------|-----------|-----------|
| 0 | 41.66667 | 49.66667 |
| 1 | 51.66667 | 57.66667 |
| 2 | 60.33333 | 67 |
# Dates
Stata dates are **elapsed dates** with a reference date **0** of January 1, 1960. So, dates are represented internally as integers counting back or forward from this date.
Any other true date representation is a formatting of this representation.
This means that data you import into Stata which might look like, for example
`20-12-2022`
may not be correctly recognised as a date by Stata.
There are Stata functions to convert dates to elapsed dates and to format dates.
Let's imagine a file of data that looks like this:
dateone datetwo
01-01-2000 01-01-2000
03-02-2001 03-02-2005
01-01-2000 01-01-1960
First, I will assume that the data are in two columns of a table in a single Excel worksheet and import them with this command:
`import excel https://www.ucl.ac.uk/~ccaajim/datesstataexample.xlsx, firstrow`
(You can use this command if you want to grab this small sheet from my website).
Now, how has Stata recognised the data? Here is the output of `desc`
<!--  -->
The display format And now, let's try some simple mathematics with these dates:
`gen interval = datetwo - dateone`
Stat creates a variable of type `float`, with format `%9.0g`.
<!--  -->
Since, this interval is expressed in elapsed days, we can convert to years by dividing by 365.25
## Converting strings to dates
Stata will convert many different string types to dates. The command is `date`. For example, suppose that we have a variable idate with a single row
Jan, 31, 2001
we can convert this with
gen newidate = date(idate, "MDY")
This will generate a float that is the elapsed date for January 31st, 2001.
The string in quotes in `date()` is called the mask. It is used to specify in what order the *month*, *day* and *year* elements occur in the input string.
## Extracting parts of dates
From any date variable, you can extract the *month*, *day* and *year* elements with the functions `month(_date variable_)`,`day(_date variable_)` and `year(_year variable_)`.
For example,
`gen yearalone = year(date1)`
`gen dayalone = day(date1)`
`gen monthalone = month(date1)`
## Formatting dates
Stata has a command to format dates `%td`. This is used with the `format` command:
format datedata %td
where **datedata** is some recognised date. Datedata can be a list of variables.
## Date mathematics
Stata has a funciton **td()** that allows literal dates to be understood as Stata elapsed dates. We can combine this with the function **datediff** to calculate durations:
display datediff(td(1January2000), td(1January2001), "month")
gives the result 12. We can combine this with the function **today()** like so
display datediff(td(1Mar2020), today(), "month")
and we can calculate fractional durations
display datediff_frac(td(1Mar2020), today(), "month")
More detail on date manipulation and calculation (including formatting) can be found using **help dates**.
# References